怎么用Python看一看最近有什么刚上映的电影

发布时间：2021-11-25 14:36:17 作者：iii
来源：亿速云阅读：152

# 怎么用Python看一看最近有什么刚上映的电影

在信息爆炸的时代，快速获取最新上映的电影信息对影迷和数据分析师都很有价值。本文将详细介绍如何用Python爬取、解析和可视化近期上映的电影数据，涵盖从基础爬虫到数据分析的完整流程。

## 一、技术方案概述

我们将采用以下技术栈：
- 数据获取：Requests + BeautifulSoup 或 Scrapy
- 数据处理：Pandas
- 数据存储：SQLite/CSV
- 可视化：Matplotlib/Pyecharts
- 可选框架：Scrapy-Selenium（应对动态页面）

## 二、选择目标数据源

### 1. 主流电影数据源对比

| 数据源         | 优点                  | 缺点                  |
|----------------|-----------------------|-----------------------|
| 猫眼电影       | 数据规范，反爬中等    | 需要处理动态加载      |
| 豆瓣电影       | 信息全面，API友好     | 新片更新有延迟        |
| IMDb          | 国际覆盖广            | 国内访问速度慢        |
| 时光网         | 专业影视数据          | 反爬机制较强          |

### 2. 以猫眼为例的解决方案

我们选择猫眼电影（maoyan.com）作为数据源，因其：
- 有明确的"近期上映"分类页
- 页面结构相对规范
- 数据包含评分、票房等关键信息

## 三、基础爬虫实现

### 1. 环境准备

```python
pip install requests beautifulsoup4 pandas

2. 页面分析

打开猫眼电影「正在热映」页面（https://maoyan.com/films?showType=1），通过浏览器开发者工具： - 找到电影列表的HTML结构 - 定位电影名称、评分、主演等信息的CSS选择器

3. 基础爬虫代码

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...',
    'Cookie': '你的Cookie'
}

def get_recent_movies():
    url = "https://maoyan.com/films?showType=1"
    try:
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        
        movies = []
        for item in soup.select('.movie-list dd'):
            name = item.select_one('.channel-detail movie-item-title').get('title')
            score = item.select_one('.score').text if item.select_one('.score') else "暂无"
            actors = item.select_one('.actor').text.split('：')[-1]
            date = item.select_one('.date').text
            
            movies.append({
                'name': name,
                'score': score,
                'actors': actors,
                'release_date': date
            })
        
        return pd.DataFrame(movies)
    
    except Exception as e:
        print(f"Error occurred: {e}")
        return pd.DataFrame()

if __name__ == '__main__':
    df = get_recent_movies()
    print(df.head())
    df.to_csv('recent_movies.csv', index=False)

四、应对反爬机制

1. 常见反爬措施

猫眼电影会检测： - 请求频率 - User-Agent合法性 - Cookie有效性 - IP地址

2. 解决方案

import random
import time
from fake_useragent import UserAgent

class AntiScrape:
    def __init__(self):
        self.ua = UserAgent()
        
    def get_random_ua(self):
        return self.ua.random
    
    def get_proxies(self):
        # 需要自行维护代理池
        return {
            'http': 'http://xxx.xxx.xxx:xxxx',
            'https': 'https://xxx.xxx.xxx:xxxx'
        }
    
    def random_delay(self):
        time.sleep(random.uniform(1, 3))

五、数据存储方案

1. SQLite数据库存储

import sqlite3

def save_to_db(df):
    conn = sqlite3.connect('movies.db')
    df.to_sql('recent_movies', conn, if_exists='replace', index=False)
    conn.close()

2. MongoDB存储（可选）

from pymongo import MongoClient

def save_to_mongo(df):
    client = MongoClient('mongodb://localhost:27017/')
    db = client['movie_db']
    collection = db['recent_movies']
    collection.insert_many(df.to_dict('records'))

六、数据可视化分析

1. 评分分布分析

import matplotlib.pyplot as plt

def plot_score_distribution(df):
    plt.figure(figsize=(10, 6))
    df['score'] = pd.to_numeric(df['score'], errors='coerce')
    df['score'].hist(bins=10)
    plt.title('近期电影评分分布')
    plt.xlabel('评分')
    plt.ylabel('数量')
    plt.savefig('score_dist.png')

2. 上映时间趋势

from pyecharts import options as opts
from pyecharts.charts import Bar

def release_trend(df):
    df['release_date'] = pd.to_datetime(df['release_date'])
    daily_count = df.groupby(df['release_date'].dt.date).size()
    
    bar = (
        Bar()
        .add_xaxis(list(daily_count.index.astype(str)))
        .add_yaxis("上映数量", daily_count.values.tolist())
        .set_global_opts(title_opts=opts.TitleOpts(title="每日新片上映数量"))
    )
    bar.render("release_trend.html")

七、进阶功能实现

1. 自动邮件通知

import smtplib
from email.mime.text import MIMEText

def send_email(new_movies):
    msg = MIMEText(f"今日新片：\n{new_movies.to_string()}")
    msg['Subject'] = '今日电影推荐'
    msg['From'] = 'your_email@example.com'
    msg['To'] = 'target_email@example.com'
    
    with smtplib.SMTP('smtp.example.com', 587) as server:
        server.starttls()
        server.login('user', 'password')
        server.send_message(msg)

2. 电影推荐系统（基础版）

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

def build_recommender(df):
    df['content'] = df['name'] + " " + df['actors']
    tfidf = TfidfVectorizer(stop_words='english')
    tfidf_matrix = tfidf.fit_transform(df['content'])
    
    cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
    return cosine_sim

八、完整项目结构

movie_crawler/
├── crawler/          # 爬虫核心
│   ├── base.py       # 基础爬虫
│   ├── anti_scrape.py # 反爬处理
├── database/         # 数据存储
│   ├── db_handler.py 
├── analysis/         # 数据分析
│   ├── visualize.py
├── config.py         # 配置文件
├── main.py           # 主程序
└── requirements.txt

九、注意事项

法律合规性
- 遵守网站的robots.txt协议
- 限制爬取频率（建议≥3秒/次）
- 仅用于个人学习

异常处理

try:
   # 爬取代码
except requests.exceptions.RequestException as e:
   print(f"网络错误: {e}")
except Exception as e:
   print(f"未知错误: {e}")

性能优化
- 使用aiohttp实现异步爬取
- 采用Scrapy-Redis分布式爬虫

十、扩展方向

多数据源融合（豆瓣+猫眼）
票房预测模型
情感分析（基于影评）
移动端展示（Flask/Django API）

通过本文介绍的方法，你可以轻松构建一个电影资讯监控系统。建议先从基础爬虫开始，逐步添加高级功能。完整项目代码可以参考GitHub上的开源实现。

提示：实际运行时需要替换示例中的cookie、代理等敏感信息，并确保遵守相关网站的使用条款。 “`

这篇文章提供了从基础到进阶的完整Python实现方案，包含： 1. 技术选型分析 2. 爬虫核心代码 3. 反爬应对策略 4. 数据存储方案 5. 可视化实现 6. 项目架构建议 7. 法律注意事项

可以根据需要调整数据源或扩展功能模块。