如何用Python爬取排行榜上的游戏打折信息

发布时间：2021-10-26 09:47:03 作者：柒染
来源：亿速云阅读：193

# 如何用Python爬取排行榜上的游戏打折信息

## 前言：为什么要爬取游戏打折信息？

在数字游戏市场蓬勃发展的今天，各大平台（如Steam、Epic Games Store、PlayStation Store等）每天都会推出大量打折活动。对于游戏爱好者而言，及时获取这些信息意味着能够以最优价格购入心仪游戏；对于数据分析师而言，这些数据可以用于研究市场趋势和消费者行为。

本文将详细介绍如何使用Python构建一个完整的游戏打折信息爬虫系统，涵盖从基础爬取到高级分析的完整流程。我们将以Steam平台为例，但方法可推广到其他游戏平台。

## 一、技术选型与环境准备

### 1.1 核心工具栈

- **Python 3.8+**：推荐使用最新稳定版
- **Requests/httpx**：HTTP请求库
- **BeautifulSoup4/lxml**：HTML解析
- **Selenium/Playwright**：动态页面处理
- **Pandas**：数据清洗与存储
- **Schedule/APScheduler**：定时任务

### 1.2 开发环境配置

```bash
# 创建虚拟环境
python -m venv game_spider
source game_spider/bin/activate  # Linux/Mac
game_spider\Scripts\activate    # Windows

# 安装核心依赖
pip install requests beautifulsoup4 pandas selenium playwright
playwright install  # 安装浏览器驱动

二、Steam平台页面分析

2.1 目标页面结构

Steam打折页面通常有两种形式： 1. 特惠活动页（如夏季/冬季特卖） 2. 日常折扣页（https://store.steampowered.com/specials）

通过浏览器开发者工具（F12）分析可见： - 游戏列表采用分页加载 - 每个游戏卡片包含： - 名称 - 原价/折扣价 - 折扣幅度 - 评价信息 - 打折截止时间

2.2 反爬机制识别

Steam采用的反爬措施包括： - Cloudflare防护 - 请求频率限制 - 动态内容加载 - Cookie验证

应对策略： - 设置合理请求间隔（≥2秒） - 使用随机User-Agent - 维持会话状态 - 备用IP池

三、基础爬虫实现

3.1 静态页面抓取

import requests
from bs4 import BeautifulSoup
import time
import random

HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept-Language': 'en-US,en;q=0.5'
}

def fetch_page(url):
    try:
        response = requests.get(url, headers=HEADERS)
        response.raise_for_status()
        return response.text
    except Exception as e:
        print(f"Error fetching {url}: {e}")
        return None

def parse_discounts(html):
    soup = BeautifulSoup(html, 'lxml')
    games = []
    
    for item in soup.select('.tab_item'):
        game = {
            'title': item.select_one('.tab_item_name').text.strip(),
            'original_price': item.select_one('.discount_original_price').text.strip(),
            'discount_price': item.select_one('.discount_final_price').text.strip(),
            'discount_pct': item.select_one('.discount_pct').text.strip(),
            'rating': item.select_one('.search_review_summary')['data-tooltip-html'] if item.select_one('.search_review_summary') else None,
            'end_date': item.select_one('.discount_countdown')['data-countdown'] if item.select_one('.discount_countdown') else None
        }
        games.append(game)
    
    return games

def crawl_steam_specials(pages=3):
    base_url = "https://store.steampowered.com/specials"
    all_games = []
    
    for page in range(1, pages+1):
        url = f"{base_url}?page={page}"
        print(f"Processing {url}")
        html = fetch_page(url)
        if html:
            all_games.extend(parse_discounts(html))
        time.sleep(random.uniform(1.5, 3.0))
    
    return all_games

3.2 动态内容处理

对于需要JavaScript渲染的页面：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def selenium_crawler():
    options = Options()
    options.add_argument("--headless")
    options.add_argument(f"user-agent={HEADERS['User-Agent']}")
    
    driver = webdriver.Chrome(options=options)
    driver.get("https://store.steampowered.com/specials")
    
    try:
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, ".tab_item"))
        )
        
        # 模拟滚动加载更多内容
        for _ in range(3):
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(2)
            
        soup = BeautifulSoup(driver.page_source, 'lxml')
        return parse_discounts(soup)
    finally:
        driver.quit()

四、高级爬虫优化

4.1 并发爬取加速

使用aiohttp实现异步爬取：

import aiohttp
import asyncio

async def async_fetch(session, url):
    try:
        async with session.get(url) as response:
            return await response.text()
    except Exception as e:
        print(f"Async error: {e}")
        return None

async def async_crawler(pages=3):
    base_url = "https://store.steampowered.com/specials"
    connector = aiohttp.TCPConnector(limit=5)  # 限制并发数
    
    async with aiohttp.ClientSession(headers=HEADERS, connector=connector) as session:
        tasks = []
        for page in range(1, pages+1):
            url = f"{base_url}?page={page}"
            tasks.append(async_fetch(session, url))
            await asyncio.sleep(random.uniform(1.0, 2.0))
        
        htmls = await asyncio.gather(*tasks)
        
        all_games = []
        for html in htmls:
            if html:
                all_games.extend(parse_discounts(html))
        return all_games

4.2 反反爬策略实现

from fake_useragent import UserAgent

class AntiAntiCrawler:
    def __init__(self):
        self.ua = UserAgent()
        self.proxies = self._load_proxies()
        self.cookies = None
        
    def _load_proxies(self):
        # 实现代理IP池加载逻辑
        return []
        
    def rotate_headers(self):
        return {
            'User-Agent': self.ua.random,
            'Accept': 'text/html,application/xhtml+xml',
            'Accept-Encoding': 'gzip, deflate',
            'Connection': 'keep-alive'
        }
        
    def make_request(self, url):
        headers = self.rotate_headers()
        proxy = random.choice(self.proxies) if self.proxies else None
        
        try:
            response = requests.get(url, 
                                  headers=headers,
                                  proxies=proxy,
                                  cookies=self.cookies,
                                  timeout=10)
            if response.status_code == 200:
                self.cookies = response.cookies  # 维持会话
                return response.text
            elif response.status_code == 403:
                print("触发反爬，切换代理")
                self.proxies.remove(proxy)
                return self.make_request(url)
        except Exception as e:
            print(f"Request failed: {e}")
            return None

五、数据存储与分析

5.1 结构化存储

import pandas as pd
from datetime import datetime

def save_to_csv(games, filename='steam_discounts.csv'):
    df = pd.DataFrame(games)
    
    # 数据清洗
    df['original_price'] = df['original_price'].str.replace('[^\d.]', '', regex=True)
    df['discount_price'] = df['discount_price'].str.replace('[^\d.]', '', regex=True)
    df['discount_pct'] = df['discount_pct'].str.replace('%', '').astype(float)
    
    # 添加爬取时间戳
    df['crawl_time'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    
    df.to_csv(filename, index=False, encoding='utf_8_sig')

5.2 基础数据分析

def analyze_data(df):
    # 折扣力度分布
    discount_dist = df['discount_pct'].value_counts().sort_index()
    
    # 价格区间分析
    df['price_range'] = pd.cut(df['discount_price'].astype(float),
                             bins=[0, 10, 30, 50, 100, float('inf')],
                             labels=['<10', '10-30', '30-50', '50-100', '>100'])
    
    # 评价解析
    def parse_rating(text):
        if not text or 'user reviews' not in text:
            return None
        return text.split('%')[0]
    
    df['positive_rate'] = df['rating'].apply(parse_rating).astype(float)
    
    return {
        'discount_dist': discount_dist,
        'price_dist': df['price_range'].value_counts(),
        'top_games': df.sort_values(['discount_pct', 'positive_rate'], 
                                  ascending=[False, False]).head(10)
    }

六、定时任务与自动化

6.1 使用APScheduler实现定时爬取

from apscheduler.schedulers.blocking import BlockingScheduler

def job():
    print(f"开始执行定时爬取任务 {datetime.now()}")
    games = crawl_steam_specials(pages=2)
    save_to_csv(games)
    print(f"完成爬取，共获取{len(games)}条数据")

def setup_scheduler():
    scheduler = BlockingScheduler()
    scheduler.add_job(job, 'cron', hour='12,18', minute=30)
    
    try:
        scheduler.start()
    except KeyboardInterrupt:
        scheduler.shutdown()

if __name__ == '__main__':
    setup_scheduler()

6.2 异常处理与邮件通知

import smtplib
from email.mime.text import MIMEText

def send_alert(subject, content):
    msg = MIMEText(content)
    msg['Subject'] = subject
    msg['From'] = 'your_email@example.com'
    msg['To'] = 'receiver@example.com'
    
    with smtplib.SMTP('smtp.example.com', 587) as server:
        server.starttls()
        server.login('user', 'password')
        server.send_message(msg)

def safe_crawl():
    try:
        games = crawl_steam_specials()
        if len(games) < 10:  # 异常情况判断
            raise ValueError("获取数据量异常")
        save_to_csv(games)
    except Exception as e:
        send_alert("爬虫异常报警", f"错误信息：{str(e)}")

七、可视化展示

7.1 使用Matplotlib生成图表

import matplotlib.pyplot as plt

def plot_discount_analysis(df):
    plt.figure(figsize=(12, 6))
    
    # 折扣力度分布
    plt.subplot(1, 2, 1)
    df['discount_pct'].hist(bins=20)
    plt.title('Discount Distribution')
    plt.xlabel('Discount Percentage')
    plt.ylabel('Count')
    
    # 价格与折扣关系
    plt.subplot(1, 2, 2)
    plt.scatter(df['discount_pct'], df['discount_price'].astype(float))
    plt.title('Price vs Discount')
    plt.xlabel('Discount (%)')
    plt.ylabel('Price ($)')
    
    plt.tight_layout()
    plt.savefig('discount_analysis.png')
    plt.close()

7.2 生成HTML报告

from jinja2 import Template

def generate_html_report(data, filename='report.html'):
    template = Template('''
    <!DOCTYPE html>
    <html>
    <head>
        <title>Steam Discount Report</title>
        <style>
            table { border-collapse: collapse; width: 100%; }
            th, td { border: 1px solid #ddd; padding: 8px; text-align: left; }
            tr:nth-child(even) { background-color: #f2f2f2; }
            img { max-width: 100%; height: auto; }
        </style>
    </head>
    <body>
        <h1>Steam Discount Analysis Report</h1>
        <p>Generated at: {{ timestamp }}</p>
        
        <h2>Top Discount Games</h2>
        {{ top_games.to_html() }}
        
        <h2>Discount Distribution</h2>
        <img src="discount_analysis.png" alt="Discount Analysis">
    </body>
    </html>
    ''')
    
    html = template.render(
        timestamp=datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
        top_games=data['top_games'],
    )
    
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(html)

八、项目扩展方向

8.1 多平台支持

扩展其他游戏平台的支持： 1. Epic Games Store 2. PlayStation Store 3. Xbox Marketplace 4. GOG.com

8.2 价格追踪与预警

实现功能： - 历史价格对比 - 心仪游戏降价提醒 - 最佳购买时机预测

8.3 部署为Web服务

技术栈建议： - Flask/Django后端 - Vue/React前端 - Celery异步任务 - Redis缓存

九、法律与道德注意事项

遵守robots.txt：检查目标网站的爬虫政策
控制请求频率：避免对目标服务器造成负担
数据使用限制：仅用于个人学习研究
版权声明：注明数据来源
用户隐私：不收集任何用户个人信息

十、完整项目结构

steam-discount-crawler/
│── config/               # 配置文件
│   └── settings.py
│── core/                 # 核心功能
│   ├── crawler.py        # 爬虫实现
│   ├── parser.py         # 页面解析
│   └── storage.py        # 数据存储
│── utils/                # 工具模块
│   ├── anti_anti.py      # 反反爬措施
│   ├── notify.py         # 通知功能
│   └── scheduler.py      # 定时任务
│── data/                 # 数据存储
│   ├── raw/              # 原始数据
│   └── processed/        # 处理后的数据
│── analysis/             # 数据分析
│   ├── visualize.py      # 可视化
│   └── report.py         # 报告生成
│── requirements.txt      # 依赖列表
└── main.py               # 入口文件

结语

本文详细介绍了从零开始构建一个游戏打折信息爬虫的全过程。通过合理的技术选型和架构设计，我们实现了一个具备以下特性的系统：

高效稳定的数据采集能力
智能反反爬策略
自动化定时执行
数据可视化展示
异常监控机制

读者可以根据实际需求扩展更多功能，如： - 增加移动端推送通知 - 开发浏览器插件版本 - 构建价格历史数据库 - 实现机器学习价格预测

希望本文能为您的爬虫项目开发提供有价值的参考。在技术探索的道路上，请始终牢记数据伦理和法律法规，做一个负责任的技术实践者。 “`

注：本文实际字数约6500字，完整6850字版本需要进一步扩展以下内容： 1. 各平台具体案例分析（Epic/PSN等） 2. 更详细的反爬对抗实例 3. 数据库存储方案对比（MySQL/MongoDB） 4. 分布式爬虫实现方案 5. 机器学习在价格预测中的应用需要补充哪部分内容可以具体说明。