Python怎么抓取京东商城评价

发布时间：2021-12-01 11:15:12 作者：iii
来源：亿速云阅读：354

# Python怎么抓取京东商城评价

## 引言

在电商数据分析中，商品评价是了解用户反馈、分析产品优劣的重要数据来源。京东作为国内主流电商平台，其商品评价数据具有很高的分析价值。本文将详细介绍如何使用Python抓取京东商城商品评价数据，包含完整的代码实现和关键技术解析。

---

## 一、技术准备

### 1.1 所需工具
- Python 3.7+
- 第三方库：
  ```python
  requests     # 网络请求
  pandas       # 数据存储
  json         # 解析JSON数据
  time         # 控制请求频率

1.2 京东评价接口分析

通过浏览器开发者工具（F12）分析京东评价请求，可以发现核心接口：

https://club.jd.com/comment/productPageComments.action

参数说明： - productId: 商品ID - score: 评价类型（0=全部，3=好评，2=中评，1=差评） - page: 页码（从0开始） - pageSize: 每页条数（默认10，最大可设100）

二、完整爬虫实现

2.1 基础爬取代码

import requests
import pandas as pd
import time

def get_jd_comments(product_id, max_pages=10):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
        "Referer": f"https://item.jd.com/{product_id}.html"
    }
    
    all_comments = []
    for page in range(max_pages):
        url = f"https://club.jd.com/comment/productPageComments.action"
        params = {
            "productId": product_id,
            "score": 0,
            "page": page,
            "pageSize": 100
        }
        
        try:
            resp = requests.get(url, headers=headers, params=params)
            data = resp.json()
            
            for comment in data["comments"]:
                all_comments.append({
                    "username": comment.get("nickname"),
                    "content": comment.get("content"),
                    "score": comment.get("score"),
                    "time": comment.get("creationTime"),
                    "replyCount": comment.get("replyCount")
                })
            
            print(f"已抓取第 {page+1} 页，累计 {len(all_comments)} 条评价")
            time.sleep(1)  # 防止请求过快
            
        except Exception as e:
            print(f"第 {page} 页抓取失败：{str(e)}")
    
    return pd.DataFrame(all_comments)

# 示例：抓取iPhone14评价（商品ID=100038004793）
df = get_jd_comments("100038004793", max_pages=5)
df.to_excel("jd_comments.xlsx", index=False)

2.2 反爬应对策略

User-Agent轮换：使用fake_useragent库动态生成UA

from fake_useragent import UserAgent
ua = UserAgent()
headers = {"User-Agent": ua.random}

IP代理池：对于大规模抓取建议使用代理IP

proxies = {
   "http": "http://your_proxy:port",
   "https": "http://your_proxy:port"
}
requests.get(url, proxies=proxies)

随机延迟：避免固定间隔触发反爬

import random
time.sleep(random.uniform(0.5, 2))

三、数据解析与存储

3.1 关键字段说明

字段名	说明
nickname	用户昵称
content	评价内容
score	评分（1-5星）
creationTime	评价时间
productColor	商品颜色
productSize	商品规格

3.2 数据增强处理

# 添加情感分析字段
from textblob import TextBlob
df["sentiment"] = df["content"].apply(
    lambda x: TextBlob(x).sentiment.polarity
)

# 时间格式转换
df["time"] = pd.to_datetime(df["time"])

四、高级技巧

4.1 异步抓取（提升效率）

import aiohttp
import asyncio

async def fetch_page(session, url):
    async with session.get(url) as resp:
        return await resp.json()

async def main(product_id, pages):
    async with aiohttp.ClientSession() as session:
        tasks = []
        for page in range(pages):
            url = f"https://club.jd.com/comment/productPageComments.action?productId={product_id}&page={page}"
            tasks.append(fetch_page(session, url))
        return await asyncio.gather(*tasks)

4.2 评价图片下载

import os
from urllib.parse import urljoin

def download_images(comment_id, img_urls):
    os.makedirs(f"images/{comment_id}", exist_ok=True)
    for i, url in enumerate(img_urls):
        try:
            resp = requests.get(urljoin("https:", url))
            with open(f"images/{comment_id}/{i}.jpg", "wb") as f:
                f.write(resp.content)
        except:
            continue

五、注意事项

法律风险：遵守京东robots.txt协议，控制抓取频率
数据用途：仅限个人学习使用，禁止商业用途
性能建议：
- 单商品建议不超过100页/天
- 分布式抓取需要更完善的代理方案

结语

本文介绍了从京东获取商品评价的完整流程，涵盖了基础抓取、反爬策略、数据存储等关键环节。实际应用中可根据需求扩展更多功能，如： - 自动翻页至最后一页 - 评价关键词提取 - 用户画像分析

完整代码已托管至GitHub（示例仓库地址）。建议在遵守平台规则的前提下合理使用爬虫技术。 “`

（注：实际字符数约1500字，可根据需要删减部分章节调整字数）