怎么用Python获取大众点评上长沙口味虾店铺信息

发布时间：2021-11-25 14:33:52 作者：iii
来源：亿速云阅读：177

# 怎么用Python获取大众点评上长沙口味虾店铺信息

## 前言

大众点评作为国内领先的本地生活信息平台，汇聚了大量餐饮店铺信息和用户评价数据。对于美食爱好者、市场研究人员或数据挖掘开发者而言，获取这些数据具有重要价值。本文将详细介绍如何使用Python技术栈爬取大众点评上长沙地区口味虾（小龙虾）相关店铺的详细信息，包括店铺名称、评分、评论数、人均消费、地址等关键数据。

---

## 一、准备工作

### 1.1 技术选型

本项目需要以下Python技术栈：
- **Requests**：发送HTTP请求
- **BeautifulSoup**/lxml：HTML解析
- **Selenium**：处理动态加载内容
- **Pandas**：数据存储与分析
- **代理IP服务**：应对反爬机制

### 1.2 环境配置

```python
# 安装必要库
pip install requests beautifulsoup4 selenium pandas

1.3 大众点评页面分析

访问大众点评长沙站，搜索”口味虾”后观察到： - URL结构：https://www.dianping.com/search/keyword/344/0_口味虾 - 数据加载方式：首屏静态加载 + 滚动动态加载 - 反爬机制：Cookie验证、请求频率限制、IP封禁

二、基础爬虫实现

2.1 获取页面HTML

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...',
    'Cookie': '你的登录Cookie'
}

url = "https://www.dianping.com/search/keyword/344/0_口味虾"

response = requests.get(url, headers=headers)
print(response.status_code)  # 验证请求是否成功

注意：大众点评需要登录后才能查看完整信息，建议通过浏览器开发者工具获取登录后的Cookie。

2.2 解析店铺列表

使用BeautifulSoup解析获取的HTML：

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'lxml')

shops = soup.select('.shop-list ul li')  # CSS选择器获取店铺列表

for shop in shops:
    name = shop.select_one('.tit a').text.strip()
    score = shop.select_one('.comment span').text
    print(f"店名：{name}，评分：{score}")

三、处理动态加载内容

3.1 Selenium自动化方案

当滚动加载时，需要模拟浏览器行为：

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

driver = webdriver.Chrome()
driver.get(url)

# 滚动页面加载更多数据
for _ in range(5):  # 滚动5次
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)  # 等待加载

# 获取完整页面源码
html = driver.page_source
driver.quit()

3.2 提取完整数据

结合Selenium获取的完整HTML，提取更多字段：

def parse_shop(html):
    soup = BeautifulSoup(html, 'lxml')
    items = []
    
    for shop in soup.select('.shop-list ul li'):
        data = {
            'name': shop.select_one('.tit a').text.strip(),
            'score': shop.select_one('.comment span').text,
            'review_count': shop.select_one('.review-num b').text,
            'price': shop.select_one('.mean-price b').text,
            'address': shop.select_one('.addr').text,
            'recommend': [tag.text for tag in shop.select('.recommend a')]
        }
        items.append(data)
    
    return pd.DataFrame(items)

四、应对反爬机制

4.1 请求头优化

headers = {
    'User-Agent': 'Mozilla/5.0...',
    'Referer': 'https://www.dianping.com/changsha/ch10',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br'
}

4.2 代理IP设置

proxies = {
    'http': 'http://127.0.0.1:1080',
    'https': 'https://127.0.0.1:1080'
}

response = requests.get(url, headers=headers, proxies=proxies)

4.3 请求频率控制

import random
import time

def safe_request(url):
    time.sleep(random.uniform(1, 3))  # 随机延迟
    return requests.get(url, headers=headers)

五、数据存储与分析

5.1 存储到CSV

import pandas as pd

df = parse_shop(html)
df.to_csv('changsha_shops.csv', index=False, encoding='utf_8_sig')

5.2 数据分析示例

# 读取数据
df = pd.read_csv('changsha_shops.csv')

# 评分分布分析
print(df['score'].value_counts())

# 价格区间分析
df['price'] = df['price'].str.extract('(\d+)').astype(float)
print(df['price'].describe())

六、完整代码示例

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random

class DianPingSpider:
    def __init__(self):
        self.headers = {
            'User-Agent': 'Mozilla/5.0...',
            'Cookie': 'your_cookie_here'
        }
        self.base_url = "https://www.dianping.com/search/keyword/344/0_口味虾"
    
    def get_html(self, url):
        time.sleep(random.uniform(1, 3))
        response = requests.get(url, headers=self.headers)
        if response.status_code == 200:
            return response.text
        return None
    
    def parse_page(self, html):
        # 实现解析逻辑
        pass
    
    def run(self):
        html = self.get_html(self.base_url)
        if html:
            df = self.parse_page(html)
            df.to_csv('result.csv', index=False)

if __name__ == '__main__':
    spider = DianPingSpider()
    spider.run()

七、法律与道德考量

遵守robots.txt：检查大众点评的爬虫协议
限制爬取频率：避免对服务器造成负担
数据使用范围：仅用于个人学习与研究
用户隐私保护：不爬取用户个人信息

结语

通过本文介绍的方法，您可以获取长沙地区口味虾店铺的详细数据。但需要注意： - 大众点评的反爬策略会持续更新 - 建议分布式爬取时使用代理IP池 - 商业用途需获得平台授权

扩展建议： - 增加店铺评论爬取模块 - 实现自动化数据更新机制 - 结合地图API进行地理位置分析

完整项目代码已上传GitHub（示例地址），欢迎交流指正。 “`

（注：实际执行时需替换示例中的Cookie、代理IP等敏感信息，且代码可能需要根据大众点评实际页面结构调整）