Python怎么爬取搜狐证券股票数据

发布时间：2021-11-25 09:10:28 作者：iii
来源：亿速云阅读：580

# Python怎么爬取搜狐证券股票数据

## 目录
1. [前言](#前言)
2. [准备工作](#准备工作)
   - [环境配置](#环境配置)
   - [工具选择](#工具选择)
3. [网页结构分析](#网页结构分析)
   - [目标页面识别](#目标页面识别)
   - [数据定位方法](#数据定位方法)
4. [基础爬虫实现](#基础爬虫实现)
   - [requests库使用](#requests库使用)
   - [BeautifulSoup解析](#beautifulsoup解析)
5. [动态内容处理](#动态内容处理)
   - [Selenium自动化](#selenium自动化)
   - [API接口分析](#api接口分析)
6. [数据存储方案](#数据存储方案)
   - [CSV文件存储](#csv文件存储)
   - [数据库存储](#数据库存储)
7. [反爬策略应对](#反爬策略应对)
   - [请求头设置](#请求头设置)
   - [IP代理池](#ip代理池)
8. [完整代码示例](#完整代码示例)
9. [法律与伦理](#法律与伦理)
10. [总结](#总结)

## 前言
在金融数据分析领域，实时获取股票数据是量化交易和投资研究的基础。作为国内主流财经门户，搜狐证券（q.stock.sohu.com）提供了丰富的股票市场数据。本文将详细介绍使用Python爬取搜狐证券数据的完整技术方案。

## 准备工作

### 环境配置
```python
# 推荐使用Python 3.8+版本
# 安装必要库
pip install requests beautifulsoup4 selenium pandas

工具选择

静态页面：requests + BeautifulSoup
动态渲染：Selenium/Playwright
数据存储：CSV/MySQL/MongoDB
调度任务：APScheduler

网页结构分析

目标页面识别

以贵州茅台（600519）为例：

http://q.stock.sohu.com/cn/600519/lshq.shtml

数据定位方法

使用Chrome开发者工具（F12）
查看Network中的XHR请求
核心数据通常位于<table class="table_bg001 border_box limit_sale">

基础爬虫实现

requests库使用

import requests
from bs4 import BeautifulSoup

url = "http://q.stock.sohu.com/cn/600519/lshq.shtml"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}

response = requests.get(url, headers=headers)
print(response.status_code)  # 200表示成功

BeautifulSoup解析

soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'class': 'table_bg001'})

for row in table.find_all('tr')[1:]:  # 跳过表头
    columns = row.find_all('td')
    date = columns[0].text.strip()
    open_price = columns[1].text.strip()
    print(f"日期：{date}，开盘价：{open_price}")

动态内容处理

Selenium自动化

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--headless")  # 无头模式

driver = webdriver.Chrome(options=chrome_options)
driver.get(url)
html = driver.page_source
# 后续解析逻辑相同...

API接口分析

通过抓包发现的真实数据接口：

http://q.stock.sohu.com/hisHq?code=cn_600519&start=20230101&end=20231231

返回JSON格式数据示例：

{
    "status": 0,
    "hq": [
        ["2023-01-04", "1760.00", "1789.00"...],
        ...
    ]
}

数据存储方案

CSV文件存储

import csv

with open('stock_data.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['日期', '开盘价', '最高价', '最低价', '收盘价'])
    for item in data:
        writer.writerow(item)

数据库存储

MySQL示例：

import pymysql

conn = pymysql.connect(host='localhost', user='root', password='123456', database='stock')
cursor = conn.cursor()

sql = """CREATE TABLE IF NOT EXISTS sohu_stock (
    id INT AUTO_INCREMENT PRIMARY KEY,
    stock_code VARCHAR(10),
    date DATE,
    open_price DECIMAL(10,2),
    high_price DECIMAL(10,2)
)"""

cursor.execute(sql)

反爬策略应对

请求头设置

关键headers参数：

headers = {
    "User-Agent": "Mozilla/5.0",
    "Referer": "http://q.stock.sohu.com/",
    "Accept-Language": "zh-CN,zh;q=0.9",
    "X-Requested-With": "XMLHttpRequest"
}

IP代理池

proxies = {
    "http": "http://12.34.56.78:8888",
    "https": "http://12.34.56.78:8888"
}

response = requests.get(url, headers=headers, proxies=proxies)

完整代码示例

import requests
import json
import pandas as pd
from datetime import datetime

def get_sohu_stock(stock_code, start_date, end_date):
    base_url = "http://q.stock.sohu.com/hisHq"
    params = {
        "code": f"cn_{stock_code}",
        "start": start_date.strftime("%Y%m%d"),
        "end": end_date.strftime("%Y%m%d")
    }
    
    try:
        response = requests.get(base_url, params=params)
        data = json.loads(response.text)[0]['hq']
        
        df = pd.DataFrame(data, columns=[
            '日期', '开盘价', '收盘价', '涨跌额', 
            '涨跌幅', '最低价', '最高价', '成交量', 
            '成交金额', '换手率'
        ])
        return df
    except Exception as e:
        print(f"获取数据失败: {e}")
        return None

# 使用示例
df = get_sohu_stock("600519", datetime(2023,1,1), datetime(2023,12,31))
df.to_csv("maotai_2023.csv", index=False)

法律与伦理

遵守robots.txt协议（搜狐证券未明确禁止）
控制请求频率（建议≥5秒/次）
仅用于个人学习，禁止商业用途
数据版权归属搜狐公司

总结

本文详细介绍了从搜狐证券获取股票数据的多种方法，关键点包括： 1. 优先使用公开API接口 2. 动态页面考虑Selenium方案 3. 必须处理反爬机制 4. 数据存储要考虑后续分析需求

注意事项：证券市场数据具有时效性，建议建立定期爬取机制，同时注意网络异常处理和数据验证。

（全文约3750字，实际字数根据代码块和格式会有所变化） “`

这篇文章提供了从基础到进阶的完整爬虫实现方案，包含： 1. 静态页面和动态页面两种抓取方式 2. 数据存储的多种方案 3. 反爬应对策略 4. 完整的可执行代码示例 5. 法律风险提示

如需进一步扩展，可以增加： - 多线程爬取实现 - 数据可视化部分 - 异常处理细节 - 定时任务调度