怎么用Python采集北京二手房数据

发布时间：2021-11-22 11:33:03 作者：iii
来源：亿速云阅读：187

# 怎么用Python采集北京二手房数据

在当今大数据时代，房产数据对于投资者、购房者和研究人员都具有重要价值。本文将详细介绍如何使用Python技术栈采集北京二手房数据，涵盖从环境准备到数据存储的完整流程。

## 一、准备工作

### 1.1 技术选型

我们主要使用以下Python库：
- **requests**：发送HTTP请求
- **BeautifulSoup**/lxml：HTML解析
- **pandas**：数据处理
- **selenium**：处理动态加载内容
- **MongoDB**/MySQL：数据存储

```python
# 安装必要库
pip install requests beautifulsoup4 pandas selenium pymongo mysql-connector-python

1.2 目标网站分析

以链家网为例（https://bj.lianjia.com/ershoufang/），我们需要： 1. 分析URL结构 2. 查看页面加载方式（静态/动态） 3. 检查反爬机制（验证码、请求频率限制等）

使用浏览器开发者工具（F12）查看： - 网络请求 - 数据返回格式（HTML/JSON） - 关键数据所在标签

二、基础爬虫实现

2.1 获取单页数据

import requests
from bs4 import BeautifulSoup
import pandas as pd

def get_page(page_num):
    url = f"https://bj.lianjia.com/ershoufang/pg{page_num}/"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
    }
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'lxml')
    
    house_list = []
    for item in soup.select('.sellListContent li'):
        try:
            title = item.select_one('.title a').text
            price = item.select_one('.totalPrice').text
            unit_price = item.select_one('.unitPrice').text
            house_info = item.select('.houseInfo')[0].text.split('|')
            
            house_list.append({
                'title': title,
                'price': float(price.replace('万', '')),
                'unit_price': float(unit_price.replace('元/平', '').replace(',', '')),
                'district': house_info[0].strip(),
                'area': float(house_info[1].replace('平米', '').strip()),
                'layout': house_info[2].strip()
            })
        except Exception as e:
            print(f"解析错误: {e}")
            continue
    
    return pd.DataFrame(house_list)

# 测试单页采集
df = get_page(1)
print(df.head())

2.2 处理分页数据

def get_multiple_pages(start_page, end_page):
    all_data = []
    for page in range(start_page, end_page+1):
        print(f"正在采集第{page}页...")
        try:
            df = get_page(page)
            all_data.append(df)
            time.sleep(random.uniform(1, 3))  # 随机延迟
        except Exception as e:
            print(f"第{page}页采集失败: {e}")
    
    return pd.concat(all_data, ignore_index=True)

# 采集前10页数据
data = get_multiple_pages(1, 10)

三、高级爬取技巧

3.1 处理动态加载内容

当遇到JavaScript渲染的页面时，需要使用selenium：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

def get_dynamic_page(url):
    chrome_options = Options()
    chrome_options.add_argument('--headless')  # 无头模式
    driver = webdriver.Chrome(options=chrome_options)
    driver.get(url)
    
    # 等待元素加载
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    try:
        element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CLASS_NAME, "sellListContent"))
        soup = BeautifulSoup(driver.page_source, 'lxml')
        # 后续解析逻辑...
    finally:
        driver.quit()

3.2 绕过反爬机制

常见应对策略： 1. User-Agent轮换

user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64)...",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)..."
]
headers = {'User-Agent': random.choice(user_agents)}

IP代理池

proxies = {
    'http': 'http://127.0.0.1:1080',
    'https': 'https://127.0.0.1:1080'
}
requests.get(url, proxies=proxies)

请求间隔控制

import time
time.sleep(random.uniform(0.5, 2))

四、数据存储方案

4.1 MongoDB存储

from pymongo import MongoClient

def save_to_mongodb(data):
    client = MongoClient('mongodb://localhost:27017/')
    db = client['real_estate']
    collection = db['beijing_ershou']
    
    # 转换为字典格式
    records = data.to_dict('records')
    collection.insert_many(records)

4.2 MySQL存储

import mysql.connector

def save_to_mysql(data):
    conn = mysql.connector.connect(
        host="localhost",
        user="root",
        password="password",
        database="real_estate"
    )
    cursor = conn.cursor()
    
    create_table = """
    CREATE TABLE IF NOT EXISTS beijing_ershou (
        id INT AUTO_INCREMENT PRIMARY KEY,
        title VARCHAR(255),
        price FLOAT,
        unit_price FLOAT,
        district VARCHAR(50),
        area FLOAT,
        layout VARCHAR(50),
        crawl_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    )
    """
    cursor.execute(create_table)
    
    insert_sql = """
    INSERT INTO beijing_ershou 
    (title, price, unit_price, district, area, layout)
    VALUES (%s, %s, %s, %s, %s, %s)
    """
    
    for _, row in data.iterrows():
        cursor.execute(insert_sql, (
            row['title'], row['price'], row['unit_price'],
            row['district'], row['area'], row['layout']
        ))
    
    conn.commit()
    conn.close()

五、数据清洗与分析

5.1 基础清洗

def clean_data(df):
    # 处理缺失值
    df = df.dropna()
    
    # 去重
    df = df.drop_duplicates(subset=['title'])
    
    # 数据类型转换
    df['price'] = df['price'].astype(float)
    df['unit_price'] = df['unit_price'].astype(float)
    
    return df

cleaned_data = clean_data(data)

5.2 简单分析

# 各区域平均单价
district_avg = cleaned_data.groupby('district')['unit_price'].mean().sort_values(ascending=False)

# 价格分布
price_bins = [0, 200, 300, 400, 500, 600, 1000, float('inf')]
price_labels = ['0-200', '200-300', '300-400', '400-500', '500-600', '600-1000', '1000+']
cleaned_data['price_range'] = pd.cut(cleaned_data['price'], bins=price_bins, labels=price_labels)

六、完整项目架构

beijing-housing-spider/
├── config/               # 配置文件
│   ├── db_config.py      # 数据库配置
│   └── user_agents.py    # User-Agent列表
├── spiders/              # 爬虫核心
│   ├── base_spider.py    # 基础爬虫类
│   ├── lianjia.py        # 链家爬虫
│   └── beike.py          # 贝壳爬虫
├── utils/                # 工具类
│   ├── proxy.py          # 代理工具
│   └── logger.py         # 日志工具
├── storage/              # 存储模块
│   ├── mongodb.py        
│   └── mysql.py
└── main.py               # 主程序入口

七、法律与道德注意事项

遵守网站的robots.txt协议
控制请求频率（建议≥2秒/请求）
仅用于个人学习研究，不进行商业用途
避免采集个人隐私信息
考虑使用官方API（如有）

八、扩展方向

增加数据维度：采集历史价格、周边设施等
可视化分析：使用pyecharts或matplotlib
价格预测模型：基于历史数据建立回归模型
自动化监控：设置定时任务跟踪价格变化

# 示例：使用APScheduler设置定时任务
from apscheduler.schedulers.blocking import BlockingScheduler

def daily_job():
    data = get_multiple_pages(1, 5)
    save_to_mongodb(data)

scheduler = BlockingScheduler()
scheduler.add_job(daily_job, 'cron', hour=2)  # 每天凌晨2点执行
scheduler.start()

通过本文介绍的方法，您可以构建一个完整的北京二手房数据采集系统。建议在实际应用中逐步完善异常处理、日志记录等功能，使爬虫更加健壮可靠。 “`

（注：实际字数约2400字，此处为精简展示版本。完整版可扩展各章节细节、增加错误处理案例和可视化示例等内容）