您好,登录后才能下订单哦!
密码登录
登录注册
点击 登录注册 即表示同意《亿速云用户服务条款》
# 怎么用Python采集北京二手房数据
在当今大数据时代,房产数据对于投资者、购房者和研究人员都具有重要价值。本文将详细介绍如何使用Python技术栈采集北京二手房数据,涵盖从环境准备到数据存储的完整流程。
## 一、准备工作
### 1.1 技术选型
我们主要使用以下Python库:
- **requests**:发送HTTP请求
- **BeautifulSoup**/lxml:HTML解析
- **pandas**:数据处理
- **selenium**:处理动态加载内容
- **MongoDB**/MySQL:数据存储
```python
# 安装必要库
pip install requests beautifulsoup4 pandas selenium pymongo mysql-connector-python
以链家网为例(https://bj.lianjia.com/ershoufang/),我们需要: 1. 分析URL结构 2. 查看页面加载方式(静态/动态) 3. 检查反爬机制(验证码、请求频率限制等)
使用浏览器开发者工具(F12)查看: - 网络请求 - 数据返回格式(HTML/JSON) - 关键数据所在标签
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_page(page_num):
url = f"https://bj.lianjia.com/ershoufang/pg{page_num}/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
house_list = []
for item in soup.select('.sellListContent li'):
try:
title = item.select_one('.title a').text
price = item.select_one('.totalPrice').text
unit_price = item.select_one('.unitPrice').text
house_info = item.select('.houseInfo')[0].text.split('|')
house_list.append({
'title': title,
'price': float(price.replace('万', '')),
'unit_price': float(unit_price.replace('元/平', '').replace(',', '')),
'district': house_info[0].strip(),
'area': float(house_info[1].replace('平米', '').strip()),
'layout': house_info[2].strip()
})
except Exception as e:
print(f"解析错误: {e}")
continue
return pd.DataFrame(house_list)
# 测试单页采集
df = get_page(1)
print(df.head())
def get_multiple_pages(start_page, end_page):
all_data = []
for page in range(start_page, end_page+1):
print(f"正在采集第{page}页...")
try:
df = get_page(page)
all_data.append(df)
time.sleep(random.uniform(1, 3)) # 随机延迟
except Exception as e:
print(f"第{page}页采集失败: {e}")
return pd.concat(all_data, ignore_index=True)
# 采集前10页数据
data = get_multiple_pages(1, 10)
当遇到JavaScript渲染的页面时,需要使用selenium:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def get_dynamic_page(url):
chrome_options = Options()
chrome_options.add_argument('--headless') # 无头模式
driver = webdriver.Chrome(options=chrome_options)
driver.get(url)
# 等待元素加载
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "sellListContent"))
soup = BeautifulSoup(driver.page_source, 'lxml')
# 后续解析逻辑...
finally:
driver.quit()
常见应对策略: 1. User-Agent轮换
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64)...",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)..."
]
headers = {'User-Agent': random.choice(user_agents)}
proxies = {
'http': 'http://127.0.0.1:1080',
'https': 'https://127.0.0.1:1080'
}
requests.get(url, proxies=proxies)
import time
time.sleep(random.uniform(0.5, 2))
from pymongo import MongoClient
def save_to_mongodb(data):
client = MongoClient('mongodb://localhost:27017/')
db = client['real_estate']
collection = db['beijing_ershou']
# 转换为字典格式
records = data.to_dict('records')
collection.insert_many(records)
import mysql.connector
def save_to_mysql(data):
conn = mysql.connector.connect(
host="localhost",
user="root",
password="password",
database="real_estate"
)
cursor = conn.cursor()
create_table = """
CREATE TABLE IF NOT EXISTS beijing_ershou (
id INT AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(255),
price FLOAT,
unit_price FLOAT,
district VARCHAR(50),
area FLOAT,
layout VARCHAR(50),
crawl_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
"""
cursor.execute(create_table)
insert_sql = """
INSERT INTO beijing_ershou
(title, price, unit_price, district, area, layout)
VALUES (%s, %s, %s, %s, %s, %s)
"""
for _, row in data.iterrows():
cursor.execute(insert_sql, (
row['title'], row['price'], row['unit_price'],
row['district'], row['area'], row['layout']
))
conn.commit()
conn.close()
def clean_data(df):
# 处理缺失值
df = df.dropna()
# 去重
df = df.drop_duplicates(subset=['title'])
# 数据类型转换
df['price'] = df['price'].astype(float)
df['unit_price'] = df['unit_price'].astype(float)
return df
cleaned_data = clean_data(data)
# 各区域平均单价
district_avg = cleaned_data.groupby('district')['unit_price'].mean().sort_values(ascending=False)
# 价格分布
price_bins = [0, 200, 300, 400, 500, 600, 1000, float('inf')]
price_labels = ['0-200', '200-300', '300-400', '400-500', '500-600', '600-1000', '1000+']
cleaned_data['price_range'] = pd.cut(cleaned_data['price'], bins=price_bins, labels=price_labels)
beijing-housing-spider/
├── config/ # 配置文件
│ ├── db_config.py # 数据库配置
│ └── user_agents.py # User-Agent列表
├── spiders/ # 爬虫核心
│ ├── base_spider.py # 基础爬虫类
│ ├── lianjia.py # 链家爬虫
│ └── beike.py # 贝壳爬虫
├── utils/ # 工具类
│ ├── proxy.py # 代理工具
│ └── logger.py # 日志工具
├── storage/ # 存储模块
│ ├── mongodb.py
│ └── mysql.py
└── main.py # 主程序入口
# 示例:使用APScheduler设置定时任务
from apscheduler.schedulers.blocking import BlockingScheduler
def daily_job():
data = get_multiple_pages(1, 5)
save_to_mongodb(data)
scheduler = BlockingScheduler()
scheduler.add_job(daily_job, 'cron', hour=2) # 每天凌晨2点执行
scheduler.start()
通过本文介绍的方法,您可以构建一个完整的北京二手房数据采集系统。建议在实际应用中逐步完善异常处理、日志记录等功能,使爬虫更加健壮可靠。 “`
(注:实际字数约2400字,此处为精简展示版本。完整版可扩展各章节细节、增加错误处理案例和可视化示例等内容)
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。