Python中Scrapy如何抓取并分析天气数据

发布时间：2021-10-09 16:40:31 作者：柒染
来源：亿速云阅读：168

# Python中Scrapy如何抓取并分析天气数据

## 一、Scrapy框架简介

Scrapy是Python生态中最强大的网络爬虫框架之一，采用异步处理架构，具有以下核心优势：
- 内置高性能下载器（基于Twisted）
- 完善的中间件扩展机制
- 支持XPath和CSS选择器
- 自动处理编码与异常
- 内置数据导出管道

## 二、目标分析：天气数据源选择

### 2.1 常见数据源对比
| 数据源       | 优点                  | 缺点                  |
|--------------|-----------------------|-----------------------|
| 中国天气网     | 官方数据权威          | 反爬机制严格          |
| Weather.com   | 国际数据全面          | 需要处理英文数据      |
| 聚合数据API   | 接口规范              | 有调用次数限制        |

### 2.2 示例选择
本文以中国天气网（www.weather.com.cn）为例，抓取北京未来7天天气预报数据。

## 三、项目搭建实战

### 3.1 创建Scrapy项目
```bash
pip install scrapy
scrapy startproject weather_spider
cd weather_spider
scrapy genspider weather www.weather.com.cn

3.2 核心组件配置

settings.py关键配置：

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 2  # 降低请求频率
ITEM_PIPELINES = {
   'weather_spider.pipelines.WeatherSpiderPipeline': 300,
}

四、爬虫逻辑实现

4.1 定义数据模型

items.py：

import scrapy

class WeatherItem(scrapy.Item):
    date = scrapy.Field()       # 日期
    temp = scrapy.Field()       # 温度范围
    weather = scrapy.Field()    # 天气状况
    wind = scrapy.Field()       # 风力风向
    humidity = scrapy.Field()   # 湿度

4.2 编写爬虫核心逻辑

weather.py：

import scrapy
from weather_spider.items import WeatherItem

class WeatherSpider(scrapy.Spider):
    name = 'weather'
    allowed_domains = ['weather.com.cn']
    start_urls = ['http://www.weather.com.cn/weather/101010100.shtml']

    def parse(self, response):
        for day in response.css('div#7d ul li'):
            item = WeatherItem()
            item['date'] = day.css('h1::text').get()
            item['weather'] = day.css('p.wea::text').get()
            item['temp'] = '/'.join(day.css('p.tem span::text, p.tem i::text').getall())
            item['wind'] = day.xpath('.//p[@class="win"]/i/text()').get()
            yield item

4.3 反爬应对策略

User-Agent轮换：使用scrapy-fake-useragent库
IP代理池：配置middlewares.py
验证码识别：对接第三方打码平台

五、数据存储与分析

5.1 数据存储方案

pipelines.py示例（MongoDB存储）：

import pymongo

class WeatherSpiderPipeline:
    def __init__(self):
        self.client = pymongo.MongoClient('mongodb://localhost:27017')
        self.db = self.client['weather_db']
        
    def process_item(self, item, spider):
        self.db['beijing'].insert_one(dict(item))
        return item

5.2 数据分析示例

使用Pandas进行基础分析：

import pandas as pd
from matplotlib import pyplot as plt

df = pd.DataFrame(list(db.beijing.find()))
df['high_temp'] = df['temp'].str.extract('(\d+)℃').astype(int)

# 绘制温度变化曲线
plt.figure(figsize=(10,5))
df.plot(x='date', y='high_temp', kind='line')
plt.title('北京未来7天最高气温变化')
plt.savefig('weather_trend.png')

六、高级技巧扩展

6.1 分布式爬虫

使用scrapy-redis实现：

# settings.py
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

6.2 动态页面处理

对接Selenium：

from scrapy_selenium import SeleniumRequest

yield SeleniumRequest(
    url=url,
    callback=self.parse_detail,
    script='window.scrollTo(0, document.body.scrollHeight)'
)

七、法律与道德规范

遵守网站的robots.txt协议
控制请求频率（建议≥2秒/次）
避免商业用途的未经授权抓取
数据使用时注明来源

结语

通过Scrapy框架，我们实现了： 1. 高效抓取结构化天气数据 2. 数据持久化存储 3. 基础可视化分析 4. 应对常见反爬措施

完整项目代码已上传GitHub（示例仓库地址）。实际应用中建议添加异常处理、日志记录等生产级代码，并考虑使用APIScheduler实现定时抓取。 “`