如何使用python爬取天气数据

发布时间：2022-01-13 15:57:42 作者：小新
来源：亿速云阅读：244

# 如何使用Python爬取天气数据

## 前言

在数据驱动的时代，天气数据对农业、交通、旅游等行业具有重要意义。Python作为强大的编程语言，凭借丰富的库生态系统，成为网络爬虫开发的首选工具。本文将详细介绍使用Python爬取天气数据的完整流程，涵盖从环境准备到数据存储的全过程。

## 一、准备工作

### 1.1 开发环境配置

首先需要确保已安装Python环境（推荐3.7+版本），并安装必要的库：

```bash
pip install requests beautifulsoup4 pandas selenium

1.2 选择目标网站

常见天气数据来源： - 中国天气网（www.weather.com.cn） - 中央气象台（www.nmc.cn） - World Weather Online（www.worldweatheronline.com）

注意：爬取前务必查看网站的robots.txt文件和使用条款

二、静态网页爬取（以中国天气网为例）

2.1 分析网页结构

打开目标城市页面（如北京）
使用浏览器开发者工具（F12）检查元素
定位温度、湿度等关键数据的HTML标签

2.2 使用Requests+BeautifulSoup实现

import requests
from bs4 import BeautifulSoup
import pandas as pd

def get_weather(city_code):
    url = f"http://www.weather.com.cn/weather/{city_code}.shtml"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
    }
    
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.encoding = 'utf-8'
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # 提取7天天气数据
        weather_list = []
        for item in soup.select(".t.clearfix li"):
            date = item.select_one("h1").get_text()
            weather = item.select_one(".wea").get_text()
            temp = item.select_one(".tem").get_text().replace("\n", "")
            wind = item.select_one(".win em span")["title"]
            
            weather_list.append({
                "日期": date,
                "天气": weather,
                "温度": temp,
                "风向": wind
            })
        
        return pd.DataFrame(weather_list)
    
    except Exception as e:
        print(f"爬取失败: {e}")
        return None

# 使用示例
df = get_weather('101010100')  # 北京城市代码
print(df.head())

2.3 处理反爬机制

User-Agent轮换：准备多个常用浏览器UA
请求间隔：使用time.sleep(random.uniform(1,3))
代理IP池：应对IP封锁
Cookies处理：维持会话状态

三、动态网页爬取（以World Weather Online为例）

3.1 Selenium自动化工具

当数据通过JavaScript动态加载时，需要使用浏览器自动化工具：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

def get_dynamic_weather():
    chrome_options = Options()
    chrome_options.add_argument("--headless")  # 无头模式
    driver = webdriver.Chrome(options=chrome_options)
    
    try:
        driver.get("https://www.worldweatheronline.com/beijing-weather/beijing/cn.aspx")
        time.sleep(5)  # 等待页面加载
        
        # 使用XPath定位元素
        temp = driver.find_element_by_xpath('//div[@class="temp"]').text
        condition = driver.find_element_by_xpath('//div[@class="condition"]').text
        
        print(f"当前温度: {temp}, 天气状况: {condition}")
        
    finally:
        driver.quit()

get_dynamic_weather()

3.2 高级技巧

显式等待：使用WebDriverWait替代固定等待
截图调试：driver.save_screenshot('debug.png')
无头浏览器检测规避：添加--disable-blink-features=AutomationControlled

四、API接口调用（推荐方式）

4.1 寻找开放API

和风天气（商业API，有免费额度）
OpenWeatherMap（免费版有限制）
国家气象局开放平台

4.2 示例：和风天气API

import requests
import json

def get_weather_by_api(location="101010100", key="YOUR_API_KEY"):
    url = f"https://devapi.qweather.com/v7/weather/now?location={location}&key={key}"
    
    response = requests.get(url)
    data = json.loads(response.text)
    
    if data['code'] == '200':
        weather_info = {
            '观测时间': data['updateTime'],
            '温度': f"{data['now']['temp']}°C",
            '体感温度': f"{data['now']['feelsLike']}°C",
            '天气': data['now']['text'],
            '风向': data['now']['windDir'],
            '风速': f"{data['now']['windSpeed']}km/h"
        }
        return weather_info
    else:
        return None

# 使用示例
result = get_weather_by_api()
print(json.dumps(result, indent=2, ensure_ascii=False))

五、数据存储与管理

5.1 存储到CSV

df.to_csv('weather_data.csv', index=False, encoding='utf_8_sig')

5.2 存储到数据库（MySQL示例）

import pymysql
from sqlalchemy import create_engine

def save_to_mysql(df, table_name='weather_data'):
    engine = create_engine('mysql+pymysql://user:password@localhost:3306/weather_db')
    df.to_sql(table_name, engine, if_exists='append', index=False)

save_to_mysql(df)

5.3 定时爬取（APScheduler）

from apscheduler.schedulers.blocking import BlockingScheduler

def job():
    print("开始执行定时爬取...")
    df = get_weather('101010100')
    save_to_mysql(df)

scheduler = BlockingScheduler()
scheduler.add_job(job, 'interval', hours=3)
scheduler.start()

六、数据分析与可视化

6.1 使用Pandas分析

# 读取数据
df = pd.read_csv('weather_data.csv')

# 温度分析
print(f"平均温度: {df['温度'].mean()}°C")
print(f"最高温度: {df['温度'].max()}°C")

6.2 使用Matplotlib可视化

import matplotlib.pyplot as plt

df['日期'] = pd.to_datetime(df['日期'])
df['最高温'] = df['温度'].str.extract('(\d+)').astype(int)

plt.figure(figsize=(10,5))
plt.plot(df['日期'], df['最高温'], marker='o')
plt.title('北京近期气温变化')
plt.xlabel('日期')
plt.ylabel('温度(°C)')
plt.grid()
plt.show()

七、项目优化建议

异常处理：增加网络请求重试机制
日志记录：使用logging模块记录运行状态
分布式爬取：Scrapy-Redis框架
数据清洗：处理缺失值和异常值
遵守法律：控制请求频率，避免对目标服务器造成压力

八、完整项目结构示例

weather_crawler/
│── config.py         # API密钥等配置
│── crawler.py        # 爬虫主程序
│── requirements.txt  # 依赖库
│── utils/            # 工具函数
│   ├── logger.py     # 日志配置
│   └── proxy.py      # 代理管理
└── data/             # 数据存储
    ├── raw/          # 原始数据
    └── processed/    # 处理后的数据

结语

本文详细介绍了使用Python爬取天气数据的多种方法，包括静态页面爬取、动态页面处理和API调用等。在实际应用中，建议优先选择官方API，并始终遵守网络爬虫道德规范。通过合理的数据存储和分析，天气数据可以为企业决策和个人生活提供有价值的参考。

注意：本文示例代码仅供学习参考，实际使用时请遵守相关网站的使用条款，合理控制爬取频率。 “`

这篇文章包含了约2550字，采用Markdown格式编写，涵盖了： 1. 环境准备和工具选择 2. 静态/动态网页爬取技术 3. API调用最佳实践 4. 数据存储和分析方法 5. 项目优化建议 6. 完整的代码示例

可根据需要调整具体细节或补充更多高级技巧。