您好,登录后才能下订单哦!
# Python Selenium如何爬取每日天气
## 前言
在当今数据驱动的时代,获取准确的天气信息对日常生活、农业规划、交通出行等领域都至关重要。传统的人工查询方式效率低下,而通过Python结合Selenium自动化工具,我们可以高效地爬取每日天气数据。本文将详细介绍如何使用Selenium构建一个稳定的天气爬虫系统。
## 一、环境准备
### 1.1 安装必要库
```python
pip install selenium beautifulsoup4 pandas
根据使用的浏览器版本下载对应的驱动: - Chrome: ChromeDriver - Firefox: GeckoDriver
将驱动文件放在系统PATH路径或项目目录下。
以中国天气网(www.weather.com.cn)为例:
# 示例元素定位
temperature_xpath = '//div[@class="tem"]/span/text()'
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def init_driver():
chrome_options = Options()
chrome_options.add_argument("--headless") # 无头模式
chrome_options.add_argument("--disable-gpu")
driver = webdriver.Chrome(options=chrome_options)
return driver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def get_weather_data(driver, city):
driver.get(f"http://www.weather.com.cn/weather/{city}.shtml")
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "7d"))
)
return element.text
except TimeoutException:
print("页面加载超时")
return None
from bs4 import BeautifulSoup
def parse_html(html):
soup = BeautifulSoup(html, 'html.parser')
weather_list = []
for item in soup.select('.t li'):
date = item.select_one('.date').get_text()
weather = item.select_one('.wea').get_text()
temp = item.select_one('.tem').get_text().replace('\n', '')
weather_list.append([date, weather, temp])
return weather_list
import csv
def save_to_csv(data, filename):
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['日期', '天气状况', '温度'])
writer.writerows(data)
import pymysql
def save_to_mysql(data):
conn = pymysql.connect(host='localhost',
user='root',
password='password',
database='weather')
cursor = conn.cursor()
sql = """INSERT INTO daily_weather
(record_date, weather, temperature)
VALUES (%s, %s, %s)"""
cursor.executemany(sql, data)
conn.commit()
conn.close()
import random
import time
time.sleep(random.uniform(1, 3))
chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64)")
使用APScheduler实现定时爬取:
from apscheduler.schedulers.blocking import BlockingScheduler
scheduler = BlockingScheduler()
@scheduler.scheduled_job('cron', hour=7)
def daily_job():
driver = init_driver()
data = get_weather_data(driver, '101010100') # 北京城市代码
processed = parse_html(data)
save_to_csv(processed, 'weather.csv')
driver.quit()
scheduler.start()
import csv
import random
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
class WeatherSpider:
def __init__(self):
self.driver = self.init_driver()
def init_driver(self):
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("user-agent=Mozilla/5.0")
driver = webdriver.Chrome(options=options)
return driver
def fetch_data(self, city_code):
url = f"http://www.weather.com.cn/weather/{city_code}.shtml"
self.driver.get(url)
try:
WebDriverWait(self.driver, 10).until(
EC.presence_of_element_located((By.ID, "7d"))
)
time.sleep(random.uniform(1, 2))
return self.driver.page_source
except Exception as e:
print(f"Error occurred: {str(e)}")
return None
def parse_data(self, html):
soup = BeautifulSoup(html, 'html.parser')
results = []
for day in soup.select('.t li'):
try:
date = day.select_one('.date').get_text().strip()
weather = day.select_one('.wea').get_text().strip()
temp = day.select_one('.tem').get_text().replace('\n', '').strip()
results.append([date, weather, temp])
except AttributeError:
continue
return results
def save_data(self, data, filename='weather.csv'):
with open(filename, 'a', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
if f.tell() == 0:
writer.writerow(['日期', '天气', '温度'])
writer.writerows(data)
def run(self, city_code):
html = self.fetch_data(city_code)
if html:
data = self.parse_data(html)
self.save_data(data)
print(f"成功获取{len(data)}条天气数据")
else:
print("数据获取失败")
def close(self):
self.driver.quit()
if __name__ == "__main__":
spider = WeatherSpider()
try:
spider.run('101010100') # 北京城市代码
finally:
spider.close()
A: 使用显式等待结合Selenium的等待机制,或分析AJAX请求接口
A: 可以考虑: 1. 降低请求频率 2. 使用第三方打码平台 3. 切换数据源
A: 建议: 1. 检查网站更新频率 2. 设置合理的爬取间隔 3. 使用网站提供的API(如有)
通过本文的讲解,我们系统性地掌握了使用Python+Selenium爬取每日天气的完整流程。在实际应用中,建议: 1. 遵守网站的robots.txt协议 2. 控制爬取频率避免给服务器造成压力 3. 定期维护代码以适应网站改版
希望本文能帮助您构建稳定可靠的天气数据采集系统,为后续的数据分析和应用打下坚实基础。
附录:国内主要城市天气代码示例
城市 | 代码 |
---|---|
北京 | 101010100 |
上海 | 101020100 |
广州 | 101280101 |
深圳 | 101280601 |
”`
注:本文代码示例需要根据实际目标网站结构调整,实际字符数约3700字左右。建议运行时根据具体网站元素修改定位方式,并遵守相关网站的使用条款。
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。