Python怎么爬取中国大学排名并且保存到excel中

发布时间：2021-10-26 09:38:37 作者：柒染
来源：亿速云阅读：281

# Python怎么爬取中国大学排名并且保存到excel中

## 前言

在当今信息爆炸的时代，数据获取和处理能力已成为一项重要技能。教育领域的数据，特别是大学排名数据，对于学生择校、学术研究等都具有重要参考价值。本文将详细介绍如何使用Python爬取中国大学排名数据，并将结果保存到Excel文件中。

## 一、准备工作

### 1.1 所需工具和库

在开始之前，我们需要准备以下Python库：

- `requests`：用于发送HTTP请求
- `BeautifulSoup`：用于解析HTML文档
- `pandas`：用于数据处理和Excel导出
- `openpyxl`：pandas导出Excel所需的引擎

可以通过以下命令安装这些库：

```python
pip install requests beautifulsoup4 pandas openpyxl

1.2 目标网站分析

我们以”软科中国大学排名”为例（假设目标网址为：https://www.shanghairanking.cn/rankings/bcur/2023）。在编写爬虫前，我们需要：

检查网站是否有反爬机制
分析页面结构
确定数据所在位置

使用浏览器开发者工具（F12）可以查看网页的HTML结构。

二、爬虫实现步骤

2.1 发送HTTP请求

首先，我们需要获取网页内容：

import requests
from bs4 import BeautifulSoup

url = "https://www.shanghairanking.cn/rankings/bcur/2023"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

response = requests.get(url, headers=headers)
response.encoding = 'utf-8'  # 确保中文正常显示

2.2 解析HTML内容

使用BeautifulSoup解析获取的HTML：

soup = BeautifulSoup(response.text, 'html.parser')

2.3 定位数据位置

通过分析网页结构，找到包含排名数据的表格：

rank_table = soup.find('table', {'class': 'rk-table'})  # 假设表格类名为rk-table

2.4 提取数据

遍历表格行，提取所需数据：

universities = []
for row in rank_table.find_all('tr')[1:]:  # 跳过表头
    cols = row.find_all('td')
    if len(cols) > 1:  # 确保有数据
        rank = cols[0].text.strip()
        name = cols[1].text.strip()
        score = cols[2].text.strip()
        universities.append({
            '排名': rank,
            '学校名称': name,
            '总分': score
        })

三、数据处理与保存

3.1 使用pandas处理数据

将提取的数据转换为DataFrame：

import pandas as pd

df = pd.DataFrame(universities)

3.2 数据清洗

对数据进行必要的清洗：

# 去除空值
df = df.dropna()
# 重置索引
df = df.reset_index(drop=True)

3.3 保存到Excel

将数据保存为Excel文件：

df.to_excel('中国大学排名.xlsx', index=False, engine='openpyxl')

四、完整代码示例

import requests
from bs4 import BeautifulSoup
import pandas as pd

def get_university_ranking():
    url = "https://www.shanghairanking.cn/rankings/bcur/2023"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    
    try:
        # 发送请求
        response = requests.get(url, headers=headers)
        response.raise_for_status()  # 检查请求是否成功
        response.encoding = 'utf-8'
        
        # 解析HTML
        soup = BeautifulSoup(response.text, 'html.parser')
        rank_table = soup.find('table', {'class': 'rk-table'})
        
        # 提取数据
        universities = []
        for row in rank_table.find_all('tr')[1:]:
            cols = row.find_all('td')
            if len(cols) > 1:
                rank = cols[0].text.strip()
                name = cols[1].text.strip()
                score = cols[2].text.strip()
                universities.append({
                    '排名': rank,
                    '学校名称': name,
                    '总分': score
                })
        
        # 转换为DataFrame
        df = pd.DataFrame(universities)
        df = df.dropna()
        df = df.reset_index(drop=True)
        
        # 保存到Excel
        df.to_excel('中国大学排名.xlsx', index=False, engine='openpyxl')
        print("数据已成功保存到'中国大学排名.xlsx'")
        
    except Exception as e:
        print(f"发生错误: {e}")

if __name__ == "__main__":
    get_university_ranking()

五、爬虫优化与注意事项

5.1 反爬虫策略应对

设置请求头：模拟浏览器访问
使用代理IP：防止IP被封
添加延迟：避免频繁请求

import time
import random

# 在请求之间添加随机延迟
time.sleep(random.uniform(1, 3))

5.2 异常处理

完善异常处理机制：

try:
    response = requests.get(url, headers=headers, timeout=10)
except requests.exceptions.RequestException as e:
    print(f"请求失败: {e}")
    return None

5.3 数据验证

确保数据的完整性和准确性：

# 检查数据是否完整
if df.isnull().values.any():
    print("警告: 数据中存在空值")

六、扩展功能

6.1 多页爬取

如果需要爬取多页数据：

base_url = "https://www.shanghairanking.cn/rankings/bcur/2023?page={}"
for page in range(1, 6):  # 假设爬取前5页
    url = base_url.format(page)
    # 发送请求和解析数据的代码...

6.2 定时任务

使用schedule库实现定时爬取：

import schedule
import time

def job():
    print("开始执行爬取任务...")
    get_university_ranking()

# 每天上午10点执行
schedule.every().day.at("10:00").do(job)

while True:
    schedule.run_pending()
    time.sleep(1)

6.3 数据可视化

使用matplotlib进行简单的数据可视化：

import matplotlib.pyplot as plt

# 假设我们想展示前20名大学的分数
top20 = df.head(20)
plt.figure(figsize=(12, 8))
plt.barh(top20['学校名称'], top20['总分'].astype(float), color='skyblue')
plt.xlabel('总分')
plt.title('中国大学排名前20名')
plt.gca().invert_yaxis()  # 反转y轴，使排名第一的在顶部
plt.tight_layout()
plt.savefig('大学排名可视化.png')
plt.show()

七、法律与道德考量

遵守robots.txt：检查目标网站的爬虫协议
限制爬取频率：避免对服务器造成过大压力
仅用于学习目的：不要将数据用于商业用途
尊重版权：注明数据来源

八、常见问题解答

Q1: 爬取时遇到403错误怎么办？

A: 这可能是因为网站检测到了爬虫。尝试： - 更换User-Agent - 使用代理IP - 增加请求间隔时间

Q2: 数据保存到Excel后格式不对？

A: 可以尝试： - 指定Excel引擎：engine='openpyxl' - 调整列宽：使用pandas的ExcelWriter

with pd.ExcelWriter('中国大学排名.xlsx', engine='openpyxl') as writer:
    df.to_excel(writer, index=False)
    worksheet = writer.sheets['Sheet1']
    worksheet.column_dimensions['A'].width = 15  # 调整A列宽度

Q3: 如何爬取动态加载的数据？

A: 如果数据是通过JavaScript动态加载的，可以考虑： - 使用Selenium - 分析网站API接口

九、总结

本文详细介绍了使用Python爬取中国大学排名并保存到Excel的完整流程，包括：

发送HTTP请求获取网页内容
使用BeautifulSoup解析HTML
定位和提取所需数据
使用pandas处理和保存数据
爬虫优化和反反爬策略
法律和道德注意事项

通过这个案例，你不仅可以掌握基本的网络爬虫技术，还能学会如何处理和存储获取的数据。这些技能可以应用于各种数据采集场景，为你的学习和研究提供有力支持。

十、参考资料

希望这篇文章能帮助你成功爬取中国大学排名数据！如果有任何问题，欢迎在评论区留言讨论。 “`