python如何使用requests+re

发布时间：2021-10-11 18:42:40 作者：柒染
来源：亿速云阅读：167

# Python如何使用requests+re进行网络数据抓取

## 一、引言

在网络数据抓取领域，Python凭借其丰富的库生态成为开发者首选工具。`requests`库负责高效网络请求，而`re`模块提供正则表达式支持，二者结合可以快速实现从网页获取数据并提取关键信息的工作流。本文将详细介绍如何组合使用这两个工具完成典型爬虫任务。

## 二、环境准备

### 1. 安装requests库
```bash
pip install requests

2. 导入必要模块

import requests
import re
from pprint import pprint  # 美化输出

三、基础请求操作

1. 发送GET请求

url = "https://example.com"
response = requests.get(url)
print(response.status_code)  # 200表示成功

2. 处理响应内容

html = response.text  # 获取文本内容
binary = response.content  # 获取字节内容

四、正则表达式基础

1. 常用元字符

. 匹配任意字符（除换行符）
\d 匹配数字
\w 匹配字母/数字/下划线
* 0次或多次重复
+ 1次或多次重复

2. 分组捕获

pattern = r'<h1>(.*?)</h1>'  # 非贪婪匹配

五、实战案例：提取网页标题

1. 获取网页内容

resp = requests.get("https://www.python.org")
html = resp.text

2. 编写正则表达式

title_pattern = re.compile(r'<title>(.*?)</title>', re.IGNORECASE)

3. 提取数据

match = title_pattern.search(html)
if match:
    print(f"网页标题: {match.group(1)}")

六、高级应用技巧

1. 处理动态参数

params = {'q': 'python', 'page': 1}
response = requests.get("https://search.example.com", params=params)

2. 设置请求头

headers = {
    'User-Agent': 'Mozilla/5.0',
    'Accept-Language': 'zh-CN'
}
requests.get(url, headers=headers)

3. 多模式组合匹配

# 提取所有链接
link_pattern = re.compile(r'href="(https?://.*?)"')
links = link_pattern.findall(html)
pprint(links[:5])  # 打印前5个链接

七、错误处理机制

1. 网络请求异常处理

try:
    response = requests.get(url, timeout=5)
    response.raise_for_status()  # 检查HTTP错误
except requests.exceptions.RequestException as e:
    print(f"请求失败: {e}")

2. 正则匹配安全措施

def safe_extract(pattern, text):
    try:
        return re.search(pattern, text).group(1)
    except AttributeError:
        return None

八、性能优化建议

预编译正则表达式：使用re.compile()提前编译
会话保持：使用requests.Session()复用连接
限制匹配范围：先定位到特定区域再应用正则

九、完整示例：抓取新闻标题

import requests
import re

def scrape_news():
    url = "https://news.example.com"
    session = requests.Session()
    response = session.get(url)
    
    # 提取新闻块
    news_pattern = re.compile(r'<div class="news-item">(.*?)</div>', re.DOTALL)
    # 提取标题和链接
    item_pattern = re.compile(
        r'<h2><a href="(.*?)">(.*?)</a></h2>'
    )
    
    for news_block in news_pattern.finditer(response.text):
        match = item_pattern.search(news_block.group(1))
        if match:
            print(f"标题: {match.group(2)}\n链接: {match.group(1)}\n")

if __name__ == '__main__':
    scrape_news()

十、注意事项

遵守robots.txt协议
设置合理的请求间隔（建议≥2秒）
检查网站是否有反爬机制
重要数据建议使用HTML解析器（如BeautifulSoup）辅助

结语

requests+re组合提供了轻量级的数据抓取方案，适合快速原型开发和小规模数据采集。对于复杂场景，建议结合xpath或css选择器等更专业的解析工具。掌握这些基础技能后，您可以进一步探索异步请求、代理设置等高级主题。 “`

python如何使用requests+re

2. 导入必要模块

三、基础请求操作

1. 发送GET请求

2. 处理响应内容

四、正则表达式基础

1. 常用元字符

2. 分组捕获

五、实战案例：提取网页标题

1. 获取网页内容

2. 编写正则表达式

3. 提取数据

六、高级应用技巧

1. 处理动态参数

2. 设置请求头

3. 多模式组合匹配

七、错误处理机制

1. 网络请求异常处理

2. 正则匹配安全措施

八、性能优化建议

九、完整示例：抓取新闻标题

十、注意事项

结语

相关阅读