在使用Python进行网页爬取时,可能会遇到多种错误。以下是一些常见的错误及其解决方法:
请求超时:
requests.exceptions.Timeout
import requests
try:
response = requests.get('http://example.com', timeout=10)
except requests.exceptions.Timeout:
print("请求超时")
连接错误:
requests.exceptions.ConnectionError
import requests
try:
response = requests.get('http://example.com')
except requests.exceptions.ConnectionError:
print("连接错误")
HTTP错误:
requests.exceptions.HTTPError
import requests
try:
response = requests.get('http://example.com')
if response.status_code != 200:
print(f"HTTP错误,状态码:{response.status_code}")
except requests.exceptions.HTTPError as e:
print(f"HTTP错误:{e}")
解析错误:
BeautifulSoup
相关的解析错误from bs4 import BeautifulSoup
try:
soup = BeautifulSoup(response.text, 'html.parser')
except Exception as e:
print(f"解析错误:{e}")
反爬虫机制:
requests.exceptions.RequestException
或 urllib.error.URLError
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
try:
response = requests.get('http://example.com', headers=headers)
except requests.exceptions.RequestException as e:
print(f"请求错误:{e}")
编码问题:
UnicodeDecodeError
或 UnicodeEncodeError
try:
response = requests.get('http://example.com')
text = response.text.encode('utf-8').decode('utf-8')
except UnicodeDecodeError as e:
print(f"编码错误:{e}")
资源限制:
MemoryError
或 RecursionError
# 避免递归深度过大
def process_page(url):
try:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# 处理逻辑...
except Exception as e:
print(f"处理错误:{e}")
第三方库依赖问题:
ModuleNotFoundError
或 ImportError
pip install requests beautifulsoup4
通过了解和解决这些常见错误,可以提高Python爬虫的稳定性和可靠性。