在使用Python进行XPath爬虫时,为了避免IP被封,可以采取以下几种策略:
import requests
from lxml import etree
proxies = {
'http': 'http://代理IP:端口',
'https': 'https://代理IP:端口',
}
url = '目标网址'
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}, proxies=proxies)
html = response.text
tree = etree.HTML(html)
# 提取数据
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
time.sleep()
函数:import time
time.sleep(5) # 等待5秒
cookies = {
'cookie_name': 'cookie_value',
'another_cookie_name': 'another_cookie_value',
}
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}, cookies=cookies)
请注意,爬虫行为应遵守网站的robots.txt规则和相关法律法规。在进行爬虫开发时,请确保自己的行为合法合规。