要使用Python编写一个代理IP爬虫,你需要遵循以下步骤:
requests
和fake_useragent
库。如果没有,请使用以下命令安装:pip install requests
pip install fake_useragent
import requests
from fake_useragent import UserAgent
proxies = [
{'http': 'http://proxy1:port'},
{'http': 'http://proxy2:port'},
{'http': 'http://proxy3:port'},
# 更多代理IP...
]
fake_useragent
库生成随机的User-Agent,以避免被目标网站屏蔽:ua = UserAgent()
def fetch(url):
proxy = random.choice(proxies)
headers = {'User-Agent': ua.random}
try:
response = requests.get(url, headers=headers, proxies=proxy, timeout=5)
response.raise_for_status()
return response.text
except requests.exceptions.RequestException as e:
print(f"Error fetching {url}: {e}")
return None
fetch
函数获取页面内容:url_list = [
'https://example.com/page1',
'https://example.com/page2',
# 更多URL...
]
for url in url_list:
content = fetch(url)
if content:
# 处理页面内容,例如保存到文件或解析HTML
with open(f"{url}.html", "w", encoding="utf-8") as f:
f.write(content)
这样,你的代理IP爬虫就可以运行了。请注意,根据目标网站的限制,你可能需要定期更新代理IP池和User-Agent。此外,确保遵循目标网站的robots.txt
规则,并遵守相关法律法规。