python爬虫多线程代码如何编写 - 问答

在Python中，我们可以使用threading库来实现多线程爬虫。以下是一个简单的多线程爬虫示例，使用了requests和BeautifulSoup库来爬取网页内容。

首先，确保已经安装了所需的库：

pip install requests
pip install beautifulsoup4

然后，编写多线程爬虫代码：

import requests
from bs4 import BeautifulSoup
import threading

# 爬取目标URL
def fetch(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        print(f"Error fetching {url}: Status code {response.status_code}")
        return None

# 解析网页内容并提取数据
def parse(html):
    soup = BeautifulSoup(html, "html.parser")
    # 在这里提取你需要的数据，例如：
    titles = soup.find_all("h2", class_="title")
    for title in titles:
        print(title.get_text())

# 爬虫线程函数
def crawl_thread(url):
    html = fetch(url)
    if html:
        parse(html)

# 主程序
if __name__ == "__main__":
    urls = [
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3",
        # 添加更多目标URL
    ]

    threads = []
    for url in urls:
        thread = threading.Thread(target=crawl_thread, args=(url,))
        threads.append(thread)
        thread.start()

    for thread in threads:
        thread.join()

这个示例中，我们定义了三个函数：fetch用于发送HTTP请求并获取网页内容，parse用于解析HTML并提取数据，crawl_thread作为爬虫线程函数，用于执行fetch和parse操作。

在主程序中，我们创建了一个URL列表，并为每个URL创建一个线程。然后，我们启动所有线程并等待它们完成。

请注意，这个示例仅用于演示目的。在实际应用中，你可能需要根据目标网站的结构和反爬虫策略进行相应的调整。同时，为了避免对目标网站造成过大压力，建议使用多进程（multiprocessing库）或异步编程（asyncio库）来实现爬虫。

0 赞

0 踩