python爬虫多线程怎样设置线程数 - 问答

在Python中，使用多线程进行爬虫时，可以通过threading模块来设置线程数。以下是一个简单的示例：

import threading
import requests
from bs4 import BeautifulSoup

# 定义一个爬虫函数
def crawl(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    # 在这里解析网页内容，提取所需数据
    print(f"Visited: {url}")

# 定义一个函数来设置线程数并启动线程
def start_threads(num_threads, urls):
    threads = []
    for i in range(num_threads):
        thread = threading.Thread(target=crawl, args=(urls[i % len(urls)],))
        threads.append(thread)
        thread.start()

    for thread in threads:
        thread.join()

if __name__ == "__main__":
    urls = [
        "https://www.example.com/page1",
        "https://www.example.com/page2",
        "https://www.example.com/page3",
        # 更多URL...
    ]

    num_threads = 5  # 设置线程数
    start_threads(num_threads, urls)

在这个示例中，我们首先定义了一个名为crawl的函数，该函数接受一个URL作为参数，并使用requests库获取网页内容。然后，我们使用BeautifulSoup库解析网页内容并提取所需数据。

接下来，我们定义了一个名为start_threads的函数，该函数接受线程数和URL列表作为参数。在这个函数中，我们创建了一个线程列表，并为每个URL创建一个新的线程。我们将crawl函数作为目标函数，并将URL作为参数传递给该函数。然后，我们将每个线程添加到线程列表中，并使用start()方法启动线程。

最后，我们遍历线程列表并使用join()方法等待所有线程完成。

在__main__部分，我们定义了一个URL列表，并设置了线程数。然后，我们调用start_threads函数来启动线程并执行爬虫任务。

0 赞

0 踩