python爬虫多线程如何处理异常 - 问答

在Python中，使用多线程处理爬虫异常时，可以使用concurrent.futures.ThreadPoolExecutor和try-except语句。以下是一个简单的示例：

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed

def fetch(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        return BeautifulSoup(response.text, 'html.parser')
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

def parse(url):
    soup = fetch(url)
    if soup:
        # 在这里解析网页内容，例如提取数据
        pass

def main():
    urls = [
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3",
        # 更多URL...
    ]

    with ThreadPoolExecutor(max_workers=5) as executor:
        futures = {executor.submit(parse, url): url for url in urls}

        for future in as_completed(futures):
            url = futures[future]
            try:
                future.result()
            except Exception as e:
                print(f"Error processing {url}: {e}")

if __name__ == "__main__":
    main()

在这个示例中，我们首先定义了一个fetch函数，用于获取网页内容。我们使用try-except语句捕获可能的异常，例如网络错误或HTTP错误。

接下来，我们定义了一个parse函数，用于解析网页内容。在这个函数中，我们调用fetch函数并处理可能的异常。

在main函数中，我们使用ThreadPoolExecutor创建一个线程池，并将parse函数提交给线程池。我们使用as_completed函数遍历已完成的任务，并捕获可能的异常。

这样，我们可以确保在多线程环境下处理爬虫异常。

0 赞

0 踩