python爬虫多线程有何并发控制 - 问答

在Python中，使用多线程进行爬虫并发控制时，可以通过以下方法实现：

使用threading模块：Python的threading模块提供了基本的线程支持。你可以创建多个线程，每个线程执行一个爬虫任务。为了控制并发数量，可以使用threading.Semaphore。

示例代码：

import threading
import requests
from bs4 import BeautifulSoup

class Crawler(threading.Thread):
    def __init__(self, url, semaphore):
        threading.Thread.__init__(self)
        self.url = url
        self.semaphore = semaphore

    def run(self):
        with self.semaphore:
            self.fetch_url(self.url)

    def fetch_url(self, url):
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        # 处理爬取到的数据
        print(f"Visited {url}")

def main():
    urls = ["http://example.com/page1", "http://example.com/page2", ...]
    concurrency_limit = 5
    semaphore = threading.Semaphore(concurrency_limit)

    threads = []
    for url in urls:
        crawler = Crawler(url, semaphore)
        crawler.start()
        threads.append(crawler)

    for thread in threads:
        thread.join()

if __name__ == "__main__":
    main()

使用asyncio模块：Python的asyncio模块提供了异步编程支持，可以更高效地处理并发任务。你可以使用asyncio.Semaphore来控制并发数量。

示例代码：

import asyncio
import aiohttp
from bs4 import BeautifulSoup

class Crawler:
    def __init__(self, url, semaphore):
        self.url = url
        self.semaphore = semaphore

    async def fetch_url(self, session, url):
        async with self.semaphore:
            await self.fetch(session, url)

    async def fetch(self, session, url):
        async with session.get(url) as response:
            html = await response.text()
            soup = BeautifulSoup(html, 'html.parser')
            # 处理爬取到的数据
            print(f"Visited {url}")

async def main():
    urls = ["http://example.com/page1", "http://example.com/page2", ...]
    concurrency_limit = 5
    semaphore = asyncio.Semaphore(concurrency_limit)

    async with aiohttp.ClientSession() as session:
        tasks = [Crawler(url, semaphore).fetch_url(session, url) for url in urls]
        await asyncio.gather(*tasks)

if __name__ == "__main__":
    asyncio.run(main())

这两种方法都可以实现爬虫的多线程并发控制。threading模块适用于I/O密集型任务，而asyncio模块适用于高并发场景，特别是I/O密集型任务。

0 赞

0 踩