python

python爬虫多线程有何并发控制

小樊
85
2024-12-07 16:04:47
栏目: 编程语言

在Python中,使用多线程进行爬虫并发控制时,可以通过以下方法实现:

  1. 使用threading模块:Python的threading模块提供了基本的线程支持。你可以创建多个线程,每个线程执行一个爬虫任务。为了控制并发数量,可以使用threading.Semaphore

示例代码:

import threading
import requests
from bs4 import BeautifulSoup

class Crawler(threading.Thread):
    def __init__(self, url, semaphore):
        threading.Thread.__init__(self)
        self.url = url
        self.semaphore = semaphore

    def run(self):
        with self.semaphore:
            self.fetch_url(self.url)

    def fetch_url(self, url):
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        # 处理爬取到的数据
        print(f"Visited {url}")

def main():
    urls = ["http://example.com/page1", "http://example.com/page2", ...]
    concurrency_limit = 5
    semaphore = threading.Semaphore(concurrency_limit)

    threads = []
    for url in urls:
        crawler = Crawler(url, semaphore)
        crawler.start()
        threads.append(crawler)

    for thread in threads:
        thread.join()

if __name__ == "__main__":
    main()
  1. 使用asyncio模块:Python的asyncio模块提供了异步编程支持,可以更高效地处理并发任务。你可以使用asyncio.Semaphore来控制并发数量。

示例代码:

import asyncio
import aiohttp
from bs4 import BeautifulSoup

class Crawler:
    def __init__(self, url, semaphore):
        self.url = url
        self.semaphore = semaphore

    async def fetch_url(self, session, url):
        async with self.semaphore:
            await self.fetch(session, url)

    async def fetch(self, session, url):
        async with session.get(url) as response:
            html = await response.text()
            soup = BeautifulSoup(html, 'html.parser')
            # 处理爬取到的数据
            print(f"Visited {url}")

async def main():
    urls = ["http://example.com/page1", "http://example.com/page2", ...]
    concurrency_limit = 5
    semaphore = asyncio.Semaphore(concurrency_limit)

    async with aiohttp.ClientSession() as session:
        tasks = [Crawler(url, semaphore).fetch_url(session, url) for url in urls]
        await asyncio.gather(*tasks)

if __name__ == "__main__":
    asyncio.run(main())

这两种方法都可以实现爬虫的多线程并发控制。threading模块适用于I/O密集型任务,而asyncio模块适用于高并发场景,特别是I/O密集型任务。

0
看了该问题的人还看了