在Python中,使用多线程进行爬虫并发控制时,可以通过以下方法实现:
threading
模块:Python的threading
模块提供了基本的线程支持。你可以创建多个线程,每个线程执行一个爬虫任务。为了控制并发数量,可以使用threading.Semaphore
。示例代码:
import threading
import requests
from bs4 import BeautifulSoup
class Crawler(threading.Thread):
def __init__(self, url, semaphore):
threading.Thread.__init__(self)
self.url = url
self.semaphore = semaphore
def run(self):
with self.semaphore:
self.fetch_url(self.url)
def fetch_url(self, url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# 处理爬取到的数据
print(f"Visited {url}")
def main():
urls = ["http://example.com/page1", "http://example.com/page2", ...]
concurrency_limit = 5
semaphore = threading.Semaphore(concurrency_limit)
threads = []
for url in urls:
crawler = Crawler(url, semaphore)
crawler.start()
threads.append(crawler)
for thread in threads:
thread.join()
if __name__ == "__main__":
main()
asyncio
模块:Python的asyncio
模块提供了异步编程支持,可以更高效地处理并发任务。你可以使用asyncio.Semaphore
来控制并发数量。示例代码:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
class Crawler:
def __init__(self, url, semaphore):
self.url = url
self.semaphore = semaphore
async def fetch_url(self, session, url):
async with self.semaphore:
await self.fetch(session, url)
async def fetch(self, session, url):
async with session.get(url) as response:
html = await response.text()
soup = BeautifulSoup(html, 'html.parser')
# 处理爬取到的数据
print(f"Visited {url}")
async def main():
urls = ["http://example.com/page1", "http://example.com/page2", ...]
concurrency_limit = 5
semaphore = asyncio.Semaphore(concurrency_limit)
async with aiohttp.ClientSession() as session:
tasks = [Crawler(url, semaphore).fetch_url(session, url) for url in urls]
await asyncio.gather(*tasks)
if __name__ == "__main__":
asyncio.run(main())
这两种方法都可以实现爬虫的多线程并发控制。threading
模块适用于I/O密集型任务,而asyncio
模块适用于高并发场景,特别是I/O密集型任务。