在Python中,Go爬虫可以通过多种方式协同工作。以下是一些常见的协同工作方法:
消息队列是一种常见的异步通信方式,可以用来解耦爬虫组件。例如,可以使用RabbitMQ、Kafka等消息队列系统来分发爬取任务。
安装RabbitMQ:
sudo apt-get install rabbitmq-server
安装Python库:
pip install pika
生产者(Producer):
import pika
connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()
channel.queue_declare(queue='crawl_queue')
def send_task(url):
channel.basic_publish(exchange='', routing_key='crawl_queue', body=url)
print(f"Sent {url}")
send_task('http://example.com')
connection.close()
消费者(Consumer):
import pika
import requests
connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()
channel.queue_declare(queue='crawl_queue')
def callback(ch, method, properties, body):
url = body.decode('utf-8')
print(f"Received {url}")
response = requests.get(url)
print(response.text)
channel.basic_consume(queue='crawl_queue', on_message_callback=callback, auto_ack=True)
print('Waiting for messages. To exit press CTRL+C')
channel.start_consuming()
多线程或多进程可以用来并行处理爬取任务,提高效率。
import threading
import requests
def crawl(url):
response = requests.get(url)
print(response.text)
urls = ['http://example.com', 'http://example.org', 'http://example.net']
threads = []
for url in urls:
thread = threading.Thread(target=crawl, args=(url,))
thread.start()
threads.append(thread)
for thread in threads:
thread.join()
import multiprocessing
import requests
def crawl(url):
response = requests.get(url)
print(response.text)
urls = ['http://example.com', 'http://example.org', 'http://example.net']
processes = []
for url in urls:
process = multiprocessing.Process(target=crawl, args=(url,))
process.start()
processes.append(process)
for process in processes:
process.join()
可以使用Flask、Django等Web框架来构建爬虫的API接口,实现远程控制和监控。
安装Flask:
pip install Flask
创建Flask应用:
from flask import Flask, request, jsonify
import requests
app = Flask(__name__)
@app.route('/crawl', methods=['POST'])
def crawl():
url = request.json['url']
response = requests.get(url)
return jsonify({'status': 'success', 'content': response.text})
if __name__ == '__main__':
app.run(debug=True)
发送请求:
import requests
url = 'http://localhost:5000/crawl'
data = {'url': 'http://example.com'}
response = requests.post(url, json=data)
print(response.json())
Scrapy是一个强大的爬虫框架,支持分布式爬取和任务调度。
安装Scrapy:
pip install scrapy
创建Scrapy项目:
scrapy startproject myproject
cd myproject
创建Spider:
# myproject/spiders/example_spider.py
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
start_urls = ['http://example.com']
def parse(self, response):
self.log('Visited %s' % response.url)
for quote in response.css('div.quote'):
item = {
'author_name': quote.css('span.text::text').get(),
'author_url': quote.xpath('span/small/a/@href').get(),
}
yield item
配置设置:
# myproject/settings.py
# 启用分布式调度
SCHEDULER = "scrapy.schedulers. twisted.TwistedScheduler"
启动爬虫:
scrapy crawl example -o output.json
通过以上方法,Python Go爬虫可以实现协同工作,提高爬取效率和可靠性。