您好,登录后才能下订单哦!
密码登录
登录注册
点击 登录注册 即表示同意《亿速云用户服务条款》
# Python爬虫利器Scrapy怎么使用
## 一、Scrapy框架概述
### 1.1 什么是Scrapy
Scrapy是一个用Python编写的开源网络爬虫框架,专门用于快速、高效地从网站提取结构化数据。它采用异步处理机制,具有以下核心优势:
- **高性能**:基于Twisted异步网络库,支持并发请求
- **模块化设计**:各组件松耦合,易于扩展
- **内置工具**:包含选择器、中间件、管道等完整爬虫工具链
- **社区支持**:丰富的扩展插件和文档资源
### 1.2 典型应用场景
- 电商价格监控
- 新闻聚合
- 搜索引擎数据采集
- API数据抓取
- 自动化测试
## 二、环境安装与项目创建
### 2.1 安装准备
```bash
# 创建虚拟环境(推荐)
python -m venv scrapy_env
source scrapy_env/bin/activate # Linux/Mac
scrapy_env\Scripts\activate # Windows
# 安装Scrapy
pip install scrapy
scrapy startproject myproject
cd myproject
scrapy genspider example example.com
生成的项目结构:
myproject/
├── scrapy.cfg # 部署配置
└── myproject/ # 项目模块
├── __init__.py
├── items.py # 数据模型定义
├── middlewares.py # 中间件
├── pipelines.py # 数据处理管道
├── settings.py # 项目设置
└── spiders/ # 爬虫目录
└── example.py # 示例爬虫
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blog_spider' # 爬虫唯一标识
allowed_domains = ['example.com']
start_urls = ['https://example.com/blog']
def parse(self, response):
# 提取文章标题
titles = response.css('h2.post-title::text').getall()
# 跟进分页链接
next_page = response.css('a.next-page::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
# 返回结构化数据
for title in titles:
yield {'title': title.strip()}
# items.py
import scrapy
class ArticleItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
publish_date = scrapy.Field()
content = scrapy.Field()
# pipelines.py
class JsonWriterPipeline:
def open_spider(self, spider):
self.file = open('articles.jl', 'w')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
def close_spider(self, spider):
self.file.close()
def parse_article(self, response):
item = ArticleItem()
item['title'] = response.css('h1::text').get()
# 处理AJAX内容
comment_url = response.css('.load-comments::attr(data-url)').get()
yield scrapy.Request(
comment_url,
callback=self.parse_comments,
meta={'item': item}
)
def parse_comments(self, response):
item = response.meta['item']
item['comments'] = response.json()['comments']
yield item
# settings.py
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = './downloads'
# spider.py
class ImageSpider(scrapy.Spider):
def parse(self, response):
yield {
'image_urls': response.css('img::attr(src)').getall(),
'title': response.css('title::text').get()
}
# settings.py
DOWNLOAD_DELAY = 2 # 请求延迟
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36'
ROTATING_PROXY_LIST = ['ip1:port', 'ip2:port']
# middlewares.py
class RandomHeaderMiddleware:
def process_request(self, request, spider):
request.headers['Accept-Language'] = random.choice(['en-US', 'zh-CN'])
# 使用第三方服务
def parse_login(self, response):
captcha_img = response.css('#captcha::attr(src)').get()
yield scrapy.Request(
captcha_img,
callback=self.solve_captcha,
meta={'login_form': response.css('form')}
)
def solve_captcha(self, response):
# 调用2Captcha等API
captcha_text = submit_to_2captcha(response.body)
formdata = parse_form(response.meta['login_form'])
formdata['captcha'] = captcha_text
yield scrapy.FormRequest.from_response(
response,
formdata=formdata
)
# 安装服务
pip install scrapyd
scrapyd &
# 部署项目
scrapyd-deploy default -p myproject
# 调度任务
curl http://localhost:6800/schedule.json -d project=myproject -d spider=blog_spider
# settings.py
LOG_LEVEL = 'INFO'
LOG_FILE = 'scrapy.log'
LOG_STDOUT = True
# 集成Sentry
EXTENSIONS = {
'scrapy.extensions.sentry.SentryLogging': 500,
}
SENTRY_DSN = 'your_dsn'
# 启用并发
CONCURRENT_REQUESTS = 16
# 启用缓存
HTTPCACHE_ENABLED = True
# 使用布隆过滤器去重
DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
# spiders/news.py
class NewsSpider(CrawlSpider):
name = 'news'
rules = (
Rule(LinkExtractor(allow=r'/news/\d+'), callback='parse_news'),
Rule(LinkExtractor(allow=r'/page/\d+')),
)
def parse_news(self, response):
loader = ItemLoader(item=NewsItem(), response=response)
loader.add_css('title', 'h1.headline::text')
loader.add_xpath('content', '//div[@class="article-body"]//text()')
loader.add_value('url', response.url)
yield loader.load_item()
# pipelines.py
class MongoDBPipeline:
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE')
)
def process_item(self, item, spider):
self.db[spider.name].insert_one(dict(item))
return item
A: 推荐方案: - 使用scrapy-splash中间件 - 集成selenium-webdriver - 直接调用网站API接口
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503]
本文共包含约4150字,涵盖了Scrapy的核心使用方法和实战技巧。实际开发中建议根据具体需求选择合适的组件组合,并始终遵守目标网站的使用条款。 “`
注:本文为Markdown格式,实际使用时可根据需要调整代码示例的缩进和内容细节。建议通过实际项目练习来巩固各个知识点的掌握。
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。