Python爬虫基础库有哪些

发布时间：2021-11-25 09:36:35 作者：iii
来源：亿速云阅读：297

# Python爬虫基础库有哪些

## 引言

在当今数据驱动的时代，网络爬虫（Web Crawler）已成为获取互联网信息的重要工具。Python凭借其简洁的语法和丰富的第三方库，成为爬虫开发的首选语言。本文将详细介绍Python爬虫开发中常用的基础库，帮助初学者快速构建自己的爬虫程序。

## 1. 网络请求库

### 1.1 urllib

Python内置的HTTP请求库，无需额外安装：
```python
from urllib.request import urlopen
response = urlopen('http://example.com')
print(response.read().decode('utf-8'))

特点： - 标准库组件，无需安装 - 功能相对基础 - 处理复杂请求时代码较冗长

1.2 requests

第三方库，需通过pip install requests安装：

import requests
response = requests.get('http://example.com')
print(response.text)

优势： - API设计优雅直观 - 自动处理编码问题 - 支持会话保持、文件上传等高级功能 - 社区支持完善

1.3 aiohttp

异步HTTP客户端/服务器框架：

import aiohttp
import asyncio

async def fetch():
    async with aiohttp.ClientSession() as session:
        async with session.get('http://example.com') as response:
            return await response.text()

适用场景： - 高性能异步爬虫 - 需要同时处理大量请求时 - Python 3.5+的async/await语法

2. 解析库

2.1 BeautifulSoup

HTML/XML解析库：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('a'))

特点： - 支持多种解析器（html.parser/lxml/html5lib） - 容错能力强 - 提供DOM树遍历方法

2.2 lxml

高性能解析库：

from lxml import etree
tree = etree.HTML(html_content)
print(tree.xpath('//a/@href'))

优势： - 解析速度极快 - 同时支持XPath和CSS选择器 - 内存效率高

2.3 pyquery

jQuery风格的解析库：

from pyquery import PyQuery as pq
doc = pq(html_content)
print(doc('a').attr('href'))

特点： - 熟悉jQuery开发者的首选 - API简洁直观 - 基于lxml实现

3. 数据存储库

3.1 csv

内置CSV文件处理：

import csv
with open('data.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(['name', 'age']))

3.2 json

JSON数据处理：

import json
data = json.loads(response.text)

3.3 pymongo

MongoDB数据库接口：

from pymongo import MongoClient
client = MongoClient()
db = client['test_db']
collection = db['test_collection']
collection.insert_one({'key': 'value'}))

3.4 SQLAlchemy

ORM工具：

from sqlalchemy import create_engine
engine = create_engine('sqlite:///data.db')

4. 其他实用库

4.1 Scrapy

完整的爬虫框架：

import scrapy

class MySpider(scrapy.Spider):
    name = 'example'
    start_urls = ['http://example.com']
    
    def parse(self, response):
        yield {'title': response.css('title::text').get()}

框架特性： - 内置请求调度系统 - 数据管道处理 - 中间件扩展机制 - 命令行工具

4.2 selenium

浏览器自动化工具：

from selenium import webdriver
driver = webdriver.Chrome()
driver.get('http://example.com')
print(driver.page_source)

适用场景： - 处理JavaScript渲染的页面 - 模拟用户操作 - 自动化测试

4.3 PyExecJS

执行JavaScript代码：

import execjs
ctx = execjs.compile("""
    function add(a, b) {
        return a + b;
    }
""")
print(ctx.call("add", 1, 2))

4.4 fake-useragent

随机生成User-Agent：

from fake_useragent import UserAgent
ua = UserAgent()
headers = {'User-Agent': ua.random}

5. 反爬虫应对库

5.1 requests-html

增强版requests：

from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://example.com')
r.html.render()  # 执行JavaScript

5.2 scrapy-splash

处理JavaScript页面：

class MySpider(scrapy.Spider):
    def start_requests(self):
        yield scrapy.Request(url, self.parse,
            meta={'splash': {'args': {'wait': 0.5}}})

5.3 proxy-pool

代理IP管理：

from proxy_pool import ProxyPool
pool = ProxyPool()
proxy = pool.get_proxy()

6. 实战案例：简易爬虫实现

import requests
from bs4 import BeautifulSoup
import csv

def simple_crawler():
    url = 'http://books.toscrape.com'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    with open('books.csv', 'w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(['Title', 'Price'])
        
        for book in soup.select('article.product_pod'):
            title = book.h3.a['title']
            price = book.select('p.price_color')[0].text
            writer.writerow([title, price])

if __name__ == '__main__':
    simple_crawler()

7. 爬虫开发注意事项

遵守robots.txt协议
设置合理的请求间隔
处理异常情况（超时、404等）
注意隐私和数据版权问题
使用代理和User-Agent轮换

结语

Python生态提供了丰富的爬虫开发工具，从简单的urllib到功能完善的Scrapy框架，开发者可以根据项目需求选择合适的工具组合。掌握这些基础库的使用方法，是构建高效、稳定爬虫系统的第一步。随着技术的深入，还可以探索更高级的分布式爬虫、智能解析等技术领域。

提示：本文介绍的库可以通过pip命令安装，建议使用虚拟环境管理项目依赖。 “`

注：实际字数约1950字（包含代码示例）。文章结构完整覆盖了请求、解析、存储等爬虫核心环节，并提供了实用案例和注意事项，适合作为爬虫入门指导材料。