Python爬虫如何提取关键信息与特征

发布时间：2024-12-07 03:39:56 作者：小樊
来源：亿速云阅读：189

在Python中，我们可以使用各种库和工具来提取网页的关键信息和特征。以下是一些常用的库和步骤：

导入所需库：

import requests
from bs4 import BeautifulSoup
import re

发送请求并获取网页内容：

url = 'https://example.com'
response = requests.get(url)
html_content = response.text

解析HTML内容：

soup = BeautifulSoup(html_content, 'html.parser')

提取关键信息：

提取标题：

title = soup.title.string

提取所有链接：

links = [a['href'] for a in soup.find_all('a', href=True)]

提取所有图片链接：

images = [img['src'] for img in soup.find_all('img', src=True)]

提取文本内容：

text = soup.get_text()

特征提取：

使用正则表达式提取数字、邮箱、电话号码等：

numbers = re.findall(r'\d+', text)
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
phone_numbers = re.findall(r'\b\d{3}-\d{3}-\d{4}\b', text)

提取标题中的关键词：

keywords = re.findall(r'\b\w+\b', title)

存储提取的信息：

data = {
    'title': title,
    'links': links,
    'images': images,
    'text': text,
    'numbers': numbers,
    'emails': emails,
    'phone_numbers': phone_numbers,
    'keywords': keywords
}

以上示例展示了如何使用Python爬虫提取网页的关键信息和特征。根据实际需求，你可能需要根据不同的网站结构和内容进行调整。同时，对于反爬虫策略，你可能需要设置请求头、使用代理IP等方法来规避限制。

Python爬虫如何提取关键信息与特征

相关阅读