如何使用Python制作网络爬虫

发布时间：2021-09-07 14:23:08 作者：chen
来源：亿速云阅读：213

本篇内容主要讲解“如何使用Python制作网络爬虫”，感兴趣的朋友不妨来看看。本文介绍的方法操作简单快捷，实用性强。下面就让小编来带大家学习“如何使用Python制作网络爬虫”吧!

简单的制作爬虫的方法：获取“website_links”列表中网站的 html 代码，并通过搜索第一个 <h2> 标签来获取其标题。这样我们就得到了一个网站的主要文章的标题（标题）。

网络爬虫是一种史诗般的小型软件，您可以在短时间内构建。采用以上方法实时检测互联网上的突发新闻就非常简单。下面是代码示例。

import requests 
from bs4 import BeautifulSoup  

website_links = ["https://www.aljazeera.com/", 
"https://www.thehindu.com/", "https://www.ndtv.com/"]  

consolidatedTitleString = ""  

for i, website in enumerate(website_links):    
 page = requests.get(website)    
 soup = BeautifulSoup(page.text, 'html.parser') 
     
 #to get the headings and display    
 title = soup.find('h2').get_text()    
 consolidatedTitleString += "\n\n" + str(i) + ")   "+ 
 title.strip("\n")

通过这种方法就可以获取到网站的标题，这里使用 5 个主要的 Python 包：scrapy、BeautifulSoup、requests、urllib 和 re。最后一个是 're' 或正则表达式库非常有用。在 html 代码中，我们有以下用处

<h2 class="top-sec-title">Indonesia quake, tsunami toll tops 800</h2>

为了使任何有用的东西，我们需要删除 html 标签并只获取文本，这通常在使用汤库时使用“.get_text()”函数完成。但是，了解如何使用正则表达式执行此操作很有用。

下面的代码有助于从推文中提取链接列表，而不是上面给出的新闻报纸网站链接。我们利用模式“htttps”来检测使用 re 库的链接。

for i,status in 
enumerate(tweepy.Cursor(api.home_timeline).items(7)):          

  try:         
       listOfLinks.append(re.search("(?P<url>https?://[^\s]+)", 
 status.text).group("url"))     
   except AttributeError:      
      print("A link does not exist")

还可以制作一个“ImageCrawler”来下载网页上的所有图像。

r = requests.get("http://pythonforengineers.com/pythonforengineersbook/") 
data = r.text 
 
soup = BeautifulSoup(data, "lxml") 
 
for link in soup.find_all('img'): 
	image = link.get("src") 
 	image = "http:" + image 
 	question_mark = image.find("?") 
 	image = image[:question_mark] 
 	image_name = os.path.split(image)[1] 
 	print(image_name) 
 	r2 = requests.get(image) 
 	with open(image_name, "wb") as f: 
 		f.write(r2.content)

到此，相信大家对“如何使用Python制作网络爬虫”有了更深的了解，不妨来实际操作一番吧！这里是亿速云网站，更多相关内容可以进入相关频道进行查询，关注我们，继续学习！

如何使用Python制作网络爬虫

相关阅读