Python的XPath爬虫在数据抓取和网页解析中非常有用。以下是一些实用的案例:
import requests
from lxml import etree
url = "https://www.zbj.com/fw/?k=saas"
resp = requests.get(url)
html = etree.HTML(resp.text)
divs = html.xpath('//*[@id="__layout"]/div/div[3]/div/div[4]/div/div[2]/div[1]/div')
for div in divs:
title = div.xpath('./div/div[3]/div[2]/a/text()')[0]
price = div.xpath("./div/div[3]/div[1]/span/text()")
com_name = div.xpath('./div/a/text()')
import requests
from lxml import etree
url = "https://xa.58.com/ershoufang/"
headers = {'User-Agent':'Mozilla/5.0'}
resp = requests.get(url, headers=headers)
tree = etree.HTML(resp.text)
div_list = tree.xpath('//section[@class="list"]/div')
with open('./58同城二手房.txt','w',encoding='utf-8') as fp:
for div in div_list:
title = div.xpath('.//div[@class="property-content-title"]/h3/text()')[0]
fp.write(title+'\n')
import requests
from lxml import etree
import os
url = "https://pic.netbian.com/4kmeinv/"
headers = {'User-Agent':'Mozilla/5.0'}
resp = requests.get(url, headers=headers)
tree = etree.HTML(resp.text)
li_list = tree.xpath('//div[@class="slist"]/ul/li/a')
if not os.path.exists('./piclibs'):
os.mkdir('./piclibs')
for li in li_list:
detail_url = li.xpath('./img/@src')[0]
detail_name = li.xpath('./img/@alt')[0]+'.jpg'
detail_path = './piclibs/'+ detail_name
detail_data = requests.get(detail_url, headers=headers).content
with open(detail_path,'wb') as fp:
fp.write(detail_data)
这些案例展示了XPath在Python爬虫中的强大功能和广泛应用。通过学习和实践这些案例,可以提高你的爬虫开发技能。