您好,登录后才能下订单哦!
密码登录
登录注册
点击 登录注册 即表示同意《亿速云用户服务条款》
python 网络爬虫常用的4大解析库助手:re正则、etree xpath、scrapy xpath、BeautifulSoup。(因为etree xpath和scrapy xpath用法上有较大的不同,故没有归为一类),本文来介绍BeautifulSoup一个少为人知的坑,见示例:
例1(它是长得不一样, 柬文勿怪):
content = """
<html>
<body>
<div class="td-post-content td-pb-padding-side">
<p>
<img alt="" class="alignnone size-full wp-image-122426"
data-recalc-dims="1" height="352"
src="https://i2.wp.com/img.postnews.com.kh/2017/01/Anal-Itching.jpg?resize=630%2C352&ssl=1"
width="630"/>
</p>
<p>
<img alt="" class="alignnone size-full wp-image-122427"
data-recalc-dims="1" height="473"
src="https://i1.wp.com/img.postnews.com.kh/2017/01/Anal-Itching1.jpg?resize=630%2C473&ssl=1"
width="630"/>
</p>
<p>
ចំណែកឯប្រេងដូងវិញ មានផ្ទុកអាស៊ីតខ្លាញ់អូមេហ្គា៣
ដែលល្អបំផុតសម្រាប់បំផ្លាញ់មីក្រុបដែលមានវត្តមាននៅក្នុងតំបន់រន្ធគូថ
ហេតុនេះហើយទើបការឆ្លងមេរោគ និងរមាស់ត្រូវបានទប់ស្កាត់។
</p>
<p>
<img alt="" class="alignnone size-full wp-image-122427"
data-recalc-dims="1" height="473"
src="https://i1.wp.com/img.postnews.com.kh/2017/01/Anal-Itching1.jpg?resize=630%2C473&ssl=1"
width="630"/>
</p>
<p>
<img alt="" class="alignnone size-full wp-image-122428"
data-recalc-dims="1" height="473"
src="https://i2.wp.com/img.postnews.com.kh/2017/01/Anal-Itching2.jpg?resize=630%2C473&ssl=1"
width="630"/>
<br/>
<em>
<br/>
ចំណាំ៖
</em>
ប្រសិនបើអ្នករមាស់ខ្លាំង មានការឈឺចាប់ ហើយមានឈាមហូរទៀតនោះ
ត្រូវប្រញាប់ទៅជួបជាមួយគ្រូពេទ្យភ្លាម៕
</p>
</div>
</body>
</html>
"""
soup = BeautifulSoup(content)
img_lst = []
inner_src_list = soup.find_all('img', src=True)
for i, src in enumerate(inner_src_list):
url=src["src"].replace("&ssl", "&ssl")
print(url)
print(soup.prettify())
# content = soup.prettify() # src的打印结果一样
img_tags = soup.find_all('img')
for img in img_tags:
print(img['src'])
控制台打印输出如下:



怎么会这样:文本中的‘amp;’字符怎么消失了?
解释如下:BeautifulSoup在提取src时内部会自动把符号‘&’转义成'&',【网页解析有时不一定要眼前的直觉】【不仅bs如此, etree xpath和scrapy xpath也是一样】
例2:
文本同上
soup = BeautifulSoup(content)
img_lst = []
inner_src_list = soup.find_all('img', src=True) # 注意比较
for i, src in enumerate(inner_src_list):
url=src["src"].replace("&ssl", "&ssl")
print(url)
inner_src_list = soup.find_all('img', attr={'src':True}) # 注意比较
for i, src in enumerate(inner_src_list):
url=src["src"].replace("&ssl", "&ssl")
print(url)
这里不作打印了,直接说明现象,第一个print正常打印,第二个print输出为空,为什么?
解释如下: 第一个find_all,把src=True视为存在src属性的img标签,第二个find_all,把attr={'src', True}视为存在src且属性值为True的img标签,所以结果可想而知!
上述如有不正之处,欢迎指出,谢谢!
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。