词频统计

发布时间:2020-03-31 22:04:35 作者:cooperfang
来源:网络 阅读:707
# pip  install bs4
from bs4 import BeautifulSoup   # python 爬虫利器
"""
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.
它能够通过你喜欢的转换器实现惯用的文档导航,查找,
修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.
"""
import requests
blog_url = 'https://blog.51cto.com/13118411/2154806'
data = requests.get(blog_url)
print(data)
print(data.text)
<Response [200]>
<!DOCTYPE html>
<html lang="zh-CN">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <link type="favicon" rel="shortcut icon" href="/favicon.ico" />
        <title>天气预报定制-cooperfang的博客-51CTO博客</title>
    <meta name="keywords" content="requests,json">
<meta name="description" content="#apiaplicationprogramminginterface#不通软件不同系统之间的功能相互调用#json是其中重要的一种数据交换形式#定制天气预报https://www.sojson.com/open/api/weather/json.shtml?city=#http://jsonviewer.stack.hu/#https://www.sojson.com/open/api/weath">    <link href="https://static1.51cto.com/edu/blog/css/header.css?v=1.0.5.1" rel="stylesheet"><link href="https://static1.51cto.com/edu/blog/css/other.css?v=1.0.3.2" rel="stylesheet"><link href="https://static1.51cto.com/edu/blog/css/right.css?v=1.0.4.7" rel="stylesheet"><link href="https://static1.51cto.com/edu/blog/css/blog_details.css?v=1.0.7.1" rel="stylesheet"><link href="https://static1.51cto.com/edu/blog/css/highlight.css" rel="stylesheet">
    <script src="https://static1.51cto.com/edu/center/js/jquery.min.js"></script><script src="https://static1.51cto.com/edu/blog/js/cookie.js"></script><script src="https://static1.51cto.com/edu/blog/js/login.js?v=1.0.0.6"></script><script src="https://static1.51cto.com/edu/blog/js/common.js?v=1.0.0.8"></script><script src="https://static1.51cto.com/edu/blog/js/mbox.js"></script><script src="https://static1.51cto.com/edu/blog/js/follow.js?v=1.0.0.8"></script><script src="https://static1.51cto.com/edu/blog/js/vip.js?v=1.0.0.1"></script></head>
<body>
<img src="https://cache.yisu.com/upload/information/20200310/57/121489.jpg" border="0" >
<!--[if lt IE 9]>
  <script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
  <script src="https://oss.maxcdn.com/libs/respond.js/1.3.0/respond.min.js"></script>
<![endif]-->

<div class="Header">
  <div class="Page ">
    <h2 class="Logo"><a href="https://blog.51cto.com/">Logo</a></h2>
    <ul class="Navigates fl">
      <li ><a href="https://blog.51cto.com/">首页</a></li>
      <li ><a href="https://blog.51cto.com/original">文章</a></li>
      <li ><a href="https://blog.51cto.com/blog/follow">关注</a></li>
      <li class="">
        <a class="column-stress" href="https://blog.51cto.com/cloumn/index">订阅专栏<b></b></a>
      </li>
            <li class="">
        <a href="https://blog.51cto.com/expert">专家</a>
      </li>
          </ul>
    <ul class="Navigates Navigates-right fr">
      <li class="more maps">
        <a href="javascript:void(0);">网站导航</a>
        <div>
            <a href="http://edu.51cto.com" target="_blank">学院</a>
            <a href="https://blog.51cto.com" target="_blank">博客</a>
            <a href="http://down.51cto.com" target="_blank">下载</a>
            <a href="http://home.51cto.com" target="_blank">家园</a>
            <a href="http://bbs.51cto.com" target="_blank">论坛</a>
            <a href="http://x.51cto.com" target="_blank">CTO训练营</a>
            <a href=" http://club.51cto.com?blog" target="_blank">CTO俱乐部</a>
            <a href="http://wot.51cto.com" target="_blank">WOT</a>
            <a href="http://www.51cto.com" target="_blank">51CTO</a>
            <i class="arrow"></i>
        </div>
      </li>
                <li><a href="http://home.51cto.com/user/register?reback=http%253A%252F%252Fblog.51cto.com%252F13118411%252F2154806" target="_self">注册</a></li>
          <li class="login"><a href="/user/login?reback=http%3A%2F%2Fblog.51cto.com%2F13118411%2F2154806" target="_self">登录</a></li>
                        <li class="mRead">
          <a href="javascript:;">手机阅读</a>
          <div>
            <img src="https://cache.yisu.com/upload/information/20200310/57/121490.jpg">
            <p>扫一扫体验手机阅读</p>
            <i class="arrow"></i>
          </div>
        </li>
            <li class="search"><a href="https://blog.51cto.com/search/index"  target="_self">搜索</a></li>
                  <li class="write"><a href="javascript:;" onClick="Login()">写文章</a></li>
                  </ul>

          <div class="clear"></div>
  </div>
</div>
<script>
    var isLogin = '0';
    var userId = '';
    var imgpath = 'https://s1.51cto.com/';
    var BLOG_URL = 'https://blog.51cto.com/';
    var msg_num_url = '/index/ajax-msg-num';
    $('.msg-follow, .msg-follow-max').eq(1).css({top: '91px'});
    $('.msg-follow, .msg-follow-max').eq(2).css({top: '121px'});
    setTimeout(function(){
            $.ajax({
                url:msg_num_url,
                type:"get",
                dataType:'json',
                success:function(res){
                    if(res.status == '0'){
                       //
                       var hasNewMsg = false;
                       if(res.data.msgNum > 0 && !$('#myMsg i').hasClass('dot')){
                            $('#myMsg i').addClass('dot');
                            hasNewMsg = true;
                       }
                       if(res.data.notifyNum > 0 && !$('#myNotify i').hasClass('dot')){
                           $('#myNotify i').addClass('dot');
                           hasNewMsg = true;
                       }
                       if(res.data.recommend_new > 0 && !$('#myRecommend i').hasClass('dot')){
                           $('#myRecommend i').addClass('dot');
                           hasNewMsg = true;
                       }
                       if(hasNewMsg && !$('#myAllMsg i').hasClass('dot')){
                           $('#myAllMsg i').addClass('dot');
                       }
                    }

                }
            });
    },70);
</script>
<div class="Content-box">
        <link rel="stylesheet" href="https://static1.51cto.com/edu/blog/css/mdeShow.css?v=1.0.0.9">
<link rel="stylesheet" href="https://static1.51cto.com/edu/blog/css/tinyscrollbar.css"/>
<script type="text/javascript" src="https://static1.51cto.com/edu/blog/js/jquery.tinyscrollbar.js"></script>
<div class="Content Index" >
    <div class="Page M764">
        <!-- left start -->
        <div class="artical-Left-blog">
            <div class="status">
                                <a class="tab_name original">原创</a>
                            </div>
            <h2 class="artical-title">天气预报定制</h2>
            <div class="artical-title-list">
                <div class="is-vip-bg-6 fl">
                    <a href="https://blog.51cto.com/13118411" class="a-img" target="_blank"><img class="is-vip-img is-vip-img-4" data-uid="13108411" src="https://cache.yisu.com/upload/information/20200310/57/121491.jpg"></a>
                </div>
                <a href="https://blog.51cto.com/13118411" class="name fl" target="_blank">cooperfang</a>
                                <a class="comment comment-num fr"><font class="comment_number">0</font>人评论</a>
                <span class="fr"></span>
                <a href="javascript:;" class="read fr">124人阅读</a>
                <a href="javascript:;" class="time fr">2018-08-04 22:59:05</a>
                <div class="clear"></div>
            </div>
                            <div class="artical-content-bak main-content">
                    <div class="con artical-content editor-preview-side"><pre><code class="language-python"># api aplication programming interface
# 不通软件不同系统之间的功能相互调用
# json是其中重要的一种数据交换形式
# 定制天气预报 https://www.sojson.com/open/api/weather/json.shtml?city=
# http://jsonviewer.stack.hu/
# https://www.sojson.com/open/api/weather/json.shtml

?city=%E5%8C%97%E4%BA%AC</code></pre>
<pre><code class="language-python">import requests # pip install requests 请求  网上api的调用形式
url = 'https://www.sojson.com/open/api/weather/json.shtml?city='
city = '北京'
ret = requests.get(url + city) # 请求的对象
print(ret.json())</code></pre>
<pre><code>{'date': '20180804', 'message': 'Success !', 'status': 200, 'city': '北京', 'count': 9, 'data': {'shidu': '70%', 'pm25': 44.0, 'pm10': 78.0, 'quality': '良', 'wendu': '30', 'ganmao': '极少数敏感人群应减少户外活动', 'yesterday': {'date': '03日星期五', 'sunrise': '05:13', 'high': '高温 36.0℃', 'low': '低温 26.0℃', 'sunset': '19:27', 'aqi': 107.0, 'fx': '南风', 'fl': '&lt;3级', 'type': '晴', 'notice': '愿你拥有比阳光明媚的心情'}, 'forecast': [{'date': '04日星期六', 'sunrise': '05:14', 'high': '高温 36.0℃', 'low': '低温 27.0℃', 'sunset': '19:26', 'aqi': 97.0, 'fx': '南风', 'fl': '&lt;3级', 'type': '晴', 'notice': '愿你拥有比阳光明媚的心情'}, {'date': '05日星期日', 'sunrise': '05:15', 'high': '高温 35.0℃', 'low': '低温 25.0℃', 'sunset': '19:25', 'aqi': 103.0, 'fx': '东南风', 'fl': '&lt;3级', 'type': '雷阵雨', 'notice': '带好雨具,别在树下躲雨'}, {'date': '06日星期一', 'sunrise': '05:16', 'high': '高温 31.0℃', 'low': '低温 25.0℃', 'sunset': '19:24', 'aqi': 97.0, 'fx': '南风', 'fl': '&lt;3级', 'type': '雷阵雨', 'notice': '带好雨具,别在树下躲雨'}, {'date': '07日星期二', 'sunrise': '05:17', 'high': '高温 31.0℃', 'low': '低温 25.0℃', 'sunset': '19:22', 'aqi': 113.0, 'fx': '西南风', 'fl': '&lt;3级', 'type': '雷阵雨', 'notice': '带好雨具,别在树下躲雨'}, {'date': '08日星期三', 'sunrise': '05:18', 'high': '高温 30.0℃', 'low': '低温 24.0℃', 'sunset': '19:21', 'aqi': 68.0, 'fx': '东南风', 'fl': '&lt;3级', 'type': '雷阵雨', 'notice': '带好雨具,别在树下躲雨'}]}}</code></pre>
<pre><code class="language-python"># 象字典一样取值
d = ret.json()
# print(d['status'])
# print(d['city'])
# print(d['data'])
# print(d['data']['yesterday'])

def hot_weather(data):
    """定制化天气预报"""
    try:
        weather_list = data['data']['forecast']
    #     print(weather_list)
        for day in weather_list:
            print(day['date'], day['high'], day['low'], day['sunset'], day['notice'])
    except Exception as e:
        print(e)
hot_weather(d)</code></pre>
<pre><code>04日星期六 高温 36.0℃ 低温 27.0℃ 19:26 愿你拥有比阳光明媚的心情
05日星期日 高温 35.0℃ 低温 25.0℃ 19:25 带好雨具,别在树下躲雨
06日星期一 高温 31.0℃ 低温 25.0℃ 19:24 带好雨具,别在树下躲雨
07日星期二 高温 31.0℃ 低温 25.0℃ 19:22 带好雨具,别在树下躲雨
08日星期三 高温 30.0℃ 低温 24.0℃ 19:21 带好雨具,别在树下躲雨</code></pre>
<pre><code class="language-python">%cd D:\全栈\json api
d = ret.json()
import json
with open('weather.json', 'w') as f:
    json.dump(d, f)</code></pre>
<pre><code>D:\全栈\json api</code></pre></div>
                </div>
                                                    <div class="artical-copyright mt26">©著作权归作者所有:来自51CTO博客作者cooperfang的原创作品,如需转载,请注明出处,否则将追究法律责任</div>
                                    <div class="for-tag mt26">
                                                                                        <a href="https://blog.51cto.com/search/result?q=requests" target="_blank">requests</a>
                                                                                                <a href="https://blog.51cto.com/search/result?q=json" target="_blank">json</a>
                                                                                            <div class="clear"></div>
            </div>
            <div class="more-list">
                <p class="is-praise fl "><span type="1" blog_id="2154806" userid='13108411'>0</span></p>
                <div class="share-box fr">
                    <p class="share"><i></i>分享</p>
                    <div class="bdsharebuttonbox">
                      <span></span>
                      <a class="bds_tsina" data-cmd="tsina" >微博</a>
                      <a class="bds_sqq" data-cmd="sqq" >QQ</a>
                      <a class="bds_weixin" data-cmd="weixin" >微信</a>
                      <img src="/qr/qr-url?url=http%3A%2F%2Fblog.51cto.com%2F13118411%2F2154806">
                    </div>
                </div>
                <p class="favorites favorites-opt fr"><i></i>收藏</p>
                                <div class="clear"></div>
            </div>
                            <div class="artical-list">
                                    <a class="fl" href="https://blog.51cto.com/13118411/2154797" title="json">
                        上一篇:json</a>
                                                    <div class="clear"></div>
                </div>
                        <div class="author-module">
                <div class="is-vip-bg-6 fl">
                    <a href="https://blog.51cto.com/13118411" class="a-img" target="_blank">
                        <img class="is-vip-img is-vip-img-4" data-uid="13108411" src="https://cache.yisu.com/upload/information/20200310/57/121491.jpg">
                    </a>
                </div>
                <div class="author-module-center fl">
                    <a class="h3" href="https://blog.51cto.com/13118411" target="_blank">cooperfang</a>
                    <h4>42篇文章,1W+人气,0粉丝</h4>
                                    </div>
                                <div class="clear"></div>
            </div>
        </div>
        <div class="artical-Left" id="comment">
            <!-- 发布评论 -->
            <div class="comment-creat">
                <div class="is-vip-bg-6 fl">
                    <a href="https://blog.51cto.com/13118411" class="header-img" target="_blank">
                        <img  src="https://cache.yisu.com/upload/information/20200310/57/121491.jpg"/>
                    </a>
                </div>
                <div class="first-publish fr publish_user_id">
                    <textarea class="textareadiv textareadiv-publish" name="" id="" placeholder="提问和评论都可以,用心的回复会被更多人看到和认可"  maxlength="500"></textarea>
                    <div class="comment-push">
                        <p class="msg fl">Ctrl+Enter&nbsp;发布</p>
                                                    <p class="publish-btn blue-btn fr" flag="1">发布</p>
                                                <p class="cancel-btn cancel-btn-1 fr">取消</p>
                        <div class="clear"></div>
                    </div>
                    <input type="hidden" class="user_id" value="13108411">
                    <input type="hidden" class="reply_id" value="2154806">
                    <input type="hidden" class="first_pid" value="">
                </div>
                <div class="clear"></div>
            </div>
                        <div class="commentList">
                        </div>
            <!-- page -->
            <div class="act_pageList_box"></div>
        </div>
        <!-- end left -->
        <!-- right start -->
        <div class="Blog-Right artical-Right">
            <a class="catalog"></a>
            <a class="scrollTop" href="javascript:void(0);" onclick="$(window).scrollTop(0);"></a>
        </div>
        <!-- end right  -->
    </div>
            <div class="special-column">
            <div class="Page M764">
                                    <div class="column-1">
                        <h3 class="column-tit">推荐专栏</h3>
                                                    <div class="column-box">
                                <a href="https://blog.51cto.com/cloumn/detail/13" class="a-img fl cloumn-tab-par" target="_blank">
                                    <img src="https://cache.yisu.com/upload/information/20200310/57/121492.jpg">
                                                                            <span class="cloumn-tab-new cloumn-tab-new-1 cloumn-tab2 f12">上新</span>
                                                                    </a>
                                <div class="center fl">
                                    <a class="h3 white-space" href="https://blog.51cto.com/cloumn/detail/13" target="_blank">基于Python的DevOps实战</a>
                                    <h4 class="white-space">运维开发全攻略</h4>
                                    <h5 class="white-space">共18章&nbsp;|&nbsp;<a href="https://blog.51cto.com/yuhongchun" target="_blank">抚琴煮酒</a></h5>
                                    <h6><span class="price">¥51.00</span><span>6人订阅</span></h6>
                                </div>
                                <div class="right fr">
                                                                              <a class="cloumn-subscribe" cid="13" href="/cloumn/detail/13" ask='1' target="_blank">订阅</a>
                                                                    </div>
                                <div class="clear"></div>
                            </div>
                                                    <div class="column-box">
                                <a href="https://blog.51cto.com/cloumn/detail/4" class="a-img fl cloumn-tab-par" target="_blank">
                                    <img src="https://cache.yisu.com/upload/information/20200310/57/121493.jpg">
                                                                    </a>
                                <div class="center fl">
                                    <a class="h3 white-space" href="https://blog.51cto.com/cloumn/detail/4" target="_blank">微服务技术架构和大数据治理实战</a>
                                    <h4 class="white-space">大数据时代的微服务之路</h4>
                                    <h5 class="white-space">共18章&nbsp;|&nbsp;<a href="https://blog.51cto.com/ityouknow" target="_blank">纯洁微笑</a></h5>
                                    <h6><span class="price">¥51.00</span><span>496人订阅</span></h6>
                                </div>
                                <div class="right fr">
                                                                              <a class="cloumn-subscribe" cid="4" href="/cloumn/detail/4" ask='1' target="_blank">订阅</a>
                                                                    </div>
                                <div class="clear"></div>
                            </div>
                                            </div>
                                                    <div class="column-2" >
                        <h3 class="column-tit">猜你喜欢</h3>
                        <div class="column-box">
                                                            <a class="white-space" href="https://blog.51cto.com/13118411/2154797?source=dra" target="_blank">json</a>
                                                            <a class="white-space" href="https://blog.51cto.com/13118411/2154710?source=dra" target="_blank">v0.35</a>
                                                            <a class="white-space" href="https://blog.51cto.com/laputaliya/536858?source=drt" target="_blank">JQuery ajax返回JSON时的处理方式</a>
                                                            <a class="white-space" href="https://blog.51cto.com/zhaojianping/629526?source=drt" target="_blank">android 读取json数据(遍历JSONObject和JSONArray)</a>
                                                            <a class="white-space" href="https://blog.51cto.com/huqilong/136802?source=drt" target="_blank">struts2 json jquery 集成详解</a>
                                                            <a class="white-space" href="https://blog.51cto.com/12731497/2154195?source=drh" target="_blank">谈谈Python实战数据可视化之pyplot模块</a>
                                                            <a class="white-space" href="https://blog.51cto.com/13719825/2151358?source=drh" target="_blank">用爬虫和Flask打造属于自己的电影网站,完整教程送上!</a>
                                                            <a class="white-space" href="https://blog.51cto.com/lavenliu/2150518?source=drh" target="_blank">掌握面向对象编程本质,彻底掌握OOP</a>
                                                        <div class="clear"></div>
                        </div>
                    </div>
                            </div>
        </div>
        <div class="the-lowest-bg">
        <div class="the-lowest Page M764">
            <p class="is-praise fl "><span type="1" blog_id="2154806" userid='13108411'></span></p>
            <p class="b-favorites favorites-opt fl"><i></i><b>0</b></p>
            <a class="b-reply fl"><i></i><font class="comment_number"></font></a>
            <div class="b-share fl">
                <i></i>分享
                <div class="bdsharebuttonbox">
                  <a class="bds_tsina p2" data-cmd="tsina"></a>
                  <a class="bds_sqq p3" data-cmd="sqq"></a>
                  <a class="bds_weixin p1" data-cmd="weixin"><em class="code-icon"></em><img class="code-img" src="/qr/qr-url?url=http%3A%2F%2Fblog.51cto.com%2F13118411%2F2154806"></a>
                </div>
            </div>
                        <a href="https://blog.51cto.com/13118411" class="b-name fr">cooperfang</a>
            <div class="is-vip-bg-6 fr">
                <a href="https://blog.51cto.com/13118411" class="b-img"><img class="is-vip-img is-vip-img-4" data-uid="13108411" src="https://cache.yisu.com/upload/information/20200310/57/121491.jpg"></a>
            </div>
            <div class="clear"></div>
        </div>
    </div>
</div>
<!-- 老博文美观处理 -->
<script>
    var praise_url = 'https://blog.51cto.com/praise/praise'
        addReply_url = 'https://blog.51cto.com/comments/add'
        removeUrl = 'https://blog.51cto.com/comments/del'
        blog_id = '2154806'
        rid = '0'
        is_comment = '0'
        comment_list = '/blog/ajax-comment-list'
        comment_sort = "asc"
        index_url = 'https://blog.51cto.com/13118411';
        uc_url = 'http://ucenter.51cto.com/'
        blog_url = 'https://blog.51cto.com/'
        img_url = 'https://static1.51cto.com/edu/blog/'
        i_user_id = ''
        c_user_id ='13108411'
        collect_url = 'https://blog.51cto.com/collect/add'
        is_old = '0'
        nicknameurl = 'https://blog.51cto.com/13118411'
        nickname = 'cooperfang'
        myself = window.location.href
    $('.you-like-list li:odd').css({'margin-left': '10%'});
    $('.column-box a:odd').addClass('left-list')
    $('.myUrl').text(myself).click(function(){window.open(myself)})
    setTimeout(function(){$('.Footer').css({'margin-top':'-50px'})},50)
            if(is_old==1){SyntaxHighlighter.all();}
    window._bd_share_config={
    "common":{
      "bdText":"天气预报定制",
      "bdDesc":$("#abstract_bdshare").text(),
      "bdMini":"2",
      "bdMiniList":false,
      "bdPic":"https://cache.yisu.com/upload/information/20200310/57/121494.jpg",
      "bdStyle":"0",
      "bdSize":"16"
    },
    "share":{}
  };
  with(document)0[(getElementsByTagName('head')[0]||body).appendChild(createElement('script')).src='http://bdimg.share.baidu.com/static/api/js/share.js?v=89860593.js?cdnversion='+~(-new Date()/36e5)];
  setTimeout(function(){
    $('.bdsharebuttonbox a').removeAttr('title')
  },1000)
</script>
</div>
<script src="https://static1.51cto.com/edu/blog/js/marked.min.js?v=1.0.0.5"></script><script src="https://static1.51cto.com/edu/blog/js/highlight.js"></script><script src="https://static1.51cto.com/edu/blog/js/detail_mp.js?v=2.0.1.7"></script><script src="https://static1.51cto.com/edu/blog/js/detail.js?v=1.0.6.9"></script><script src="https://static1.51cto.com/edu/blog/js/details_new.js?v=1.1.1"></script><script src="https://static1.51cto.com/edu/blog/js/copy.js?v=1.0.0.0"></script>    <script src="https://static1.51cto.com/edu/blog/js/pvlog.js"></script>
<script src="https://logs.51cto.com/rizhi/count/count.js"></script>
<script>
  $(".gotop").click(function(){$(window).scrollTop(0)})
</script>

    <script type="text/javascript">
    //百度统计代码
    var _hmt = _hmt || [];
    (function() {
      var hm = document.createElement("script");
      hm.src = "https://hm.baidu.com/hm.js?2283d46608159c3b39fc9f1178809c21";
      var s = document.getElementsByTagName("script")[0];
      s.parentNode.insertBefore(hm, s);
    })();

    //自动推送链接
    (function(){
        var bp = document.createElement('script');
        var curProtocol = window.location.protocol.split(':')[0];
        if (curProtocol === 'https') {
            bp.src = 'https://zz.bdstatic.com/linksubmit/push.js';
        }
        else {
            bp.src = 'http://push.zhanzhang.baidu.com/push.js';
        }
        var s = document.getElementsByTagName("script")[0];
        s.parentNode.insertBefore(bp, s);
    })();
      var _vds = _vds || [];
      window._vds = _vds;
      (function(){
        _vds.push(['setAccountId', '8c51975c40edfb67']);
        (function() {
          var vds = document.createElement('script');
          vds.type='text/javascript';
          vds.async = true;
          vds.src = ('https:' == document.location.protocol ? 'https://' : 'http://') + 'assets.growingio.com/vds.js';
          var s = document.getElementsByTagName('script')[0];
          s.parentNode.insertBefore(vds, s);
        })();
      })();
      document.write(decodeURI("%3Cscript src='https://cache.yisu.com/upload/information/20200310/57/121495.jpg' type='text/javascript'%3E%3C/script%3E"));
    </script>

<script>
  var uid = '';
  var BLOG_URL = 'https://blog.51cto.com/';
</script>
<script src="https://static1.51cto.com/edu//blog/js/jquery.cookie.js"></script>
<script src="https://static1.51cto.com/edu/blog/js/time-on-page.js?v=1.0.2" charset="utf-8"></script>
<script>
    (function(){
        var wh=$(window).height(),fh=$('.Footer').outerHeight(true),hh=$('.Header').outerHeight(true)
        $('.Content-box').css({'min-height': wh-fh-hh})
    })()
</script>
</body>
</html>
contents = BeautifulSoup(data.text, 'html.parser') # data.text博客文本,html.parser这个类自带的功能
# print(contents)  输出更标准化
all_p = contents.find_all('p')  # 寻找p标签
all_text = ''
for p in all_p:
#     print(p.text)
    all_text += str(p.text)  # 拼接成一个句子
print(all_text)
扫一扫体验手机阅读0分享收藏Ctrl+Enter 发布发布取消0
# pip install jieba    对中文进行拆解为独立的词语
import jieba
text = jieba.cut(all_text)  # jieba.cut() 
"""
Signature: jieba.cut(sentence, cut_all=False, HMM=True)
Docstring:
The main function that segments an entire sentence that contains
Chinese characters into seperated words.

"""
text_list= []
for t in text:
    print(t)
    text_list.append(t)
Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\coop\AppData\Local\Temp\jieba.cache
Loading model cost 1.107 seconds.
Prefix dict has been built succesfully.

扫一扫
体验
手机
阅读
0
分享
收藏
Ctrl
+
Enter
 
发布
发布
取消
0
import collections  # python 内置的api,以上jieba也可叫做api,收集
count = collections.Counter(text_list)   # 产生一个对象count
for key, val in count.most_common(30):
    # 有序(返回前n个出现次数最多的)
    print(key, val)
0 2
发布 2
扫一扫 1
体验 1
手机 1
阅读 1
分享 1
收藏 1
Ctrl 1
+ 1
Enter 1
  1
取消 1
# 做接口  可以给被人这个py文件,也可以是个链接
import collections

def get_most_common(text_list, max_num = 30):
    """根据max_num取排名靠前的词和出现次数"""
    ret = {'status':0, "statusText":'ok', 'data':{}}  # api通用格式
    try:
        new_list = list(text_list)
        count = collections.Counter(new_list)
        ret['data'] = count.most_common(max_num)
    except Exception as e:
        ret['status'] = 1
        ret['statusText'] = e
    return ret

get_most_common(text_list)
{'status': 0,
 'statusText': 'ok',
 'data': [('0', 2),
  ('发布', 2),
  ('扫一扫', 1),
  ('体验', 1),
  ('手机', 1),
  ('阅读', 1),
  ('分享', 1),
  ('收藏', 1),
  ('Ctrl', 1),
  ('+', 1),
  ('Enter', 1),
  ('\xa0', 1),
  ('取消', 1)]}
推荐阅读:
  1. MapReduce编写实现wordcount词频统计
  2. Spark shell 词频统计和统计PV的心得是什么

免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。

requests beautifusoup jieba

上一篇:asp.net C# 微信消息自动回复 asp.net版

下一篇:09-02-部署边缘服务器-1-安装-先决条件

相关阅读

您好,登录后才能下订单哦!

密码登录
登录注册
其他方式登录
点击 登录注册 即表示同意《亿速云用户服务条款》