您好,登录后才能下订单哦!
# 怎么用Python制作一个每天新闻热点
## 引言
在信息爆炸的时代,每天都有海量的新闻产生。如何高效获取并整理每日新闻热点,成为许多人的需求。Python作为一门强大的编程语言,可以帮助我们自动化完成这一任务。本文将详细介绍如何使用Python制作一个每天自动抓取、整理并推送新闻热点的系统。
## 系统架构概述
整个系统可以分为以下几个模块:
1. **新闻数据抓取模块**:从各大新闻网站API或网页抓取新闻数据
2. **数据处理模块**:对抓取的新闻进行清洗、分析和热点提取
3. **存储模块**:将处理后的新闻存入数据库
4. **推送模块**:将每日热点推送给用户
5. **定时任务模块**:设置定时任务自动执行上述流程
下面我们将分步骤详细讲解每个模块的实现。
## 1. 新闻数据抓取模块
### 1.1 选择新闻源
我们可以选择以下几种新闻源:
- 新闻网站API(如NewsAPI、新浪新闻API等)
- RSS订阅源
- 直接爬取新闻网站
这里我们以NewsAPI为例,它是一个提供新闻数据的免费API服务。
### 1.2 使用NewsAPI获取新闻
首先需要注册NewsAPI账号获取API key。
```python
import requests
def fetch_news_from_api():
    api_key = "your_api_key_here"
    url = f"https://newsapi.org/v2/top-headlines?country=us&apiKey={api_key}"
    
    try:
        response = requests.get(url)
        response.raise_for_status()
        data = response.json()
        return data['articles']
    except requests.exceptions.RequestException as e:
        print(f"Error fetching news: {e}")
        return []
如果API不可用,可以使用BeautifulSoup爬取新闻网站:
from bs4 import BeautifulSoup
import requests
def scrape_news_from_web():
    url = "https://example-news-website.com"
    headers = {'User-Agent': 'Mozilla/5.0'}
    
    try:
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        
        news_items = []
        for item in soup.select('.news-item'):
            title = item.select_one('.title').text.strip()
            link = item.find('a')['href']
            summary = item.select_one('.summary').text.strip()
            news_items.append({'title': title, 'url': link, 'description': summary})
            
        return news_items
    except Exception as e:
        print(f"Error scraping news: {e}")
        return []
从不同来源获取的新闻数据可能需要清洗:
import re
from datetime import datetime
def clean_news_data(news_items):
    cleaned_news = []
    
    for item in news_items:
        # 移除HTML标签
        description = re.sub(r'<[^>]+>', '', item.get('description', ''))
        
        # 格式化日期
        published_at = item.get('publishedAt', '')
        if published_at:
            try:
                published_at = datetime.strptime(published_at, '%Y-%m-%dT%H:%M:%SZ')
            except ValueError:
                published_at = None
                
        cleaned_news.append({
            'title': item.get('title', '').strip(),
            'url': item.get('url', '').strip(),
            'description': description,
            'source': item.get('source', {}).get('name', 'Unknown'),
            'published_at': published_at,
            'content': item.get('content', '')[:500]  # 限制内容长度
        })
    
    return cleaned_news
我们可以使用简单的TF-IDF算法或TextRank算法提取热点关键词:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
def extract_hot_topics(news_items, top_n=5):
    # 合并所有新闻标题和描述作为语料库
    corpus = [f"{item['title']} {item['description']}" for item in news_items]
    
    # 使用TF-IDF提取关键词
    vectorizer = TfidfVectorizer(max_features=100, stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(corpus)
    
    # 获取特征词
    feature_names = vectorizer.get_feature_names_out()
    
    # 计算每个词的TF-IDF总分
    word_scores = np.asarray(tfidf_matrix.sum(axis=0)).ravel()
    sorted_indices = np.argsort(word_scores)[::-1]
    
    # 获取top_n热点词
    top_words = [feature_names[i] for i in sorted_indices[:top_n]]
    
    return top_words
我们可以使用SQLite或MongoDB存储新闻数据。以下是SQLite表设计:
import sqlite3
from datetime import datetime
def setup_database():
    conn = sqlite3.connect('news_database.db')
    cursor = conn.cursor()
    
    cursor.execute('''
    CREATE TABLE IF NOT EXISTS news (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        title TEXT NOT NULL,
        url TEXT UNIQUE NOT NULL,
        description TEXT,
        source TEXT,
        published_at DATETIME,
        content TEXT,
        is_hot BOOLEAN DEFAULT 0,
        created_at DATETIME DEFAULT CURRENT_TIMESTAMP
    )
    ''')
    
    cursor.execute('''
    CREATE TABLE IF NOT EXISTS daily_hot_topics (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        date DATE UNIQUE NOT NULL,
        topics TEXT,
        created_at DATETIME DEFAULT CURRENT_TIMESTAMP
    )
    ''')
    
    conn.commit()
    conn.close()
def save_news_to_db(news_items):
    conn = sqlite3.connect('news_database.db')
    cursor = conn.cursor()
    
    for item in news_items:
        try:
            cursor.execute('''
            INSERT OR IGNORE INTO news 
            (title, url, description, source, published_at, content)
            VALUES (?, ?, ?, ?, ?, ?)
            ''', (
                item['title'],
                item['url'],
                item['description'],
                item['source'],
                item['published_at'],
                item['content']
            ))
        except sqlite3.Error as e:
            print(f"Error saving news: {e}")
    
    conn.commit()
    conn.close()
可以使用smtplib发送每日热点邮件:
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from datetime import datetime
def send_daily_email(topics, news_items):
    # 邮件配置
    sender_email = "your_email@example.com"
    receiver_email = "recipient@example.com"
    password = "your_email_password"
    
    # 创建邮件内容
    msg = MIMEMultipart()
    msg['From'] = sender_email
    msg['To'] = receiver_email
    msg['Subject'] = f"每日新闻热点 - {datetime.now().strftime('%Y-%m-%d')}"
    
    # 构建HTML内容
    html = f"""
    <h1>今日热点话题</h1>
    <p>{', '.join(topics)}</p>
    
    <h2>热点新闻</h2>
    <ul>
    """
    
    for item in news_items[:5]:  # 只发送前5条热点新闻
        html += f"""
        <li>
            <h3><a href="{item['url']}">{item['title']}</a></h3>
            <p>{item['description']}</p>
            <small>来源: {item['source']} | 发布时间: {item['published_at']}</small>
        </li>
        """
    
    html += "</ul>"
    
    msg.attach(MIMEText(html, 'html'))
    
    # 发送邮件
    try:
        with smtplib.SMTP_SSL('smtp.example.com', 465) as server:
            server.login(sender_email, password)
            server.sendmail(sender_email, receiver_email, msg.as_string())
        print("Email sent successfully!")
    except Exception as e:
        print(f"Error sending email: {e}")
可以使用Server酱或企业微信API实现微信推送:
def send_wechat_notification(content):
    api_url = "https://sc.ftqq.com/YOUR_SERVER_CHAN_KEY.send"
    data = {
        "text": "每日新闻热点",
        "desp": content
    }
    
    try:
        response = requests.post(api_url, data=data)
        response.raise_for_status()
        print("WeChat notification sent!")
    except Exception as e:
        print(f"Error sending WeChat notification: {e}")
from apscheduler.schedulers.blocking import BlockingScheduler
def daily_news_job():
    print("开始执行每日新闻采集任务...")
    
    # 1. 获取新闻
    raw_news = fetch_news_from_api()
    if not raw_news:
        raw_news = scrape_news_from_web()
    
    # 2. 清洗数据
    cleaned_news = clean_news_data(raw_news)
    
    # 3. 提取热点
    hot_topics = extract_hot_topics(cleaned_news)
    
    # 4. 存储数据
    save_news_to_db(cleaned_news)
    
    # 5. 发送推送
    send_daily_email(hot_topics, cleaned_news)
    
    print("每日新闻采集任务完成!")
def setup_scheduler():
    scheduler = BlockingScheduler()
    # 每天上午9点执行
    scheduler.add_job(daily_news_job, 'cron', hour=9, minute=0)
    
    try:
        scheduler.start()
    except (KeyboardInterrupt, SystemExit):
        scheduler.shutdown()
if __name__ == "__main__":
    setup_database()
    setup_scheduler()
可以直接在本地运行Python脚本,适合个人使用:
python daily_news.py
对于长期运行,建议部署到云服务器:
nohup python daily_news.py &
创建/etc/systemd/system/daily-news.service:
[Unit]
Description=Daily News Collector
After=network.target
[Service]
User=your_username
WorkingDirectory=/path/to/your/project
ExecStart=/usr/bin/python /path/to/your/project/daily_news.py
Restart=always
[Install]
WantedBy=multi-user.target
然后启用服务:
sudo systemctl enable daily-news
sudo systemctl start daily-news
可以使用NLP技术对新闻分类:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
# 简单的分类器示例
class NewsClassifier:
    def __init__(self):
        self.categories = ['政治', '经济', '科技', '体育', '娱乐']
        self.pipeline = Pipeline([
            ('vectorizer', CountVectorizer()),
            ('classifier', MultinomialNB())
        ])
        
    def train(self, X, y):
        self.pipeline.fit(X, y)
        
    def predict(self, text):
        return self.categories[self.pipeline.predict([text])[0]]
添加情感分析了解舆论倾向:
from textblob import TextBlob
def analyze_sentiment(text):
    analysis = TextBlob(text)
    return {
        'polarity': analysis.sentiment.polarity,
        'subjectivity': analysis.sentiment.subjectivity
    }
使用Matplotlib或Pyecharts生成可视化报告:
import matplotlib.pyplot as plt
def plot_topic_distribution(news_items):
    sources = [item['source'] for item in news_items]
    source_counts = {source: sources.count(source) for source in set(sources)}
    
    plt.figure(figsize=(10, 6))
    plt.bar(source_counts.keys(), source_counts.values())
    plt.title('新闻来源分布')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig('source_distribution.png')
通过本文的介绍,我们实现了一个完整的每日新闻热点自动化系统。这个系统可以:
你可以根据自己的需求进一步扩展这个系统,例如: - 添加更多新闻源 - 改进热点提取算法 - 增加用户个性化推荐 - 开发Web界面展示新闻
Python的强大生态让我们能够轻松实现这样的自动化系统,希望本文对你有所帮助! “`
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。