您好,登录后才能下订单哦!
# 怎么用Python制作一个每天新闻热点
## 引言
在信息爆炸的时代,每天都有海量的新闻产生。如何高效获取并整理每日新闻热点,成为许多人的需求。Python作为一门强大的编程语言,可以帮助我们自动化完成这一任务。本文将详细介绍如何使用Python制作一个每天自动抓取、整理并推送新闻热点的系统。
## 系统架构概述
整个系统可以分为以下几个模块:
1. **新闻数据抓取模块**:从各大新闻网站API或网页抓取新闻数据
2. **数据处理模块**:对抓取的新闻进行清洗、分析和热点提取
3. **存储模块**:将处理后的新闻存入数据库
4. **推送模块**:将每日热点推送给用户
5. **定时任务模块**:设置定时任务自动执行上述流程
下面我们将分步骤详细讲解每个模块的实现。
## 1. 新闻数据抓取模块
### 1.1 选择新闻源
我们可以选择以下几种新闻源:
- 新闻网站API(如NewsAPI、新浪新闻API等)
- RSS订阅源
- 直接爬取新闻网站
这里我们以NewsAPI为例,它是一个提供新闻数据的免费API服务。
### 1.2 使用NewsAPI获取新闻
首先需要注册NewsAPI账号获取API key。
```python
import requests
def fetch_news_from_api():
api_key = "your_api_key_here"
url = f"https://newsapi.org/v2/top-headlines?country=us&apiKey={api_key}"
try:
response = requests.get(url)
response.raise_for_status()
data = response.json()
return data['articles']
except requests.exceptions.RequestException as e:
print(f"Error fetching news: {e}")
return []
如果API不可用,可以使用BeautifulSoup爬取新闻网站:
from bs4 import BeautifulSoup
import requests
def scrape_news_from_web():
url = "https://example-news-website.com"
headers = {'User-Agent': 'Mozilla/5.0'}
try:
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
news_items = []
for item in soup.select('.news-item'):
title = item.select_one('.title').text.strip()
link = item.find('a')['href']
summary = item.select_one('.summary').text.strip()
news_items.append({'title': title, 'url': link, 'description': summary})
return news_items
except Exception as e:
print(f"Error scraping news: {e}")
return []
从不同来源获取的新闻数据可能需要清洗:
import re
from datetime import datetime
def clean_news_data(news_items):
cleaned_news = []
for item in news_items:
# 移除HTML标签
description = re.sub(r'<[^>]+>', '', item.get('description', ''))
# 格式化日期
published_at = item.get('publishedAt', '')
if published_at:
try:
published_at = datetime.strptime(published_at, '%Y-%m-%dT%H:%M:%SZ')
except ValueError:
published_at = None
cleaned_news.append({
'title': item.get('title', '').strip(),
'url': item.get('url', '').strip(),
'description': description,
'source': item.get('source', {}).get('name', 'Unknown'),
'published_at': published_at,
'content': item.get('content', '')[:500] # 限制内容长度
})
return cleaned_news
我们可以使用简单的TF-IDF算法或TextRank算法提取热点关键词:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
def extract_hot_topics(news_items, top_n=5):
# 合并所有新闻标题和描述作为语料库
corpus = [f"{item['title']} {item['description']}" for item in news_items]
# 使用TF-IDF提取关键词
vectorizer = TfidfVectorizer(max_features=100, stop_words='english')
tfidf_matrix = vectorizer.fit_transform(corpus)
# 获取特征词
feature_names = vectorizer.get_feature_names_out()
# 计算每个词的TF-IDF总分
word_scores = np.asarray(tfidf_matrix.sum(axis=0)).ravel()
sorted_indices = np.argsort(word_scores)[::-1]
# 获取top_n热点词
top_words = [feature_names[i] for i in sorted_indices[:top_n]]
return top_words
我们可以使用SQLite或MongoDB存储新闻数据。以下是SQLite表设计:
import sqlite3
from datetime import datetime
def setup_database():
conn = sqlite3.connect('news_database.db')
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS news (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
url TEXT UNIQUE NOT NULL,
description TEXT,
source TEXT,
published_at DATETIME,
content TEXT,
is_hot BOOLEAN DEFAULT 0,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
)
''')
cursor.execute('''
CREATE TABLE IF NOT EXISTS daily_hot_topics (
id INTEGER PRIMARY KEY AUTOINCREMENT,
date DATE UNIQUE NOT NULL,
topics TEXT,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
)
''')
conn.commit()
conn.close()
def save_news_to_db(news_items):
conn = sqlite3.connect('news_database.db')
cursor = conn.cursor()
for item in news_items:
try:
cursor.execute('''
INSERT OR IGNORE INTO news
(title, url, description, source, published_at, content)
VALUES (?, ?, ?, ?, ?, ?)
''', (
item['title'],
item['url'],
item['description'],
item['source'],
item['published_at'],
item['content']
))
except sqlite3.Error as e:
print(f"Error saving news: {e}")
conn.commit()
conn.close()
可以使用smtplib发送每日热点邮件:
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from datetime import datetime
def send_daily_email(topics, news_items):
# 邮件配置
sender_email = "your_email@example.com"
receiver_email = "recipient@example.com"
password = "your_email_password"
# 创建邮件内容
msg = MIMEMultipart()
msg['From'] = sender_email
msg['To'] = receiver_email
msg['Subject'] = f"每日新闻热点 - {datetime.now().strftime('%Y-%m-%d')}"
# 构建HTML内容
html = f"""
<h1>今日热点话题</h1>
<p>{', '.join(topics)}</p>
<h2>热点新闻</h2>
<ul>
"""
for item in news_items[:5]: # 只发送前5条热点新闻
html += f"""
<li>
<h3><a href="{item['url']}">{item['title']}</a></h3>
<p>{item['description']}</p>
<small>来源: {item['source']} | 发布时间: {item['published_at']}</small>
</li>
"""
html += "</ul>"
msg.attach(MIMEText(html, 'html'))
# 发送邮件
try:
with smtplib.SMTP_SSL('smtp.example.com', 465) as server:
server.login(sender_email, password)
server.sendmail(sender_email, receiver_email, msg.as_string())
print("Email sent successfully!")
except Exception as e:
print(f"Error sending email: {e}")
可以使用Server酱或企业微信API实现微信推送:
def send_wechat_notification(content):
api_url = "https://sc.ftqq.com/YOUR_SERVER_CHAN_KEY.send"
data = {
"text": "每日新闻热点",
"desp": content
}
try:
response = requests.post(api_url, data=data)
response.raise_for_status()
print("WeChat notification sent!")
except Exception as e:
print(f"Error sending WeChat notification: {e}")
from apscheduler.schedulers.blocking import BlockingScheduler
def daily_news_job():
print("开始执行每日新闻采集任务...")
# 1. 获取新闻
raw_news = fetch_news_from_api()
if not raw_news:
raw_news = scrape_news_from_web()
# 2. 清洗数据
cleaned_news = clean_news_data(raw_news)
# 3. 提取热点
hot_topics = extract_hot_topics(cleaned_news)
# 4. 存储数据
save_news_to_db(cleaned_news)
# 5. 发送推送
send_daily_email(hot_topics, cleaned_news)
print("每日新闻采集任务完成!")
def setup_scheduler():
scheduler = BlockingScheduler()
# 每天上午9点执行
scheduler.add_job(daily_news_job, 'cron', hour=9, minute=0)
try:
scheduler.start()
except (KeyboardInterrupt, SystemExit):
scheduler.shutdown()
if __name__ == "__main__":
setup_database()
setup_scheduler()
可以直接在本地运行Python脚本,适合个人使用:
python daily_news.py
对于长期运行,建议部署到云服务器:
nohup python daily_news.py &
创建/etc/systemd/system/daily-news.service
:
[Unit]
Description=Daily News Collector
After=network.target
[Service]
User=your_username
WorkingDirectory=/path/to/your/project
ExecStart=/usr/bin/python /path/to/your/project/daily_news.py
Restart=always
[Install]
WantedBy=multi-user.target
然后启用服务:
sudo systemctl enable daily-news
sudo systemctl start daily-news
可以使用NLP技术对新闻分类:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
# 简单的分类器示例
class NewsClassifier:
def __init__(self):
self.categories = ['政治', '经济', '科技', '体育', '娱乐']
self.pipeline = Pipeline([
('vectorizer', CountVectorizer()),
('classifier', MultinomialNB())
])
def train(self, X, y):
self.pipeline.fit(X, y)
def predict(self, text):
return self.categories[self.pipeline.predict([text])[0]]
添加情感分析了解舆论倾向:
from textblob import TextBlob
def analyze_sentiment(text):
analysis = TextBlob(text)
return {
'polarity': analysis.sentiment.polarity,
'subjectivity': analysis.sentiment.subjectivity
}
使用Matplotlib或Pyecharts生成可视化报告:
import matplotlib.pyplot as plt
def plot_topic_distribution(news_items):
sources = [item['source'] for item in news_items]
source_counts = {source: sources.count(source) for source in set(sources)}
plt.figure(figsize=(10, 6))
plt.bar(source_counts.keys(), source_counts.values())
plt.title('新闻来源分布')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('source_distribution.png')
通过本文的介绍,我们实现了一个完整的每日新闻热点自动化系统。这个系统可以:
你可以根据自己的需求进一步扩展这个系统,例如: - 添加更多新闻源 - 改进热点提取算法 - 增加用户个性化推荐 - 开发Web界面展示新闻
Python的强大生态让我们能够轻松实现这样的自动化系统,希望本文对你有所帮助! “`
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。