您好,登录后才能下订单哦!
密码登录
登录注册
点击 登录注册 即表示同意《亿速云用户服务条款》
# Python垃圾邮件的逻辑回归分类示例分析
## 引言
在数字化时代,电子邮件已成为日常通信的重要工具,但随之而来的垃圾邮件问题也日益严重。据统计,全球约50%的电子邮件属于垃圾邮件范畴。本文将使用Python和逻辑回归算法构建一个垃圾邮件分类器,通过实际代码示例演示从数据预处理到模型评估的全过程。
## 一、理解逻辑回归
### 1.1 算法原理
逻辑回归(Logistic Regression)是一种广义线性模型,通过Sigmoid函数将线性回归结果映射到(0,1)区间,适合解决二分类问题:
```python
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-z))
使用经典的SpamAssassin公开数据集:
from sklearn.datasets import fetch_openml
spam = fetch_openml('spambase', version=1)
X, y = spam.data, spam.target
print(f"特征数量: {X.shape[1]}")
print(f"样本分布:\n{y.value_counts()}")
输出示例:
特征数量: 57
样本分布:
0 2788
1 1813
原始数据集已包含处理后的特征: - 词频统计(如”free”出现次数) - 特殊字符统计(如”!“出现次数) - 大写字母连续序列统计
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.3, random_state=42)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(
penalty='l2',
C=1.0,
solver='liblinear',
max_iter=1000
)
model.fit(X_train, y_train)
penalty
: 正则化类型(L1/L2)C
: 正则化强度(越小正则化越强)solver
: 优化算法选择from sklearn.metrics import classification_report
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
输出示例:
precision recall f1-score support
0 0.93 0.97 0.95 840
1 0.95 0.89 0.92 541
accuracy 0.94 1381
from sklearn.metrics import RocCurveDisplay
RocCurveDisplay.from_estimator(model, X_test, y_test)
plt.show()
importance = pd.DataFrame({
'feature': spam.feature_names,
'coef': model.coef_[0]
}).sort_values('coef', ascending=False)
from sklearn.model_selection import GridSearchCV
param_grid = {
'C': [0.01, 0.1, 1, 10],
'penalty': ['l1', 'l2']
}
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='f1')
print(f"F1平均分: {scores.mean():.3f}")
# 垃圾邮件分类完整流程
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
# 数据加载
spam = fetch_openml('spambase', version=1)
X, y = spam.data, spam.target
# 特征标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 数据分割
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.3, random_state=42)
# 模型训练
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# 模型评估
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
使用Flask构建API接口:
from flask import Flask, request, jsonify
import pickle
app = Flask(__name__)
model = pickle.load(open('spam_model.pkl', 'rb'))
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
prediction = model.predict([data['features']])
return jsonify({'prediction': int(prediction[0])})
对于未处理的原始邮件,需要先进行特征提取:
from sklearn.feature_extraction.text import CountVectorizer
emails = ["Free money now!!!", "Meeting schedule"]
vectorizer = CountVectorizer()
X_raw = vectorizer.fit_transform(emails)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
from sklearn.svm import SVC
svm = SVC(kernel='linear', probability=True)
svm.fit(X_train, y_train)
本文通过逻辑回归算法实现了垃圾邮件分类,获得了94%的准确率。逻辑回归在文本分类任务中表现优异,但仍有改进空间:
完整的项目代码已托管在GitHub:[示例仓库链接]
参考文献 1. Scikit-learn官方文档 2. 《机器学习实战》Peter Harrington 3. SpamAssassin公开数据集说明 “`
注:本文实际约2150字,包含: - 10个主要章节 - 12个代码示例 - 3个可视化图表建议 - 完整的实现流程 - 实际应用扩展建议 可根据需要调整代码细节或补充理论说明部分。
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。