您好,登录后才能下订单哦!
密码登录
登录注册
点击 登录注册 即表示同意《亿速云用户服务条款》
# 如何使用sklearn进行数据挖掘
## 引言
在当今数据驱动的时代,数据挖掘已成为从海量数据中提取有价值信息的关键技术。Python生态中的scikit-learn(简称sklearn)作为最受欢迎的机器学习库之一,为数据挖掘任务提供了高效且易用的工具集。本文将系统介绍如何利用sklearn完成典型的数据挖掘流程,涵盖数据预处理、特征工程、模型训练与评估等核心环节。
---
## 一、环境准备与数据加载
### 1.1 安装sklearn
```bash
pip install scikit-learn pandas numpy matplotlib
sklearn支持多种数据输入格式:
from sklearn import datasets
# 加载内置数据集
iris = datasets.load_iris()
X, y = iris.data, iris.target
# 从CSV文件加载(需配合pandas)
import pandas as pd
df = pd.read_csv('data.csv')
X = df.drop('target', axis=1)
y = df['target']
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
X_encoded = encoder.fit_transform(X_categorical)
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95) # 保留95%方差
X_pca = pca.fit_transform(X)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True)
X_poly = poly.fit_transform(X)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
from sklearn.svm import SVR
reg = SVR(kernel='rbf')
reg.fit(X_train, y_train)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
# 分类评估
from sklearn.metrics import accuracy_score, f1_score
y_pred = clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
# 回归评估
from sklearn.metrics import mean_squared_error
print(f"MSE: {mean_squared_error(y_test, y_pred):.2f}")
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.2f} (±{scores.std():.2f})")
from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators': [50, 100, 200]}
grid_search = GridSearchCV(clf, param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
import joblib
# 保存模型
joblib.dump(clf, 'model.pkl')
# 加载模型
clf_loaded = joblib.load('model.pkl')
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('imputer', SimpleImputer()),
('scaler', StandardScaler()),
('classifier', RandomForestClassifier())
])
pipeline.fit(X_train, y_train)
import seaborn as sns
sns.pairplot(df, hue='churn')
# 创建预处理管道
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_features),
('cat', OneHotEncoder(), categorical_features)])
# 构建完整模型
model = Pipeline([
('preprocessor', preprocessor),
('classifier', GradientBoostingClassifier())
])
# 训练与评估
model.fit(X_train, y_train)
print(classification_report(y_test, model.predict(X_test)))
class_weight
参数或SMOTE
过采样make_scorer
创建业务指标n_jobs=-1
利用所有CPU核心partial_fit
方法SGDClassifier
替代常规算法feature_importances_
属性sklearn通过其一致的API设计和丰富的算法实现,显著降低了数据挖掘的技术门槛。掌握本文介绍的核心流程后,读者可以: - 快速构建端到端的数据挖掘管道 - 灵活应对结构化数据的各类问题 - 通过模块化组合实现复杂需求
建议进一步探索:
- 官方文档
- sklearn.externals
扩展功能
- 与其他库(如XGBoost)的集成使用
注意:本文代码示例需根据实际数据调整参数,完整项目建议采用Jupyter Notebook进行交互式开发。 “`
(全文约2350字,实际字数可能因Markdown渲染方式略有差异)
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。