您好,登录后才能下订单哦!
密码登录
登录注册
点击 登录注册 即表示同意《亿速云用户服务条款》
# sklearn特征有哪些
## 一、引言
Scikit-learn(简称sklearn)是Python中最流行的机器学习库之一,提供了丰富的特征处理工具。特征工程是机器学习流程中的关键步骤,直接影响模型性能。本文将系统梳理sklearn中的特征类型及处理方法,涵盖数值型、类别型、文本型、时间序列等特征的处理技术。
## 二、数值型特征处理
### 2.1 标准化与归一化
```python
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# 标准化(Z-score)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 归一化(0-1范围)
minmax_scaler = MinMaxScaler()
X_normalized = minmax_scaler.fit_transform(X)
from sklearn.preprocessing import PowerTransformer, QuantileTransformer
# Yeo-Johnson变换
pt = PowerTransformer(method='yeo-johnson')
X_trans = pt.fit_transform(X)
# 分位数变换(高斯输出)
qt = QuantileTransformer(output_distribution='normal')
X_quantile = qt.fit_transform(X)
from sklearn.preprocessing import KBinsDiscretizer
# 等宽分箱
discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
X_binned = discretizer.fit_transform(X)
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
# 独热编码
ohe = OneHotEncoder(sparse=False)
X_ohe = ohe.fit_transform(X_cat)
# 序数编码
ord_enc = OrdinalEncoder()
X_ord = ord_enc.fit_transform(X_cat)
from sklearn.preprocessing import TargetEncoder
from category_encoders import CatBoostEncoder # 需额外安装category_encoders
# 目标编码(需谨慎处理数据泄漏)
te = TargetEncoder()
X_te = te.fit_transform(X_cat, y)
# CatBoost编码
cbe = CatBoostEncoder()
X_cbe = cbe.fit_transform(X_cat, y)
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# 词频统计
cv = CountVectorizer(max_features=5000)
X_count = cv.fit_transform(text_data)
# TF-IDF
tfidf = TfidfVectorizer(ngram_range=(1,2))
X_tfidf = tfidf.fit_transform(text_data)
from sklearn.feature_extraction.text import HashingVectorizer
# 哈希向量化(适合大规模数据)
hv = HashingVectorizer(n_features=2**18)
X_hash = hv.fit_transform(text_data)
import pandas as pd
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5,6]).astype(int)
from sklearn.preprocessing import FunctionTransformer
def cyclic_encoding(X, period):
sin = np.sin(2 * np.pi * X/period)
cos = np.cos(2 * np.pi * X/period)
return np.column_stack([sin, cos])
# 应用到小时特征
X_hour_cyclic = cyclic_encoding(df['hour'].values, 24)
from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif
# 移除低方差特征
selector = VarianceThreshold(threshold=0.1)
X_high_var = selector.fit_transform(X)
# 基于统计检验选择
skb = SelectKBest(score_func=f_classif, k=20)
X_selected = skb.fit_transform(X, y)
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
# 基于模型重要性选择
sfm = SelectFromModel(
RandomForestClassifier(n_estimators=100),
threshold="median"
)
X_embedded = sfm.fit_transform(X, y)
from sklearn.preprocessing import PolynomialFeatures
# 二阶多项式特征
poly = PolynomialFeatures(degree=2, interaction_only=False)
X_poly = poly.fit_transform(X)
from sklearn.preprocessing import FunctionTransformer
def create_ratio_features(X):
return X[:, [0]] / (X[:, [1]] + 1e-6) # 避免除以0
ratio_transformer = FunctionTransformer(create_ratio_features)
X_ratio = ratio_transformer.fit_transform(X)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
from sklearn.compose import make_column_selector as selector
preprocessor = ColumnTransformer([
('num', numeric_transformer, selector(dtype_exclude="category")),
('cat', categorical_transformer, selector(dtype_include="category"))
])
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('feature_selection', SelectFromModel(LogisticRegression())),
('classifier', RandomForestClassifier())
])
# 最终完整示例
from sklearn import set_config
set_config(display="diagram") # 可视化流水线结构
final_pipeline = Pipeline([
('preprocessing', preprocessor),
('feature_engineering', FeatureUnion([
('polynomial', PolynomialFeatures(degree=2)),
('interactions', FunctionTransformer(create_interaction_features))
])),
('model', XGBClassifier())
])
通过系统掌握sklearn的特征处理工具,可以显著提升机器学习项目的效果和效率。建议读者在实践中根据具体数据特点选择合适的特征处理方法。 “`
注:本文实际约3500字(含代码),完整版可扩展以下内容: 1. 各方法的数学原理详解 2. 更多实际案例和可视化展示 3. 性能优化技巧(并行处理、内存管理等) 4. 与其他库(如pandas、numpy)的集成方法 5. 不同机器学习任务(分类/回归/聚类)的特征处理差异
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。