在Ubuntu环境下使用PyTorch进行特征工程,主要包括数据预处理、特征提取和特征转换等步骤,以下是具体方法及代码示例:
安装依赖库
确保已安装Python、pip,然后通过pip安装PyTorch、torchvision等库:
pip install torch torchvision numpy pandas
加载与转换数据
torchvision.transforms进行缩放、归一化等操作,例如:from torchvision import transforms
transform = transforms.Compose([
transforms.Resize((224, 224)), # 调整图像大小
transforms.ToTensor(), # 转换为张量
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) # 标准化
])
nltk或spaCy进行分词,使用torchtext构建词汇表。数据加载
通过torch.utils.data.DataLoader批量加载数据,支持多线程加速:
from torch.utils.data import DataLoader
train_loader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=4)
预训练模型提取特征
利用ResNet、VGG等预训练模型提取高层语义特征:
from torchvision import models
model = models.resnet50(pretrained=True)
model.eval() # 设置为评估模式
# 提取特征
with torch.no_grad():
features = model(input_tensor) # 输入图像张量,输出特征向量
自定义层特征提取
通过注册钩子(Hook)获取中间层特征图:
feature_maps = []
def hook_fn(module, input, output):
feature_maps.append(output)
# 注册钩子到目标层(如ResNet的conv3层)
model.layer3.register_forward_hook(hook_fn)
特征标准化
对提取的特征进行标准化处理,例如使用sklearn.preprocessing.StandardScaler:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features.numpy())
降维与可视化
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
reduced_features = pca.fit_transform(features.numpy())
matplotlib绘制散点图或热图。import torch
from torchvision import datasets, transforms, models
from torch.utils.data import DataLoader
# 1. 数据预处理
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
# 2. 加载数据
train_dataset = datasets.ImageFolder('path/to/train', transform=transform)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
# 3. 特征提取(以ResNet为例)
model = models.resnet50(pretrained=True)
model.eval()
features = []
labels = []
for images, label in train_loader:
with torch.no_grad():
output = model(images) # 提取特征
features.append(output.numpy())
labels.append(label.numpy())
features = np.concatenate(features, axis=0)
labels = np.concatenate(labels, axis=0)
DataLoader的batch_size参数)避免内存溢出。.npy或.csv文件,便于后续模型训练。参考资料: