在Ubuntu上进行PyTorch自然语言处理可按以下步骤操作:
安装依赖
安装PyTorch及常用库:
pip install torch torchtext spacy transformers nltk
python -m spacy download en_core_web_sm # 英文分词模型
数据预处理
torchtext或transformers加载数据集(如IMDB、自定义文本),并进行分词、构建词表:from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
tokenizer = get_tokenizer("spacy", language="en_core_web_sm")
# 示例:构建词表
def yield_tokens(data_iter):
for _, text in data_iter:
yield tokenizer(text)
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
from torchnlp.word_to_vector import GloVe
glove = GloVe(name="6B", dim=100)
word_vector = glove["word"] # 获取词向量
构建模型
常用模型包括LSTM、GRU等,例如简单LSTM分类模型:
import torch.nn as nn
class TextClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
embedded = self.embedding(x)
_, (hidden, _) = self.lstm(embedded)
return self.fc(hidden.squeeze(0))
训练与评估
定义损失函数(如nn.CrossEntropyLoss)、优化器(如Adam),并进行训练循环:
from torch.utils.data import DataLoader
# 数据加载
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
# 模型初始化
model = TextClassifier(vocab_size, embed_dim=128, hidden_dim=256, output_dim=2)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
# 训练循环
for epoch in range(5):
for texts, labels in train_loader:
optimizer.zero_grad()
outputs = model(texts)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
进阶工具
transformers库加载预训练模型(如BERT):from transformers import BertTokenizer, BertForSequenceClassification
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
torch.nn.DataParallel或torch.distributed扩展。参考资料: