在Ubuntu上进行PyTorch自然语言处理可按以下步骤操作:
安装依赖
安装PyTorch及常用库:
pip install torch torchtext spacy transformers nltk  
python -m spacy download en_core_web_sm  # 英文分词模型  
数据预处理
torchtext或transformers加载数据集(如IMDB、自定义文本),并进行分词、构建词表:from torchtext.data.utils import get_tokenizer  
from torchtext.vocab import build_vocab_from_iterator  
tokenizer = get_tokenizer("spacy", language="en_core_web_sm")  
# 示例:构建词表  
def yield_tokens(data_iter):  
    for _, text in data_iter:  
        yield tokenizer(text)  
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])  
from torchnlp.word_to_vector import GloVe  
glove = GloVe(name="6B", dim=100)  
word_vector = glove["word"]  # 获取词向量  
构建模型
常用模型包括LSTM、GRU等,例如简单LSTM分类模型:
import torch.nn as nn  
class TextClassifier(nn.Module):  
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):  
        super().__init__()  
        self.embedding = nn.Embedding(vocab_size, embed_dim)  
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)  
        self.fc = nn.Linear(hidden_dim, output_dim)  
    def forward(self, x):  
        embedded = self.embedding(x)  
        _, (hidden, _) = self.lstm(embedded)  
        return self.fc(hidden.squeeze(0))  
训练与评估
定义损失函数(如nn.CrossEntropyLoss)、优化器(如Adam),并进行训练循环:
from torch.utils.data import DataLoader  
# 数据加载  
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)  
# 模型初始化  
model = TextClassifier(vocab_size, embed_dim=128, hidden_dim=256, output_dim=2)  
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)  
criterion = nn.CrossEntropyLoss()  
# 训练循环  
for epoch in range(5):  
    for texts, labels in train_loader:  
        optimizer.zero_grad()  
        outputs = model(texts)  
        loss = criterion(outputs, labels)  
        loss.backward()  
        optimizer.step()  
进阶工具
transformers库加载预训练模型(如BERT):from transformers import BertTokenizer, BertForSequenceClassification  
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')  
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')  
torch.nn.DataParallel或torch.distributed扩展。参考资料: