Linux版PyTorch进行自然语言处理的步骤如下:
安装基础环境
sudo apt-get install python3 python3-pip
(Ubuntu/Debian)或对应Linux发行版命令。python3 -m venv myenv
,激活:source myenv/bin/activate
。安装PyTorch及NLP库
pip install torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
(替换为实际CUDA版本)。pip install transformers torchtext spacy
,并下载Spacy英文模型:python -m spacy download en_core_web_sm
。数据预处理
transformers
库的分词器(如BERT)处理文本:from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
encoded = tokenizer("Hello, world!", return_tensors='pt') # 转为张量
torchtext
构建词表和批处理数据(以IMDB数据集为例):from torchtext.datasets import IMDB
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer("spacy")
train_iter, test_iter = IMDB(split=("train", "test"))
构建模型
import torch.nn as nn
class TextClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
self.fc = nn.Linear(hidden_dim, num_class)
def forward(self, text):
embedded = self.embedding(text)
_, (hidden, _) = self.lstm(embedded)
return self.fc(hidden.squeeze(0))
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
训练与评估
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in range(5):
model.train()
for texts, labels in train_loader:
optimizer.zero_grad()
outputs = model(texts)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
保存与加载模型
# 保存
model.save_pretrained('./my_model')
tokenizer.save_pretrained('./my_model')
# 加载
from transformers import BertForSequenceClassification, BertTokenizer
model = BertForSequenceClassification.from_pretrained('./my_model')
tokenizer = BertTokenizer.from_pretrained('./my_model')
常用任务扩展:
参考资料: