在Ubuntu上使用PyTorch进行自然语言处理可按以下步骤操作:
sudo apt update
sudo apt install python3 python3-pip python3-venv
python3 -m venv pytorch_env
source pytorch_env/bin/activate
pip install torch torchvision torchaudio # CPU版本
# 若需GPU支持,使用对应CUDA版本的pip命令(参考)
python -c "import torch; print(torch.__version__)"
torchtext
或transformers
库处理文本,例如:from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
tokenizer = get_tokenizer('spacy', language='en_core_web_sm') # 需安装spacy和英文模型
train_iter, _ = torchtext.datasets.IMDB(split=('train',))
vocab = build_vocab_from_iterator((tokenizer(text) for text, _ in train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])
(参考)import torch.nn as nn
class TextClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
self.fc = nn.Linear(hidden_dim, num_classes)
def forward(self, x):
x = self.embedding(x)
_, (hidden, _) = self.lstm(x)
return self.fc(hidden.squeeze(0))
model = TextClassifier(len(vocab), 100, 256, 2) # 二分类示例
(参考)import torch.optim as optim
from torch.utils.data import DataLoader
# 假设已定义collate_batch函数处理批次数据
train_loader = DataLoader(train_data, batch_size=64, collate_fn=collate_batch)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
for epoch in range(5):
model.train()
for texts, labels in train_loader:
optimizer.zero_grad()
outputs = model(texts)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
model.eval()
correct, total = 0, 0
with torch.no_grad():
for texts, labels in test_loader:
outputs = model(texts)
_, predicted = torch.max(outputs, 1)
correct += (predicted == labels).sum().item()
accuracy = correct / total
print(f"Accuracy: {accuracy:.4f}")
(参考)transformers
库加载BERT等模型:from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
(参考)device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
inputs = inputs.to(device)
(参考)torchtext
:处理文本数据、构建词表、批次处理。transformers
:提供预训练模型(如BERT、GPT)和分词器。torch.nn
:定义模型结构(如LSTM、Embedding层)。通过以上步骤,可在Ubuntu上完成PyTorch的自然语言处理任务,从基础模型到预训练模型均可灵活实现。