Debian 上 PyTorch 变慢的排查与优化清单
一、先快速定位瓶颈
torch.cuda.is_available();打印设备名与显存:torch.cuda.get_device_name(0)、torch.cuda.memory_allocated(0)。watch -n 1 nvidia-smi;看 CPU/内存:htop;综合资源:dstat -c -m -y -p --top-io --top-bio。二、GPU 训练加速要点
DataLoader(num_workers>0, pin_memory=True, prefetch_factor>0);训练时将张量异步拷贝到 GPU:tensor.to(device, non_blocking=True);验证/推理用 torch.no_grad()。torch.cuda.amp.autocast() + GradScaler;用 梯度累积 模拟大 batch;optimizer.zero_grad(set_to_none=True) 减少开销。三、CPU 或仅 CPU 运行时的优化
四、Debian 系统层面的优化
五、可直接套用的优化代码片段
# 设备与基准设置
import torch, torch.backends.cudnn as cudnn
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cudnn.benchmark = True # 输入尺寸固定时开启
# DataLoader 优化
from torch.utils.data import DataLoader
loader = DataLoader(dataset,
batch_size=256,
shuffle=True,
num_workers=4, # 依据CPU核心数调优
pin_memory=True,
prefetch_factor=2)
# AMP + 梯度累积
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
accum_steps = 4
for i, (x, y) in enumerate(loader):
x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
with autocast():
out = model(x)
loss = criterion(out, y)
scaler.scale(loss / accum_steps).backward()
if (i + 1) % accum_steps == 0:
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad(set_to_none=True)
# 验证阶段
with torch.no_grad():
for x, y in val_loader:
x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
out = model(x)