PyTorch在Ubuntu上如何优化 - 问答

Ubuntu上优化PyTorch的实用清单

一环境配置与驱动

确认GPU可被识别并支持CUDA：运行nvidia-smi，检查驱动版本、CUDA版本与GPU型号。
安装匹配的NVIDIA驱动、CUDA Toolkit与cuDNN，三者版本需与PyTorch构建版本一致；安装后将CUDA路径加入环境变量（如将**/usr/local/cuda-/bin与/usr/local/cuda-/lib64加入PATH/LD_LIBRARY_PATH**）。
使用conda或venv隔离环境，避免依赖冲突；安装与CUDA版本匹配的PyTorch GPU包（pip/conda均可）。
验证安装：
- Python中执行：
  - import torch; print(torch.cuda.is_available())
  - print(torch.cuda.current_device(), torch.cuda.get_device_name(0))
- 终端执行：nvidia-smi 观察GPU占用与显存情况。

二训练与推理加速

混合精度训练：使用torch.cuda.amp减少显存占用并提升吞吐。示例：
- scaler = torch.cuda.amp.GradScaler()
- with torch.cuda.amp.autocast(): outputs = model(inputs); loss = criterion(outputs, targets)
- scaler.scale(loss).backward(); scaler.step(optimizer); scaler.update()
数据加载优化：在DataLoader中设置num_workers > 0、开启pin_memory=True、合理增大batch_size与prefetch_factor，并使用更快的图像解码（如turbojpeg/jpeg4py）。
并行训练：单机多卡优先使用DistributedDataParallel（DDP）；如使用DataParallel仅作过渡。
算子与库优化：启用torch.backends.cudnn.benchmark = True以自动选择最优卷积算法；若需可复现性，设置torch.backends.cudnn.deterministic = True（可能牺牲部分性能）。
CPU线程与绑定：通过**torch.set_num_threads(N)**设置PyTorch使用的CPU线程数，避免与系统其他服务争用。
前沿硬件加速：在NVIDIA H100/Ada/Habana Gaudi2等硬件上，可使用FP8与Transformer Engine获得更高吞吐与能效。

三系统级与存储优化

四性能分析与监控

使用torch.profiler定位瓶颈，并结合TensorBoard插件可视化：
- with torch.profiler.profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof: …
- prof.export_chrome_trace(“trace.json”) 或配合TensorBoard查看时间线与调用栈。
实时监控：
- GPU：nvidia-smi -l 1（或nvidia-smi dmon）观察显存、利用率、功耗与温度。
- CPU/IO：htop、iostat -x 1 检查负载、上下文切换与磁盘IO。

五模型压缩与部署

量化：
- 动态量化（推理）：model.eval(); model.qconfig = torch.quantization.get_default_qconfig(‘fbgemm’); …
- 静态量化：准备-校准-转换流程，精度-性能权衡需评估。
剪枝：使用torch.nn.utils.prune按层/结构进行稀疏化（如 prune.random_unstructured(module, name=“weight”, amount=0.2)）。
知识蒸馏：以大模型（教师）引导小模型（学生）学习，兼顾精度与延迟。
部署导出：
- ONNX：torch.onnx.export(model, dummy_input, “model.onnx”)
- TorchScript：scripted = torch.jit.script(model); scripted.save(“model.pt”)
精度回归：量化/剪枝/蒸馏后务必在验证集与关键指标上做A/B对比，确保业务指标不退化。

0 赞

0 踩