在Ubuntu上调试PyTorch程序可按以下步骤进行:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda create -n pytorch_env python=3.8
conda activate pytorch_env
conda install pytorch torchvision torchaudio cudatoolkit=11.8 -c pytorch
import torch
print(torch.__version__, torch.cuda.is_available()) # 检查版本和CUDA是否可用
import pdb; pdb.set_trace(),程序暂停后可使用n(下一步)、s(进入函数)、c(继续)等命令。logging模块记录关键信息。print()输出变量值(简单场景适用)。from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter('runs/experiment')
writer.add_scalar('Loss/train', loss.item(), epoch)
torch.autograd.set_detect_anomaly(True):检测梯度异常。torch.autograd.profiler:分析计算性能。unittest或pytest编写测试用例,验证模块功能。torch.cuda.amp减少内存占用,加速计算。torch.distributed模块时需检查进程同步和通信。nvidia-smi查看驱动状态。torch.utils.checkpoint进行梯度检查点优化。参考资料: