Ubuntu上PyTorch的常见问题与排查
一 安装与版本兼容
python -m pip install --upgrade pip。apt装了旧版PyTorch,再用pip装新版会产生冲突或属性错误。建议统一用pip,先卸载apt版本:sudo apt remove python3-pytorch。nvidia-smi查看驱动/CUDA支持,再从PyTorch官网选择匹配命令安装。pip install torch torchvision torchaudio -i https://pypi.tuna.tsinghua.edu.cn/simple/。sudo pip污染系统环境,优先在虚拟环境内安装与管理依赖。二 GPU与驱动相关
torch.cuda.is_available()返回False。先正确安装匹配你GPU的NVIDIA驱动,再安装对应CUDA版本的PyTorch。nvcc --version、驱动版本与PyTorch的CUDA版本一致。~/.bashrc中设置PATH与LD_LIBRARY_PATH,例如:export PATH=/usr/local/cuda/bin:$PATHexport LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATHsource ~/.bashrc生效。三 依赖与运行错误
ImportError(如图像/多媒体相关)常因缺少系统库。安装:sudo apt update && sudo apt install -y libgl1 libglib2.0-0 ffmpeg。which python与which pip不一致,易引发ModuleNotFoundError。建议在虚拟环境中安装:python -m venv ~/pytorch_env && source ~/pytorch_env/bin/activate。num_workers>0报错(如“already started”)或无响应,可先设为0排查;多卡/远程环境需确保正确启动子进程。conda install tensorboard或pip install tensorboard,再用tensorboard --logdir log --port 6006查看。tensor = tensor.to(device); model = model.to(device)。float与double混用导致训练异常,统一dtype(如torch.float32)。四 源码编译与进阶场景
git submodule update --init --recursive失败,可更换网络、使用镜像或分步拉取。cudnn_version.h在CUDA include路径下,必要时建立软链。libstdc++.so.6版本,必要时在conda环境中替换/软链合适版本。pip uninstall torch与python setup.py clean清理旧产物。五 快速自检与定位清单
python -c "import torch; print(torch.__version__)"python -c "import torch; print('CUDA:', torch.cuda.is_available())"nvidia-smi(右上显示Supported/Runtime CUDA)nvcc --version(Toolkit版本)which python 与 which pip 是否指向同一虚拟环境pip list | grep torch 确认已安装包与版本ldd $(python -c 'import torch; print(torch.__file__)') | grep cudart 检查CUDA运行时库是否可解析LD_LIBRARY_PATH、PATH、CUDA_HOME并source ~/.bashrcpython -c "import torch; x=torch.randn(2,2); print(x.device)"python -c "import torch; print('CUDA:', torch.cuda.is_available())"