在CentOS上使用PyTorch进行分布式训练,可以按照以下步骤进行:
安装Python和PyTorch:
pip install torch torchvision torchaudio
安装依赖库:
nccl
、mpi4py
等。sudo yum install -y epel-release
sudo yum install -y libnccl-devel
pip install mpi4py
配置SSH无密码登录:
启动分布式训练环境:
mpirun
或torch.distributed.launch
来启动分布式训练。torch.distributed.launch
:python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE YOUR_TRAINING_SCRIPT.py
编写分布式训练脚本:
torch.distributed.init_process_group
来初始化分布式环境。import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
def main(rank, world_size):
torch.manual_seed(1234)
torch.cuda.set_device(rank)
# 初始化分布式环境
torch.distributed.init_process_group(
backend='nccl',
init_method='tcp://<master_ip>:<master_port>',
world_size=world_size,
rank=rank
)
# 创建模型并移动到GPU
model = nn.Linear(10, 10).to(rank)
ddp_model = DDP(model, device_ids=[rank])
# 创建损失函数和优化器
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.01)
# 训练循环
for epoch in range(10):
optimizer.zero_grad()
inputs = torch.randn(20, 10).to(rank)
labels = torch.randint(0, 10, (20,)).to(rank)
outputs = ddp_model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
print(f'Rank {rank}, Epoch {epoch}, Loss {loss.item()}')
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--world_size', type=int, default=2)
parser.add_argument('--rank', type=int, default=0)
args = parser.parse_args()
main(args.rank, args.world_size)
运行分布式训练:
world_size
和rank
参数正确设置。# 节点1
python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE YOUR_TRAINING_SCRIPT.py --world_size=2 --rank=0
# 节点2
python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE YOUR_TRAINING_SCRIPT.py --world_size=2 --rank=1
通过以上步骤,你可以在CentOS上实现PyTorch的分布式训练。