ubuntu

如何在Ubuntu上搭建PyTorch集群

小樊
49
2025-10-04 05:59:43
栏目: 智能运维

在Ubuntu上搭建PyTorch集群的详细步骤

1. 硬件与环境准备

2. 网络与SSH配置

3. 集群参数配置

4. 编写分布式训练脚本

创建distributed_train.py,核心逻辑包括进程组初始化DDP模型包装分布式数据加载

import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
import torchvision.datasets as datasets
import torchvision.transforms as transforms

def main():
    # 初始化进程组(使用环境变量自动获取参数)
    torch.distributed.init_process_group(backend='nccl', init_method='env://')
    rank = torch.distributed.get_rank()  # 当前进程的rank(0~WORLD_SIZE-1)
    torch.cuda.set_device(rank)          # 设置当前进程使用的GPU

    # 定义模型并移动到对应GPU
    model = nn.Sequential(
        nn.Linear(784, 128),
        nn.ReLU(),
        nn.Linear(128, 10)
    ).to(rank)
    model = DDP(model, device_ids=[rank])  # 包装模型以实现梯度同步

    # 定义损失函数与优化器
    criterion = nn.CrossEntropyLoss().to(rank)
    optimizer = optim.SGD(model.parameters(), lr=0.01)

    # 加载数据集(使用DistributedSampler分割数据)
    transform = transforms.Compose([transforms.ToTensor()])
    dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
    sampler = DistributedSampler(dataset, num_replicas=torch.distributed.get_world_size(), rank=rank)
    loader = DataLoader(dataset, batch_size=64, sampler=sampler)

    # 训练循环
    for epoch in range(5):
        sampler.set_epoch(epoch)  # 每个epoch打乱数据分布
        running_loss = 0.0
        for i, (inputs, labels) in enumerate(loader):
            inputs, labels = inputs.to(rank), labels.to(rank)
            optimizer.zero_grad()
            outputs = model(inputs.view(-1, 784))
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
        if rank == 0:  # 仅主节点打印日志
            print(f'Epoch {epoch+1}, Loss: {running_loss/len(loader)}')

    # 清理进程组
    torch.distributed.destroy_process_group()

if __name__ == "__main__":
    main()

5. 启动分布式训练

6. 可选优化与监控

通过以上步骤,即可在Ubuntu上搭建PyTorch集群,实现高效的分布式训练。需注意根据实际环境调整网络配置、GPU数量及批量大小,以获得最佳性能。

0
看了该问题的人还看了