Linux版PyTorch的性能测试方法是什么 - 问答

Linux版 PyTorch 性能测试方法

一环境与基线检查

确认软件与硬件环境：查看 PyTorch 版本、CUDA/ROCm 是否可用、驱动与库版本是否匹配。
设置线程与并行参数：根据 CPU 核心数设置 OMP_NUM_THREADS、MKL_NUM_THREADS，避免线程争用导致结果波动。
快速连通性验证：运行最小化 GPU 样例，确保设备可正常计算与同步。
建议记录：CPU/GPU 型号、驱动/CUDA 版本、PyTorch 版本、测试时间、是否开启 cuDNN benchmark、线程数等，便于复现与对比。

示例基线脚本

import torch, time

print("PyTorch:", torch.__version__, "CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("Device:", torch.cuda.get_device_name(0))

# 最小化 GPU 基准
N, iters = 10_000_000, 100
x = torch.ones(N, device="cuda")
torch.cuda.synchronize()
t0 = time.time()
for _ in range(iters):
    x += 1
torch.cuda.synchronize()
print(f"GPU add {iters} iters: {time.time()-t0:.3f} s")

上述步骤中的环境检查与线程设置、最小化 GPU 样例可直接用于验证安装与设备可用性，并作为后续测试的基线参考。

二微基准测试 GPU 与 CPU 算子

GPU 微基准要点：使用 torch.cuda.synchronize() 前后夹逼计时，避免异步执行造成计时偏短；尽量复用张量、关闭梯度以排除训练开销。
CPU 微基准要点：固定 OMP_NUM_THREADS/MKL_NUM_THREADS，使用 timeit 多次取中位数，减少系统抖动影响。
关注指标：单次迭代耗时、吞吐量（如 images/s、tokens/s）、显存占用与带宽利用。

示例 GPU 微基准

def gpu_bench(N=10_000_000, iters=100, device="cuda"):
    x = torch.ones(N, device=device)
    torch.cuda.synchronize()
    t0 = time.time()
    for _ in range(iters):
        x = x + 1
    torch.cuda.synchronize()
    return (time.time() - t0) / iters  # s/iter

print("GPU add:", gpu_bench(), "s/iter")

该模式适合定位算子/内核级别的性能瓶颈，并为模型层或自定义内核提供对照数据。

三模型级基准测试与 Profiling

标准模型测试：使用 PyTorch Benchmark 项目对 ResNet、BERT 等常见模型进行训练/推理基准，便于跨硬件、跨版本对比；支持生成详尽报告与可视化。
训练循环 Profiling：使用 torch.profiler 采集 CPU/GPU 活动、内存与调用栈，结合 TensorBoard 查看时间线、瓶颈与优化建议。
序列化性能：对 torch.save / torch.load 进行吞吐测试，评估数据 I/O 对整体训练/推理的影响。

示例 Profiling 训练循环

import torch, torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity

model = nn.Linear(1024, 1024).cuda()
x = torch.randn(256, 1024, device="cuda")
opt = torch.optim.SGD(model.parameters(), lr=1e-3)

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1),
    on_trace_ready=lambda prof: prof.export_chrome_trace("trace.json"),
    record_shapes=True, profile_memory=True
) as prof:
    for _ in range(5):
        with record_function("forward"):
            y = model(x)
        with record_function("backward"):
            y.sum().backward()
        with record_function("optim"):
            opt.step(); opt.zero_grad()
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

示例运行 PyTorch Benchmark（需先安装并进入项目目录）

# pip install -e .
python run.py -d cuda -t train --model resnet50
tensorboard --logdir=./logs

上述方法覆盖模型级吞吐与瓶颈定位，适合做版本升级、参数变化与硬件迁移的对比实验。

四系统与多卡分布式测试

系统监控：使用 htop、dstat、Monitorix 观察 CPU、内存、I/O、网络 等系统层面指标，辅助判断是否为数据加载、CPU 绑定或系统抖动导致性能异常。
多卡与分布式：使用 DistributedDataParallel（DDP） 与 NCCL 进行多 GPU 训练测试，验证通信与扩展效率；注意设置正确的 backend、init_method、world_size、rank 与 DistributedSampler。
启动方式：可使用 torch.distributed.launch 或 torchrun 启动多进程训练脚本，确保每张卡的 batch 与全局 batch 设置一致，便于吞吐换算与对比。

示例 DDP 启动

# 方式一：旧接口
python -m torch.distributed.launch --nproc_per_node=2 train_ddp.py

# 方式二：新接口
torchrun --nproc_per_node=2 train_ddp.py

系统监控与 DDP 测试可帮助识别通信瓶颈、负载不均与数据管道问题，是规模化训练前的必要验证。

五结果记录与对比建议

固定随机种子与数据顺序，确保可复现；在相同条件下运行多次取中位数与方差。
明确报告关键参数：batch size、seq_len/imgs、precision（FP32/FP16/BF16）、num_workers、pin_memory、cuDNN benchmark 开关、线程数等。
统一换算口径：训练吞吐以 samples/s 或 images/s 表示；推理以 tokens/s 或 images/s 表示；报告 单卡与多卡 两种口径。
结果存储：保存 Profiler trace、日志、曲线与表格，便于回溯与团队评审；对比时关注 吞吐提升、单步时延、显存/带宽利用、GPU 利用率 等关键指标。

0 赞

0 踩