Ubuntu上PyTorch的性能测试方法 - 问答

Ubuntu上PyTorch性能测试方法

1. 基础性能基准测试

工具/方法：使用PyTorch内置的torch.utils.bottleneck工具，快速定位代码性能瓶颈。
步骤：

运行命令：python -m torch.utils.bottleneck /path/to/source/script.py [args]，其中script.py是你的训练/推理脚本，[args]是脚本所需的参数。
结果分析：工具会生成详细的性能报告，包括CPU/GPU计算时间、内存占用、数据加载耗时等，帮助识别瓶颈环节（如数据预处理、模型前向传播、反向传播等）。
用途：适用于快速评估脚本整体性能，定位需要优化的模块。

2. CUDA设备性能测试

工具/方法：通过自定义脚本测试CUDA设备的计算性能（单卡/多卡）。
单卡测试示例：

import torch
import time

def cuda_benchmark(device_id, N=1000000):
    torch.cuda.set_device(device_id)
    data = torch.ones(N).cuda()
    torch.cuda.synchronize()  # 确保CUDA操作完成
    start = time.time()
    for _ in range(10000):
        data += 1
    torch.cuda.synchronize()
    end = time.time()
    print(f"Execution time: {end - start:.4f} seconds")

cuda_benchmark(0)  # 测试第一块GPU

多卡并行测试：使用torch.nn.DataParallel或DistributedDataParallel（DDP）扩展至多卡，测试多卡协同计算的性能提升。
用途：验证CUDA设备是否正常工作，评估单卡/多卡的算力性能。

3. PyTorch Profiler性能分析

工具/方法：使用torch.autograd.profiler或torch.profiler（PyTorch 1.8+）分析模型各操作的耗时、内存占用及执行流程。
代码示例：

import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity

model = nn.Sequential(nn.Linear(1000, 1000)).cuda()
inputs = torch.randn(64, 1000).cuda()

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=3),
    on_trace_ready=lambda prof: prof.export_chrome_trace("profile.json"),
    record_shapes=True,
    profile_memory=True
) as prof:
    for _ in range(5):
        with record_function("model_inference"):
            outputs = model(inputs)
            torch.cuda.synchronize()

prof.key_averages().table(sort_by="cuda_time_total", row_limit=10)

可视化：通过prof.export_chrome_trace("profile.json")导出Chrome Trace文件，在Chrome浏览器中打开chrome://tracing查看可视化性能图。
用途：深入分析模型内部操作的耗时（如卷积层、矩阵乘法），识别性能热点（如耗时操作、内存瓶颈）。

4. 实时系统资源监控

工具/方法：使用Ubuntu系统工具监控PyTorch运行时的硬件资源占用。

GPU监控：watch -n 1 nvidia-smi（每秒刷新GPU使用率、显存占用、温度等）。
CPU/内存监控：htop（交互式查看进程CPU/内存占用）或top（命令行查看系统整体状态）。
进程级监控：使用psutil库在Python脚本中获取当前进程的资源占用（如CPU百分比、内存使用量）。
用途：实时观察PyTorch程序对系统资源的占用情况，判断是否存在资源瓶颈（如GPU未充分利用、内存泄漏）。

5. 模型推理性能测试

步骤：

设置评估模式：model.eval()（关闭dropout、batch normalization的训练行为）。
禁用梯度计算：with torch.no_grad()（减少内存占用，加速推理）。
批量推理测试：使用测试数据集批量输入模型，记录推理时间（如每秒处理的样本数）。
代码示例：

import torch
from torchvision import models, transforms
from PIL import Image

# 加载预训练模型
model = models.resnet50(pretrained=True).cuda()
model.eval()

# 数据预处理
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# 加载测试图像
image = Image.open("test.jpg").convert("RGB")
input_tensor = transform(image).unsqueeze(0).cuda()

# 推理并计时
start = time.time()
with torch.no_grad():
    output = model(input_tensor)
end = time.time()
print(f"Inference time: {end - start:.4f} seconds")

用途：评估模型在实际部署中的推理性能（如延迟、吞吐量），验证模型是否满足应用需求。

6. 混合精度训练性能测试

工具/方法：使用torch.cuda.amp（自动混合精度）测试混合精度训练的性能提升（减少显存占用、加速计算）。
代码示例：

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
for inputs, targets in dataloader:
    inputs, targets = inputs.cuda(), targets.cuda()
    
    optimizer.zero_grad()
    with autocast():  # 自动混合精度计算
        outputs = model(inputs)
        loss = criterion(outputs, targets)
    
    scaler.scale(loss).backward()  # 缩放梯度防止溢出
    scaler.step(optimizer)         # 更新参数
    scaler.update()                # 调整缩放因子

用途：验证混合精度训练对性能的影响（如显存占用减少、训练速度提升），适用于需要大batch size的场景。

0 赞

0 踩