DataLoader
num_workers>0
pin_memory=True
torch.cuda.amp
DistributedDataParallel
DataParallel
nvidia-smi
top
htop