Hi I am currently running Horovod benchmarks in an intra-node setup. However, I have observed that increasing the number of GPUs does not result in a proportional increase in total throughput. Specifically, the throughput per GPU with a single GPU is approximately 842.6 ± 2.4, whereas with two GPUs, the total throughput is around 485.7 ± 44.8, which translates to approximately 242.8 ± 22.4 per GPU.
The configuration for the test is: MPI : OpenMPI - 5.0.6 HOROVOD : 0.28.1 pytorch : 1.12.1 GPU : NVIDIA A100 CUDA : 11.8 Python : 3.10 GCC : 8.5.0 command : mpirun -n 1 --report-bindings python pytorch_synthetic_benchmark.py -batch-size=64 --model=resnet50 [gpu39:59123] Rank 0 bound package[0][core:0] Model: resnet50 Batch size: 64 Number of GPUs: 1 Running warmup... Running benchmark... Iter #0: 844.3 img/sec per GPU Iter #1: 844.0 img/sec per GPU Iter #2: 843.6 img/sec per GPU Iter #3: 843.5 img/sec per GPU Iter #4: 843.5 img/sec per GPU Iter #5: 842.0 img/sec per GPU Iter #6: 841.3 img/sec per GPU Iter #7: 841.8 img/sec per GPU Iter #8: 841.1 img/sec per GPU Iter #9: 841.1 img/sec per GPU Img/sec per GPU: 842.6 +-2.4 Total img/sec on 1 GPU(s): 842.6 +-2.4 Run with two GPU(s) on the same node command : mpirun -n 2 --report-bindings python pytorch_synthetic_benchmark.py -batch-size=64 --model=resnet50 [gpu39:59166] Rank 0 bound package[0][core:0] [gpu39:59166] Rank 1 bound package[0][core:1] Model: resnet50 Batch size: 64 Number of GPUs: 2 Running warmup... Running benchmark... Iter #0: 235.7 img/sec per GPU Iter #1: 251.5 img/sec per GPU Iter #2: 217.0 img/sec per GPU Iter #3: 239.4 img/sec per GPU Iter #4: 257.2 img/sec per GPU Iter #5: 258.3 img/sec per GPU Iter #6: 248.4 img/sec per GPU Iter #7: 242.6 img/sec per GPU Iter #8: 238.0 img/sec per GPU Iter #9: 240.3 img/sec per GPU Img/sec per GPU: 242.8 +-22.4 Total img/sec on 1 GPU(s): 485.7 +-44.8 To unsubscribe from this group and stop receiving emails from it, send an email to users+unsubscr...@lists.open-mpi.org.