Hi Shruti,

What version of NCCL is installed on the system?  Horovod has environment 
variables you can set to force use of NCCL.

You may want to use nvidia-smi to double check whether the benchmark is 
actually using both GPUs when using two mpi processes.

Also, you may want to consult with a technical assistant like chat-gpt o4-mini 
about this problem.  The assistant may prove to be quite helpful.

Howard


From: 'George Bosilca' via Open MPI users <users@lists.open-mpi.org>
Reply-To: "users@lists.open-mpi.org" <users@lists.open-mpi.org>
Date: Wednesday, June 4, 2025 at 8:12 AM
To: "users@lists.open-mpi.org" <users@lists.open-mpi.org>
Subject: [EXTERNAL] Re: [OMPI users] Horovod Performance with OpenMPI

What's the network on your cluster ? Without a very good network you cannot 
obtain anything closer to the single GPU, because the data exchanged between 
the two GPUs will become the bottleneck.

  George.


On Wed, Jun 4, 2025 at 5:56 AM Shruti Sharma 
<shrutic...@gmail.com<mailto:shrutic...@gmail.com>> wrote:
Hi
I am currently running Horovod benchmarks in an intra-node setup. However, I 
have observed that increasing the number of GPUs does not result in a 
proportional increase in total throughput. Specifically, the throughput per GPU 
with a single GPU is approximately 842.6 ± 2.4, whereas with two GPUs, the 
total throughput is around 485.7 ± 44.8, which translates to approximately 
242.8 ± 22.4 per GPU.

The configuration for the test is:
MPI : OpenMPI - 5.0.6
HOROVOD : 0.28.1
pytorch : 1.12.1
GPU : NVIDIA A100
CUDA : 11.8
Python : 3.10
GCC : 8.5.0

command : mpirun -n 1 --report-bindings python 
pytorch_synthetic_benchmark.py<https://urldefense.com/v3/__http:/pytorch_synthetic_benchmark.py__;!!Bt8fGhp8LhKGRg!DbuiTU7QqHqk1oVabAatNEsUynwBu4Ly5boTXsZL0ujekd215Xx3n2PJOgdt0_6MVy3yhMD1Q5KsxHZlzx1j$>
 -batch-size=64 --model=resnet50
[gpu39:59123] Rank 0 bound package[0][core:0]

Model: resnet50

Batch size: 64

Number of GPUs: 1

Running warmup...

Running benchmark...

Iter #0: 844.3 img/sec per GPU

Iter #1: 844.0 img/sec per GPU

Iter #2: 843.6 img/sec per GPU

Iter #3: 843.5 img/sec per GPU

Iter #4: 843.5 img/sec per GPU

Iter #5: 842.0 img/sec per GPU

Iter #6: 841.3 img/sec per GPU

Iter #7: 841.8 img/sec per GPU

Iter #8: 841.1 img/sec per GPU

Iter #9: 841.1 img/sec per GPU

Img/sec per GPU: 842.6 +-2.4

Total img/sec on 1 GPU(s): 842.6 +-2.4



Run with two GPU(s) on the same node
command : mpirun -n 2 --report-bindings python 
pytorch_synthetic_benchmark.py<https://urldefense.com/v3/__http:/pytorch_synthetic_benchmark.py__;!!Bt8fGhp8LhKGRg!DbuiTU7QqHqk1oVabAatNEsUynwBu4Ly5boTXsZL0ujekd215Xx3n2PJOgdt0_6MVy3yhMD1Q5KsxHZlzx1j$>
 -batch-size=64 --model=resnet50
[gpu39:59166] Rank 0 bound package[0][core:0]
[gpu39:59166] Rank 1 bound package[0][core:1]

Model: resnet50

Batch size: 64

Number of GPUs: 2

Running warmup...

Running benchmark...

Iter #0: 235.7 img/sec per GPU

Iter #1: 251.5 img/sec per GPU

Iter #2: 217.0 img/sec per GPU

Iter #3: 239.4 img/sec per GPU

Iter #4: 257.2 img/sec per GPU

Iter #5: 258.3 img/sec per GPU

Iter #6: 248.4 img/sec per GPU

Iter #7: 242.6 img/sec per GPU

Iter #8: 238.0 img/sec per GPU

Iter #9: 240.3 img/sec per GPU

Img/sec per GPU: 242.8 +-22.4

Total img/sec on 2 GPU(s): 485.7 +-44.8
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
users+unsubscr...@lists.open-mpi.org<mailto:users+unsubscr...@lists.open-mpi.org>.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
users+unsubscr...@lists.open-mpi.org<mailto:users+unsubscr...@lists.open-mpi.org>.

To unsubscribe from this group and stop receiving emails from it, send an email 
to users+unsubscr...@lists.open-mpi.org.

Reply via email to