Hi,
There is no functional btl openib with OpenMPI4.x versions. Point to point 
communications through infiniband interconnect is provided by UCX pml.
To have GPUDirect RDMA, UCX must have been configured with --with-cuda and 
--with-gdrcopy.
Regards,
Florent Germain

-----Message d'origine-----
De : users <users-boun...@lists.open-mpi.org> De la part de Oskar Lappi via 
users
Envoyé : mercredi 1 juillet 2020 00:24
À : users@lists.open-mpi.org
Cc : Oskar Lappi <oskar.la...@abo.fi>
Objet : [OMPI users] openib BTL vs UCX. Which do I need to use GPUDirect RDMA?

Hi,

  I'm trying to troubleshoot a problem, we don't seem to be getting the 
bandwidth we'd expect from our distributed CUDA program, where we're using Open 
MPI to pass data between GPUs in a HPC cluster.

I thought I found a possible root cause, but now I'm unsure of how to fix this, 
since the documentation provides conflicting information.

Running

     ompi_info --all| grep "MCA btl"

gives me the following output:

                  MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.2)
                  MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.2)
                  MCA btl: smcuda (MCA v2.1.0, API v3.1.0, Component v4.0.2)
                  MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.2)

According to this: 
https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.open-mpi.org%2Ffaq%2F%3Fcategory%3Druncuda&amp;data=02%7C01%7Cflorent.germain%40atos.net%7C24249795a77e434a02bf08d81d449c4a%7C33440fc6b7c7412cbb730e70b0198d5a%7C0%7C0%7C637291527814335771&amp;sdata=FyYfz71C7CFwys2LW%2FMDLA6er9snFdJfvFNJnERgc%2Bw%3D&amp;reserved=0,
 the openib btl is a prerequisite for GPUDirect RDMA.

However, I'm also reading that UCX is the preferred way to do RDMA and that it 
has CUDA support.

Can anyone tell me what a proper configuration for GPUDirect RDMA over 
Infiniband looks like?

Best regards,

Oskar Lappi

Reply via email to