Nice catch Rainer!

I absolutely forgot to include the btl/self component.

Cheers,

Gilles

On Mon, Mar 31, 2025, 23:05 Keller, Rainer <rainer.kel...@hs-esslingen.de>
wrote:

> Dear Sangam,
> as Gilles suggested, please try add self for loopback:
>   mpirun --mca pml ob1 --mca btl ofi,self …
>
> since the error is:
>
> [g100n052:00000] *** An error occurred in MPI_Init
> [g100n052:00000] *** reported by process [901316609,0]
> [g100n052:00000] *** on a NULL communicator
> [g100n052:00000] *** Unknown error
>
> Hope, this helps.
>
> Best regards,
> Rainer
>
>
> > On 31. Mar 2025, at 08:55, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
> >
> > Sangam,
> >
> > What if you run a simple MPI hello world program with
> > mpirun --mca pml ob1 --mca btl ofi ...
> >
> > on one and several nodes?
> >
> > Cheers,
> >
> > Gilles
> >
> > On Mon, Mar 31, 2025 at 3:48 PM Sangam B <forum....@gmail.com> wrote:
> > Hello Gilles,
> >
> > The gromacs-2024.4 build at cmake stage shows that CUDA AWARE MPI is
> detected:
> >
> > -- Performing Test HAVE_MPI_EXT
> > -- Performing Test HAVE_MPI_EXT - Success
> > -- Performing Test MPI_SUPPORTS_CUDA_AWARE_DETECTION
> > -- Performing Test MPI_SUPPORTS_CUDA_AWARE_DETECTION - Success
> > -- Checking for MPI_SUPPORTS_CUDA_AWARE_DETECTION - yes
> >
> > But during runtime it is not able to detect it.
> >
> > The OpenMPI MCA transports used are:
> >
> >  --mca btl ofi  --mca coll ^hcoll -x GMX_ENABLE_DIRECT_GPU_COMM=true -x
> PATH -x LD_LIBRARY_PATH  -hostfile s_hosts2 -np ${s_nmpi} --map-by numa
> --bind-to numa
> >
> > This fails with following Seg-Fault error:
> >
> >
> --------------------------------------------------------------------------
> > No components were able to be opened in the btl framework.
> >
> > This typically means that either no components of this type were
> > installed, or none of the installed components can be loaded.
> > Sometimes this means that shared libraries required by these
> > components are unable to be found/loaded.
> >
> >   Host:      g100n052
> >   Framework: btl
> >
> --------------------------------------------------------------------------
> > ^@[g100n052:14172:0:14172] Caught signal 11 (Segmentation fault: address
> not mapped to object at address (nil))
> > [g100n052:14176:0:14176] Caught signal 11 (Segmentation fault: address
> not mapped to object at address (nil))
> >
> >
> > But 'ompi_info' shows "btl ofi" is available.
> >
> > And the other notable point is, single node jobs with multiple gpus work
> fine, I mean on single node gromacs detects gpu aware mpi and the
> performance is good, as expected.
> >
> > On more than 1 node only, it fails with the above seg fault error.
> >
> > On a single node, is it using the XPMEM for communication?
> >
> > Is there any OpenMPI env variable to show what transport is being used
> for communication between GPUs and between MPI ranks?
> >
> > Thanks
> >
> > On Sun, Mar 30, 2025 at 10:21 AM Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
> > Sangan,
> >
> > The issue should have been fixed in Open MPI 5.0.6.
> >
> > Anyway, are you certain Open MPI is not GPU aware and this is not
> cmake/GROMACS that failed to detect it?
> >
> > What if you "configure" GROMACS with
> > cmake -DGMX_FORCE_GPU_AWARE_MPI=ON ...
> >
> > If the problem persists, please open an issue at
> https://github.com/open-mpi/ompi/issues and do provide the required
> information.
> >
> > Cheers,
> >
> > Gilles
> >
> > On Sun, Mar 30, 2025 at 12:08 AM Sangam B <forum....@gmail.com> wrote:
> > Hi,
> >
> >        OpenMPI-5.0.5 or 5.0.6 versions fail with following error during
> "make" stage of the build procedure:
> >
> > In file included from ../../../../../../ompi/mca/mtl/ofi/mtl_ofi.h:51,
> >                  from ../../../../../../ompi/mca/mtl/ofi/mtl_ofi.c:13:
> > ../../../../../../ompi/mca/mtl/ofi/mtl_ofi.h: In function
> ‘ompi_mtl_ofi_context_progress’:
> > ../../../../../../ompi/mca/mtl/ofi/mtl_ofi_request.h:19:5: warning:
> implicit declaration of function ‘container_of’
> [-Wimplicit-function-declaration]
> >    19 |     container_of((_ptr_ctx), struct ompi_mtl_ofi_request_t, ctx)
> >       |     ^~~~~~~~~~~~
> > ../../../../../../ompi/mca/mtl/ofi/mtl_ofi.h:152:27: note: in expansion
> of macro ‘TO_OFI_REQ’
> >   152 |                 ofi_req =
> TO_OFI_REQ(ompi_mtl_ofi_wc[i].op_context);
> >       |                           ^~~~~~~~~~
> > ../../../../../../ompi/mca/mtl/ofi/mtl_ofi_request.h:19:30: error:
> expected expression before ‘struct’
> >    19 |     container_of((_ptr_ctx), struct ompi_mtl_ofi_request_t, ctx)
> >       |                              ^~~~~~
> > ../../../../../../ompi/mca/mtl/ofi/mtl_ofi.h:152:27: note: in expansion
> of macro ‘TO_OFI_REQ’
> >   152 |                 ofi_req =
> TO_OFI_REQ(ompi_mtl_ofi_wc[i].op_context);
> >       |                           ^~~~~~~~~~
> > ../../../../../../ompi/mca/mtl/ofi/mtl_ofi_request.h:19:30: error:
> expected expression before ‘struct’
> >    19 |     container_of((_ptr_ctx), struct ompi_mtl_ofi_request_t, ctx)
> >       |                              ^~~~~~
> > ../../../../../../ompi/mca/mtl/ofi/mtl_ofi.h:200:19: note: in expansion
> of macro ‘TO_OFI_REQ’
> >   200 |         ofi_req = TO_OFI_REQ(error.op_context);
> >       |                   ^~~~~~~~~~
> > make[2]: *** [Makefile:1603: mtl_ofi.lo] Error 1
> >
> > OpenMPI-5.0.7 surpasses this error, but it is not able build cuda [GPU
> DIRECT] & ofi support:
> >
> > Gromacs applications complains that it is not able to detect Cuda Aware
> MPI:
> >
> > GPU-aware MPI was not detected, will not use direct GPU communication.
> Check the GROMACS install guide for recommendations for GPU-aware support.
> If you are certain about GPU-aware support in your MPI library, you can
> force its use by setting the GMX_FORCE_GPU_AWARE_MPI environment variable.
> >
> > OpenMPI is configured like this:
> >
> > '--disable-opencl' '--with-slurm' '--without-lsf'
> >                           '--without-opencl'
> >
>  '--with-cuda=/opt/nvidia/hpc_sdk/Linux_x86_64/25.1/cuda/12.6'
> >                           '--without-rocm'
> >                           '--with-knem=/opt/knem-1.1.4.90mlnx3'
> >
>  '--with-xpmem=/sw/openmpi/5.0.7/g133cu126_ubu2404/xpmem/2.7.3/'
> >
>  '--with-xpmem-libdir=/sw/openmpi/5.0.7/g133cu126_ubu2404/xpmem/2.7.3//lib'
> >
>  '--with-ofi=/sw/openmpi/5.0.7/g133cu126_ubu2404/ofi/2.0.0/c126g25xu118'
> >
>  
> '--with-ofi-libdir=/sw/openmpi/5.0.7/g133cu126_ubu2404/ofi/2.0.0/c126g25xu118/lib'
> >                           '--enable-mca-no-build=btl-usnic'
> >
> > Can somebody help me to build a successful cuda aware openmpi here?
> >
> > Thanks
> >
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to users+unsubscr...@lists.open-mpi.org.
> >
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to users+unsubscr...@lists.open-mpi.org.
> >
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to users+unsubscr...@lists.open-mpi.org.
> >
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to users+unsubscr...@lists.open-mpi.org.
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to users+unsubscr...@lists.open-mpi.org.
>
>

To unsubscribe from this group and stop receiving emails from it, send an email 
to users+unsubscr...@lists.open-mpi.org.

Reply via email to