Nice catch Rainer! I absolutely forgot to include the btl/self component.
Cheers, Gilles On Mon, Mar 31, 2025, 23:05 Keller, Rainer <rainer.kel...@hs-esslingen.de> wrote: > Dear Sangam, > as Gilles suggested, please try add self for loopback: > mpirun --mca pml ob1 --mca btl ofi,self … > > since the error is: > > [g100n052:00000] *** An error occurred in MPI_Init > [g100n052:00000] *** reported by process [901316609,0] > [g100n052:00000] *** on a NULL communicator > [g100n052:00000] *** Unknown error > > Hope, this helps. > > Best regards, > Rainer > > > > On 31. Mar 2025, at 08:55, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > > > > Sangam, > > > > What if you run a simple MPI hello world program with > > mpirun --mca pml ob1 --mca btl ofi ... > > > > on one and several nodes? > > > > Cheers, > > > > Gilles > > > > On Mon, Mar 31, 2025 at 3:48 PM Sangam B <forum....@gmail.com> wrote: > > Hello Gilles, > > > > The gromacs-2024.4 build at cmake stage shows that CUDA AWARE MPI is > detected: > > > > -- Performing Test HAVE_MPI_EXT > > -- Performing Test HAVE_MPI_EXT - Success > > -- Performing Test MPI_SUPPORTS_CUDA_AWARE_DETECTION > > -- Performing Test MPI_SUPPORTS_CUDA_AWARE_DETECTION - Success > > -- Checking for MPI_SUPPORTS_CUDA_AWARE_DETECTION - yes > > > > But during runtime it is not able to detect it. > > > > The OpenMPI MCA transports used are: > > > > --mca btl ofi --mca coll ^hcoll -x GMX_ENABLE_DIRECT_GPU_COMM=true -x > PATH -x LD_LIBRARY_PATH -hostfile s_hosts2 -np ${s_nmpi} --map-by numa > --bind-to numa > > > > This fails with following Seg-Fault error: > > > > > -------------------------------------------------------------------------- > > No components were able to be opened in the btl framework. > > > > This typically means that either no components of this type were > > installed, or none of the installed components can be loaded. > > Sometimes this means that shared libraries required by these > > components are unable to be found/loaded. > > > > Host: g100n052 > > Framework: btl > > > -------------------------------------------------------------------------- > > ^@[g100n052:14172:0:14172] Caught signal 11 (Segmentation fault: address > not mapped to object at address (nil)) > > [g100n052:14176:0:14176] Caught signal 11 (Segmentation fault: address > not mapped to object at address (nil)) > > > > > > But 'ompi_info' shows "btl ofi" is available. > > > > And the other notable point is, single node jobs with multiple gpus work > fine, I mean on single node gromacs detects gpu aware mpi and the > performance is good, as expected. > > > > On more than 1 node only, it fails with the above seg fault error. > > > > On a single node, is it using the XPMEM for communication? > > > > Is there any OpenMPI env variable to show what transport is being used > for communication between GPUs and between MPI ranks? > > > > Thanks > > > > On Sun, Mar 30, 2025 at 10:21 AM Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > > Sangan, > > > > The issue should have been fixed in Open MPI 5.0.6. > > > > Anyway, are you certain Open MPI is not GPU aware and this is not > cmake/GROMACS that failed to detect it? > > > > What if you "configure" GROMACS with > > cmake -DGMX_FORCE_GPU_AWARE_MPI=ON ... > > > > If the problem persists, please open an issue at > https://github.com/open-mpi/ompi/issues and do provide the required > information. > > > > Cheers, > > > > Gilles > > > > On Sun, Mar 30, 2025 at 12:08 AM Sangam B <forum....@gmail.com> wrote: > > Hi, > > > > OpenMPI-5.0.5 or 5.0.6 versions fail with following error during > "make" stage of the build procedure: > > > > In file included from ../../../../../../ompi/mca/mtl/ofi/mtl_ofi.h:51, > > from ../../../../../../ompi/mca/mtl/ofi/mtl_ofi.c:13: > > ../../../../../../ompi/mca/mtl/ofi/mtl_ofi.h: In function > ‘ompi_mtl_ofi_context_progress’: > > ../../../../../../ompi/mca/mtl/ofi/mtl_ofi_request.h:19:5: warning: > implicit declaration of function ‘container_of’ > [-Wimplicit-function-declaration] > > 19 | container_of((_ptr_ctx), struct ompi_mtl_ofi_request_t, ctx) > > | ^~~~~~~~~~~~ > > ../../../../../../ompi/mca/mtl/ofi/mtl_ofi.h:152:27: note: in expansion > of macro ‘TO_OFI_REQ’ > > 152 | ofi_req = > TO_OFI_REQ(ompi_mtl_ofi_wc[i].op_context); > > | ^~~~~~~~~~ > > ../../../../../../ompi/mca/mtl/ofi/mtl_ofi_request.h:19:30: error: > expected expression before ‘struct’ > > 19 | container_of((_ptr_ctx), struct ompi_mtl_ofi_request_t, ctx) > > | ^~~~~~ > > ../../../../../../ompi/mca/mtl/ofi/mtl_ofi.h:152:27: note: in expansion > of macro ‘TO_OFI_REQ’ > > 152 | ofi_req = > TO_OFI_REQ(ompi_mtl_ofi_wc[i].op_context); > > | ^~~~~~~~~~ > > ../../../../../../ompi/mca/mtl/ofi/mtl_ofi_request.h:19:30: error: > expected expression before ‘struct’ > > 19 | container_of((_ptr_ctx), struct ompi_mtl_ofi_request_t, ctx) > > | ^~~~~~ > > ../../../../../../ompi/mca/mtl/ofi/mtl_ofi.h:200:19: note: in expansion > of macro ‘TO_OFI_REQ’ > > 200 | ofi_req = TO_OFI_REQ(error.op_context); > > | ^~~~~~~~~~ > > make[2]: *** [Makefile:1603: mtl_ofi.lo] Error 1 > > > > OpenMPI-5.0.7 surpasses this error, but it is not able build cuda [GPU > DIRECT] & ofi support: > > > > Gromacs applications complains that it is not able to detect Cuda Aware > MPI: > > > > GPU-aware MPI was not detected, will not use direct GPU communication. > Check the GROMACS install guide for recommendations for GPU-aware support. > If you are certain about GPU-aware support in your MPI library, you can > force its use by setting the GMX_FORCE_GPU_AWARE_MPI environment variable. > > > > OpenMPI is configured like this: > > > > '--disable-opencl' '--with-slurm' '--without-lsf' > > '--without-opencl' > > > '--with-cuda=/opt/nvidia/hpc_sdk/Linux_x86_64/25.1/cuda/12.6' > > '--without-rocm' > > '--with-knem=/opt/knem-1.1.4.90mlnx3' > > > '--with-xpmem=/sw/openmpi/5.0.7/g133cu126_ubu2404/xpmem/2.7.3/' > > > '--with-xpmem-libdir=/sw/openmpi/5.0.7/g133cu126_ubu2404/xpmem/2.7.3//lib' > > > '--with-ofi=/sw/openmpi/5.0.7/g133cu126_ubu2404/ofi/2.0.0/c126g25xu118' > > > > '--with-ofi-libdir=/sw/openmpi/5.0.7/g133cu126_ubu2404/ofi/2.0.0/c126g25xu118/lib' > > '--enable-mca-no-build=btl-usnic' > > > > Can somebody help me to build a successful cuda aware openmpi here? > > > > Thanks > > > > To unsubscribe from this group and stop receiving emails from it, send > an email to users+unsubscr...@lists.open-mpi.org. > > > > To unsubscribe from this group and stop receiving emails from it, send > an email to users+unsubscr...@lists.open-mpi.org. > > > > To unsubscribe from this group and stop receiving emails from it, send > an email to users+unsubscr...@lists.open-mpi.org. > > > > To unsubscribe from this group and stop receiving emails from it, send > an email to users+unsubscr...@lists.open-mpi.org. > > To unsubscribe from this group and stop receiving emails from it, send an > email to users+unsubscr...@lists.open-mpi.org. > > To unsubscribe from this group and stop receiving emails from it, send an email to users+unsubscr...@lists.open-mpi.org.