Thank you for responding. The output of ompi_info regarding configuration is
Configure command line: '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--prefix=/usr/mpi/gcc/openmpi-4.0.2a1' '--exec-prefix=/usr/mpi/gcc/openmpi-4.0.2a1' '--bindir=/usr/mpi/gcc/openmpi-4.0.2a1/bin' '--sbindir=/usr/mpi/gcc/openmpi-4.0.2a1/sbin' '--sysconfdir=/usr/mpi/gcc/openmpi-4.0.2a1/etc' '--datadir=/usr/mpi/gcc/openmpi-4.0.2a1/share' '--includedir=/usr/mpi/gcc/openmpi-4.0.2a1/include' '--libdir=/usr/mpi/gcc/openmpi-4.0.2a1/lib64' '--libexecdir=/usr/mpi/gcc/openmpi-4.0.2a1/libexec' '--localstatedir=/var' '--sharedstatedir=/var/lib' '--mandir=/usr/mpi/gcc/openmpi-4.0.2a1/share/man' '--infodir=/usr/mpi/gcc/openmpi-4.0.2a1/share/info' '--with-platform=contrib/platform/mellanox/optimized' BUT MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.0.2) MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.0.2) MCA pmix: pmix3x (MCA v2.1.0, API v2.0.0, Component v4.0.2) MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.0.2) MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component v4.0.2) MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.2) are also present and contain references to pmi and slurm. On Fri, May 19, 2023 at 2:48 PM Juergen Salk <juergen.s...@uni-ulm.de> wrote: > Hi, > > I am not sure if this related to GPUs. I rather think the issue has to do > with > how your OpenMPI has been built. > > What does ompi_info command show? Look for "Configure command line" in > the output. Does this include '--with-slurm' and '--with-pmi' flags? > > To my very best knowledge, both flags need to be set for OpenMPI to > work with srun. > > Also see: > > https://www.open-mpi.org/faq/?category=slurm#slurm-direct-srun-mpi-apps > > https://slurm.schedmd.com/mpi_guide.html#open_mpi > > Best regards > Jürgen > > > * Saksham Pande 5-Year IDD Physics <saksham.pande.ph...@itbhu.ac.in> > [230519 07:42]: > > Hi everyone, > > I am trying to run a simulation software on slurm using openmpi-4.1.1 and > > cuda/11.1. > > On executing, I get the following error: > > > > srun --mpi=pmi2 --nodes=1 --ntasks-per-node=5 --partition=gpu > --gres=gpu:1 > > --time=02:00:00 --pty bash -i > > ./<execultable> > > > > > > > ```._____________________________________________________________________________________ > > | > > | Initial checks... > > | All good. > > > |_____________________________________________________________________________________ > > [gpu008:162305] OPAL ERROR: Not initialized in file pmix3x_client.c at > line > > 112 > > > -------------------------------------------------------------------------- > > The application appears to have been direct launched using "srun", > > but OMPI was not built with SLURM's PMI support and therefore cannot > > execute. There are several options for building PMI support under > > SLURM, depending upon the SLURM version you are using: > > > > version 16.05 or later: you can use SLURM's PMIx support. This > > requires that you configure and build SLURM --with-pmix. > > > > Versions earlier than 16.05: you must use either SLURM's PMI-1 or > > PMI-2 support. SLURM builds PMI-1 by default, or you can manually > > install PMI-2. You must then build Open MPI using --with-pmi pointing > > to the SLURM PMI library location. > > > > Please configure as appropriate and try again. > > > -------------------------------------------------------------------------- > > *** An error occurred in MPI_Init > > *** on a NULL communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > > *** and potentially your MPI job) > > [gpu008:162305] Local abort before MPI_INIT completed completed > > successfully, but am not able to aggregate error messages, and not able > to > > guarantee that all other processes were killed!``` > > > > > > using the following modules: gcc/10.2 openmpi/4.1.1 cuda/11.1 > > on using which mpic++ or mpirun or nvcc, I get the module paths only, > which > > looks correct. > > I also changed the $PATH and $LD_LIBRARY_PATH based on ldd <executable>, > > but still the same error. > > > > [sakshamp.phy20.itbhu@login2 menura]$ srun --mpi=list > > srun: MPI types are... > > srun: cray_shasta > > srun: none > > srun: pmi2 > > > > What should I do from here, been stuck on this error for 6 days now? If > > there is any build difference, I will have to tell the sysadmin. > > Since there is an openmpi pairing error with slurm, are there other > error I > > could expect between cuda and openmpi? > > > > Thanks > > >