You can trust the --mpi=list. The problem is likely that OMPI wasn't configured --with-pmi2
> On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users > <users@lists.open-mpi.org> wrote: > > --mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= to > either of them, my job still fails. Why is that? Can I not trust the output > of --mpi=list? > > Prentice > > On 4/23/20 10:43 AM, Ralph Castain via users wrote: >> No, but you do have to explicitly build OMPI with non-PMIx support if that >> is what you are going to use. In this case, you need to configure OMPI >> --with-pmi2=<path-to-the-pmi2-installation> >> >> You can leave off the path if Slurm (i.e., just "--with-pmi2") was installed >> in a standard location as we should find it there. >> >> >>> On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users >>> <users@lists.open-mpi.org> wrote: >>> >>> It looks like it was built with PMI2, but not PMIx: >>> >>> $ srun --mpi=list >>> srun: MPI types are... >>> srun: none >>> srun: pmi2 >>> srun: openmpi >>> >>> I did launch the job with srun --mpi=pmi2 .... >>> >>> Does OpenMPI 4 need PMIx specifically? >>> >>> >>> On 4/23/20 10:23 AM, Ralph Castain via users wrote: >>>> Is Slurm built with PMIx support? Did you tell srun to use it? >>>> >>>> >>>>> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users >>>>> <users@lists.open-mpi.org> wrote: >>>>> >>>>> I'm using OpenMPI 4.0.3 with Slurm 19.05.5 I'm testing the software with >>>>> a very simple hello, world MPI program that I've used reliably for years. >>>>> When I submit the job through slurm and use srun to launch the job, I get >>>>> these errors: >>>>> >>>>> *** An error occurred in MPI_Init >>>>> *** on a NULL communicator >>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>>> *** and potentially your MPI job) >>>>> [dawson029.pppl.gov:26070] Local abort before MPI_INIT completed >>>>> completed successfully, but am not able to aggregate error messages, and >>>>> not able to guarantee that all other processes were killed! >>>>> *** An error occurred in MPI_Init >>>>> *** on a NULL communicator >>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>>> *** and potentially your MPI job) >>>>> [dawson029.pppl.gov:26076] Local abort before MPI_INIT completed >>>>> completed successfully, but am not able to aggregate error messages, and >>>>> not able to guarantee that all other processes were killed! >>>>> >>>>> If I run the same job, but use mpiexec or mpirun instead of srun, the >>>>> jobs run just fine. I checked ompi_info to make sure OpenMPI was compiled >>>>> with Slurm support: >>>>> >>>>> $ ompi_info | grep slurm >>>>> Configure command line: >>>>> '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' >>>>> '--disable-silent-rules' '--enable-shared' '--with-pmix=internal' >>>>> '--with-slurm' '--with-psm' >>>>> MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3) >>>>> MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3) >>>>> MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3) >>>>> MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3) >>>>> >>>>> Any ideas what could be wrong? Do you need any additional information? >>>>> >>>>> Prentice >>>>> >>