You can trust the --mpi=list. The problem is likely that OMPI wasn't configured 
--with-pmi2


> On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users 
> <users@lists.open-mpi.org> wrote:
> 
> --mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= to 
> either of them, my job still fails. Why is that? Can I not trust the output 
> of --mpi=list?
> 
> Prentice
> 
> On 4/23/20 10:43 AM, Ralph Castain via users wrote:
>> No, but you do have to explicitly build OMPI with non-PMIx support if that 
>> is what you are going to use. In this case, you need to configure OMPI 
>> --with-pmi2=<path-to-the-pmi2-installation>
>> 
>> You can leave off the path if Slurm (i.e., just "--with-pmi2") was installed 
>> in a standard location as we should find it there.
>> 
>> 
>>> On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
>>> <users@lists.open-mpi.org> wrote:
>>> 
>>> It looks like it was built with PMI2, but not PMIx:
>>> 
>>> $ srun --mpi=list
>>> srun: MPI types are...
>>> srun: none
>>> srun: pmi2
>>> srun: openmpi
>>> 
>>> I did launch the job with srun --mpi=pmi2 ....
>>> 
>>> Does OpenMPI 4 need PMIx specifically?
>>> 
>>> 
>>> On 4/23/20 10:23 AM, Ralph Castain via users wrote:
>>>> Is Slurm built with PMIx support? Did you tell srun to use it?
>>>> 
>>>> 
>>>>> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
>>>>> <users@lists.open-mpi.org> wrote:
>>>>> 
>>>>> I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with 
>>>>> a very simple hello, world MPI program that I've used reliably for years. 
>>>>> When I submit the job through slurm and use srun to launch the job, I get 
>>>>> these errors:
>>>>> 
>>>>> *** An error occurred in MPI_Init
>>>>> *** on a NULL communicator
>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>>>> ***    and potentially your MPI job)
>>>>> [dawson029.pppl.gov:26070] Local abort before MPI_INIT completed 
>>>>> completed successfully, but am not able to aggregate error messages, and 
>>>>> not able to guarantee that all other processes were killed!
>>>>> *** An error occurred in MPI_Init
>>>>> *** on a NULL communicator
>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>>>> ***    and potentially your MPI job)
>>>>> [dawson029.pppl.gov:26076] Local abort before MPI_INIT completed 
>>>>> completed successfully, but am not able to aggregate error messages, and 
>>>>> not able to guarantee that all other processes were killed!
>>>>> 
>>>>> If I run the same job, but use mpiexec or mpirun instead of srun, the 
>>>>> jobs run just fine. I checked ompi_info to make sure OpenMPI was compiled 
>>>>> with  Slurm support:
>>>>> 
>>>>> $ ompi_info | grep slurm
>>>>>   Configure command line: 
>>>>> '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
>>>>> '--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
>>>>> '--with-slurm' '--with-psm'
>>>>>                  MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
>>>>>                  MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
>>>>>                  MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
>>>>>               MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)
>>>>> 
>>>>> Any ideas what could be wrong? Do you need any additional information?
>>>>> 
>>>>> Prentice
>>>>> 
>> 


Reply via email to