I have also this problem on servers I'm benching at DELL's lab with
OpenMPI-4.0.3. I've tried  a new build of OpenMPI with "--with-pmi2". No
change.
Finally my work around in the slurm script was to launch my code with
mpirun. As mpirun was only finding one slot per nodes I have used
"--oversubscribe --bind-to core" and checked that every process was
binded on a separate core. It worked but do not ask me why :-)

Patrick

Le 24/04/2020 à 20:28, Riebs, Andy via users a écrit :
> Prentice, have you tried something trivial, like "srun -N3 hostname", to rule 
> out non-OMPI problems?
>
> Andy
>
> -----Original Message-----
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Prentice 
> Bisbal via users
> Sent: Friday, April 24, 2020 2:19 PM
> To: Ralph Castain <r...@open-mpi.org>; Open MPI Users 
> <users@lists.open-mpi.org>
> Cc: Prentice Bisbal <pbis...@pppl.gov>
> Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.
>
> Okay. I've got Slurm built with pmix support:
>
> $ srun --mpi=list
> srun: MPI types are...
> srun: none
> srun: pmix_v3
> srun: pmi2
> srun: openmpi
> srun: pmix
>
> But now when I try to launch a job with srun, the job appears to be 
> running, but doesn't do anything - it just hangs in the running state 
> but doesn't do anything. Any ideas what could be wrong, or how to debug 
> this?
>
> I'm also asking around on the Slurm mailing list, too
>
> Prentice
>
> On 4/23/20 3:03 PM, Ralph Castain wrote:
>> You can trust the --mpi=list. The problem is likely that OMPI wasn't 
>> configured --with-pmi2
>>
>>
>>> On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users 
>>> <users@lists.open-mpi.org> wrote:
>>>
>>> --mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= to 
>>> either of them, my job still fails. Why is that? Can I not trust the output 
>>> of --mpi=list?
>>>
>>> Prentice
>>>
>>> On 4/23/20 10:43 AM, Ralph Castain via users wrote:
>>>> No, but you do have to explicitly build OMPI with non-PMIx support if that 
>>>> is what you are going to use. In this case, you need to configure OMPI 
>>>> --with-pmi2=<path-to-the-pmi2-installation>
>>>>
>>>> You can leave off the path if Slurm (i.e., just "--with-pmi2") was 
>>>> installed in a standard location as we should find it there.
>>>>
>>>>
>>>>> On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
>>>>> <users@lists.open-mpi.org> wrote:
>>>>>
>>>>> It looks like it was built with PMI2, but not PMIx:
>>>>>
>>>>> $ srun --mpi=list
>>>>> srun: MPI types are...
>>>>> srun: none
>>>>> srun: pmi2
>>>>> srun: openmpi
>>>>>
>>>>> I did launch the job with srun --mpi=pmi2 ....
>>>>>
>>>>> Does OpenMPI 4 need PMIx specifically?
>>>>>
>>>>>
>>>>> On 4/23/20 10:23 AM, Ralph Castain via users wrote:
>>>>>> Is Slurm built with PMIx support? Did you tell srun to use it?
>>>>>>
>>>>>>
>>>>>>> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
>>>>>>> <users@lists.open-mpi.org> wrote:
>>>>>>>
>>>>>>> I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software 
>>>>>>> with a very simple hello, world MPI program that I've used reliably for 
>>>>>>> years. When I submit the job through slurm and use srun to launch the 
>>>>>>> job, I get these errors:
>>>>>>>
>>>>>>> *** An error occurred in MPI_Init
>>>>>>> *** on a NULL communicator
>>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>>>>>> ***    and potentially your MPI job)
>>>>>>> [dawson029.pppl.gov:26070] Local abort before MPI_INIT completed 
>>>>>>> completed successfully, but am not able to aggregate error messages, 
>>>>>>> and not able to guarantee that all other processes were killed!
>>>>>>> *** An error occurred in MPI_Init
>>>>>>> *** on a NULL communicator
>>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>>>>>> ***    and potentially your MPI job)
>>>>>>> [dawson029.pppl.gov:26076] Local abort before MPI_INIT completed 
>>>>>>> completed successfully, but am not able to aggregate error messages, 
>>>>>>> and not able to guarantee that all other processes were killed!
>>>>>>>
>>>>>>> If I run the same job, but use mpiexec or mpirun instead of srun, the 
>>>>>>> jobs run just fine. I checked ompi_info to make sure OpenMPI was 
>>>>>>> compiled with  Slurm support:
>>>>>>>
>>>>>>> $ ompi_info | grep slurm
>>>>>>>    Configure command line: 
>>>>>>> '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
>>>>>>> '--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
>>>>>>> '--with-slurm' '--with-psm'
>>>>>>>                   MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component 
>>>>>>> v4.0.3)
>>>>>>>                   MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component 
>>>>>>> v4.0.3)
>>>>>>>                   MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component 
>>>>>>> v4.0.3)
>>>>>>>                MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component 
>>>>>>> v4.0.3)
>>>>>>>
>>>>>>> Any ideas what could be wrong? Do you need any additional information?
>>>>>>>
>>>>>>> Prentice
>>>>>>>

Reply via email to