I have also this problem on servers I'm benching at DELL's lab with OpenMPI-4.0.3. I've tried a new build of OpenMPI with "--with-pmi2". No change. Finally my work around in the slurm script was to launch my code with mpirun. As mpirun was only finding one slot per nodes I have used "--oversubscribe --bind-to core" and checked that every process was binded on a separate core. It worked but do not ask me why :-)
Patrick Le 24/04/2020 à 20:28, Riebs, Andy via users a écrit : > Prentice, have you tried something trivial, like "srun -N3 hostname", to rule > out non-OMPI problems? > > Andy > > -----Original Message----- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Prentice > Bisbal via users > Sent: Friday, April 24, 2020 2:19 PM > To: Ralph Castain <r...@open-mpi.org>; Open MPI Users > <users@lists.open-mpi.org> > Cc: Prentice Bisbal <pbis...@pppl.gov> > Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun. > > Okay. I've got Slurm built with pmix support: > > $ srun --mpi=list > srun: MPI types are... > srun: none > srun: pmix_v3 > srun: pmi2 > srun: openmpi > srun: pmix > > But now when I try to launch a job with srun, the job appears to be > running, but doesn't do anything - it just hangs in the running state > but doesn't do anything. Any ideas what could be wrong, or how to debug > this? > > I'm also asking around on the Slurm mailing list, too > > Prentice > > On 4/23/20 3:03 PM, Ralph Castain wrote: >> You can trust the --mpi=list. The problem is likely that OMPI wasn't >> configured --with-pmi2 >> >> >>> On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users >>> <users@lists.open-mpi.org> wrote: >>> >>> --mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= to >>> either of them, my job still fails. Why is that? Can I not trust the output >>> of --mpi=list? >>> >>> Prentice >>> >>> On 4/23/20 10:43 AM, Ralph Castain via users wrote: >>>> No, but you do have to explicitly build OMPI with non-PMIx support if that >>>> is what you are going to use. In this case, you need to configure OMPI >>>> --with-pmi2=<path-to-the-pmi2-installation> >>>> >>>> You can leave off the path if Slurm (i.e., just "--with-pmi2") was >>>> installed in a standard location as we should find it there. >>>> >>>> >>>>> On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users >>>>> <users@lists.open-mpi.org> wrote: >>>>> >>>>> It looks like it was built with PMI2, but not PMIx: >>>>> >>>>> $ srun --mpi=list >>>>> srun: MPI types are... >>>>> srun: none >>>>> srun: pmi2 >>>>> srun: openmpi >>>>> >>>>> I did launch the job with srun --mpi=pmi2 .... >>>>> >>>>> Does OpenMPI 4 need PMIx specifically? >>>>> >>>>> >>>>> On 4/23/20 10:23 AM, Ralph Castain via users wrote: >>>>>> Is Slurm built with PMIx support? Did you tell srun to use it? >>>>>> >>>>>> >>>>>>> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users >>>>>>> <users@lists.open-mpi.org> wrote: >>>>>>> >>>>>>> I'm using OpenMPI 4.0.3 with Slurm 19.05.5 I'm testing the software >>>>>>> with a very simple hello, world MPI program that I've used reliably for >>>>>>> years. When I submit the job through slurm and use srun to launch the >>>>>>> job, I get these errors: >>>>>>> >>>>>>> *** An error occurred in MPI_Init >>>>>>> *** on a NULL communicator >>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>>>>> *** and potentially your MPI job) >>>>>>> [dawson029.pppl.gov:26070] Local abort before MPI_INIT completed >>>>>>> completed successfully, but am not able to aggregate error messages, >>>>>>> and not able to guarantee that all other processes were killed! >>>>>>> *** An error occurred in MPI_Init >>>>>>> *** on a NULL communicator >>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >>>>>>> *** and potentially your MPI job) >>>>>>> [dawson029.pppl.gov:26076] Local abort before MPI_INIT completed >>>>>>> completed successfully, but am not able to aggregate error messages, >>>>>>> and not able to guarantee that all other processes were killed! >>>>>>> >>>>>>> If I run the same job, but use mpiexec or mpirun instead of srun, the >>>>>>> jobs run just fine. I checked ompi_info to make sure OpenMPI was >>>>>>> compiled with Slurm support: >>>>>>> >>>>>>> $ ompi_info | grep slurm >>>>>>> Configure command line: >>>>>>> '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' >>>>>>> '--disable-silent-rules' '--enable-shared' '--with-pmix=internal' >>>>>>> '--with-slurm' '--with-psm' >>>>>>> MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component >>>>>>> v4.0.3) >>>>>>> MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component >>>>>>> v4.0.3) >>>>>>> MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component >>>>>>> v4.0.3) >>>>>>> MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component >>>>>>> v4.0.3) >>>>>>> >>>>>>> Any ideas what could be wrong? Do you need any additional information? >>>>>>> >>>>>>> Prentice >>>>>>>