Ralph,
PMI2 support works just fine. It's just PMIx that seems to be the problem.
We rebuilt Slurm with PMIx 3.1.5, but the problem persists. I've opened
a ticket with Slurm support to see if it's a problem on Slurm's end.
Prentice
On 4/26/20 2:12 PM, Ralph Castain via users wrote:
It is entirely possible that the PMI2 support in OMPI v4 is broken - I
doubt it is used or tested very much as pretty much everyone has moved
to PMIx. In fact, we completely dropped PMI-1 and PMI-2 from OMPI v5
for that reason.
I would suggest building Slurm with PMIx v3.1.5
(https://github.com/openpmix/openpmix/releases/tag/v3.1.5) as that is
what OMPI v4 is using, and launching with "srun --mpi=pmix_v3"
On Apr 26, 2020, at 10:07 AM, Patrick Bégou via users
<users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
I have also this problem on servers I'm benching at DELL's lab with
OpenMPI-4.0.3. I've tried a new build of OpenMPI with "--with-pmi2". No
change.
Finally my work around in the slurm script was to launch my code with
mpirun. As mpirun was only finding one slot per nodes I have used
"--oversubscribe --bind-to core" and checked that every process was
binded on a separate core. It worked but do not ask me why :-)
Patrick
Le 24/04/2020 à 20:28, Riebs, Andy via users a écrit :
Prentice, have you tried something trivial, like "srun -N3
hostname", to rule out non-OMPI problems?
Andy
-----Original Message-----
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
Prentice Bisbal via users
Sent: Friday, April 24, 2020 2:19 PM
To: Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org>>; Open
MPI Users <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>
Cc: Prentice Bisbal <pbis...@pppl.gov <mailto:pbis...@pppl.gov>>
Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.
Okay. I've got Slurm built with pmix support:
$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmix_v3
srun: pmi2
srun: openmpi
srun: pmix
But now when I try to launch a job with srun, the job appears to be
running, but doesn't do anything - it just hangs in the running state
but doesn't do anything. Any ideas what could be wrong, or how to debug
this?
I'm also asking around on the Slurm mailing list, too
Prentice
On 4/23/20 3:03 PM, Ralph Castain wrote:
You can trust the --mpi=list. The problem is likely that OMPI
wasn't configured --with-pmi2
On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users
<users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
--mpi=list shows pmi2 and openmpi as valid values, but if I set
--mpi= to either of them, my job still fails. Why is that? Can I
not trust the output of --mpi=list?
Prentice
On 4/23/20 10:43 AM, Ralph Castain via users wrote:
No, but you do have to explicitly build OMPI with non-PMIx
support if that is what you are going to use. In this case, you
need to configure OMPI --with-pmi2=<path-to-the-pmi2-installation>
You can leave off the path if Slurm (i.e., just "--with-pmi2")
was installed in a standard location as we should find it there.
On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users
<users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
It looks like it was built with PMI2, but not PMIx:
$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: openmpi
I did launch the job with srun --mpi=pmi2 ....
Does OpenMPI 4 need PMIx specifically?
On 4/23/20 10:23 AM, Ralph Castain via users wrote:
Is Slurm built with PMIx support? Did you tell srun to use it?
On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users
<users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>
wrote:
I'm using OpenMPI 4.0.3 with Slurm 19.05.5 I'm testing the
software with a very simple hello, world MPI program that I've
used reliably for years. When I submit the job through slurm
and use srun to launch the job, I get these errors:
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will
now abort,
*** and potentially your MPI job)
[dawson029.pppl.gov:26070 <http://dawson029.pppl.gov:26070>]
Local abort before MPI_INIT completed completed successfully,
but am not able to aggregate error messages, and not able to
guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will
now abort,
*** and potentially your MPI job)
[dawson029.pppl.gov:26076 <http://dawson029.pppl.gov:26076>]
Local abort before MPI_INIT completed completed successfully,
but am not able to aggregate error messages, and not able to
guarantee that all other processes were killed!
If I run the same job, but use mpiexec or mpirun instead of
srun, the jobs run just fine. I checked ompi_info to make sure
OpenMPI was compiled with Slurm support:
$ ompi_info | grep slurm
Configure command line:
'--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3'
'--disable-silent-rules' '--enable-shared'
'--with-pmix=internal' '--with-slurm' '--with-psm'
MCA ess: slurm (MCA v2.1.0, API v3.0.0,
Component v4.0.3)
MCA plm: slurm (MCA v2.1.0, API v2.0.0,
Component v4.0.3)
MCA ras: slurm (MCA v2.1.0, API v2.0.0,
Component v4.0.3)
MCA schizo: slurm (MCA v2.1.0, API v1.0.0,
Component v4.0.3)
Any ideas what could be wrong? Do you need any additional
information?
Prentice