Re: [OMPI users] RMA in openmpi

2020-04-27 Thread Claire Cashmore via users
Hi Joseph

Thank you for your reply. From what I had been reading I thought they were both 
called "synchronization calls" just that one was passive (lock) and one was 
active (fence), sorry if I've got confused! 
So I'm asking do need either MPI_Win_fence or MPI_Win_unlock/lock in order to 
use one-sided calls, and is it not possible to use one-sided communication 
without them? So just a stand alone MPI_Get, without the other calls before and 
after? It seems not from what you are saying, but I just wanted to confirm.

Thanks again

Claire

On 27/04/2020, 07:50, "Joseph Schuchart via users"  
wrote:

Claire,

 > Is it possible to use the one-sided communication without combining 
it with synchronization calls?

What exactly do you mean by "synchronization calls"? MPI_Win_fence is 
indeed synchronizing (basically flush+barrier) but MPI_Win_lock (and the 
passive target synchronization interface at large) is not. It does incur 
some overhead because the lock has to be taken somehow at some point. 
However, it does not require a matching call at the target to complete.

You can lock a window using a (shared or exclusive) lock, initiate RMA 
operations, flush them to wait for their completion, and initiate the 
next set of RMA operations to flush later. None of these calls are 
synchronizing. You will have to perform your own synchronization at some 
point though to make sure processes read consistent data.

HTH!
Joseph


On 4/24/20 5:34 PM, Claire Cashmore via users wrote:
> Hello
> 
> I was wondering if someone could help me with a question.
> 
> When using RMA is there a requirement to use some type of 
> synchronization? When using one-sided communication such as MPI_Get the 
> code will only run when I combine it with MPI_Win_fence or 
> MPI_Win_lock/unlock. I do not want to use MPI_Win_fence as I’m using the 
> one-sided communication to allow some communication when processes are 
> not synchronised, so this defeats the point. I could use 
> MPI_Win_lock/unlock, however someone I’ve spoken to has said that I 
> should be able to use RMA without any synchronization calls, if so then 
> I would prefer to do this to reduce any overheads using MPI_Win_lock 
> every time I use the one-sided communication may produce.
> 
> Is it possible to use the one-sided communication without combining it 
> with synchronization calls?
> 
> (It doesn’t seem to matter what version of openmpi I use).
> 
> Thank you
> 
> Claire
> 



Re: [OMPI users] RMA in openmpi

2020-04-27 Thread Joseph Schuchart via users

Hi Claire,

You cannot use MPI_Get (or any other RMA communication routine) on a 
window for which no access epoch has been started. MPI_Win_fence starts 
an active target access epoch, MPI_Win_lock[_all] start a passive target 
access epoch. Window locks are synchronizing in the sense that they 
provide a means for mutual exclusion if an exclusive lock is involved (a 
process holding a shared window lock allows for other processes to 
acquire shared locks but prevents them from taking an exclusive lock, 
and vice versa).


One common strategy is to call MPI_Win_lock_all on all processes to let 
all processes acquire a shared lock, which they hold until the end of 
the application run. Communication is then done using a combination of 
MPI_Get/MPI_Put/accumulate functions and flushes. As said earlier, you 
likely will need to take care of synchronization among the processes if 
they also modify data in the window.


Cheers
Joseph

On 4/27/20 12:14 PM, Claire Cashmore wrote:

Hi Joseph

Thank you for your reply. From what I had been reading I thought they were both called 
"synchronization calls" just that one was passive (lock) and one was active 
(fence), sorry if I've got confused!
So I'm asking do need either MPI_Win_fence or MPI_Win_unlock/lock in order to 
use one-sided calls, and is it not possible to use one-sided communication 
without them? So just a stand alone MPI_Get, without the other calls before and 
after? It seems not from what you are saying, but I just wanted to confirm.

Thanks again

Claire

On 27/04/2020, 07:50, "Joseph Schuchart via users"  
wrote:

 Claire,

  > Is it possible to use the one-sided communication without combining
 it with synchronization calls?

 What exactly do you mean by "synchronization calls"? MPI_Win_fence is
 indeed synchronizing (basically flush+barrier) but MPI_Win_lock (and the
 passive target synchronization interface at large) is not. It does incur
 some overhead because the lock has to be taken somehow at some point.
 However, it does not require a matching call at the target to complete.

 You can lock a window using a (shared or exclusive) lock, initiate RMA
 operations, flush them to wait for their completion, and initiate the
 next set of RMA operations to flush later. None of these calls are
 synchronizing. You will have to perform your own synchronization at some
 point though to make sure processes read consistent data.

 HTH!
 Joseph


 On 4/24/20 5:34 PM, Claire Cashmore via users wrote:
 > Hello
 >
 > I was wondering if someone could help me with a question.
 >
 > When using RMA is there a requirement to use some type of
 > synchronization? When using one-sided communication such as MPI_Get the
 > code will only run when I combine it with MPI_Win_fence or
 > MPI_Win_lock/unlock. I do not want to use MPI_Win_fence as I’m using the
 > one-sided communication to allow some communication when processes are
 > not synchronised, so this defeats the point. I could use
 > MPI_Win_lock/unlock, however someone I’ve spoken to has said that I
 > should be able to use RMA without any synchronization calls, if so then
 > I would prefer to do this to reduce any overheads using MPI_Win_lock
 > every time I use the one-sided communication may produce.
 >
 > Is it possible to use the one-sided communication without combining it
 > with synchronization calls?
 >
 > (It doesn’t seem to matter what version of openmpi I use).
 >
 > Thank you
 >
 > Claire
 >



[OMPI users] Handle Ctrl+C in subprocesses

2020-04-27 Thread Jérémie Wenger via users
Hi,

I recently installed open mpi (4.0.3) using the procedure described here
, as I'm trying to use
Horovod for multiple gpu acceleration.

I am looking for a way to handle a keyboard interrupt (save a deep learning
model before shutting everything down). I posted a question here
.

I have seen this thread
,
which is inconclusive, and this specific message
 which
is really the exact situation I'm in.
And I've seen that this earlier one

mentions the SIGINT received (although strangely enough when I tried to
print the signal I got SIGCONT instead (the result being the same as above
anyway, my subprocesses just stop without any handling).

I'm wondering if there is a not way of delaying the shutdown of my gpu
processes so I can save the latest state of the model. It would be
practical.

Many thanks in advance for your help,
Jeremie


Re: [OMPI users] [External] RE: Re: Can't start jobs with srun.

2020-04-27 Thread Prentice Bisbal via users
Yes. "srun -N3 hostname" works. The problem only seems to occur when I 
specify the --mpi option, so the problem seems related to PMI.


On 4/24/20 2:28 PM, Riebs, Andy wrote:

Prentice, have you tried something trivial, like "srun -N3 hostname", to rule 
out non-OMPI problems?

Andy

-Original Message-
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Prentice 
Bisbal via users
Sent: Friday, April 24, 2020 2:19 PM
To: Ralph Castain ; Open MPI Users 
Cc: Prentice Bisbal 
Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.

Okay. I've got Slurm built with pmix support:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmix_v3
srun: pmi2
srun: openmpi
srun: pmix

But now when I try to launch a job with srun, the job appears to be
running, but doesn't do anything - it just hangs in the running state
but doesn't do anything. Any ideas what could be wrong, or how to debug
this?

I'm also asking around on the Slurm mailing list, too

Prentice

On 4/23/20 3:03 PM, Ralph Castain wrote:

You can trust the --mpi=list. The problem is likely that OMPI wasn't configured 
--with-pmi2



On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users 
 wrote:

--mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= to 
either of them, my job still fails. Why is that? Can I not trust the output of 
--mpi=list?

Prentice

On 4/23/20 10:43 AM, Ralph Castain via users wrote:

No, but you do have to explicitly build OMPI with non-PMIx support if that is what 
you are going to use. In this case, you need to configure OMPI 
--with-pmi2=

You can leave off the path if Slurm (i.e., just "--with-pmi2") was installed in 
a standard location as we should find it there.



On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
 wrote:

It looks like it was built with PMI2, but not PMIx:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: openmpi

I did launch the job with srun --mpi=pmi2 

Does OpenMPI 4 need PMIx specifically?


On 4/23/20 10:23 AM, Ralph Castain via users wrote:

Is Slurm built with PMIx support? Did you tell srun to use it?



On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
 wrote:

I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with a 
very simple hello, world MPI program that I've used reliably for years. When I 
submit the job through slurm and use srun to launch the job, I get these errors:

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26070] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26076] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!

If I run the same job, but use mpiexec or mpirun instead of srun, the jobs run 
just fine. I checked ompi_info to make sure OpenMPI was compiled with  Slurm 
support:

$ ompi_info | grep slurm
Configure command line: '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
'--disable-silent-rules' '--enable-shared' '--with-pmix=internal' 
'--with-slurm' '--with-psm'
   MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
   MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
   MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)

Any ideas what could be wrong? Do you need any additional information?

Prentice



Re: [OMPI users] [External] Re: Can't start jobs with srun.

2020-04-27 Thread Prentice Bisbal via users

Ralph,

PMI2 support works just fine. It's just PMIx that seems to be the problem.

We rebuilt Slurm with PMIx 3.1.5, but the problem persists. I've opened 
a ticket with Slurm support to see if it's a problem on Slurm's end.


Prentice

On 4/26/20 2:12 PM, Ralph Castain via users wrote:
It is entirely possible that the PMI2 support in OMPI v4 is broken - I 
doubt it is used or tested very much as pretty much everyone has moved 
to PMIx. In fact, we completely dropped PMI-1 and PMI-2 from OMPI v5 
for that reason.


I would suggest building Slurm with PMIx v3.1.5 
(https://github.com/openpmix/openpmix/releases/tag/v3.1.5) as that is 
what OMPI v4 is using, and launching with "srun --mpi=pmix_v3"



On Apr 26, 2020, at 10:07 AM, Patrick Bégou via users 
mailto:users@lists.open-mpi.org>> wrote:


I have also this problem on servers I'm benching at DELL's lab with
OpenMPI-4.0.3. I've tried  a new build of OpenMPI with "--with-pmi2". No
change.
Finally my work around in the slurm script was to launch my code with
mpirun. As mpirun was only finding one slot per nodes I have used
"--oversubscribe --bind-to core" and checked that every process was
binded on a separate core. It worked but do not ask me why :-)

Patrick

Le 24/04/2020 à 20:28, Riebs, Andy via users a écrit :
Prentice, have you tried something trivial, like "srun -N3 
hostname", to rule out non-OMPI problems?


Andy

-Original Message-
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of 
Prentice Bisbal via users

Sent: Friday, April 24, 2020 2:19 PM
To: Ralph Castain mailto:r...@open-mpi.org>>; Open 
MPI Users mailto:users@lists.open-mpi.org>>

Cc: Prentice Bisbal mailto:pbis...@pppl.gov>>
Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.

Okay. I've got Slurm built with pmix support:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmix_v3
srun: pmi2
srun: openmpi
srun: pmix

But now when I try to launch a job with srun, the job appears to be
running, but doesn't do anything - it just hangs in the running state
but doesn't do anything. Any ideas what could be wrong, or how to debug
this?

I'm also asking around on the Slurm mailing list, too

Prentice

On 4/23/20 3:03 PM, Ralph Castain wrote:
You can trust the --mpi=list. The problem is likely that OMPI 
wasn't configured --with-pmi2



On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org>> wrote:


--mpi=list shows pmi2 and openmpi as valid values, but if I set 
--mpi= to either of them, my job still fails. Why is that? Can I 
not trust the output of --mpi=list?


Prentice

On 4/23/20 10:43 AM, Ralph Castain via users wrote:
No, but you do have to explicitly build OMPI with non-PMIx 
support if that is what you are going to use. In this case, you 
need to configure OMPI --with-pmi2=


You can leave off the path if Slurm (i.e., just "--with-pmi2") 
was installed in a standard location as we should find it there.



On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org>> wrote:


It looks like it was built with PMI2, but not PMIx:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: openmpi

I did launch the job with srun --mpi=pmi2 

Does OpenMPI 4 need PMIx specifically?


On 4/23/20 10:23 AM, Ralph Castain via users wrote:

Is Slurm built with PMIx support? Did you tell srun to use it?


On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org>> 
wrote:


I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the 
software with a very simple hello, world MPI program that I've 
used reliably for years. When I submit the job through slurm 
and use srun to launch the job, I get these errors:


*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,

***    and potentially your MPI job)
[dawson029.pppl.gov:26070 ] 
Local abort before MPI_INIT completed completed successfully, 
but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,

***    and potentially your MPI job)
[dawson029.pppl.gov:26076 ] 
Local abort before MPI_INIT completed completed successfully, 
but am not able to aggregate error messages, and not able to 
guarantee that all other processes were killed!


If I run the same job, but use mpiexec or mpirun instead of 
srun, the jobs run just fine. I checked ompi_info to make sure 
OpenMPI was compiled with  Slurm support:


$ ompi_info | grep slurm
  Configure command line: 
'--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' 
'--disable-silent-rules' '--enable-shared' 
'--with-pmix=internal' '--with-slurm' '--with-psm'
 MCA ess: slurm (MCA v2.1.0, API v3.0.0, 
Component v

Re: [OMPI users] Can't start jobs with srun.

2020-04-27 Thread Riebs, Andy via users
Y’know, a quick check on versions and PATHs might be a good idea here. I 
suggest something like

$ srun  -N3  ompi_info  |&  grep  "MPI repo"

to confirm that all nodes are running the same version of OMPI.

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Prentice 
Bisbal via users
Sent: Monday, April 27, 2020 10:25 AM
To: users@lists.open-mpi.org
Cc: Prentice Bisbal 
Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.


Ralph,

PMI2 support works just fine. It's just PMIx that seems to be the problem.

We rebuilt Slurm with PMIx 3.1.5, but the problem persists. I've opened a 
ticket with Slurm support to see if it's a problem on Slurm's end.

Prentice
On 4/26/20 2:12 PM, Ralph Castain via users wrote:
It is entirely possible that the PMI2 support in OMPI v4 is broken - I doubt it 
is used or tested very much as pretty much everyone has moved to PMIx. In fact, 
we completely dropped PMI-1 and PMI-2 from OMPI v5 for that reason.

I would suggest building Slurm with PMIx v3.1.5 
(https://github.com/openpmix/openpmix/releases/tag/v3.1.5) as that is what OMPI 
v4 is using, and launching with "srun --mpi=pmix_v3"



On Apr 26, 2020, at 10:07 AM, Patrick Bégou via users 
mailto:users@lists.open-mpi.org>> wrote:

I have also this problem on servers I'm benching at DELL's lab with
OpenMPI-4.0.3. I've tried  a new build of OpenMPI with "--with-pmi2". No
change.
Finally my work around in the slurm script was to launch my code with
mpirun. As mpirun was only finding one slot per nodes I have used
"--oversubscribe --bind-to core" and checked that every process was
binded on a separate core. It worked but do not ask me why :-)

Patrick

Le 24/04/2020 à 20:28, Riebs, Andy via users a écrit :

Prentice, have you tried something trivial, like "srun -N3 hostname", to rule 
out non-OMPI problems?

Andy

-Original Message-
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Prentice 
Bisbal via users
Sent: Friday, April 24, 2020 2:19 PM
To: Ralph Castain mailto:r...@open-mpi.org>>; Open MPI Users 
mailto:users@lists.open-mpi.org>>
Cc: Prentice Bisbal mailto:pbis...@pppl.gov>>
Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.

Okay. I've got Slurm built with pmix support:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmix_v3
srun: pmi2
srun: openmpi
srun: pmix

But now when I try to launch a job with srun, the job appears to be
running, but doesn't do anything - it just hangs in the running state
but doesn't do anything. Any ideas what could be wrong, or how to debug
this?

I'm also asking around on the Slurm mailing list, too

Prentice

On 4/23/20 3:03 PM, Ralph Castain wrote:

You can trust the --mpi=list. The problem is likely that OMPI wasn't configured 
--with-pmi2



On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org>> wrote:

--mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= to 
either of them, my job still fails. Why is that? Can I not trust the output of 
--mpi=list?

Prentice

On 4/23/20 10:43 AM, Ralph Castain via users wrote:

No, but you do have to explicitly build OMPI with non-PMIx support if that is 
what you are going to use. In this case, you need to configure OMPI 
--with-pmi2=

You can leave off the path if Slurm (i.e., just "--with-pmi2") was installed in 
a standard location as we should find it there.



On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org>> wrote:

It looks like it was built with PMI2, but not PMIx:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: openmpi

I did launch the job with srun --mpi=pmi2 

Does OpenMPI 4 need PMIx specifically?


On 4/23/20 10:23 AM, Ralph Castain via users wrote:

Is Slurm built with PMIx support? Did you tell srun to use it?



On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org>> wrote:

I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with a 
very simple hello, world MPI program that I've used reliably for years. When I 
submit the job through slurm and use srun to launch the job, I get these errors:

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26070] Local abort before 
MPI_INIT completed completed successfully, but am not able to aggregate error 
messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26076] Local abort before 
MPI_INIT completed completed successfully, but am not able to aggregate error 
messages, and not able to guarantee that all other processes were killed!

Re: [OMPI users] Can't start jobs with srun.

2020-04-27 Thread Riebs, Andy via users
Lost a line…

Also helpful to check

$ srun -N3 which ompi_info

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Riebs, Andy 
via users
Sent: Monday, April 27, 2020 10:59 AM
To: Open MPI Users 
Cc: Riebs, Andy 
Subject: Re: [OMPI users] Can't start jobs with srun.

Y’know, a quick check on versions and PATHs might be a good idea here. I 
suggest something like

$ srun  -N3  ompi_info  |&  grep  "MPI repo"

to confirm that all nodes are running the same version of OMPI.

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Prentice 
Bisbal via users
Sent: Monday, April 27, 2020 10:25 AM
To: users@lists.open-mpi.org
Cc: Prentice Bisbal mailto:pbis...@pppl.gov>>
Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.


Ralph,

PMI2 support works just fine. It's just PMIx that seems to be the problem.

We rebuilt Slurm with PMIx 3.1.5, but the problem persists. I've opened a 
ticket with Slurm support to see if it's a problem on Slurm's end.

Prentice
On 4/26/20 2:12 PM, Ralph Castain via users wrote:
It is entirely possible that the PMI2 support in OMPI v4 is broken - I doubt it 
is used or tested very much as pretty much everyone has moved to PMIx. In fact, 
we completely dropped PMI-1 and PMI-2 from OMPI v5 for that reason.

I would suggest building Slurm with PMIx v3.1.5 
(https://github.com/openpmix/openpmix/releases/tag/v3.1.5) as that is what OMPI 
v4 is using, and launching with "srun --mpi=pmix_v3"


On Apr 26, 2020, at 10:07 AM, Patrick Bégou via users 
mailto:users@lists.open-mpi.org>> wrote:

I have also this problem on servers I'm benching at DELL's lab with
OpenMPI-4.0.3. I've tried  a new build of OpenMPI with "--with-pmi2". No
change.
Finally my work around in the slurm script was to launch my code with
mpirun. As mpirun was only finding one slot per nodes I have used
"--oversubscribe --bind-to core" and checked that every process was
binded on a separate core. It worked but do not ask me why :-)

Patrick

Le 24/04/2020 à 20:28, Riebs, Andy via users a écrit :
Prentice, have you tried something trivial, like "srun -N3 hostname", to rule 
out non-OMPI problems?

Andy

-Original Message-
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Prentice 
Bisbal via users
Sent: Friday, April 24, 2020 2:19 PM
To: Ralph Castain mailto:r...@open-mpi.org>>; Open MPI Users 
mailto:users@lists.open-mpi.org>>
Cc: Prentice Bisbal mailto:pbis...@pppl.gov>>
Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.

Okay. I've got Slurm built with pmix support:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmix_v3
srun: pmi2
srun: openmpi
srun: pmix

But now when I try to launch a job with srun, the job appears to be
running, but doesn't do anything - it just hangs in the running state
but doesn't do anything. Any ideas what could be wrong, or how to debug
this?

I'm also asking around on the Slurm mailing list, too

Prentice

On 4/23/20 3:03 PM, Ralph Castain wrote:
You can trust the --mpi=list. The problem is likely that OMPI wasn't configured 
--with-pmi2


On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org>> wrote:

--mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi= to 
either of them, my job still fails. Why is that? Can I not trust the output of 
--mpi=list?

Prentice

On 4/23/20 10:43 AM, Ralph Castain via users wrote:
No, but you do have to explicitly build OMPI with non-PMIx support if that is 
what you are going to use. In this case, you need to configure OMPI 
--with-pmi2=

You can leave off the path if Slurm (i.e., just "--with-pmi2") was installed in 
a standard location as we should find it there.


On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org>> wrote:

It looks like it was built with PMI2, but not PMIx:

$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: openmpi

I did launch the job with srun --mpi=pmi2 

Does OpenMPI 4 need PMIx specifically?


On 4/23/20 10:23 AM, Ralph Castain via users wrote:
Is Slurm built with PMIx support? Did you tell srun to use it?


On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users 
mailto:users@lists.open-mpi.org>> wrote:

I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the software with a 
very simple hello, world MPI program that I've used reliably for years. When I 
submit the job through slurm and use srun to launch the job, I get these errors:

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[dawson029.pppl.gov:26070] Local abort before 
MPI_INIT completed completed successfully, but am not able to aggregate error 
messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI

Re: [OMPI users] Can't start jobs with srun.

2020-04-27 Thread Daniel Letai via users

  
  
I know it's not supposed to matter, but have you tried building
  both ompi and slurm against same pmix? That is - first build pmix,
  than build slurm with-pmix, and than ompi with both slurm and
  pmix=external ?





On 23/04/2020 17:00, Prentice Bisbal
  via users wrote:


  $ ompi_info | grep slurm
  
    Configure command line:
  '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3'
  '--disable-silent-rules' '--enable-shared' '--with-pmix=internal'
  '--with-slurm' '--with-psm'
  
   MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component
  v4.0.3)
  
   MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component
  v4.0.3)
  
   MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component
  v4.0.3)
  
    MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component
  v4.0.3)
  
  
  Any ideas what could be wrong? Do you need any additional
  information?
  
  
  Prentice