[OMPI users] MPI_Intercomm_create error

2022-03-16 Thread Mccall, Kurt E. (MSFC-EV41) via users
I'm using OpenMpi 4.1.2 under Slurm 20.11.8.  My 2 process job is successfully 
launched, but when the main process rank 0
attempts to create an intercommunicator with process rank 1 on the other node:

MPI_Comm intercom;
MPI_Intercomm_create(MPI_COMM_SELF, 0, MPI_COMM_WORLD, 1, ,   &intercom);

OpenMpi spins deep inside the MPI_Intercomm_create code, and the following 
error is reported:

WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

The output resulting from using the mpirun arguments "--mca ras_base_verbose 5 
--display-devel-map --mca rmaps_base_verbose 5" is attached.
Any help would be appreciated.
SLURM_JOB_NODELIST =  n[001-002]
Calling mpirun for slurm
num_proc =  2
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm: available for 
selection
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base set policy with 
ppr:1:node device NONNULL
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base policy ppr 
modifiers 1:node provided
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available 
component mindist
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component 
[mindist]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available 
component ppr
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component [ppr]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available 
component rank_file
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component 
[rank_file]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available 
component resilient
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component 
[resilient]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available 
component round_robin
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component 
[round_robin]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available 
component seq
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component [seq]
[n001.cluster.pssclabs.com:3473322] [[65186,0],0]: Final mapper priorities
[n001.cluster.pssclabs.com:3473322] Mapper: ppr Priority: 90
[n001.cluster.pssclabs.com:3473322] Mapper: seq Priority: 60
[n001.cluster.pssclabs.com:3473322] Mapper: resilient Priority: 40
[n001.cluster.pssclabs.com:3473322] Mapper: mindist Priority: 20
[n001.cluster.pssclabs.com:3473322] Mapper: round_robin Priority: 10
[n001.cluster.pssclabs.com:3473322] Mapper: rank_file Priority: 0
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base set policy with 
ppr:1:node device NULL
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base policy ppr 
modifiers 1:node provided
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:allocate
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate:discover: 
checking nodelist: n[001-002]
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate:discover: 
parse range 001-002 (2)
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate:discover: 
adding node n001 (24 slots)
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate:discover: 
adding node n002 (24 slots)
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate: success
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:node_insert 
inserting 2 nodes
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:node_insert updating 
HNP [n001] info to 24 slots
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:node_insert node 
n002 slots 24

==   ALLOCATED NODES   ==
n001: flags=0x11 slots=24 max_slots=0 slots_inuse=0 state=UP
n002: flags=0x10 slots=24 max_slots=0 slots_inuse=0 state=UP
=

==   ALLOCATED NODES   ==
n001: flags=0x11 slots=24 max_slots=0 slots_inuse=0 state=UP
n002: flags=0x11 slots=24 max_slots=0 slots_inuse=0 state=UP
=
[n001.cluster.pssclabs.com:3473322] mca:rmaps: mapping job [65186,1]
[n001.cluster.pssclabs.com:3473322] mca:rmaps: setting mapping policies for job 
[65186,1] nprocs 2
[n001.cluster.pssclabs.com:3473322] mca:rmaps[303] binding not given - using 
bycore
[n001.cluster.pssclabs.com:3473322] mca:rmaps:ppr: mapping job [65186,1] with 
ppr 1:node
[n001.cluster.pssclabs.com:3473322] mca:rmaps:ppr: job [65186,1] assigned 
policy BYNODE:NOOVERSUBSCRIBE
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] Starting with 2 nodes in list
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] Filtering thru apps
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] Retained 2 nodes in list
[

Re: [OMPI users] MPI_Intercomm_create error

2022-03-16 Thread George Bosilca via users
I see similar issues on platforms with multiple IP addresses, if some of
them are not fully connected. In general, specifying which interface OMPI
can use (with --mca btl_tcp_if_include x.y.z.t/s) solves the problem.

  George.


On Wed, Mar 16, 2022 at 5:11 PM Mccall, Kurt E. (MSFC-EV41) via users <
users@lists.open-mpi.org> wrote:

> I’m using OpenMpi 4.1.2 under Slurm 20.11.8.  My 2 process job is
> successfully launched, but when the main process rank 0
>
> attempts to create an intercommunicator with process rank 1 on the other
> node:
>
>
>
> MPI_Comm intercom;
>
> MPI_Intercomm_create(MPI_COMM_SELF, 0, MPI_COMM_WORLD, 1, ,
>   &intercom);
>
>
>
> OpenMpi spins deep inside the MPI_Intercomm_create code, and the following
> error is reported:
>
>
>
> *WARNING: Open MPI accepted a TCP connection from what appears to be a*
>
> *another Open MPI process but cannot find a corresponding process*
>
> *entry for that peer.*
>
>
>
> *This attempted connection will be ignored; your MPI job may or may not*
>
> *continue properly.*
>
>
>
> The output resulting from using the mpirun arguments “--mca
> ras_base_verbose 5 --display-devel-map --mca rmaps_base_verbose 5” is
> attached.
>
> Any help would be appreciated.
>


Re: [OMPI users] MPI_Intercomm_create error

2022-03-16 Thread Mccall, Kurt E. (MSFC-EV41) via users
George,

Thanks, that was it!

Kurt

From: George Bosilca 
Sent: Wednesday, March 16, 2022 4:38 PM
To: Open MPI Users 
Cc: Mccall, Kurt E. (MSFC-EV41) 
Subject: [EXTERNAL] Re: [OMPI users] MPI_Intercomm_create error

I see similar issues on platforms with multiple IP addresses, if some of them 
are not fully connected. In general, specifying which interface OMPI can use 
(with --mca btl_tcp_if_include x.y.z.t/s) solves the problem.

  George.


On Wed, Mar 16, 2022 at 5:11 PM Mccall, Kurt E. (MSFC-EV41) via users 
mailto:users@lists.open-mpi.org>> wrote:
I’m using OpenMpi 4.1.2 under Slurm 20.11.8.  My 2 process job is successfully 
launched, but when the main process rank 0
attempts to create an intercommunicator with process rank 1 on the other node:

MPI_Comm intercom;
MPI_Intercomm_create(MPI_COMM_SELF, 0, MPI_COMM_WORLD, 1, ,   &intercom);

OpenMpi spins deep inside the MPI_Intercomm_create code, and the following 
error is reported:

WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

The output resulting from using the mpirun arguments “--mca ras_base_verbose 5 
--display-devel-map --mca rmaps_base_verbose 5” is attached.
Any help would be appreciated.