I'm using OpenMpi 4.1.2 under Slurm 20.11.8. My 2 process job is successfully
launched, but when the main process rank 0
attempts to create an intercommunicator with process rank 1 on the other node:
MPI_Comm intercom;
MPI_Intercomm_create(MPI_COMM_SELF, 0, MPI_COMM_WORLD, 1, , &intercom);
OpenMpi spins deep inside the MPI_Intercomm_create code, and the following
error is reported:
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.
This attempted connection will be ignored; your MPI job may or may not
continue properly.
The output resulting from using the mpirun arguments "--mca ras_base_verbose 5
--display-devel-map --mca rmaps_base_verbose 5" is attached.
Any help would be appreciated.
SLURM_JOB_NODELIST = n[001-002]
Calling mpirun for slurm
num_proc = 2
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm: available for
selection
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base set policy with
ppr:1:node device NONNULL
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base policy ppr
modifiers 1:node provided
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available
component mindist
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component
[mindist]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available
component ppr
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component [ppr]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available
component rank_file
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component
[rank_file]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available
component resilient
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component
[resilient]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available
component round_robin
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component
[round_robin]
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: checking available
component seq
[n001.cluster.pssclabs.com:3473322] mca:rmaps:select: Querying component [seq]
[n001.cluster.pssclabs.com:3473322] [[65186,0],0]: Final mapper priorities
[n001.cluster.pssclabs.com:3473322] Mapper: ppr Priority: 90
[n001.cluster.pssclabs.com:3473322] Mapper: seq Priority: 60
[n001.cluster.pssclabs.com:3473322] Mapper: resilient Priority: 40
[n001.cluster.pssclabs.com:3473322] Mapper: mindist Priority: 20
[n001.cluster.pssclabs.com:3473322] Mapper: round_robin Priority: 10
[n001.cluster.pssclabs.com:3473322] Mapper: rank_file Priority: 0
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base set policy with
ppr:1:node device NULL
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] rmaps:base policy ppr
modifiers 1:node provided
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:allocate
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate:discover:
checking nodelist: n[001-002]
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate:discover:
parse range 001-002 (2)
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate:discover:
adding node n001 (24 slots)
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate:discover:
adding node n002 (24 slots)
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:slurm:allocate: success
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:node_insert
inserting 2 nodes
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:node_insert updating
HNP [n001] info to 24 slots
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] ras:base:node_insert node
n002 slots 24
== ALLOCATED NODES ==
n001: flags=0x11 slots=24 max_slots=0 slots_inuse=0 state=UP
n002: flags=0x10 slots=24 max_slots=0 slots_inuse=0 state=UP
=
== ALLOCATED NODES ==
n001: flags=0x11 slots=24 max_slots=0 slots_inuse=0 state=UP
n002: flags=0x11 slots=24 max_slots=0 slots_inuse=0 state=UP
=
[n001.cluster.pssclabs.com:3473322] mca:rmaps: mapping job [65186,1]
[n001.cluster.pssclabs.com:3473322] mca:rmaps: setting mapping policies for job
[65186,1] nprocs 2
[n001.cluster.pssclabs.com:3473322] mca:rmaps[303] binding not given - using
bycore
[n001.cluster.pssclabs.com:3473322] mca:rmaps:ppr: mapping job [65186,1] with
ppr 1:node
[n001.cluster.pssclabs.com:3473322] mca:rmaps:ppr: job [65186,1] assigned
policy BYNODE:NOOVERSUBSCRIBE
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] Starting with 2 nodes in list
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] Filtering thru apps
[n001.cluster.pssclabs.com:3473322] [[65186,0],0] Retained 2 nodes in list
[