Thanks Gilles but no go...

/usr/lib64/openmpi/bin/mpirun -c 1 --mca pml ^ucx /home/franco/spawn_example 47

I'm the parent on fsc07
Starting 47 children

  Process 1 ([[48649,2],32]) is on host: fsc08
  Process 2 ([[48649,1],0]) is on host: unknown!
  BTLs attempted: vader tcp self

Your MPI job is now going to abort; sorry.

[fsc08:465159] [[45369,2],27] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at 
line 493

On Thu, 2021-08-26 at 14:30 +0900, Gilles Gouaillardet via users wrote:
Franco,

I am surprised UCX gets selected since there is no Infiniband network.
There used to be a bug that lead UCX to be selected on shm/tcp systems, but
it has been fixed. You might want to give a try to the latest versions of Open 
MPI
(4.0.6 or 4.1.1)

Meanwhile, try to
mpirun --mca pml ^ucx ...
and see if it helps


Cheers,

Gilles

On Thu, Aug 26, 2021 at 2:13 PM Broi, Franco via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:
Hi,

I have 2 example progs that I found on the internet (attached) that illustrate 
a problem we are having launching multiple node jobs with OpenMPI-4.0.5 and 
MPI_spawn

CentOS Linux release 8.4.2105
openmpi-4.0.5-3.el8.x86_64
Slum 20.11.8

10Gbit ethernet network, no IB or other networks

I allocate 2 nodes, each with 24 cores. They are identical systems with a 
shared NFS root.

salloc -p fsc -w fsc07,fsc08 --ntasks-per-node=24

Running the hello prog with OpenMPI 4.0.5

/usr/lib64/openmpi/bin/mpirun --version
mpirun (Open MPI) 4.0.5

/usr/lib64/openmpi/bin/mpirun /home/franco/hello

MPI_Init(): 307.434000
hello, world (rank 0 of 48 fsc07)
...
MPI_Init(): 264.714000
hello, world (rank 47 of 48 fsc08)

All well and good.

Now running the MPI_spawn example prog with OpenMPI 4.0.1

/library/mpi/openmpi-4.0.1//bin/mpirun -c 1 /home/franco/spawn_example 47

I'm the parent on fsc07
Starting 47 children

I'm the spawned.
hello, world (rank 0 of 47 fsc07)
Received 999 err 0 (rank 0 of 47 fsc07)
I'm the spawned.
hello, world (rank 1 of 47 fsc07)
Received 999 err 0 (rank 1 of 47 fsc07)
....
I'm the spawned.
hello, world (rank 45 of 47 fsc08)
Received 999 err 0 (rank 45 of 47 fsc08)
I'm the spawned.
hello, world (rank 46 of 47 fsc08)
Received 999 err 0 (rank 46 of 47 fsc08)

Works fine.

Now rebuild spawn_example with 4.0.5 and run as before

ldd /home/franco/spawn_example | grep openmpi
        libmpi.so.40 => /usr/lib64/openmpi/lib/libmpi.so.40 (0x00007fc2c0655000)
        libopen-rte.so.40 => /usr/lib64/openmpi/lib/libopen-rte.so.40 
(0x00007fc2bfdb6000)
        libopen-pal.so.40 => /usr/lib64/openmpi/lib/libopen-pal.so.40 
(0x00007fc2bfb08000)

/usr/lib64/openmpi/bin/mpirun --version
mpirun (Open MPI) 4.0.5

/usr/lib64/openmpi/bin/mpirun -c 1 /home/franco/spawn_example 47


I'm the parent on fsc07

Starting 47 children


[fsc08:463361] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not 
found (-13)

[fsc08:463361] [[42596,2],32] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 
493

....

[fsc08:462917] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not 
found (-13)

[fsc08:462917] [[42416,2],33] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 
493


  ompi_dpm_dyn_init() failed

  --> Returned "Error" (-1) instead of "Success" (0)

--------------------------------------------------------------------------

[fsc08:462926] *** An error occurred in MPI_Init

[fsc08:462926] *** reported by process [2779774978,42]

[fsc08:462926] *** on a NULL communicator

[fsc08:462926] *** Unknown error

[fsc08:462926] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,

[fsc08:462926] ***    and potentially your MPI job)

[fsc07:1158342] *** An error occurred in MPI_Comm_spawn_multiple

[fsc07:1158342] *** reported by process [2779774977,0]

[fsc07:1158342] *** on communicator MPI_COMM_WORLD

[fsc07:1158342] *** MPI_ERR_OTHER: known error not in list

[fsc07:1158342] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,

[fsc07:1158342] ***    and potentially your MPI job)

[1629952748.688500] [fsc07:1158342:0]           sock.c:244  UCX  ERROR 
connect(fd=64, dest_addr=

<http://10.220.6.239:38471>

10.220.6.239:38471

) failed: Connection refused

The IP address is for node fsc08, the program is being run from fsc07

I see the orted process running on fsc08 for both hello and spwan_example with 
the same arguments. I also tried turning on various debug options but I'm none 
the wiser.

If I run the spawn example with 23 children it works fine - because they are 
all on fsc07.

Any idea what might be wrong?

Cheers,
Franco

Reply via email to