Indeed ...
I am not 100% sure the two errors are unrelated, but anyway,
That examples passes with Open MPI 4.0.1 and 4.0.6 and crashed with the
versions in between.
It also passes with the 4.1 and master branches
Bottom line, upgrade Open MPI to a latest version and you should be fine.
Cheers,
Gilles
On 8/26/2021 2:42 PM, Broi, Franco via users wrote:
Thanks Gilles but no go...
/usr/lib64/openmpi/bin/mpirun -c 1 --mca pml ^ucx
/home/franco/spawn_example 47
I'm the parent on fsc07
Starting 47 children
Process 1 ([[48649,2],32]) is on host: fsc08
Process 2 ([[48649,1],0]) is on host: unknown!
BTLs attempted: vader tcp self
Your MPI job is now going to abort; sorry.
[fsc08:465159] [[45369,2],27] ORTE_ERROR_LOG: Unreachable in file
dpm/dpm.c at line 493
On Thu, 2021-08-26 at 14:30 +0900, Gilles Gouaillardet via users wrote:
Franco,
I am surprised UCX gets selected since there is no Infiniband network.
There used to be a bug that lead UCX to be selected on shm/tcp
systems, but
it has been fixed. You might want to give a try to the latest
versions of Open MPI
(4.0.6 or 4.1.1)
Meanwhile, try to
mpirun --mca pml ^ucx ...
and see if it helps
Cheers,
Gilles
On Thu, Aug 26, 2021 at 2:13 PM Broi, Franco via users
<users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
Hi,
I have 2 example progs that I found on the internet (attached) that
illustrate a problem we are having launching multiple node jobs with
OpenMPI-4.0.5 and MPI_spawn
CentOS Linux release 8.4.2105
openmpi-4.0.5-3.el8.x86_64
Slum 20.11.8
10Gbit ethernet network, no IB or other networks
I allocate 2 nodes, each with 24 cores. They are identical systems
with a shared NFS root.
salloc -p fsc -w fsc07,fsc08 --ntasks-per-node=24
Running the hello prog with OpenMPI 4.0.5
/usr/lib64/openmpi/bin/mpirun --version
mpirun (Open MPI) 4.0.5
*/usr/lib64/openmpi/bin/mpirun /home/franco/hello*
MPI_Init(): 307.434000
hello, world (rank 0 of 48 fsc07)
...
MPI_Init(): 264.714000
hello, world (rank 47 of 48 fsc08)
All well and good.
Now running the MPI_spawn example prog with OpenMPI 4.0.1
*/library/mpi/openmpi-4.0.1//bin/mpirun -c 1
/home/franco/spawn_example 47*
I'm the parent on fsc07
Starting 47 children
I'm the spawned.
hello, world (rank 0 of 47 fsc07)
Received 999 err 0 (rank 0 of 47 fsc07)
I'm the spawned.
hello, world (rank 1 of 47 fsc07)
Received 999 err 0 (rank 1 of 47 fsc07)
....
I'm the spawned.
hello, world (rank 45 of 47 fsc08)
Received 999 err 0 (rank 45 of 47 fsc08)
I'm the spawned.
hello, world (rank 46 of 47 fsc08)
Received 999 err 0 (rank 46 of 47 fsc08)
Works fine.
Now rebuild spawn_example with 4.0.5 and run as before
ldd /home/franco/spawn_example | grep openmpi
libmpi.so.40 => /usr/lib64/openmpi/lib/libmpi.so.40
(0x00007fc2c0655000)
libopen-rte.so.40 =>
/usr/lib64/openmpi/lib/libopen-rte.so.40 (0x00007fc2bfdb6000)
libopen-pal.so.40 =>
/usr/lib64/openmpi/lib/libopen-pal.so.40 (0x00007fc2bfb08000)
/usr/lib64/openmpi/bin/mpirun --version
mpirun (Open MPI) 4.0.5
*/usr/lib64/openmpi/bin/mpirun -c 1 /home/franco/spawn_example 47*
I'm the parent on fsc07
Starting 47 children
[fsc08:463361] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not
found (-13)
[fsc08:463361] [[42596,2],32] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line
493
....
[fsc08:462917] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not
found (-13)
[fsc08:462917] [[42416,2],33] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line
493
ompi_dpm_dyn_init() failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[fsc08:462926] *** An error occurred in MPI_Init
[fsc08:462926] *** reported by process [2779774978,42]
[fsc08:462926] *** on a NULL communicator
[fsc08:462926] *** Unknown error
[fsc08:462926] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
now abort,
[fsc08:462926] *** and potentially your MPI job)
[fsc07:1158342] *** An error occurred in MPI_Comm_spawn_multiple
[fsc07:1158342] *** reported by process [2779774977,0]
[fsc07:1158342] *** on communicator MPI_COMM_WORLD
[fsc07:1158342] *** MPI_ERR_OTHER: known error not in list
[fsc07:1158342] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
now abort,
[fsc07:1158342] *** and potentially your MPI job)
[1629952748.688500] [fsc07:1158342:0] sock.c:244 UCX ERROR
connect(fd=64, dest_addr=
10.220.6.239:38471
<http://10.220.6.239:38471>
) failed: Connection refused
The IP address is for node fsc08, the program is being run from fsc07
I see the orted process running on fsc08 for both hello and
spwan_example with the same arguments. I also tried turning on
various debug options but I'm none the wiser.
If I run the spawn example with 23 children it works fine - because
they are all on fsc07.
Any idea what might be wrong?
Cheers,
Franco