Thanks Gilles but no go... /usr/lib64/openmpi/bin/mpirun -c 1 --mca pml ^ucx /home/franco/spawn_example 47
I'm the parent on fsc07 Starting 47 children Process 1 ([[48649,2],32]) is on host: fsc08 Process 2 ([[48649,1],0]) is on host: unknown! BTLs attempted: vader tcp self Your MPI job is now going to abort; sorry. [fsc08:465159] [[45369,2],27] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493 On Thu, 2021-08-26 at 14:30 +0900, Gilles Gouaillardet via users wrote: Franco, I am surprised UCX gets selected since there is no Infiniband network. There used to be a bug that lead UCX to be selected on shm/tcp systems, but it has been fixed. You might want to give a try to the latest versions of Open MPI (4.0.6 or 4.1.1) Meanwhile, try to mpirun --mca pml ^ucx ... and see if it helps Cheers, Gilles On Thu, Aug 26, 2021 at 2:13 PM Broi, Franco via users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote: Hi, I have 2 example progs that I found on the internet (attached) that illustrate a problem we are having launching multiple node jobs with OpenMPI-4.0.5 and MPI_spawn CentOS Linux release 8.4.2105 openmpi-4.0.5-3.el8.x86_64 Slum 20.11.8 10Gbit ethernet network, no IB or other networks I allocate 2 nodes, each with 24 cores. They are identical systems with a shared NFS root. salloc -p fsc -w fsc07,fsc08 --ntasks-per-node=24 Running the hello prog with OpenMPI 4.0.5 /usr/lib64/openmpi/bin/mpirun --version mpirun (Open MPI) 4.0.5 /usr/lib64/openmpi/bin/mpirun /home/franco/hello MPI_Init(): 307.434000 hello, world (rank 0 of 48 fsc07) ... MPI_Init(): 264.714000 hello, world (rank 47 of 48 fsc08) All well and good. Now running the MPI_spawn example prog with OpenMPI 4.0.1 /library/mpi/openmpi-4.0.1//bin/mpirun -c 1 /home/franco/spawn_example 47 I'm the parent on fsc07 Starting 47 children I'm the spawned. hello, world (rank 0 of 47 fsc07) Received 999 err 0 (rank 0 of 47 fsc07) I'm the spawned. hello, world (rank 1 of 47 fsc07) Received 999 err 0 (rank 1 of 47 fsc07) .... I'm the spawned. hello, world (rank 45 of 47 fsc08) Received 999 err 0 (rank 45 of 47 fsc08) I'm the spawned. hello, world (rank 46 of 47 fsc08) Received 999 err 0 (rank 46 of 47 fsc08) Works fine. Now rebuild spawn_example with 4.0.5 and run as before ldd /home/franco/spawn_example | grep openmpi libmpi.so.40 => /usr/lib64/openmpi/lib/libmpi.so.40 (0x00007fc2c0655000) libopen-rte.so.40 => /usr/lib64/openmpi/lib/libopen-rte.so.40 (0x00007fc2bfdb6000) libopen-pal.so.40 => /usr/lib64/openmpi/lib/libopen-pal.so.40 (0x00007fc2bfb08000) /usr/lib64/openmpi/bin/mpirun --version mpirun (Open MPI) 4.0.5 /usr/lib64/openmpi/bin/mpirun -c 1 /home/franco/spawn_example 47 I'm the parent on fsc07 Starting 47 children [fsc08:463361] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13) [fsc08:463361] [[42596,2],32] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493 .... [fsc08:462917] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13) [fsc08:462917] [[42416,2],33] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493 ompi_dpm_dyn_init() failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- [fsc08:462926] *** An error occurred in MPI_Init [fsc08:462926] *** reported by process [2779774978,42] [fsc08:462926] *** on a NULL communicator [fsc08:462926] *** Unknown error [fsc08:462926] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [fsc08:462926] *** and potentially your MPI job) [fsc07:1158342] *** An error occurred in MPI_Comm_spawn_multiple [fsc07:1158342] *** reported by process [2779774977,0] [fsc07:1158342] *** on communicator MPI_COMM_WORLD [fsc07:1158342] *** MPI_ERR_OTHER: known error not in list [fsc07:1158342] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [fsc07:1158342] *** and potentially your MPI job) [1629952748.688500] [fsc07:1158342:0] sock.c:244 UCX ERROR connect(fd=64, dest_addr= <http://10.220.6.239:38471> 10.220.6.239:38471 ) failed: Connection refused The IP address is for node fsc08, the program is being run from fsc07 I see the orted process running on fsc08 for both hello and spwan_example with the same arguments. I also tried turning on various debug options but I'm none the wiser. If I run the spawn example with 23 children it works fine - because they are all on fsc07. Any idea what might be wrong? Cheers, Franco