Franco,

I am surprised UCX gets selected since there is no Infiniband network.
There used to be a bug that lead UCX to be selected on shm/tcp systems, but
it has been fixed. You might want to give a try to the latest versions of
Open MPI
(4.0.6 or 4.1.1)

Meanwhile, try to
mpirun --mca pml ^ucx ...
and see if it helps


Cheers,

Gilles

On Thu, Aug 26, 2021 at 2:13 PM Broi, Franco via users <
users@lists.open-mpi.org> wrote:

> Hi,
>
> I have 2 example progs that I found on the internet (attached) that
> illustrate a problem we are having launching multiple node jobs with
> OpenMPI-4.0.5 and MPI_spawn
>
> CentOS Linux release 8.4.2105
> openmpi-4.0.5-3.el8.x86_64
> Slum 20.11.8
>
> 10Gbit ethernet network, no IB or other networks
>
> I allocate 2 nodes, each with 24 cores. They are identical systems with a
> shared NFS root.
>
> salloc -p fsc -w fsc07,fsc08 --ntasks-per-node=24
>
> Running the hello prog with OpenMPI 4.0.5
>
> /usr/lib64/openmpi/bin/mpirun --version
> mpirun (Open MPI) 4.0.5
>
> */usr/lib64/openmpi/bin/mpirun /home/franco/hello*
>
> MPI_Init(): 307.434000
> hello, world (rank 0 of 48 fsc07)
> ...
> MPI_Init(): 264.714000
> hello, world (rank 47 of 48 fsc08)
>
> All well and good.
>
> Now running the MPI_spawn example prog with OpenMPI 4.0.1
>
> */library/mpi/openmpi-4.0.1//bin/mpirun -c 1 /home/franco/spawn_example 47*
>
> I'm the parent on fsc07
> Starting 47 children
>
> I'm the spawned.
> hello, world (rank 0 of 47 fsc07)
> Received 999 err 0 (rank 0 of 47 fsc07)
> I'm the spawned.
> hello, world (rank 1 of 47 fsc07)
> Received 999 err 0 (rank 1 of 47 fsc07)
> ....
> I'm the spawned.
> hello, world (rank 45 of 47 fsc08)
> Received 999 err 0 (rank 45 of 47 fsc08)
> I'm the spawned.
> hello, world (rank 46 of 47 fsc08)
> Received 999 err 0 (rank 46 of 47 fsc08)
>
> Works fine.
>
> Now rebuild spawn_example with 4.0.5 and run as before
>
> ldd /home/franco/spawn_example | grep openmpi
>         libmpi.so.40 => /usr/lib64/openmpi/lib/libmpi.so.40
> (0x00007fc2c0655000)
>         libopen-rte.so.40 => /usr/lib64/openmpi/lib/libopen-rte.so.40
> (0x00007fc2bfdb6000)
>         libopen-pal.so.40 => /usr/lib64/openmpi/lib/libopen-pal.so.40
> (0x00007fc2bfb08000)
>
> /usr/lib64/openmpi/bin/mpirun --version
> mpirun (Open MPI) 4.0.5
>
> */usr/lib64/openmpi/bin/mpirun -c 1 /home/franco/spawn_example 47*
>
> I'm the parent on fsc07
>
> Starting 47 children
>
>
> [fsc08:463361] pml_ucx.c:178  Error: Failed to receive UCX worker address: 
> Not found (-13)
>
> [fsc08:463361] [[42596,2],32] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 
> 493
>
> ....
>
> [fsc08:462917] pml_ucx.c:178  Error: Failed to receive UCX worker address: 
> Not found (-13)
>
> [fsc08:462917] [[42416,2],33] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 
> 493
>
>
>   ompi_dpm_dyn_init() failed
>
>   --> Returned "Error" (-1) instead of "Success" (0)
>
> --------------------------------------------------------------------------
>
> [fsc08:462926] *** An error occurred in MPI_Init
>
> [fsc08:462926] *** reported by process [2779774978,42]
>
> [fsc08:462926] *** on a NULL communicator
>
> [fsc08:462926] *** Unknown error
>
> [fsc08:462926] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
> now abort,
>
> [fsc08:462926] ***    and potentially your MPI job)
>
> [fsc07:1158342] *** An error occurred in MPI_Comm_spawn_multiple
>
> [fsc07:1158342] *** reported by process [2779774977,0]
>
> [fsc07:1158342] *** on communicator MPI_COMM_WORLD
>
> [fsc07:1158342] *** MPI_ERR_OTHER: known error not in list
>
> [fsc07:1158342] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
> now abort,
>
> [fsc07:1158342] ***    and potentially your MPI job)
>
> [1629952748.688500] [fsc07:1158342:0]           sock.c:244  UCX  ERROR 
> connect(fd=64, dest_addr=10.220.6.239:38471) failed: Connection refused
>
>
> The IP address is for node fsc08, the program is being run from fsc07
>
> I see the orted process running on fsc08 for both hello and spwan_example
> with the same arguments. I also tried turning on various debug options but
> I'm none the wiser.
>
> If I run the spawn example with 23 children it works fine - because they
> are all on fsc07.
>
> Any idea what might be wrong?
>
> Cheers,
> Franco
>
>
>

Reply via email to