Indeed ...

I am not 100% sure the two errors are unrelated, but anyway,


That examples passes with Open MPI 4.0.1 and 4.0.6 and crashed with the versions in between.

It also passes with the 4.1 and master branches


Bottom line, upgrade Open MPI to a latest version and you should be fine.



Cheers,


Gilles

On 8/26/2021 2:42 PM, Broi, Franco via users wrote:

Thanks Gilles but no go...

/usr/lib64/openmpi/bin/mpirun -c 1 --mca pml ^ucx /home/franco/spawn_example 47

I'm the parent on fsc07
Starting 47 children

  Process 1 ([[48649,2],32]) is on host: fsc08
  Process 2 ([[48649,1],0]) is on host: unknown!
  BTLs attempted: vader tcp self

Your MPI job is now going to abort; sorry.

[fsc08:465159] [[45369,2],27] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493

On Thu, 2021-08-26 at 14:30 +0900, Gilles Gouaillardet via users wrote:
Franco,

I am surprised UCX gets selected since there is no Infiniband network.
There used to be a bug that lead UCX to be selected on shm/tcp systems, but it has been fixed. You might want to give a try to the latest versions of Open MPI
(4.0.6 or 4.1.1)

Meanwhile, try to
mpirun --mca pml ^ucx ...
and see if it helps


Cheers,

Gilles

On Thu, Aug 26, 2021 at 2:13 PM Broi, Franco via users <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
Hi,

I have 2 example progs that I found on the internet (attached) that illustrate a problem we are having launching multiple node jobs with OpenMPI-4.0.5 and MPI_spawn

CentOS Linux release 8.4.2105
openmpi-4.0.5-3.el8.x86_64
Slum 20.11.8

10Gbit ethernet network, no IB or other networks

I allocate 2 nodes, each with 24 cores. They are identical systems with a shared NFS root.

salloc -p fsc -w fsc07,fsc08 --ntasks-per-node=24

Running the hello prog with OpenMPI 4.0.5

/usr/lib64/openmpi/bin/mpirun --version
mpirun (Open MPI) 4.0.5

*/usr/lib64/openmpi/bin/mpirun /home/franco/hello*

MPI_Init(): 307.434000
hello, world (rank 0 of 48 fsc07)
...
MPI_Init(): 264.714000
hello, world (rank 47 of 48 fsc08)

All well and good.

Now running the MPI_spawn example prog with OpenMPI 4.0.1

*/library/mpi/openmpi-4.0.1//bin/mpirun -c 1 /home/franco/spawn_example 47*

I'm the parent on fsc07
Starting 47 children

I'm the spawned.
hello, world (rank 0 of 47 fsc07)
Received 999 err 0 (rank 0 of 47 fsc07)
I'm the spawned.
hello, world (rank 1 of 47 fsc07)
Received 999 err 0 (rank 1 of 47 fsc07)
....
I'm the spawned.
hello, world (rank 45 of 47 fsc08)
Received 999 err 0 (rank 45 of 47 fsc08)
I'm the spawned.
hello, world (rank 46 of 47 fsc08)
Received 999 err 0 (rank 46 of 47 fsc08)

Works fine.

Now rebuild spawn_example with 4.0.5 and run as before

ldd /home/franco/spawn_example | grep openmpi
        libmpi.so.40 => /usr/lib64/openmpi/lib/libmpi.so.40 (0x00007fc2c0655000)         libopen-rte.so.40 => /usr/lib64/openmpi/lib/libopen-rte.so.40 (0x00007fc2bfdb6000)         libopen-pal.so.40 => /usr/lib64/openmpi/lib/libopen-pal.so.40 (0x00007fc2bfb08000)

/usr/lib64/openmpi/bin/mpirun --version
mpirun (Open MPI) 4.0.5

*/usr/lib64/openmpi/bin/mpirun -c 1 /home/franco/spawn_example 47*

I'm the parent on fsc07
Starting 47 children

[fsc08:463361] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not 
found (-13)
[fsc08:463361] [[42596,2],32] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 
493
....
[fsc08:462917] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not 
found (-13)
[fsc08:462917] [[42416,2],33] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 
493

   ompi_dpm_dyn_init() failed
   --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[fsc08:462926] *** An error occurred in MPI_Init
[fsc08:462926] *** reported by process [2779774978,42]
[fsc08:462926] *** on a NULL communicator
[fsc08:462926] *** Unknown error
[fsc08:462926] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[fsc08:462926] ***    and potentially your MPI job)
[fsc07:1158342] *** An error occurred in MPI_Comm_spawn_multiple
[fsc07:1158342] *** reported by process [2779774977,0]
[fsc07:1158342] *** on communicator MPI_COMM_WORLD
[fsc07:1158342] *** MPI_ERR_OTHER: known error not in list
[fsc07:1158342] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[fsc07:1158342] ***    and potentially your MPI job)
[1629952748.688500] [fsc07:1158342:0]           sock.c:244  UCX  ERROR 
connect(fd=64, dest_addr=
10.220.6.239:38471
<http://10.220.6.239:38471>
) failed: Connection refused

The IP address is for node fsc08, the program is being run from fsc07

I see the orted process running on fsc08 for both hello and spwan_example with the same arguments. I also tried turning on various debug options but I'm none the wiser.

If I run the spawn example with 23 children it works fine - because they are all on fsc07.

Any idea what might be wrong?

Cheers,
Franco


Reply via email to