Re: [OMPI users] OpenMPI-4.0.5 and MPI_spawn

Gilles Gouaillardet via users Wed, 25 Aug 2021 23:51:06 -0700

Indeed ...


I am not 100% sure the two errors are unrelated, but anyway,

That examples passes with Open MPI 4.0.1 and 4.0.6 and crashed with theversions in between.


It also passes with the 4.1 and master branches


Bottom line, upgrade Open MPI to a latest version and you should be fine.



Cheers,


Gilles

On 8/26/2021 2:42 PM, Broi, Franco via users wrote:


Thanks Gilles but no go...

/usr/lib64/openmpi/bin/mpirun -c 1 --mca pml ^ucx/home/franco/spawn_example 47


I'm the parent on fsc07
Starting 47 children

  Process 1 ([[48649,2],32]) is on host: fsc08
  Process 2 ([[48649,1],0]) is on host: unknown!
  BTLs attempted: vader tcp self

Your MPI job is now going to abort; sorry.

[fsc08:465159] [[45369,2],27] ORTE_ERROR_LOG: Unreachable in filedpm/dpm.c at line 493


On Thu, 2021-08-26 at 14:30 +0900, Gilles Gouaillardet via users wrote:

Franco,

I am surprised UCX gets selected since there is no Infiniband network.

There used to be a bug that lead UCX to be selected on shm/tcpsystems, butit has been fixed. You might want to give a try to the latestversions of Open MPI

(4.0.6 or 4.1.1)

Meanwhile, try to
mpirun --mca pml ^ucx ...
and see if it helps


Cheers,

Gilles

On Thu, Aug 26, 2021 at 2:13 PM Broi, Franco via users<users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:

Hi,

I have 2 example progs that I found on the internet (attached) thatillustrate a problem we are having launching multiple node jobs withOpenMPI-4.0.5 and MPI_spawn


CentOS Linux release 8.4.2105
openmpi-4.0.5-3.el8.x86_64
Slum 20.11.8

10Gbit ethernet network, no IB or other networks

I allocate 2 nodes, each with 24 cores. They are identical systemswith a shared NFS root.


salloc -p fsc -w fsc07,fsc08 --ntasks-per-node=24

Running the hello prog with OpenMPI 4.0.5

/usr/lib64/openmpi/bin/mpirun --version
mpirun (Open MPI) 4.0.5

*/usr/lib64/openmpi/bin/mpirun /home/franco/hello*

MPI_Init(): 307.434000
hello, world (rank 0 of 48 fsc07)
...
MPI_Init(): 264.714000
hello, world (rank 47 of 48 fsc08)

All well and good.

Now running the MPI_spawn example prog with OpenMPI 4.0.1

*/library/mpi/openmpi-4.0.1//bin/mpirun -c 1/home/franco/spawn_example 47*


I'm the parent on fsc07
Starting 47 children

I'm the spawned.
hello, world (rank 0 of 47 fsc07)
Received 999 err 0 (rank 0 of 47 fsc07)
I'm the spawned.
hello, world (rank 1 of 47 fsc07)
Received 999 err 0 (rank 1 of 47 fsc07)
....
I'm the spawned.
hello, world (rank 45 of 47 fsc08)
Received 999 err 0 (rank 45 of 47 fsc08)
I'm the spawned.
hello, world (rank 46 of 47 fsc08)
Received 999 err 0 (rank 46 of 47 fsc08)

Works fine.

Now rebuild spawn_example with 4.0.5 and run as before

ldd /home/franco/spawn_example | grep openmpi

libmpi.so.40 => /usr/lib64/openmpi/lib/libmpi.so.40(0x00007fc2c0655000) libopen-rte.so.40 =>/usr/lib64/openmpi/lib/libopen-rte.so.40 (0x00007fc2bfdb6000) libopen-pal.so.40 =>/usr/lib64/openmpi/lib/libopen-pal.so.40 (0x00007fc2bfb08000)


/usr/lib64/openmpi/bin/mpirun --version
mpirun (Open MPI) 4.0.5

*/usr/lib64/openmpi/bin/mpirun -c 1 /home/franco/spawn_example 47*

I'm the parent on fsc07
Starting 47 children

[fsc08:463361] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not 
found (-13)
[fsc08:463361] [[42596,2],32] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 
493
....
[fsc08:462917] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not 
found (-13)
[fsc08:462917] [[42416,2],33] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 
493

   ompi_dpm_dyn_init() failed
   --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[fsc08:462926] *** An error occurred in MPI_Init
[fsc08:462926] *** reported by process [2779774978,42]
[fsc08:462926] *** on a NULL communicator
[fsc08:462926] *** Unknown error
[fsc08:462926] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[fsc08:462926] ***    and potentially your MPI job)
[fsc07:1158342] *** An error occurred in MPI_Comm_spawn_multiple
[fsc07:1158342] *** reported by process [2779774977,0]
[fsc07:1158342] *** on communicator MPI_COMM_WORLD
[fsc07:1158342] *** MPI_ERR_OTHER: known error not in list
[fsc07:1158342] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[fsc07:1158342] ***    and potentially your MPI job)
[1629952748.688500] [fsc07:1158342:0]           sock.c:244  UCX  ERROR 
connect(fd=64, dest_addr=
10.220.6.239:38471
<http://10.220.6.239:38471>
) failed: Connection refused

The IP address is for node fsc08, the program is being run from fsc07

I see the orted process running on fsc08 for both hello andspwan_example with the same arguments. I also tried turning onvarious debug options but I'm none the wiser.

If I run the spawn example with 23 children it works fine - becausethey are all on fsc07.


Any idea what might be wrong?

Cheers,
Franco

Re: [OMPI users] OpenMPI-4.0.5 and MPI_spawn

Reply via email to