HI Todd,

You may want to ask UCX what’s going wrong.  See if setting this env variable 
provides more info:

export UCX_LOG_LEVEL=debug

Have you tried to run the UCX smoke tests?

https://github.com/openucx/ucx?tab=readme-ov-file#running-internal-unit-tests

Howard

From: users <users-boun...@lists.open-mpi.org> on behalf of "Merritt, Todd R - 
(tmerritt) via users" <users@lists.open-mpi.org>
Reply-To: Open MPI Users <users@lists.open-mpi.org>
Date: Thursday, August 29, 2024 at 7:57 AM
To: "users@lists.open-mpi.org" <users@lists.open-mpi.org>
Cc: "Merritt, Todd R - (tmerritt)" <tmerr...@arizona.edu>
Subject: [EXTERNAL] [OMPI users] PML issue with openmpi5 and ucx

I am having a devil of a time tracking the cause of this error down and the 
debugging output from mpirun is not helpful to my mortal eyes so I'm reaching 
out to the community here for some help. I've built openmpi5 with pmix and ucx 
support. I'm running on a slurm cluster with roce. Under slurm, I can launch 
multinode jobs, running one core per node and they run fine. I cannot, however, 
run more than two processes on a single node. When I try I get


--------------------------------------------------------------------------

It looks like MPI_INIT failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can

fail during MPI_INIT; some of which are due to configuration or environment

problems.  This failure appears to be an internal failure; here's some

additional information (which may only be relevant to an Open MPI

developer):



  PML add procs failed

  --> Returned "Error" (-1) instead of "Success" (0)

--------------------------------------------------------------------------

[r6u25n2:00000] *** An error occurred in MPI_Init

[r6u25n2:00000] *** reported by process [2702508033,0]

[r6u25n2:00000] *** on a NULL communicator

[r6u25n2:00000] *** Unknown error

[r6u25n2:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,

[r6u25n2:00000] ***    and MPI will try to terminate your MPI job as well)

--------------------------------------------------------------------------

prterun has exited due to process rank 1 with PID 0 on node r6u25n2 calling

"abort". This may have caused other processes in the application to be

terminated by signals sent by prterun (as reported here).

--------------------------------------------------------------------------
If I run the same single node, multi-core job outside of slurm, it runs fine. 
Any pointers on where to look into the PML add procs issue?

Thanks,
Todd

Reply via email to