I am having a devil of a time tracking the cause of this error down and the
debugging output from mpirun is not helpful to my mortal eyes so I'm reaching
out to the community here for some help. I've built openmpi5 with pmix and ucx
support. I'm running on a slurm cluster with roce. Under slurm,
HI Todd,
You may want to ask UCX what’s going wrong. See if setting this env variable
provides more info:
export UCX_LOG_LEVEL=debug
Have you tried to run the UCX smoke tests?
https://github.com/openucx/ucx?tab=readme-ov-file#running-internal-unit-tests
Howard
From: users on behalf of "Mer
Hi,
PBS version : 2021.1.3
OpenMPI : 5.0.3
I’d like to do a partial oversubscription with MPMD, like this
1. Request 40 trunks in total
2. First group of 32 trunks with 128 cores per trunk, 1 MPI rank per core, 1
OMP threads per MPI rank
3. Second group of 8 trunks with 128 cores per tru