Cross posting to Slurm, PMIx and UCX lists.
Trying to execute a simple openmpi (4.0.1) mpi-hello-world via Slurm (19.05.0) compiled with both PMIx (3.1.2) and UCX (1.5.0) results in:
[root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=true
SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true
OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1'
SLURM_PMIX_DIRECT_CONN_EARLY=false UCX_TLS=rc,shm srun --export
SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl,
UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS --mpi=pmix -N
2 -n 2 /data/mpihello/mpihello
slurmstepd: error: n1 [0] pmixp_dconn_ucx.c:668 [_ucx_connect]
mpi/pmix: ERROR: ucp_ep_create failed: Input/output error
However, the following works:
[root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=false SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1' SLURM_PMIX_DIRECT_CONN_EARLY=false UCX_TLS=rc,shm srun --export SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl, UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS --mpi=pmix -N 2 -n 2 /data/mpihello/mpihello
[root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=false SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1' SLURM_PMIX_DIRECT_CONN_EARLY=true UCX_TLS=rc,shm srun --export SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl, UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS --mpi=pmix -N 2 -n 2 /data/mpihello/mpihello
n2: Process 1 out of 2
Executing mpirun directly (same env vars, without the slurm vars)
works, so UCX appears to function correctly.
If both SLURM_PMIX_DIRECT_CONN_EARLY=true and
SLURM_PMIX_DIRECT_CONN_UCX=true then I get collective timeout
errors from mellanox/hcoll and glibc detected
/data/mpihello/mpihello: malloc(): memory corruption (fast)
Can anyone help using PMIx direct connection with UCX in Slurm?
Some info about my setup:
UCX version [root@n1 ~]# ucx_info -v # UCT version=1.5.0 revision 02078b9
Mellanox OFED version: [root@n1 ~]# ofed_info -s
Slurm: slurm was built with:
PMIx: [root@n1 ~]# pmix_info -c --parsable
Thanks, --Dani_L. |
- [slurm-users] [Cross post - Slurm, PMIx, UCX] Using sr... Daniel Letai
- Re: [slurm-users] [pmix] [Cross post - Slurm, PMI... Daniel Letai
- Re: [slurm-users] [pmix] [Cross post - Slurm,... Michael Di Domenico
- Re: [slurm-users] [pmix] [Cross post - Sl... Fulcomer, Samuel