Cross posting to Slurm, PMIx and UCX lists.
Trying to execute a simple openmpi (4.0.1) mpi-hello-world
via Slurm (19.05.0) compiled with both PMIx (3.1.2) and UCX
(1.5.0) results in:
[root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=true
SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true
OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1'
SLURM_PMIX_DIRECT_CONN_EARLY=false UCX_TLS=rc,shm srun
--export
SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl,
UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS
--mpi=pmix -N 2 -n 2 /data/mpihello/mpihello
slurmstepd: error: n1 [0] pmixp_dconn_ucx.c:668
[_ucx_connect] mpi/pmix: ERROR: ucp_ep_create failed:
Input/output error
slurmstepd: error: n1 [0] pmixp_dconn.h:243
[pmixp_dconn_connect] mpi/pmix: ERROR: Cannot establish direct
connection to n2 (1)
slurmstepd: error: n1 [0] pmixp_server.c:731
[_process_extended_hdr] mpi/pmix: ERROR: Unable to connect to
1
srun: Job step aborted: Waiting up to 32 seconds for job step
to finish.
slurmstepd: error: n2 [1] pmixp_dconn_ucx.c:668 [_ucx_connect]
mpi/pmix: ERROR: ucp_ep_create failed: Input/output error
slurmstepd: error: n2 [1] pmixp_dconn.h:243
[pmixp_dconn_connect] mpi/pmix: ERROR: Cannot establish direct
connection to n1 (0)
slurmstepd: error: *** STEP 7202.0 ON n1 CANCELLED AT
2019-07-01T13:20:36 ***
slurmstepd: error: n2 [1] pmixp_server.c:731
[_process_extended_hdr] mpi/pmix: ERROR: Unable to connect to
0
srun: error: n2: task 1: Killed
srun: error: n1: task 0: Killed
However, the following works:
[root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=false
SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true
OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1'
SLURM_PMIX_DIRECT_CONN_EARLY=false UCX_TLS=rc,shm srun
--export
SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl,
UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS
--mpi=pmix -N 2 -n 2 /data/mpihello/mpihello
n2: Process 1 out of 2
n1: Process 0 out of 2
[root@n1 ~]# SLURM_PMIX_DIRECT_CONN_UCX=false
SLURM_PMIX_DIRECT_CONN=true OMPI_MCA_pml=true
OMPI_MCA_btl='^vader,tcp,openib' UCX_NET_DEVICES='mlx4_0:1'
SLURM_PMIX_DIRECT_CONN_EARLY=true UCX_TLS=rc,shm srun --export
SLURM_PMIX_DIRECT_CONN_UCX,SLURM_PMIX_DIRECT_CONN,OMPI_MCA_pml,OMPI_MCA_btl,
UCX_NET_DEVICES,SLURM_PMIX_DIRECT_CONN_EARLY,UCX_TLS
--mpi=pmix -N 2 -n 2 /data/mpihello/mpihello
n2: Process 1 out of 2
n1: Process 0 out of 2
Executing mpirun directly (same env vars, without the slurm
vars) works, so UCX appears to function correctly.
If both SLURM_PMIX_DIRECT_CONN_EARLY=true and
SLURM_PMIX_DIRECT_CONN_UCX=true then I get collective timeout
errors from mellanox/hcoll and glibc detected
/data/mpihello/mpihello: malloc(): memory corruption (fast)
Can anyone help using PMIx direct connection with UCX in
Slurm?
Some info about my setup:
UCX version
[root@n1 ~]# ucx_info -v
# UCT version=1.5.0 revision 02078b9
# configured with: --build=x86_64-redhat-linux-gnu
--host=x86_64-redhat-linux-gnu
--target=x86_64-redhat-linux-gnu --program-prefix=
--prefix=/usr --exec-prefix=/usr --bindir=/usr/bin
--sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share
--includedir=/usr/include --libdir=/usr/lib64
--libexecdir=/usr/libexec --localstatedir=/var
--sharedstatedir=/var/lib --mandir=/usr/share/man
--infodir=/usr/share/info --disable-optimizations
--disable-logging --disable-debug --disable-assertions
--enable-mt --disable-params-check
Mellanox OFED version:
[root@n1 ~]# ofed_info -s
OFED-internal-4.5-1.0.1:
Slurm:
slurm was built with:
rpmbuild -ta slurm-19.05.0.tar.bz2 --without debug --with ucx
--define '_with_pmix --with-pmix=/usr'
PMIx:
[root@n1 ~]# pmix_info -c --parsable
config:user:root
config:timestamp:"Mon Mar 25 09:51:04 IST 2019"
config:host:slurm-test
config:cli: '--host=x86_64-redhat-linux-gnu'
'--build=x86_64-redhat-linux-gnu' '--program-prefix='
'--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin'
'--sbindir=/usr/sbin' '--sysconfdir=/etc'
'--datadir=/usr/share' '--includedir=/usr/include'
'--libdir=/usr/lib64' '--libexecdir=/usr/libexec'
'--localstatedir=/var' '--sharedstatedir=/var/lib'
'--mandir=/usr/share/man' '--infodir=/usr/share/info'
Thanks,
--Dani_L.
--
You received this message because you are subscribed to the
Google Groups "pmix" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to
pmix+unsubscr...@googlegroups.com.
To post to this group, send email to
p...@googlegroups.com.
Visit this group at
https://groups.google.com/group/pmix.
To view this discussion on the web visit
https://groups.google.com/d/msgid/pmix/ce4a81a4-b3f7-48ce-4b9c-a5ebb098862c%40letai.org.il.
For more options, visit
https://groups.google.com/d/optout.
--