Hello, I am trying to use ucx with slurm/pmix and run into the error below. The following works using mpirun, but what I was hoping was the srun equivalent fails. Is there some flag or configuration I might be missing for slurm?
Works fine: mpirun -n 100 --host apcpu-004:88,apcpu-005:88 --mca pml ucx --mca osc ucx ./hello does not work: srun -n 100 ./hello slurmstepd: error: apcpu-004 [0] pmixp_dconn_ucx.c:668 [_ucx_connect] mpi/pmix: ERROR: ucp_ep_create failed: Input/output error slurmstepd: error: apcpu-004 [0] pmixp_dconn.h:243 [pmixp_dconn_connect] mpi/pmix: ERROR: Cannot establish direct connection to apcpu-005 (1) slurmstepd: error: apcpu-004 [0] pmixp_server.c:731 [_process_extended_hdr] mpi/pmix: ERROR: Unable to connect to 1 slurmstepd: error: *** STEP 50.0 ON apcpu-004 CANCELLED AT 2019-06-17T13:30:11 *** The configurations for pmix, openmpi, slurm, ucx are the following (on Debian 8): pmix 3.1.2 ./configure --prefix=/opt/apps/gcc-7_4/pmix/3.1.2 openmpi 4.0.1 ./configure --prefix=/opt/apps/gcc-7_4/openmpi/4.0.1 --with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2 --with-libfabric=/opt/apps/gcc-7_4/libfabric/1.7.2 --with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1 --with-libevent=external --disable-dlopen --without-verbs slurm 19.05.0 ./configure --enable-debug --enable-x11 --with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2 --sysconfdir=/etc/slurm --prefix=/opt/apps/slurm/19.05.0 --with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1 ucx 1.5.1 ./configure --enable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/opt/apps/gcc-7_4/ucx/1.5.1 Any advice is much appreciated. Best, -Dean