Hello,

I am trying to use ucx with slurm/pmix and run into the error below.  The 
following works using mpirun, but what I was hoping was the srun equivalent 
fails.  Is there some flag or configuration I might be missing for slurm?

Works fine:
mpirun -n 100 --host apcpu-004:88,apcpu-005:88 --mca pml ucx --mca osc ucx 
./hello

does not work:
srun -n 100 ./hello
slurmstepd: error: apcpu-004 [0] pmixp_dconn_ucx.c:668 [_ucx_connect] mpi/pmix: 
ERROR: ucp_ep_create failed: Input/output error
slurmstepd: error: apcpu-004 [0] pmixp_dconn.h:243 [pmixp_dconn_connect] 
mpi/pmix: ERROR: Cannot establish direct connection to apcpu-005 (1)
slurmstepd: error: apcpu-004 [0] pmixp_server.c:731 [_process_extended_hdr] 
mpi/pmix: ERROR: Unable to connect to 1
slurmstepd: error: *** STEP 50.0 ON apcpu-004 CANCELLED AT 2019-06-17T13:30:11 
***

The configurations for pmix, openmpi, slurm, ucx are the following (on Debian 
8):
pmix 3.1.2
./configure --prefix=/opt/apps/gcc-7_4/pmix/3.1.2

openmpi 4.0.1
./configure --prefix=/opt/apps/gcc-7_4/openmpi/4.0.1 
--with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2 
--with-libfabric=/opt/apps/gcc-7_4/libfabric/1.7.2 
--with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1 --with-libevent=external 
--disable-dlopen --without-verbs

slurm 19.05.0
./configure --enable-debug --enable-x11 
--with-pmix=/opt/apps/gcc-7_4/pmix/3.1.2 --sysconfdir=/etc/slurm 
--prefix=/opt/apps/slurm/19.05.0 --with-ucx=/opt/apps/gcc-7_4/ucx/1.5.1

ucx 1.5.1
./configure --enable-optimizations --disable-logging --disable-debug 
--disable-assertions --disable-params-check --prefix=/opt/apps/gcc-7_4/ucx/1.5.1

Any advice is much appreciated.

Best,

-Dean

Reply via email to