Greetings all, I have slurm-17.11.5, pmix-1.2.4, and openmpi-3.0.1 working on several clusters. I find srun handy for things like:
bill@headnode:~/src/relay$ srun -N 2 -n 2 -t 1 ./relay 1 c7-18 c7-19 size= 1, 16384 hops, 2 nodes in 0.03 sec ( 2.00 us/hop) 1953 KB/sec Building was straight forward, I build the dependencies myself since I was using a newer compiler. So I built libevent, pmix, hwloc and related against the current compiler. Then built slurm + openmpi against those shared dependencies. However I just tried a newer cluster with ubuntu-18.04, slurm-17.11.5, openmpi-3.1 and pmix-2.1.1. On the slurm side things looked promising, I removed any openmpi and pmix related packages to ensure the packages I built were used. I can post the complete log, but hopefully the lines returned with grep -i pmix are the most helpful: $ cat config.log | grep -i pmix $ ./configure --prefix=/share/apps/slurm-17.11.5/gcc7 --with-pmix=/share/apps/pmix-2.1.1/gcc7 configure:21530: checking for pmix installation configure:21565: gcc -o conftest -DNUMA_VERSION1_COMPATIBILITY -g -O2 -pthread -I/share/apps/pmix-2.1.1/gcc7/include conftest.c -L/share/apps/pmix-2.1.1/gcc7/lib -lpmix >&5 configure:21596: gcc -E -I/share/apps/pmix-2.1.1/gcc7/include conftest.c configure:21648: result: /share/apps/pmix-2.1.1/gcc7 | #define HAVE_PMIX 1 config.status:1697: creating src/plugins/mpi/pmix/Makefile x_ac_cv_pmix_dir=/share/apps/pmix-2.1.1/gcc7 x_ac_cv_pmix_libdir=/share/apps/pmix-2.1.1/gcc7/lib HAVE_PMIX_FALSE='#' HAVE_PMIX_TRUE='' HAVE_PMIX_V1_FALSE='' HAVE_PMIX_V1_TRUE='#' HAVE_PMIX_V2_FALSE='#' HAVE_PMIX_V2_TRUE='' PMIX_V1_CPPFLAGS='' PMIX_V1_LDFLAGS='' PMIX_V2_CPPFLAGS='-I/share/apps/pmix-2.1.1/gcc7/include' PMIX_V2_LDFLAGS='-Wl,-rpath -Wl,/share/apps/pmix-2.1.1/gcc7/lib -L/share/apps/pmix-2.1.1/gcc7/lib' #define HAVE_PMIX 1 Looks pretty promising so far. Some of the most relevant lines for openmpi-3.1 are: OPAL_CONFIGURE_CLI=' \'\''--prefix=/share/apps/openmpi-3.1.0/gcc7\'\'' \'\''--with-pmix=/share/apps/pmix-2.1.1/gcc7\'\'' \'\''--with-libevent=external\'\'' \'\''--disable-io-romio\'\'' \'\''--disable-io-ompio\'\''' opal_pmix_ext1x_CPPFLAGS='-I/share/apps/pmix-2.1.1/gcc7/include' opal_pmix_ext1x_LDFLAGS='-L/share/apps/pmix-2.1.1/gcc7/lib' opal_pmix_ext1x_LIBS='-lpmix' opal_pmix_ext2x_CPPFLAGS='-I/share/apps/pmix-2.1.1/gcc7/include' opal_pmix_ext2x_LDFLAGS='-L/share/apps/pmix-2.1.1/gcc7/lib' opal_pmix_ext2x_LIBS='-lpmix' opal_pmix_pmix2x_CPPFLAGS='' opal_pmix_pmix2x_DEPENDENCIES='' opal_pmix_pmix2x_LDFLAGS='' opal_pmix_pmix2x_LIBS='' pmix_alps_CPPFLAGS='' pmix_alps_LDFLAGS='' pmix_alps_LIBS='' pmix_cray_CPPFLAGS='' pmix_cray_LDFLAGS='' pmix_cray_LIBS='' #define OPAL_PMIX_V1 0 Looks pretty promising, the biggest difference I see between this non-working setup and the working setups is that the working setups have: #define OPAL_PMIX_V1 1 So when I try to run the above compiled slurm + openmpi-3.1 I get: bill@demon:~/relay$ srun -N 2 -n 2 -t 1 ./relay 1 [c2-33:02763] OPAL ERROR: Not initialized in file ext2x_client.c at line 109 -------------------------------------------------------------------------- The application appears to have been direct launched using "srun", but OMPI was not built with SLURM's PMI support and therefore cannot execute. There are several options for building PMI support under SLURM, depending upon the SLURM version you are using: version 16.05 or later: you can use SLURM's PMIx support. This requires that you configure and build SLURM --with-pmix. Versions earlier than 16.05: you must use either SLURM's PMI-1 or PMI-2 support. SLURM builds PMI-1 by default, or you can manually install PMI-2. You must then build Open MPI using --with-pmi pointing to the SLURM PMI library location. Please configure as appropriate and try again. -------------------------------------------------------------------------- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [c2-33:02763] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! [c2-31:17377] OPAL ERROR: Not initialized in file ext2x_client.c at line 109 Any ideas on how to debug the above? I was trying to use ldd to double check what libraries things were compiled against, but I couldn't find any, even on the working clusters. It's possible of course that it's entirely an openmpi problem, I'll be investigating and posting there if I can't find a solution.