Hi, Ralph, Thanks for the reply, and sorry for the missing information. I hope this fills in the picture better.
$ srun --version slurm 17.11.7 $ srun --mpi=list srun: MPI types are... srun: pmix_v2 srun: openmpi srun: none srun: pmi2 srun: pmix We have pmix configured as the default in /opt/slurm/etc/slurm.conf MpiDefault=pmix and on the x86_64 system configured the same way, a bare 'srun ./test_mpi' is sufficient and runs. I have tried all of the following srun variations with no joy srun ./test_mpi srun --mpi=pmix ./test_mpi srun --mpi=pmi2 ./test_mpi srun --mpi=openmpi ./test_mpi I believe we are using the spec files that come with both pmix and with slurm, and the following to build the .rpm files used at installation $ rpmbuild --define '_prefix /opt/pmix/2.0.2' \ -ba pmix-2.0.2.spec $ rpmbuild --define '_prefix /opt/slurm' \ --define '_with-pmix --with-pmix=/opt/pmix/2.0.2' \ -ta slurm-17.11.7.tar.bz2 I did use the '--with-pmix=/opt/pmix/2.0.2' option when building OpenMPI. In case it helps, we have these libraries on the aarch64 in /opt/slurm/lib64/slurm/mpi* -rwxr-xr-x 1 root root 257288 May 30 15:27 /opt/slurm/lib64/slurm/mpi_none.so -rwxr-xr-x 1 root root 257240 May 30 15:27 /opt/slurm/lib64/slurm/mpi_openmpi.so -rwxr-xr-x 1 root root 668808 May 30 15:27 /opt/slurm/lib64/slurm/mpi_pmi2.so lrwxrwxrwx 1 root root 16 Jun 1 08:38 /opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so -rwxr-xr-x 1 root root 841312 May 30 15:27 /opt/slurm/lib64/slurm/mpi_pmix_v2.so and on the x86_64, where it runs, we have a comparable list, -rwxr-xr-x 1 root root 193192 May 30 15:20 /opt/slurm/lib64/slurm/mpi_none.so -rwxr-xr-x 1 root root 193192 May 30 15:20 /opt/slurm/lib64/slurm/mpi_openmpi.so -rwxr-xr-x 1 root root 622848 May 30 15:20 /opt/slurm/lib64/slurm/mpi_pmi2.so lrwxrwxrwx 1 root root 16 Jun 1 08:32 /opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so -rwxr-xr-x 1 root root 828232 May 30 15:20 /opt/slurm/lib64/slurm/mpi_pmix_v2.so Let me know if anything else would be helpful. Thanks, -- bennet On Thu, Jun 7, 2018 at 8:56 AM, r...@open-mpi.org <r...@open-mpi.org> wrote: > You didn’t show your srun direct launch cmd line or what version of Slurm is > being used (and how it was configured), so I can only provide some advice. If > you want to use PMIx, then you have to do two things: > > 1. Slurm must be configured to use PMIx - depending on the version, that > might be there by default in the rpm > > 2. you have to tell srun to use the pmix plugin (IIRC you add --mpi pmix to > the cmd line - you should check that) > > If your intent was to use Slurm’s PMI-1 or PMI-2, then you need to configure > OMPI --with-pmi=<path-to-those-libraries> > > Ralph > > >> On Jun 7, 2018, at 5:21 AM, Bennet Fauber <ben...@umich.edu> wrote: >> >> We are trying out MPI on an aarch64 cluster. >> >> Our system administrators installed SLURM and PMIx 2.0.2 from .rpm. >> >> I compiled OpenMPI using the ARM distributed gcc/7.1.0 using the >> configure flags shown in this snippet from the top of config.log >> >> It was created by Open MPI configure 3.1.0, which was >> generated by GNU Autoconf 2.69. Invocation command line was >> >> $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0 >> --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/share/man >> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external >> --with-hwloc=external --with-slurm CC=gcc CXX=g++ FC=gfortran >> >> ## --------- ## >> ## Platform. ## >> ## --------- ## >> >> hostname = cavium-hpc.arc-ts.umich.edu >> uname -m = aarch64 >> uname -r = 4.11.0-45.4.1.el7a.aarch64 >> uname -s = Linux >> uname -v = #1 SMP Fri Feb 2 17:11:57 UTC 2018 >> >> /usr/bin/uname -p = aarch64 >> >> >> It checks for pmi and reports it found, >> >> >> configure:12680: checking if user requested external PMIx >> support(/opt/pmix/2.0.2) >> configure:12690: result: yes >> configure:12701: checking --with-external-pmix value >> configure:12725: result: sanity check ok (/opt/pmix/2.0.2/include) >> configure:12768: checking libpmix.* in /opt/pmix/2.0.2/lib64 >> configure:12774: checking libpmix.* in /opt/pmix/2.0.2/lib >> configure:12794: checking PMIx version >> configure:12804: result: version file found >> >> >> It fails on the test for PMIx 3, which is expected, but then reports >> >> >> configure:12843: checking version 2x >> configure:12861: gcc -E -I/opt/pmix/2.0.2/include conftest.c >> configure:12861: $? = 0 >> configure:12862: result: found >> >> >> I have a small, test MPI program that I run, and it runs when run with >> mpirun using mpirun. The processes running on the first node of a two >> node job are >> >> >> [bennet@cav02 ~]$ ps -ef | grep bennet | egrep 'test_mpi|srun' >> >> bennet 20340 20282 0 08:04 ? 00:00:00 mpirun ./test_mpi >> >> bennet 20346 20340 0 08:04 ? 00:00:00 srun >> --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1 >> --nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid >> "3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca >> orte_node_regex "cav[2:2-3]@0(2)" -mca orte_hnp_uri >> "3609657344.0;tcp://10.242.15.36:58681" >> >> bennet 20347 20346 0 08:04 ? 00:00:00 srun >> --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1 >> --nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid >> "3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca >> orte_node_regex "cav[2:2-3]@0(2)" -mca orte_hnp_uri >> "3609657344.0;tcp://10.242.15.36:58681" >> >> bennet 20352 20340 98 08:04 ? 00:01:50 ./test_mpi >> >> bennet 20353 20340 98 08:04 ? 00:01:50 ./test_mpi >> >> >> However, when I run it using srun directly, I get the following output: >> >> >> srun: Step created for job 87 >> [cav02.arc-ts.umich.edu:19828] OPAL ERROR: Not initialized in file >> pmix2x_client.c at line 109 >> -------------------------------------------------------------------------- >> The application appears to have been direct launched using "srun", >> but OMPI was not built with SLURM's PMI support and therefore cannot >> execute. There are several options for building PMI support under >> SLURM, depending upon the SLURM version you are using: >> >> version 16.05 or later: you can use SLURM's PMIx support. This >> requires that you configure and build SLURM --with-pmix. >> >> Versions earlier than 16.05: you must use either SLURM's PMI-1 or >> PMI-2 support. SLURM builds PMI-1 by default, or you can manually >> install PMI-2. You must then build Open MPI using --with-pmi pointing >> to the SLURM PMI library location. >> >> Please configure as appropriate and try again. >> -------------------------------------------------------------------------- >> *** An error occurred in MPI_Init >> *** on a NULL communicator >> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> *** and potentially your MPI job) >> [cav02.arc-ts.umich.edu:19828] Local abort before MPI_INIT completed >> completed successfully, but am not able to aggregate error messages, >> and not able to guarantee that all other processes were killed! >> >> >> Using the same scheme to set this up on x86_64 worked, and I am taking >> installation parameters, test files, and job parameters from the >> working x86_64 installation. >> >> Other than the architecture, the main difference between the two >> clusters is that the aarch64 has only ethernet networking, whereas >> there is infiniband on the x86_64 cluster. I removed the --with-verbs >> from the configure line, though, and I thought that would be >> sufficient. >> >> Anyone have suggestions what might be wrong, how to fix it, or for >> further diagnostics? >> >> Thank you, -- bennet >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users