Jeff, Hmm. Maybe I had insufficient error checking in our installation process.
Can you make and make install after the configure fails? I somehow got an installation, despite the configure status, perhaps? -- bennet On Fri, Jun 8, 2018 at 11:32 AM Jeff Squyres (jsquyres) via users < users@lists.open-mpi.org> wrote: > Hmm. I'm confused -- can we clarify? > > I just tried configuring Open MPI v3.1.0 on a RHEL 7.4 system with the > RHEL hwloc RPM installed, but *not* the hwloc-devel RPM. Hence, no hwloc.h > (for example). > > When specifying an external hwloc, configure did fail, as expected: > > ----- > $ ./configure --with-hwloc=external ... > ... > > +++ Configuring MCA framework hwloc > checking for no configure components in framework hwloc... > checking for m4 configure components in framework hwloc... external, > hwloc1117 > > --- MCA component hwloc:external (m4 configuration macro, priority 90) > checking for MCA component hwloc:external compile mode... static > checking --with-hwloc-libdir value... simple ok (unspecified value) > checking looking for external hwloc in... (default search paths) > checking hwloc.h usability... no > checking hwloc.h presence... no > checking for hwloc.h... no > checking if MCA component hwloc:external can compile... no > configure: WARNING: MCA component "external" failed to configure properly > configure: WARNING: This component was selected as the default > configure: error: Cannot continue > $ > --- > > Are you seeing something different? > > > > > On Jun 8, 2018, at 11:16 AM, r...@open-mpi.org wrote: > > > > > > > >> On Jun 8, 2018, at 8:10 AM, Bennet Fauber <ben...@umich.edu> wrote: > >> > >> Further testing shows that it was the failure to find the hwloc-devel > files that seems to be the cause of the failure. I compiled and ran > without the additional configure flags, and it still seems to work. > >> > >> I think it issued a two-line warning about this. Is that something > that should result in an error if --with-hwloc=external is specified but > not found? Just a thought. > > > > Yes - that is a bug in our configury. It should have immediately error’d > out. > > > >> > >> My immediate problem is solved. Thanks very much Ralph and Artem for > your help! > >> > >> -- bennet > >> > >> > >> On Thu, Jun 7, 2018 at 11:06 AM r...@open-mpi.org <r...@open-mpi.org> > wrote: > >> Odd - Artem, do you have any suggestions? > >> > >> > On Jun 7, 2018, at 7:41 AM, Bennet Fauber <ben...@umich.edu> wrote: > >> > > >> > Thanks, Ralph, > >> > > >> > I just tried it with > >> > > >> > srun --mpi=pmix_v2 ./test_mpi > >> > > >> > and got these messages > >> > > >> > > >> > srun: Step created for job 89 > >> > [cav02.arc-ts.umich.edu:92286] PMIX ERROR: OUT-OF-RESOURCE in file > >> > client/pmix_client.c at line 234 > >> > [cav02.arc-ts.umich.edu:92286] OPAL ERROR: Error in file > >> > pmix2x_client.c at line 109 > >> > [cav02.arc-ts.umich.edu:92287] PMIX ERROR: OUT-OF-RESOURCE in file > >> > client/pmix_client.c at line 234 > >> > [cav02.arc-ts.umich.edu:92287] OPAL ERROR: Error in file > >> > pmix2x_client.c at line 109 > >> > > -------------------------------------------------------------------------- > >> > The application appears to have been direct launched using "srun", > >> > but OMPI was not built with SLURM's PMI support and therefore cannot > >> > execute. There are several options for building PMI support under > >> > SLURM, depending upon the SLURM version you are using: > >> > > >> > version 16.05 or later: you can use SLURM's PMIx support. This > >> > requires that you configure and build SLURM --with-pmix. > >> > > >> > Versions earlier than 16.05: you must use either SLURM's PMI-1 or > >> > PMI-2 support. SLURM builds PMI-1 by default, or you can manually > >> > install PMI-2. You must then build Open MPI using --with-pmi pointing > >> > to the SLURM PMI library location. > >> > > >> > Please configure as appropriate and try again. > >> > > -------------------------------------------------------------------------- > >> > > >> > > >> > Just to be complete, I checked the library path, > >> > > >> > > >> > $ ldconfig -p | egrep 'slurm|pmix' > >> > libpmi2.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so.1 > >> > libpmi2.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so > >> > libpmix.so.2 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so.2 > >> > libpmix.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so > >> > libpmi.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so.1 > >> > libpmi.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so > >> > > >> > > >> > and libpmi* does appear there. > >> > > >> > > >> > I also tried explicitly listing the slurm directory from the slurm > >> > library installation in LD_LIBRARY_PATH, just in case it wasn't > >> > traversing correctly. that is, both > >> > > >> > $ echo $LD_LIBRARY_PATH > >> > > /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/sw/arcts/centos7/hpc-utils/lib > >> > > >> > and > >> > > >> > $ echo $LD_LIBRARY_PATH > >> > > /opt/slurm/lib64/slurm:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/sw/arcts/centos7/hpc-utils/lib > >> > > >> > > >> > I don't have a saved build log, but I can rebuild this and save the > >> > build logs, in case any information in those logs would help. > >> > > >> > I will also mention that we have, in the past, used the > >> > --disable-dlopen and --enable-shared flags, which we did not use here. > >> > Just in case that makes any difference. > >> > > >> > -- bennet > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > On Thu, Jun 7, 2018 at 10:01 AM, r...@open-mpi.org <r...@open-mpi.org> > wrote: > >> >> I think you need to set your MPIDefault to pmix_v2 since you are > using a PMIx v2 library > >> >> > >> >> > >> >>> On Jun 7, 2018, at 6:25 AM, Bennet Fauber <ben...@umich.edu> wrote: > >> >>> > >> >>> Hi, Ralph, > >> >>> > >> >>> Thanks for the reply, and sorry for the missing information. I hope > >> >>> this fills in the picture better. > >> >>> > >> >>> $ srun --version > >> >>> slurm 17.11.7 > >> >>> > >> >>> $ srun --mpi=list > >> >>> srun: MPI types are... > >> >>> srun: pmix_v2 > >> >>> srun: openmpi > >> >>> srun: none > >> >>> srun: pmi2 > >> >>> srun: pmix > >> >>> > >> >>> We have pmix configured as the default in /opt/slurm/etc/slurm.conf > >> >>> > >> >>> MpiDefault=pmix > >> >>> > >> >>> and on the x86_64 system configured the same way, a bare 'srun > >> >>> ./test_mpi' is sufficient and runs. > >> >>> > >> >>> I have tried all of the following srun variations with no joy > >> >>> > >> >>> > >> >>> srun ./test_mpi > >> >>> srun --mpi=pmix ./test_mpi > >> >>> srun --mpi=pmi2 ./test_mpi > >> >>> srun --mpi=openmpi ./test_mpi > >> >>> > >> >>> > >> >>> I believe we are using the spec files that come with both pmix and > >> >>> with slurm, and the following to build the .rpm files used at > >> >>> installation > >> >>> > >> >>> > >> >>> $ rpmbuild --define '_prefix /opt/pmix/2.0.2' \ > >> >>> -ba pmix-2.0.2.spec > >> >>> > >> >>> $ rpmbuild --define '_prefix /opt/slurm' \ > >> >>> --define '_with-pmix --with-pmix=/opt/pmix/2.0.2' \ > >> >>> -ta slurm-17.11.7.tar.bz2 > >> >>> > >> >>> > >> >>> I did use the '--with-pmix=/opt/pmix/2.0.2' option when building > OpenMPI. > >> >>> > >> >>> > >> >>> In case it helps, we have these libraries on the aarch64 in > >> >>> /opt/slurm/lib64/slurm/mpi* > >> >>> > >> >>> -rwxr-xr-x 1 root root 257288 May 30 15:27 > /opt/slurm/lib64/slurm/mpi_none.so > >> >>> -rwxr-xr-x 1 root root 257240 May 30 15:27 > /opt/slurm/lib64/slurm/mpi_openmpi.so > >> >>> -rwxr-xr-x 1 root root 668808 May 30 15:27 > /opt/slurm/lib64/slurm/mpi_pmi2.so > >> >>> lrwxrwxrwx 1 root root 16 Jun 1 08:38 > >> >>> /opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so > >> >>> -rwxr-xr-x 1 root root 841312 May 30 15:27 > /opt/slurm/lib64/slurm/mpi_pmix_v2.so > >> >>> > >> >>> and on the x86_64, where it runs, we have a comparable list, > >> >>> > >> >>> -rwxr-xr-x 1 root root 193192 May 30 15:20 > /opt/slurm/lib64/slurm/mpi_none.so > >> >>> -rwxr-xr-x 1 root root 193192 May 30 15:20 > /opt/slurm/lib64/slurm/mpi_openmpi.so > >> >>> -rwxr-xr-x 1 root root 622848 May 30 15:20 > /opt/slurm/lib64/slurm/mpi_pmi2.so > >> >>> lrwxrwxrwx 1 root root 16 Jun 1 08:32 > >> >>> /opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so > >> >>> -rwxr-xr-x 1 root root 828232 May 30 15:20 > /opt/slurm/lib64/slurm/mpi_pmix_v2.so > >> >>> > >> >>> > >> >>> Let me know if anything else would be helpful. > >> >>> > >> >>> Thanks, -- bennet > >> >>> > >> >>> On Thu, Jun 7, 2018 at 8:56 AM, r...@open-mpi.org <r...@open-mpi.org> > wrote: > >> >>>> You didn’t show your srun direct launch cmd line or what version > of Slurm is being used (and how it was configured), so I can only provide > some advice. If you want to use PMIx, then you have to do two things: > >> >>>> > >> >>>> 1. Slurm must be configured to use PMIx - depending on the > version, that might be there by default in the rpm > >> >>>> > >> >>>> 2. you have to tell srun to use the pmix plugin (IIRC you add > --mpi pmix to the cmd line - you should check that) > >> >>>> > >> >>>> If your intent was to use Slurm’s PMI-1 or PMI-2, then you need to > configure OMPI --with-pmi=<path-to-those-libraries> > >> >>>> > >> >>>> Ralph > >> >>>> > >> >>>> > >> >>>>> On Jun 7, 2018, at 5:21 AM, Bennet Fauber <ben...@umich.edu> > wrote: > >> >>>>> > >> >>>>> We are trying out MPI on an aarch64 cluster. > >> >>>>> > >> >>>>> Our system administrators installed SLURM and PMIx 2.0.2 from > .rpm. > >> >>>>> > >> >>>>> I compiled OpenMPI using the ARM distributed gcc/7.1.0 using the > >> >>>>> configure flags shown in this snippet from the top of config.log > >> >>>>> > >> >>>>> It was created by Open MPI configure 3.1.0, which was > >> >>>>> generated by GNU Autoconf 2.69. Invocation command line was > >> >>>>> > >> >>>>> $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0 > >> >>>>> --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/share/man > >> >>>>> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external > >> >>>>> --with-hwloc=external --with-slurm CC=gcc CXX=g++ FC=gfortran > >> >>>>> > >> >>>>> ## --------- ## > >> >>>>> ## Platform. ## > >> >>>>> ## --------- ## > >> >>>>> > >> >>>>> hostname = cavium-hpc.arc-ts.umich.edu > >> >>>>> uname -m = aarch64 > >> >>>>> uname -r = 4.11.0-45.4.1.el7a.aarch64 > >> >>>>> uname -s = Linux > >> >>>>> uname -v = #1 SMP Fri Feb 2 17:11:57 UTC 2018 > >> >>>>> > >> >>>>> /usr/bin/uname -p = aarch64 > >> >>>>> > >> >>>>> > >> >>>>> It checks for pmi and reports it found, > >> >>>>> > >> >>>>> > >> >>>>> configure:12680: checking if user requested external PMIx > >> >>>>> support(/opt/pmix/2.0.2) > >> >>>>> configure:12690: result: yes > >> >>>>> configure:12701: checking --with-external-pmix value > >> >>>>> configure:12725: result: sanity check ok (/opt/pmix/2.0.2/include) > >> >>>>> configure:12768: checking libpmix.* in /opt/pmix/2.0.2/lib64 > >> >>>>> configure:12774: checking libpmix.* in /opt/pmix/2.0.2/lib > >> >>>>> configure:12794: checking PMIx version > >> >>>>> configure:12804: result: version file found > >> >>>>> > >> >>>>> > >> >>>>> It fails on the test for PMIx 3, which is expected, but then > reports > >> >>>>> > >> >>>>> > >> >>>>> configure:12843: checking version 2x > >> >>>>> configure:12861: gcc -E -I/opt/pmix/2.0.2/include conftest.c > >> >>>>> configure:12861: $? = 0 > >> >>>>> configure:12862: result: found > >> >>>>> > >> >>>>> > >> >>>>> I have a small, test MPI program that I run, and it runs when run > with > >> >>>>> mpirun using mpirun. The processes running on the first node of > a two > >> >>>>> node job are > >> >>>>> > >> >>>>> > >> >>>>> [bennet@cav02 ~]$ ps -ef | grep bennet | egrep 'test_mpi|srun' > >> >>>>> > >> >>>>> bennet 20340 20282 0 08:04 ? 00:00:00 mpirun ./test_mpi > >> >>>>> > >> >>>>> bennet 20346 20340 0 08:04 ? 00:00:00 srun > >> >>>>> --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1 > >> >>>>> --nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca > ess_base_jobid > >> >>>>> "3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" > -mca > >> >>>>> orte_node_regex "cav[2:2-3]@0(2)" -mca orte_hnp_uri > >> >>>>> "3609657344.0;tcp://10.242.15.36:58681" > >> >>>>> > >> >>>>> bennet 20347 20346 0 08:04 ? 00:00:00 srun > >> >>>>> --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1 > >> >>>>> --nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca > ess_base_jobid > >> >>>>> "3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" > -mca > >> >>>>> orte_node_regex "cav[2:2-3]@0(2)" -mca orte_hnp_uri > >> >>>>> "3609657344.0;tcp://10.242.15.36:58681" > >> >>>>> > >> >>>>> bennet 20352 20340 98 08:04 ? 00:01:50 ./test_mpi > >> >>>>> > >> >>>>> bennet 20353 20340 98 08:04 ? 00:01:50 ./test_mpi > >> >>>>> > >> >>>>> > >> >>>>> However, when I run it using srun directly, I get the following > output: > >> >>>>> > >> >>>>> > >> >>>>> srun: Step created for job 87 > >> >>>>> [cav02.arc-ts.umich.edu:19828] OPAL ERROR: Not initialized in > file > >> >>>>> pmix2x_client.c at line 109 > >> >>>>> > -------------------------------------------------------------------------- > >> >>>>> The application appears to have been direct launched using "srun", > >> >>>>> but OMPI was not built with SLURM's PMI support and therefore > cannot > >> >>>>> execute. There are several options for building PMI support under > >> >>>>> SLURM, depending upon the SLURM version you are using: > >> >>>>> > >> >>>>> version 16.05 or later: you can use SLURM's PMIx support. This > >> >>>>> requires that you configure and build SLURM --with-pmix. > >> >>>>> > >> >>>>> Versions earlier than 16.05: you must use either SLURM's PMI-1 or > >> >>>>> PMI-2 support. SLURM builds PMI-1 by default, or you can manually > >> >>>>> install PMI-2. You must then build Open MPI using --with-pmi > pointing > >> >>>>> to the SLURM PMI library location. > >> >>>>> > >> >>>>> Please configure as appropriate and try again. > >> >>>>> > -------------------------------------------------------------------------- > >> >>>>> *** An error occurred in MPI_Init > >> >>>>> *** on a NULL communicator > >> >>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now > abort, > >> >>>>> *** and potentially your MPI job) > >> >>>>> [cav02.arc-ts.umich.edu:19828] Local abort before MPI_INIT > completed > >> >>>>> completed successfully, but am not able to aggregate error > messages, > >> >>>>> and not able to guarantee that all other processes were killed! > >> >>>>> > >> >>>>> > >> >>>>> Using the same scheme to set this up on x86_64 worked, and I am > taking > >> >>>>> installation parameters, test files, and job parameters from the > >> >>>>> working x86_64 installation. > >> >>>>> > >> >>>>> Other than the architecture, the main difference between the two > >> >>>>> clusters is that the aarch64 has only ethernet networking, whereas > >> >>>>> there is infiniband on the x86_64 cluster. I removed the > --with-verbs > >> >>>>> from the configure line, though, and I thought that would be > >> >>>>> sufficient. > >> >>>>> > >> >>>>> Anyone have suggestions what might be wrong, how to fix it, or for > >> >>>>> further diagnostics? > >> >>>>> > >> >>>>> Thank you, -- bennet > >> >>>>> _______________________________________________ > >> >>>>> users mailing list > >> >>>>> users@lists.open-mpi.org > >> >>>>> https://lists.open-mpi.org/mailman/listinfo/users > >> >>>> > >> >>>> _______________________________________________ > >> >>>> users mailing list > >> >>>> users@lists.open-mpi.org > >> >>>> https://lists.open-mpi.org/mailman/listinfo/users > >> >>> _______________________________________________ > >> >>> users mailing list > >> >>> users@lists.open-mpi.org > >> >>> https://lists.open-mpi.org/mailman/listinfo/users > >> >> > >> >> _______________________________________________ > >> >> users mailing list > >> >> users@lists.open-mpi.org > >> >> https://lists.open-mpi.org/mailman/listinfo/users > >> > _______________________________________________ > >> > users mailing list > >> > users@lists.open-mpi.org > >> > https://lists.open-mpi.org/mailman/listinfo/users > >> > >> _______________________________________________ > >> users mailing list > >> users@lists.open-mpi.org > >> https://lists.open-mpi.org/mailman/listinfo/users > >> _______________________________________________ > >> users mailing list > >> users@lists.open-mpi.org > >> https://lists.open-mpi.org/mailman/listinfo/users > > > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > > > -- > Jeff Squyres > jsquy...@cisco.com > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users