Jeff,

Hmm.  Maybe I had insufficient error checking in our installation process.

Can you make and make install after the configure fails?  I somehow got an
installation, despite the configure status, perhaps?

-- bennet




On Fri, Jun 8, 2018 at 11:32 AM Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> Hmm.  I'm confused -- can we clarify?
>
> I just tried configuring Open MPI v3.1.0 on a RHEL 7.4 system with the
> RHEL hwloc RPM installed, but *not* the hwloc-devel RPM.  Hence, no hwloc.h
> (for example).
>
> When specifying an external hwloc, configure did fail, as expected:
>
> -----
> $ ./configure --with-hwloc=external ...
> ...
>
> +++ Configuring MCA framework hwloc
> checking for no configure components in framework hwloc...
> checking for m4 configure components in framework hwloc... external,
> hwloc1117
>
> --- MCA component hwloc:external (m4 configuration macro, priority 90)
> checking for MCA component hwloc:external compile mode... static
> checking --with-hwloc-libdir value... simple ok (unspecified value)
> checking looking for external hwloc in... (default search paths)
> checking hwloc.h usability... no
> checking hwloc.h presence... no
> checking for hwloc.h... no
> checking if MCA component hwloc:external can compile... no
> configure: WARNING: MCA component "external" failed to configure properly
> configure: WARNING: This component was selected as the default
> configure: error: Cannot continue
> $
> ---
>
> Are you seeing something different?
>
>
>
> > On Jun 8, 2018, at 11:16 AM, r...@open-mpi.org wrote:
> >
> >
> >
> >> On Jun 8, 2018, at 8:10 AM, Bennet Fauber <ben...@umich.edu> wrote:
> >>
> >> Further testing shows that it was the failure to find the hwloc-devel
> files that seems to be the cause of the failure.  I compiled and ran
> without the additional configure flags, and it still seems to work.
> >>
> >> I think it issued a two-line warning about this.  Is that something
> that should result in an error if --with-hwloc=external is specified but
> not found?  Just a thought.
> >
> > Yes - that is a bug in our configury. It should have immediately error’d
> out.
> >
> >>
> >> My immediate problem is solved. Thanks very much Ralph and Artem for
> your help!
> >>
> >> -- bennet
> >>
> >>
> >> On Thu, Jun 7, 2018 at 11:06 AM r...@open-mpi.org <r...@open-mpi.org>
> wrote:
> >> Odd - Artem, do you have any suggestions?
> >>
> >> > On Jun 7, 2018, at 7:41 AM, Bennet Fauber <ben...@umich.edu> wrote:
> >> >
> >> > Thanks, Ralph,
> >> >
> >> > I just tried it with
> >> >
> >> >    srun --mpi=pmix_v2 ./test_mpi
> >> >
> >> > and got these messages
> >> >
> >> >
> >> > srun: Step created for job 89
> >> > [cav02.arc-ts.umich.edu:92286] PMIX ERROR: OUT-OF-RESOURCE in file
> >> > client/pmix_client.c at line 234
> >> > [cav02.arc-ts.umich.edu:92286] OPAL ERROR: Error in file
> >> > pmix2x_client.c at line 109
> >> > [cav02.arc-ts.umich.edu:92287] PMIX ERROR: OUT-OF-RESOURCE in file
> >> > client/pmix_client.c at line 234
> >> > [cav02.arc-ts.umich.edu:92287] OPAL ERROR: Error in file
> >> > pmix2x_client.c at line 109
> >> >
> --------------------------------------------------------------------------
> >> > The application appears to have been direct launched using "srun",
> >> > but OMPI was not built with SLURM's PMI support and therefore cannot
> >> > execute. There are several options for building PMI support under
> >> > SLURM, depending upon the SLURM version you are using:
> >> >
> >> >  version 16.05 or later: you can use SLURM's PMIx support. This
> >> >  requires that you configure and build SLURM --with-pmix.
> >> >
> >> >  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
> >> >  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
> >> >  install PMI-2. You must then build Open MPI using --with-pmi pointing
> >> >  to the SLURM PMI library location.
> >> >
> >> > Please configure as appropriate and try again.
> >> >
> --------------------------------------------------------------------------
> >> >
> >> >
> >> > Just to be complete, I checked the library path,
> >> >
> >> >
> >> > $ ldconfig -p | egrep 'slurm|pmix'
> >> >    libpmi2.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so.1
> >> >    libpmi2.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so
> >> >    libpmix.so.2 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so.2
> >> >    libpmix.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so
> >> >    libpmi.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so.1
> >> >    libpmi.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so
> >> >
> >> >
> >> > and libpmi* does appear there.
> >> >
> >> >
> >> > I also tried explicitly listing the slurm directory from the slurm
> >> > library installation in LD_LIBRARY_PATH, just in case it wasn't
> >> > traversing correctly.  that is, both
> >> >
> >> > $ echo $LD_LIBRARY_PATH
> >> >
> /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/sw/arcts/centos7/hpc-utils/lib
> >> >
> >> > and
> >> >
> >> > $ echo $LD_LIBRARY_PATH
> >> >
> /opt/slurm/lib64/slurm:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/sw/arcts/centos7/hpc-utils/lib
> >> >
> >> >
> >> > I don't have a saved build log, but I can rebuild this and save the
> >> > build logs, in case any information in those logs would help.
> >> >
> >> > I will also mention that we have, in the past, used the
> >> > --disable-dlopen and --enable-shared flags, which we did not use here.
> >> > Just in case that makes any difference.
> >> >
> >> > -- bennet
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Thu, Jun 7, 2018 at 10:01 AM, r...@open-mpi.org <r...@open-mpi.org>
> wrote:
> >> >> I think you need to set your MPIDefault to pmix_v2 since you are
> using a PMIx v2 library
> >> >>
> >> >>
> >> >>> On Jun 7, 2018, at 6:25 AM, Bennet Fauber <ben...@umich.edu> wrote:
> >> >>>
> >> >>> Hi, Ralph,
> >> >>>
> >> >>> Thanks for the reply, and sorry for the missing information.  I hope
> >> >>> this fills in the picture better.
> >> >>>
> >> >>> $ srun --version
> >> >>> slurm 17.11.7
> >> >>>
> >> >>> $ srun --mpi=list
> >> >>> srun: MPI types are...
> >> >>> srun: pmix_v2
> >> >>> srun: openmpi
> >> >>> srun: none
> >> >>> srun: pmi2
> >> >>> srun: pmix
> >> >>>
> >> >>> We have pmix configured as the default in /opt/slurm/etc/slurm.conf
> >> >>>
> >> >>>   MpiDefault=pmix
> >> >>>
> >> >>> and on the x86_64 system configured the same way, a bare 'srun
> >> >>> ./test_mpi' is sufficient and runs.
> >> >>>
> >> >>> I have tried all of the following srun variations with no joy
> >> >>>
> >> >>>
> >> >>> srun ./test_mpi
> >> >>> srun --mpi=pmix ./test_mpi
> >> >>> srun --mpi=pmi2 ./test_mpi
> >> >>> srun --mpi=openmpi ./test_mpi
> >> >>>
> >> >>>
> >> >>> I believe we are using the spec files that come with both pmix and
> >> >>> with slurm, and the following to build the .rpm files used at
> >> >>> installation
> >> >>>
> >> >>>
> >> >>> $ rpmbuild --define '_prefix /opt/pmix/2.0.2' \
> >> >>>   -ba pmix-2.0.2.spec
> >> >>>
> >> >>> $ rpmbuild --define '_prefix /opt/slurm' \
> >> >>>   --define '_with-pmix --with-pmix=/opt/pmix/2.0.2' \
> >> >>>   -ta slurm-17.11.7.tar.bz2
> >> >>>
> >> >>>
> >> >>> I did use the '--with-pmix=/opt/pmix/2.0.2' option when building
> OpenMPI.
> >> >>>
> >> >>>
> >> >>> In case it helps, we have these libraries on the aarch64 in
> >> >>> /opt/slurm/lib64/slurm/mpi*
> >> >>>
> >> >>> -rwxr-xr-x 1 root root 257288 May 30 15:27
> /opt/slurm/lib64/slurm/mpi_none.so
> >> >>> -rwxr-xr-x 1 root root 257240 May 30 15:27
> /opt/slurm/lib64/slurm/mpi_openmpi.so
> >> >>> -rwxr-xr-x 1 root root 668808 May 30 15:27
> /opt/slurm/lib64/slurm/mpi_pmi2.so
> >> >>> lrwxrwxrwx 1 root root     16 Jun  1 08:38
> >> >>> /opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so
> >> >>> -rwxr-xr-x 1 root root 841312 May 30 15:27
> /opt/slurm/lib64/slurm/mpi_pmix_v2.so
> >> >>>
> >> >>> and on the x86_64, where it runs, we have a comparable list,
> >> >>>
> >> >>> -rwxr-xr-x 1 root root 193192 May 30 15:20
> /opt/slurm/lib64/slurm/mpi_none.so
> >> >>> -rwxr-xr-x 1 root root 193192 May 30 15:20
> /opt/slurm/lib64/slurm/mpi_openmpi.so
> >> >>> -rwxr-xr-x 1 root root 622848 May 30 15:20
> /opt/slurm/lib64/slurm/mpi_pmi2.so
> >> >>> lrwxrwxrwx 1 root root     16 Jun  1 08:32
> >> >>> /opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so
> >> >>> -rwxr-xr-x 1 root root 828232 May 30 15:20
> /opt/slurm/lib64/slurm/mpi_pmix_v2.so
> >> >>>
> >> >>>
> >> >>> Let me know if anything else would be helpful.
> >> >>>
> >> >>> Thanks,    -- bennet
> >> >>>
> >> >>> On Thu, Jun 7, 2018 at 8:56 AM, r...@open-mpi.org <r...@open-mpi.org>
> wrote:
> >> >>>> You didn’t show your srun direct launch cmd line or what version
> of Slurm is being used (and how it was configured), so I can only provide
> some advice. If you want to use PMIx, then you have to do two things:
> >> >>>>
> >> >>>> 1. Slurm must be configured to use PMIx - depending on the
> version, that might be there by default in the rpm
> >> >>>>
> >> >>>> 2. you have to tell srun to use the pmix plugin (IIRC you add
> --mpi pmix to the cmd line - you should check that)
> >> >>>>
> >> >>>> If your intent was to use Slurm’s PMI-1 or PMI-2, then you need to
> configure OMPI --with-pmi=<path-to-those-libraries>
> >> >>>>
> >> >>>> Ralph
> >> >>>>
> >> >>>>
> >> >>>>> On Jun 7, 2018, at 5:21 AM, Bennet Fauber <ben...@umich.edu>
> wrote:
> >> >>>>>
> >> >>>>> We are trying out MPI on an aarch64 cluster.
> >> >>>>>
> >> >>>>> Our system administrators installed SLURM and PMIx 2.0.2 from
> .rpm.
> >> >>>>>
> >> >>>>> I compiled OpenMPI using the ARM distributed gcc/7.1.0 using the
> >> >>>>> configure flags shown in this snippet from the top of config.log
> >> >>>>>
> >> >>>>> It was created by Open MPI configure 3.1.0, which was
> >> >>>>> generated by GNU Autoconf 2.69.  Invocation command line was
> >> >>>>>
> >> >>>>> $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0
> >> >>>>> --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/share/man
> >> >>>>> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external
> >> >>>>> --with-hwloc=external --with-slurm CC=gcc CXX=g++ FC=gfortran
> >> >>>>>
> >> >>>>> ## --------- ##
> >> >>>>> ## Platform. ##
> >> >>>>> ## --------- ##
> >> >>>>>
> >> >>>>> hostname = cavium-hpc.arc-ts.umich.edu
> >> >>>>> uname -m = aarch64
> >> >>>>> uname -r = 4.11.0-45.4.1.el7a.aarch64
> >> >>>>> uname -s = Linux
> >> >>>>> uname -v = #1 SMP Fri Feb 2 17:11:57 UTC 2018
> >> >>>>>
> >> >>>>> /usr/bin/uname -p = aarch64
> >> >>>>>
> >> >>>>>
> >> >>>>> It checks for pmi and reports it found,
> >> >>>>>
> >> >>>>>
> >> >>>>> configure:12680: checking if user requested external PMIx
> >> >>>>> support(/opt/pmix/2.0.2)
> >> >>>>> configure:12690: result: yes
> >> >>>>> configure:12701: checking --with-external-pmix value
> >> >>>>> configure:12725: result: sanity check ok (/opt/pmix/2.0.2/include)
> >> >>>>> configure:12768: checking libpmix.* in /opt/pmix/2.0.2/lib64
> >> >>>>> configure:12774: checking libpmix.* in /opt/pmix/2.0.2/lib
> >> >>>>> configure:12794: checking PMIx version
> >> >>>>> configure:12804: result: version file found
> >> >>>>>
> >> >>>>>
> >> >>>>> It fails on the test for PMIx 3, which is expected, but then
> reports
> >> >>>>>
> >> >>>>>
> >> >>>>> configure:12843: checking version 2x
> >> >>>>> configure:12861: gcc -E -I/opt/pmix/2.0.2/include  conftest.c
> >> >>>>> configure:12861: $? = 0
> >> >>>>> configure:12862: result: found
> >> >>>>>
> >> >>>>>
> >> >>>>> I have a small, test MPI program that I run, and it runs when run
> with
> >> >>>>> mpirun using mpirun.  The processes running on the first node of
> a two
> >> >>>>> node job are
> >> >>>>>
> >> >>>>>
> >> >>>>> [bennet@cav02 ~]$ ps -ef | grep bennet | egrep 'test_mpi|srun'
> >> >>>>>
> >> >>>>> bennet   20340 20282  0 08:04 ?        00:00:00 mpirun ./test_mpi
> >> >>>>>
> >> >>>>> bennet   20346 20340  0 08:04 ?        00:00:00 srun
> >> >>>>> --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
> >> >>>>> --nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca
> ess_base_jobid
> >> >>>>> "3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2"
> -mca
> >> >>>>> orte_node_regex "cav[2:2-3]@0(2)" -mca orte_hnp_uri
> >> >>>>> "3609657344.0;tcp://10.242.15.36:58681"
> >> >>>>>
> >> >>>>> bennet   20347 20346  0 08:04 ?        00:00:00 srun
> >> >>>>> --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
> >> >>>>> --nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca
> ess_base_jobid
> >> >>>>> "3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2"
> -mca
> >> >>>>> orte_node_regex "cav[2:2-3]@0(2)" -mca orte_hnp_uri
> >> >>>>> "3609657344.0;tcp://10.242.15.36:58681"
> >> >>>>>
> >> >>>>> bennet   20352 20340 98 08:04 ?        00:01:50 ./test_mpi
> >> >>>>>
> >> >>>>> bennet   20353 20340 98 08:04 ?        00:01:50 ./test_mpi
> >> >>>>>
> >> >>>>>
> >> >>>>> However, when I run it using srun directly, I get the following
> output:
> >> >>>>>
> >> >>>>>
> >> >>>>> srun: Step created for job 87
> >> >>>>> [cav02.arc-ts.umich.edu:19828] OPAL ERROR: Not initialized in
> file
> >> >>>>> pmix2x_client.c at line 109
> >> >>>>>
> --------------------------------------------------------------------------
> >> >>>>> The application appears to have been direct launched using "srun",
> >> >>>>> but OMPI was not built with SLURM's PMI support and therefore
> cannot
> >> >>>>> execute. There are several options for building PMI support under
> >> >>>>> SLURM, depending upon the SLURM version you are using:
> >> >>>>>
> >> >>>>> version 16.05 or later: you can use SLURM's PMIx support. This
> >> >>>>> requires that you configure and build SLURM --with-pmix.
> >> >>>>>
> >> >>>>> Versions earlier than 16.05: you must use either SLURM's PMI-1 or
> >> >>>>> PMI-2 support. SLURM builds PMI-1 by default, or you can manually
> >> >>>>> install PMI-2. You must then build Open MPI using --with-pmi
> pointing
> >> >>>>> to the SLURM PMI library location.
> >> >>>>>
> >> >>>>> Please configure as appropriate and try again.
> >> >>>>>
> --------------------------------------------------------------------------
> >> >>>>> *** An error occurred in MPI_Init
> >> >>>>> *** on a NULL communicator
> >> >>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> >> >>>>> ***    and potentially your MPI job)
> >> >>>>> [cav02.arc-ts.umich.edu:19828] Local abort before MPI_INIT
> completed
> >> >>>>> completed successfully, but am not able to aggregate error
> messages,
> >> >>>>> and not able to guarantee that all other processes were killed!
> >> >>>>>
> >> >>>>>
> >> >>>>> Using the same scheme to set this up on x86_64 worked, and I am
> taking
> >> >>>>> installation parameters, test files, and job parameters from the
> >> >>>>> working x86_64 installation.
> >> >>>>>
> >> >>>>> Other than the architecture, the main difference between the two
> >> >>>>> clusters is that the aarch64 has only ethernet networking, whereas
> >> >>>>> there is infiniband on the x86_64 cluster.  I removed the
> --with-verbs
> >> >>>>> from the configure line, though, and I thought that would be
> >> >>>>> sufficient.
> >> >>>>>
> >> >>>>> Anyone have suggestions what might be wrong, how to fix it, or for
> >> >>>>> further diagnostics?
> >> >>>>>
> >> >>>>> Thank you,    -- bennet
> >> >>>>> _______________________________________________
> >> >>>>> users mailing list
> >> >>>>> users@lists.open-mpi.org
> >> >>>>> https://lists.open-mpi.org/mailman/listinfo/users
> >> >>>>
> >> >>>> _______________________________________________
> >> >>>> users mailing list
> >> >>>> users@lists.open-mpi.org
> >> >>>> https://lists.open-mpi.org/mailman/listinfo/users
> >> >>> _______________________________________________
> >> >>> users mailing list
> >> >>> users@lists.open-mpi.org
> >> >>> https://lists.open-mpi.org/mailman/listinfo/users
> >> >>
> >> >> _______________________________________________
> >> >> users mailing list
> >> >> users@lists.open-mpi.org
> >> >> https://lists.open-mpi.org/mailman/listinfo/users
> >> > _______________________________________________
> >> > users mailing list
> >> > users@lists.open-mpi.org
> >> > https://lists.open-mpi.org/mailman/listinfo/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> users@lists.open-mpi.org
> >> https://lists.open-mpi.org/mailman/listinfo/users
> >> _______________________________________________
> >> users mailing list
> >> users@lists.open-mpi.org
> >> https://lists.open-mpi.org/mailman/listinfo/users
> >
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to