Odd - Artem, do you have any suggestions?

> On Jun 7, 2018, at 7:41 AM, Bennet Fauber <ben...@umich.edu> wrote:
> 
> Thanks, Ralph,
> 
> I just tried it with
> 
>    srun --mpi=pmix_v2 ./test_mpi
> 
> and got these messages
> 
> 
> srun: Step created for job 89
> [cav02.arc-ts.umich.edu:92286] PMIX ERROR: OUT-OF-RESOURCE in file
> client/pmix_client.c at line 234
> [cav02.arc-ts.umich.edu:92286] OPAL ERROR: Error in file
> pmix2x_client.c at line 109
> [cav02.arc-ts.umich.edu:92287] PMIX ERROR: OUT-OF-RESOURCE in file
> client/pmix_client.c at line 234
> [cav02.arc-ts.umich.edu:92287] OPAL ERROR: Error in file
> pmix2x_client.c at line 109
> --------------------------------------------------------------------------
> The application appears to have been direct launched using "srun",
> but OMPI was not built with SLURM's PMI support and therefore cannot
> execute. There are several options for building PMI support under
> SLURM, depending upon the SLURM version you are using:
> 
>  version 16.05 or later: you can use SLURM's PMIx support. This
>  requires that you configure and build SLURM --with-pmix.
> 
>  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
>  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
>  install PMI-2. You must then build Open MPI using --with-pmi pointing
>  to the SLURM PMI library location.
> 
> Please configure as appropriate and try again.
> --------------------------------------------------------------------------
> 
> 
> Just to be complete, I checked the library path,
> 
> 
> $ ldconfig -p | egrep 'slurm|pmix'
>    libpmi2.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so.1
>    libpmi2.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi2.so
>    libpmix.so.2 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so.2
>    libpmix.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmix.so
>    libpmi.so.1 (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so.1
>    libpmi.so (libc6,AArch64) => /opt/pmix/2.0.2/lib/libpmi.so
> 
> 
> and libpmi* does appear there.
> 
> 
> I also tried explicitly listing the slurm directory from the slurm
> library installation in LD_LIBRARY_PATH, just in case it wasn't
> traversing correctly.  that is, both
> 
> $ echo $LD_LIBRARY_PATH
> /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/sw/arcts/centos7/hpc-utils/lib
> 
> and
> 
> $ echo $LD_LIBRARY_PATH
> /opt/slurm/lib64/slurm:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/sw/arcts/centos7/hpc-utils/lib
> 
> 
> I don't have a saved build log, but I can rebuild this and save the
> build logs, in case any information in those logs would help.
> 
> I will also mention that we have, in the past, used the
> --disable-dlopen and --enable-shared flags, which we did not use here.
> Just in case that makes any difference.
> 
> -- bennet
> 
> 
> 
> 
> 
> 
> 
> On Thu, Jun 7, 2018 at 10:01 AM, r...@open-mpi.org <r...@open-mpi.org> wrote:
>> I think you need to set your MPIDefault to pmix_v2 since you are using a 
>> PMIx v2 library
>> 
>> 
>>> On Jun 7, 2018, at 6:25 AM, Bennet Fauber <ben...@umich.edu> wrote:
>>> 
>>> Hi, Ralph,
>>> 
>>> Thanks for the reply, and sorry for the missing information.  I hope
>>> this fills in the picture better.
>>> 
>>> $ srun --version
>>> slurm 17.11.7
>>> 
>>> $ srun --mpi=list
>>> srun: MPI types are...
>>> srun: pmix_v2
>>> srun: openmpi
>>> srun: none
>>> srun: pmi2
>>> srun: pmix
>>> 
>>> We have pmix configured as the default in /opt/slurm/etc/slurm.conf
>>> 
>>>   MpiDefault=pmix
>>> 
>>> and on the x86_64 system configured the same way, a bare 'srun
>>> ./test_mpi' is sufficient and runs.
>>> 
>>> I have tried all of the following srun variations with no joy
>>> 
>>> 
>>> srun ./test_mpi
>>> srun --mpi=pmix ./test_mpi
>>> srun --mpi=pmi2 ./test_mpi
>>> srun --mpi=openmpi ./test_mpi
>>> 
>>> 
>>> I believe we are using the spec files that come with both pmix and
>>> with slurm, and the following to build the .rpm files used at
>>> installation
>>> 
>>> 
>>> $ rpmbuild --define '_prefix /opt/pmix/2.0.2' \
>>>   -ba pmix-2.0.2.spec
>>> 
>>> $ rpmbuild --define '_prefix /opt/slurm' \
>>>   --define '_with-pmix --with-pmix=/opt/pmix/2.0.2' \
>>>   -ta slurm-17.11.7.tar.bz2
>>> 
>>> 
>>> I did use the '--with-pmix=/opt/pmix/2.0.2' option when building OpenMPI.
>>> 
>>> 
>>> In case it helps, we have these libraries on the aarch64 in
>>> /opt/slurm/lib64/slurm/mpi*
>>> 
>>> -rwxr-xr-x 1 root root 257288 May 30 15:27 
>>> /opt/slurm/lib64/slurm/mpi_none.so
>>> -rwxr-xr-x 1 root root 257240 May 30 15:27 
>>> /opt/slurm/lib64/slurm/mpi_openmpi.so
>>> -rwxr-xr-x 1 root root 668808 May 30 15:27 
>>> /opt/slurm/lib64/slurm/mpi_pmi2.so
>>> lrwxrwxrwx 1 root root     16 Jun  1 08:38
>>> /opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so
>>> -rwxr-xr-x 1 root root 841312 May 30 15:27 
>>> /opt/slurm/lib64/slurm/mpi_pmix_v2.so
>>> 
>>> and on the x86_64, where it runs, we have a comparable list,
>>> 
>>> -rwxr-xr-x 1 root root 193192 May 30 15:20 
>>> /opt/slurm/lib64/slurm/mpi_none.so
>>> -rwxr-xr-x 1 root root 193192 May 30 15:20 
>>> /opt/slurm/lib64/slurm/mpi_openmpi.so
>>> -rwxr-xr-x 1 root root 622848 May 30 15:20 
>>> /opt/slurm/lib64/slurm/mpi_pmi2.so
>>> lrwxrwxrwx 1 root root     16 Jun  1 08:32
>>> /opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so
>>> -rwxr-xr-x 1 root root 828232 May 30 15:20 
>>> /opt/slurm/lib64/slurm/mpi_pmix_v2.so
>>> 
>>> 
>>> Let me know if anything else would be helpful.
>>> 
>>> Thanks,    -- bennet
>>> 
>>> On Thu, Jun 7, 2018 at 8:56 AM, r...@open-mpi.org <r...@open-mpi.org> wrote:
>>>> You didn’t show your srun direct launch cmd line or what version of Slurm 
>>>> is being used (and how it was configured), so I can only provide some 
>>>> advice. If you want to use PMIx, then you have to do two things:
>>>> 
>>>> 1. Slurm must be configured to use PMIx - depending on the version, that 
>>>> might be there by default in the rpm
>>>> 
>>>> 2. you have to tell srun to use the pmix plugin (IIRC you add --mpi pmix 
>>>> to the cmd line - you should check that)
>>>> 
>>>> If your intent was to use Slurm’s PMI-1 or PMI-2, then you need to 
>>>> configure OMPI --with-pmi=<path-to-those-libraries>
>>>> 
>>>> Ralph
>>>> 
>>>> 
>>>>> On Jun 7, 2018, at 5:21 AM, Bennet Fauber <ben...@umich.edu> wrote:
>>>>> 
>>>>> We are trying out MPI on an aarch64 cluster.
>>>>> 
>>>>> Our system administrators installed SLURM and PMIx 2.0.2 from .rpm.
>>>>> 
>>>>> I compiled OpenMPI using the ARM distributed gcc/7.1.0 using the
>>>>> configure flags shown in this snippet from the top of config.log
>>>>> 
>>>>> It was created by Open MPI configure 3.1.0, which was
>>>>> generated by GNU Autoconf 2.69.  Invocation command line was
>>>>> 
>>>>> $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0
>>>>> --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/share/man
>>>>> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external
>>>>> --with-hwloc=external --with-slurm CC=gcc CXX=g++ FC=gfortran
>>>>> 
>>>>> ## --------- ##
>>>>> ## Platform. ##
>>>>> ## --------- ##
>>>>> 
>>>>> hostname = cavium-hpc.arc-ts.umich.edu
>>>>> uname -m = aarch64
>>>>> uname -r = 4.11.0-45.4.1.el7a.aarch64
>>>>> uname -s = Linux
>>>>> uname -v = #1 SMP Fri Feb 2 17:11:57 UTC 2018
>>>>> 
>>>>> /usr/bin/uname -p = aarch64
>>>>> 
>>>>> 
>>>>> It checks for pmi and reports it found,
>>>>> 
>>>>> 
>>>>> configure:12680: checking if user requested external PMIx
>>>>> support(/opt/pmix/2.0.2)
>>>>> configure:12690: result: yes
>>>>> configure:12701: checking --with-external-pmix value
>>>>> configure:12725: result: sanity check ok (/opt/pmix/2.0.2/include)
>>>>> configure:12768: checking libpmix.* in /opt/pmix/2.0.2/lib64
>>>>> configure:12774: checking libpmix.* in /opt/pmix/2.0.2/lib
>>>>> configure:12794: checking PMIx version
>>>>> configure:12804: result: version file found
>>>>> 
>>>>> 
>>>>> It fails on the test for PMIx 3, which is expected, but then reports
>>>>> 
>>>>> 
>>>>> configure:12843: checking version 2x
>>>>> configure:12861: gcc -E -I/opt/pmix/2.0.2/include  conftest.c
>>>>> configure:12861: $? = 0
>>>>> configure:12862: result: found
>>>>> 
>>>>> 
>>>>> I have a small, test MPI program that I run, and it runs when run with
>>>>> mpirun using mpirun.  The processes running on the first node of a two
>>>>> node job are
>>>>> 
>>>>> 
>>>>> [bennet@cav02 ~]$ ps -ef | grep bennet | egrep 'test_mpi|srun'
>>>>> 
>>>>> bennet   20340 20282  0 08:04 ?        00:00:00 mpirun ./test_mpi
>>>>> 
>>>>> bennet   20346 20340  0 08:04 ?        00:00:00 srun
>>>>> --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
>>>>> --nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid
>>>>> "3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca
>>>>> orte_node_regex "cav[2:2-3]@0(2)" -mca orte_hnp_uri
>>>>> "3609657344.0;tcp://10.242.15.36:58681"
>>>>> 
>>>>> bennet   20347 20346  0 08:04 ?        00:00:00 srun
>>>>> --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
>>>>> --nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid
>>>>> "3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca
>>>>> orte_node_regex "cav[2:2-3]@0(2)" -mca orte_hnp_uri
>>>>> "3609657344.0;tcp://10.242.15.36:58681"
>>>>> 
>>>>> bennet   20352 20340 98 08:04 ?        00:01:50 ./test_mpi
>>>>> 
>>>>> bennet   20353 20340 98 08:04 ?        00:01:50 ./test_mpi
>>>>> 
>>>>> 
>>>>> However, when I run it using srun directly, I get the following output:
>>>>> 
>>>>> 
>>>>> srun: Step created for job 87
>>>>> [cav02.arc-ts.umich.edu:19828] OPAL ERROR: Not initialized in file
>>>>> pmix2x_client.c at line 109
>>>>> --------------------------------------------------------------------------
>>>>> The application appears to have been direct launched using "srun",
>>>>> but OMPI was not built with SLURM's PMI support and therefore cannot
>>>>> execute. There are several options for building PMI support under
>>>>> SLURM, depending upon the SLURM version you are using:
>>>>> 
>>>>> version 16.05 or later: you can use SLURM's PMIx support. This
>>>>> requires that you configure and build SLURM --with-pmix.
>>>>> 
>>>>> Versions earlier than 16.05: you must use either SLURM's PMI-1 or
>>>>> PMI-2 support. SLURM builds PMI-1 by default, or you can manually
>>>>> install PMI-2. You must then build Open MPI using --with-pmi pointing
>>>>> to the SLURM PMI library location.
>>>>> 
>>>>> Please configure as appropriate and try again.
>>>>> --------------------------------------------------------------------------
>>>>> *** An error occurred in MPI_Init
>>>>> *** on a NULL communicator
>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>>>> ***    and potentially your MPI job)
>>>>> [cav02.arc-ts.umich.edu:19828] Local abort before MPI_INIT completed
>>>>> completed successfully, but am not able to aggregate error messages,
>>>>> and not able to guarantee that all other processes were killed!
>>>>> 
>>>>> 
>>>>> Using the same scheme to set this up on x86_64 worked, and I am taking
>>>>> installation parameters, test files, and job parameters from the
>>>>> working x86_64 installation.
>>>>> 
>>>>> Other than the architecture, the main difference between the two
>>>>> clusters is that the aarch64 has only ethernet networking, whereas
>>>>> there is infiniband on the x86_64 cluster.  I removed the --with-verbs
>>>>> from the configure line, though, and I thought that would be
>>>>> sufficient.
>>>>> 
>>>>> Anyone have suggestions what might be wrong, how to fix it, or for
>>>>> further diagnostics?
>>>>> 
>>>>> Thank you,    -- bennet
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to