Hi, Ralph,

Thanks for the reply, and sorry for the missing information.  I hope
this fills in the picture better.

$ srun --version
slurm 17.11.7

$ srun --mpi=list
srun: MPI types are...
srun: pmix_v2
srun: openmpi
srun: none
srun: pmi2
srun: pmix

We have pmix configured as the default in /opt/slurm/etc/slurm.conf

    MpiDefault=pmix

and on the x86_64 system configured the same way, a bare 'srun
./test_mpi' is sufficient and runs.

I have tried all of the following srun variations with no joy


srun ./test_mpi
srun --mpi=pmix ./test_mpi
srun --mpi=pmi2 ./test_mpi
srun --mpi=openmpi ./test_mpi


I believe we are using the spec files that come with both pmix and
with slurm, and the following to build the .rpm files used at
installation


$ rpmbuild --define '_prefix /opt/pmix/2.0.2' \
    -ba pmix-2.0.2.spec

$ rpmbuild --define '_prefix /opt/slurm' \
    --define '_with-pmix --with-pmix=/opt/pmix/2.0.2' \
    -ta slurm-17.11.7.tar.bz2


I did use the '--with-pmix=/opt/pmix/2.0.2' option when building OpenMPI.


In case it helps, we have these libraries on the aarch64 in
/opt/slurm/lib64/slurm/mpi*

-rwxr-xr-x 1 root root 257288 May 30 15:27 /opt/slurm/lib64/slurm/mpi_none.so
-rwxr-xr-x 1 root root 257240 May 30 15:27 /opt/slurm/lib64/slurm/mpi_openmpi.so
-rwxr-xr-x 1 root root 668808 May 30 15:27 /opt/slurm/lib64/slurm/mpi_pmi2.so
lrwxrwxrwx 1 root root     16 Jun  1 08:38
/opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so
-rwxr-xr-x 1 root root 841312 May 30 15:27 /opt/slurm/lib64/slurm/mpi_pmix_v2.so

and on the x86_64, where it runs, we have a comparable list,

-rwxr-xr-x 1 root root 193192 May 30 15:20 /opt/slurm/lib64/slurm/mpi_none.so
-rwxr-xr-x 1 root root 193192 May 30 15:20 /opt/slurm/lib64/slurm/mpi_openmpi.so
-rwxr-xr-x 1 root root 622848 May 30 15:20 /opt/slurm/lib64/slurm/mpi_pmi2.so
lrwxrwxrwx 1 root root     16 Jun  1 08:32
/opt/slurm/lib64/slurm/mpi_pmix.so -> ./mpi_pmix_v2.so
-rwxr-xr-x 1 root root 828232 May 30 15:20 /opt/slurm/lib64/slurm/mpi_pmix_v2.so


Let me know if anything else would be helpful.

Thanks,    -- bennet

On Thu, Jun 7, 2018 at 8:56 AM, r...@open-mpi.org <r...@open-mpi.org> wrote:
> You didn’t show your srun direct launch cmd line or what version of Slurm is 
> being used (and how it was configured), so I can only provide some advice. If 
> you want to use PMIx, then you have to do two things:
>
> 1. Slurm must be configured to use PMIx - depending on the version, that 
> might be there by default in the rpm
>
> 2. you have to tell srun to use the pmix plugin (IIRC you add --mpi pmix to 
> the cmd line - you should check that)
>
> If your intent was to use Slurm’s PMI-1 or PMI-2, then you need to configure 
> OMPI --with-pmi=<path-to-those-libraries>
>
> Ralph
>
>
>> On Jun 7, 2018, at 5:21 AM, Bennet Fauber <ben...@umich.edu> wrote:
>>
>> We are trying out MPI on an aarch64 cluster.
>>
>> Our system administrators installed SLURM and PMIx 2.0.2 from .rpm.
>>
>> I compiled OpenMPI using the ARM distributed gcc/7.1.0 using the
>> configure flags shown in this snippet from the top of config.log
>>
>> It was created by Open MPI configure 3.1.0, which was
>> generated by GNU Autoconf 2.69.  Invocation command line was
>>
>>  $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0
>> --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0/share/man
>> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external
>> --with-hwloc=external --with-slurm CC=gcc CXX=g++ FC=gfortran
>>
>> ## --------- ##
>> ## Platform. ##
>> ## --------- ##
>>
>> hostname = cavium-hpc.arc-ts.umich.edu
>> uname -m = aarch64
>> uname -r = 4.11.0-45.4.1.el7a.aarch64
>> uname -s = Linux
>> uname -v = #1 SMP Fri Feb 2 17:11:57 UTC 2018
>>
>> /usr/bin/uname -p = aarch64
>>
>>
>> It checks for pmi and reports it found,
>>
>>
>> configure:12680: checking if user requested external PMIx
>> support(/opt/pmix/2.0.2)
>> configure:12690: result: yes
>> configure:12701: checking --with-external-pmix value
>> configure:12725: result: sanity check ok (/opt/pmix/2.0.2/include)
>> configure:12768: checking libpmix.* in /opt/pmix/2.0.2/lib64
>> configure:12774: checking libpmix.* in /opt/pmix/2.0.2/lib
>> configure:12794: checking PMIx version
>> configure:12804: result: version file found
>>
>>
>> It fails on the test for PMIx 3, which is expected, but then reports
>>
>>
>> configure:12843: checking version 2x
>> configure:12861: gcc -E -I/opt/pmix/2.0.2/include  conftest.c
>> configure:12861: $? = 0
>> configure:12862: result: found
>>
>>
>> I have a small, test MPI program that I run, and it runs when run with
>> mpirun using mpirun.  The processes running on the first node of a two
>> node job are
>>
>>
>> [bennet@cav02 ~]$ ps -ef | grep bennet | egrep 'test_mpi|srun'
>>
>> bennet   20340 20282  0 08:04 ?        00:00:00 mpirun ./test_mpi
>>
>> bennet   20346 20340  0 08:04 ?        00:00:00 srun
>> --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
>> --nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid
>> "3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca
>> orte_node_regex "cav[2:2-3]@0(2)" -mca orte_hnp_uri
>> "3609657344.0;tcp://10.242.15.36:58681"
>>
>> bennet   20347 20346  0 08:04 ?        00:00:00 srun
>> --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=1
>> --nodelist=cav03 --ntasks=1 orted -mca ess "slurm" -mca ess_base_jobid
>> "3609657344" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca
>> orte_node_regex "cav[2:2-3]@0(2)" -mca orte_hnp_uri
>> "3609657344.0;tcp://10.242.15.36:58681"
>>
>> bennet   20352 20340 98 08:04 ?        00:01:50 ./test_mpi
>>
>> bennet   20353 20340 98 08:04 ?        00:01:50 ./test_mpi
>>
>>
>> However, when I run it using srun directly, I get the following output:
>>
>>
>> srun: Step created for job 87
>> [cav02.arc-ts.umich.edu:19828] OPAL ERROR: Not initialized in file
>> pmix2x_client.c at line 109
>> --------------------------------------------------------------------------
>> The application appears to have been direct launched using "srun",
>> but OMPI was not built with SLURM's PMI support and therefore cannot
>> execute. There are several options for building PMI support under
>> SLURM, depending upon the SLURM version you are using:
>>
>>  version 16.05 or later: you can use SLURM's PMIx support. This
>>  requires that you configure and build SLURM --with-pmix.
>>
>>  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
>>  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
>>  install PMI-2. You must then build Open MPI using --with-pmi pointing
>>  to the SLURM PMI library location.
>>
>> Please configure as appropriate and try again.
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***    and potentially your MPI job)
>> [cav02.arc-ts.umich.edu:19828] Local abort before MPI_INIT completed
>> completed successfully, but am not able to aggregate error messages,
>> and not able to guarantee that all other processes were killed!
>>
>>
>> Using the same scheme to set this up on x86_64 worked, and I am taking
>> installation parameters, test files, and job parameters from the
>> working x86_64 installation.
>>
>> Other than the architecture, the main difference between the two
>> clusters is that the aarch64 has only ethernet networking, whereas
>> there is infiniband on the x86_64 cluster.  I removed the --with-verbs
>> from the configure line, though, and I thought that would be
>> sufficient.
>>
>> Anyone have suggestions what might be wrong, how to fix it, or for
>> further diagnostics?
>>
>> Thank you,    -- bennet
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to