Well, this is kind of interesting.  I can strip the configure line
back and get mpirun to work on one node, but then neither srun nor
mpirun within a SLURM job will run.  I can add back configure options
to get to

./configure \
    --prefix=${PREFIX} \
    --mandir=${PREFIX}/share/man \
    --with-pmix=/opt/pmix/2.0.2 \
    --with-slurm

and the situation does not seem to change.  Then I add libevent,

./configure \
    --prefix=${PREFIX} \
    --mandir=${PREFIX}/share/man \
    --with-pmix=/opt/pmix/2.0.2 \
    --with-libevent=external \
    --with-slurm

and it works again with srun but fails to run the binary with mpirun.

It is late, and I am baffled.

On Mon, Jun 18, 2018 at 9:02 PM Bennet Fauber <ben...@umich.edu> wrote:
>
> Ryan,
>
> With srun it's fine.  Only with mpirun is there a problem, and that is
> both on a single node and on multiple nodes.  SLURM was built against
> pmix 2.0.2, and I am pretty sure that SLURM's default is pmix.  We are
> running a recent patch of SLURM, I think.  SLURM and OMPI are both
> being built using the same installation of pmix.
>
> [bennet@cavium-hpc etc]$ srun --version
> slurm 17.11.7
>
> [bennet@cavium-hpc etc]$ grep pmi slurm.conf
> MpiDefault=pmix
>
> [bennet@cavium-hpc pmix]$ srun --mpi=list
> srun: MPI types are...
> srun: pmix_v2
> srun: openmpi
> srun: none
> srun: pmi2
> srun: pmix
>
> I think I said that I was pretty sure I had got this to work with both
> mpirun and srun at one point, but I am unable to find the magic a
> second time.
>
>
>
>
> On Mon, Jun 18, 2018 at 4:44 PM Ryan Novosielski <novos...@rutgers.edu> wrote:
> >
> > What MPI is SLURM set to use/how was that compiled? Out of the box, the 
> > SLURM MPI is set to “none”, or was last I checked, and so isn’t necessarily 
> > doing MPI. Now, I did try this with OpenMPI 2.1.1 and it looked right 
> > either way (OpenMPI built with “--with-pmi"), but for MVAPICH2 this 
> > definitely made a difference:
> >
> > [novosirj@amarel1 novosirj]$ srun --mpi=none -N 4 -n 16 --ntasks-per-node=4 
> > ./mpi_hello_world-intel-17.0.4-mvapich2-2.2
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > [slepner032.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: 
> > Bus error (signal 7)
> > srun: error: slepner032: task 10: Bus error
> >
> > [novosirj@amarel1 novosirj]$ srun --mpi=pmi2 -N 4 -n 16 --ntasks-per-node=4 
> > ./mpi_hello_world-intel-17.0.4-mvapich2-2.2
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 16 
> > processors
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 1 out of 16 
> > processors
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 2 out of 16 
> > processors
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 3 out of 16 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 12 out of 16 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 13 out of 16 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 14 out of 16 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 15 out of 16 
> > processors
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 4 out of 16 
> > processors
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 5 out of 16 
> > processors
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 6 out of 16 
> > processors
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 7 out of 16 
> > processors
> > Hello world from processor slepner032.amarel.rutgers.edu, rank 8 out of 16 
> > processors
> > Hello world from processor slepner032.amarel.rutgers.edu, rank 9 out of 16 
> > processors
> > Hello world from processor slepner032.amarel.rutgers.edu, rank 10 out of 16 
> > processors
> > Hello world from processor slepner032.amarel.rutgers.edu, rank 11 out of 16 
> > processors
> >
> > > On Jun 17, 2018, at 5:51 PM, Bennet Fauber <ben...@umich.edu> wrote:
> > >
> > > I rebuilt with --enable-debug, then ran with
> > >
> > > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> > > salloc: Pending job allocation 158
> > > salloc: job 158 queued and waiting for resources
> > > salloc: job 158 has been allocated resources
> > > salloc: Granted job allocation 158
> > >
> > > [bennet@cavium-hpc ~]$ srun ./test_mpi
> > > The sum = 0.866386
> > > Elapsed time is:  5.426759
> > > The sum = 0.866386
> > > Elapsed time is:  5.424068
> > > The sum = 0.866386
> > > Elapsed time is:  5.426195
> > > The sum = 0.866386
> > > Elapsed time is:  5.426059
> > > The sum = 0.866386
> > > Elapsed time is:  5.423192
> > > The sum = 0.866386
> > > Elapsed time is:  5.426252
> > > The sum = 0.866386
> > > Elapsed time is:  5.425444
> > > The sum = 0.866386
> > > Elapsed time is:  5.423647
> > > The sum = 0.866386
> > > Elapsed time is:  5.426082
> > > The sum = 0.866386
> > > Elapsed time is:  5.425936
> > > The sum = 0.866386
> > > Elapsed time is:  5.423964
> > > Total time is:  59.677830
> > >
> > > [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
> > > 2>&1 | tee debug2.log
> > >
> > > The zipped debug log should be attached.
> > >
> > > I did that after using systemctl to turn off the firewall on the login
> > > node from which the mpirun is executed, as well as on the host on
> > > which it runs.
> > >
> > > [bennet@cavium-hpc ~]$ mpirun hostname
> > > --------------------------------------------------------------------------
> > > An ORTE daemon has unexpectedly failed after launch and before
> > > communicating back to mpirun. This could be caused by a number
> > > of factors, including an inability to create a connection back
> > > to mpirun due to a lack of common network interfaces and/or no
> > > route found between them. Please check network connectivity
> > > (including firewalls and network routing requirements).
> > > --------------------------------------------------------------------------
> > >
> > > [bennet@cavium-hpc ~]$ squeue
> > >             JOBID PARTITION     NAME     USER ST       TIME  NODES
> > > NODELIST(REASON)
> > >               158  standard     bash   bennet  R      14:30      1 cav01
> > > [bennet@cavium-hpc ~]$ srun hostname
> > > cav01.arc-ts.umich.edu
> > > [ repeated 23 more times ]
> > >
> > > As always, your help is much appreciated,
> > >
> > > -- bennet
> > >
> > > On Sun, Jun 17, 2018 at 1:06 PM r...@open-mpi.org <r...@open-mpi.org> 
> > > wrote:
> > >>
> > >> Add --enable-debug to your OMPI configure cmd line, and then add --mca 
> > >> plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote 
> > >> daemon isn’t starting - this will give you some info as to why.
> > >>
> > >>
> > >>> On Jun 17, 2018, at 9:07 AM, Bennet Fauber <ben...@umich.edu> wrote:
> > >>>
> > >>> I have a compiled binary that will run with srun but not with mpirun.
> > >>> The attempts to run with mpirun all result in failures to initialize.
> > >>> I have tried this on one node, and on two nodes, with firewall turned
> > >>> on and with it off.
> > >>>
> > >>> Am I missing some command line option for mpirun?
> > >>>
> > >>> OMPI built from this configure command
> > >>>
> > >>> $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
> > >>> --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
> > >>> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external
> > >>> --with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
> > >>> FC=gfortran
> > >>>
> > >>> All tests from `make check` passed, see below.
> > >>>
> > >>> [bennet@cavium-hpc ~]$ mpicc --show
> > >>> gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
> > >>> -L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
> > >>> -Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
> > >>> -Wl,--enable-new-dtags
> > >>> -L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
> > >>>
> > >>> The test_mpi was compiled with
> > >>>
> > >>> $ gcc -o test_mpi test_mpi.c -lm
> > >>>
> > >>> This is the runtime library path
> > >>>
> > >>> [bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH
> > >>> /opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
> > >>>
> > >>>
> > >>> These commands are given in exact sequence in which they were entered
> > >>> at a console.
> > >>>
> > >>> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> > >>> salloc: Pending job allocation 156
> > >>> salloc: job 156 queued and waiting for resources
> > >>> salloc: job 156 has been allocated resources
> > >>> salloc: Granted job allocation 156
> > >>>
> > >>> [bennet@cavium-hpc ~]$ mpirun ./test_mpi
> > >>> --------------------------------------------------------------------------
> > >>> An ORTE daemon has unexpectedly failed after launch and before
> > >>> communicating back to mpirun. This could be caused by a number
> > >>> of factors, including an inability to create a connection back
> > >>> to mpirun due to a lack of common network interfaces and/or no
> > >>> route found between them. Please check network connectivity
> > >>> (including firewalls and network routing requirements).
> > >>> --------------------------------------------------------------------------
> > >>>
> > >>> [bennet@cavium-hpc ~]$ srun ./test_mpi
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.425439
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.427427
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.422579
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.424168
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.423951
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.422414
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.427156
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.424834
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.425103
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.422415
> > >>> The sum = 0.866386
> > >>> Elapsed time is:  5.422948
> > >>> Total time is:  59.668622
> > >>>
> > >>> Thanks,    -- bennet
> > >>>
> > >>>
> > >>> make check results
> > >>> ----------------------------------------------
> > >>>
> > >>> make  check-TESTS
> > >>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
> > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
> > >>> PASS: predefined_gap_test
> > >>> PASS: predefined_pad_test
> > >>> SKIP: dlopen_test
> > >>> ============================================================================
> > >>> Testsuite summary for Open MPI 3.1.0
> > >>> ============================================================================
> > >>> # TOTAL: 3
> > >>> # PASS:  2
> > >>> # SKIP:  1
> > >>> # XFAIL: 0
> > >>> # FAIL:  0
> > >>> # XPASS: 0
> > >>> # ERROR: 0
> > >>> ============================================================================
> > >>> [ elided ]
> > >>> PASS: atomic_cmpset_noinline
> > >>>   - 5 threads: Passed
> > >>> PASS: atomic_cmpset_noinline
> > >>>   - 8 threads: Passed
> > >>> ============================================================================
> > >>> Testsuite summary for Open MPI 3.1.0
> > >>> ============================================================================
> > >>> # TOTAL: 8
> > >>> # PASS:  8
> > >>> # SKIP:  0
> > >>> # XFAIL: 0
> > >>> # FAIL:  0
> > >>> # XPASS: 0
> > >>> # ERROR: 0
> > >>> ============================================================================
> > >>> [ elided ]
> > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
> > >>> PASS: ompi_rb_tree
> > >>> PASS: opal_bitmap
> > >>> PASS: opal_hash_table
> > >>> PASS: opal_proc_table
> > >>> PASS: opal_tree
> > >>> PASS: opal_list
> > >>> PASS: opal_value_array
> > >>> PASS: opal_pointer_array
> > >>> PASS: opal_lifo
> > >>> PASS: opal_fifo
> > >>> ============================================================================
> > >>> Testsuite summary for Open MPI 3.1.0
> > >>> ============================================================================
> > >>> # TOTAL: 10
> > >>> # PASS:  10
> > >>> # SKIP:  0
> > >>> # XFAIL: 0
> > >>> # FAIL:  0
> > >>> # XPASS: 0
> > >>> # ERROR: 0
> > >>> ============================================================================
> > >>> [ elided ]
> > >>> make  opal_thread opal_condition
> > >>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
> > >>> CC       opal_thread.o
> > >>> CCLD     opal_thread
> > >>> CC       opal_condition.o
> > >>> CCLD     opal_condition
> > >>> make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
> > >>> make  check-TESTS
> > >>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
> > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
> > >>> ============================================================================
> > >>> Testsuite summary for Open MPI 3.1.0
> > >>> ============================================================================
> > >>> # TOTAL: 0
> > >>> # PASS:  0
> > >>> # SKIP:  0
> > >>> # XFAIL: 0
> > >>> # FAIL:  0
> > >>> # XPASS: 0
> > >>> # ERROR: 0
> > >>> ============================================================================
> > >>> [ elided ]
> > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype'
> > >>> PASS: opal_datatype_test
> > >>> PASS: unpack_hetero
> > >>> PASS: checksum
> > >>> PASS: position
> > >>> PASS: position_noncontig
> > >>> PASS: ddt_test
> > >>> PASS: ddt_raw
> > >>> PASS: unpack_ooo
> > >>> PASS: ddt_pack
> > >>> PASS: external32
> > >>> ============================================================================
> > >>> Testsuite summary for Open MPI 3.1.0
> > >>> ============================================================================
> > >>> # TOTAL: 10
> > >>> # PASS:  10
> > >>> # SKIP:  0
> > >>> # XFAIL: 0
> > >>> # FAIL:  0
> > >>> # XPASS: 0
> > >>> # ERROR: 0
> > >>> ============================================================================
> > >>> [ elided ]
> > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util'
> > >>> PASS: opal_bit_ops
> > >>> PASS: opal_path_nfs
> > >>> PASS: bipartite_graph
> > >>> ============================================================================
> > >>> Testsuite summary for Open MPI 3.1.0
> > >>> ============================================================================
> > >>> # TOTAL: 3
> > >>> # PASS:  3
> > >>> # SKIP:  0
> > >>> # XFAIL: 0
> > >>> # FAIL:  0
> > >>> # XPASS: 0
> > >>> # ERROR: 0
> > >>> ============================================================================
> > >>> [ elided ]
> > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss'
> > >>> PASS: dss_buffer
> > >>> PASS: dss_cmp
> > >>> PASS: dss_payload
> > >>> PASS: dss_print
> > >>> ============================================================================
> > >>> Testsuite summary for Open MPI 3.1.0
> > >>> ============================================================================
> > >>> # TOTAL: 4
> > >>> # PASS:  4
> > >>> # SKIP:  0
> > >>> # XFAIL: 0
> > >>> # FAIL:  0
> > >>> # XPASS: 0
> > >>> # ERROR: 0
> > >>> ============================================================================
> > >>> _______________________________________________
> > >>> users mailing list
> > >>> users@lists.open-mpi.org
> > >>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0
> > >>
> > >> _______________________________________________
> > >> users mailing list
> > >> users@lists.open-mpi.org
> > >> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0
> > > <debug2.log.gz>_______________________________________________
> > > users mailing list
> > > users@lists.open-mpi.org
> > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0
> >
> > --
> > ____
> > || \\UTGERS,     |---------------------------*O*---------------------------
> > ||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
> > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> > ||  \\    of NJ  | Office of Advanced Research Computing - MSB C630, Newark
> >      `'
> >
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to