To eliminate possibilities, I removed all other versions of OpenMPI
from the system, and rebuilt using the same build script as was used
to generate the prior report.

[bennet@cavium-hpc bennet]$ ./ompi-3.1.0bd.sh
Checking compilers and things
OMPI is ompi
COMP_NAME is gcc_7_1_0
SRC_ROOT is /sw/arcts/centos7/src
PREFIX_ROOT is /sw/arcts/centos7
PREFIX is /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
CONFIGURE_FLAGS are
COMPILERS are CC=gcc CXX=g++ FC=gfortran

Currently Loaded Modules:
  1) gcc/7.1.0

 gcc (ARM-build-14) 7.1.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Using the following configure command

./configure     --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
   --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd/share/man
--with-pmix=/opt/pmix/2.0.2     --with-libevent=external
--with-hwloc=external     --with-slurm     --disable-dlopen
--enable-debug          CC=gcc CXX=g++ FC=gfortran

The tar ball is

2e783873f6b206aa71f745762fa15da5
/sw/arcts/centos7/src/ompi/openmpi-3.1.0.tar.gz

I still get

[bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
salloc: Pending job allocation 165
salloc: job 165 queued and waiting for resources
salloc: job 165 has been allocated resources
salloc: Granted job allocation 165
[bennet@cavium-hpc ~]$ srun ./test_mpi
The sum = 0.866386
Elapsed time is:  5.425549
The sum = 0.866386
Elapsed time is:  5.422826
The sum = 0.866386
Elapsed time is:  5.427676
The sum = 0.866386
Elapsed time is:  5.424928
The sum = 0.866386
Elapsed time is:  5.422060
The sum = 0.866386
Elapsed time is:  5.425431
The sum = 0.866386
Elapsed time is:  5.424350
The sum = 0.866386
Elapsed time is:  5.423037
The sum = 0.866386
Elapsed time is:  5.427727
The sum = 0.866386
Elapsed time is:  5.424922
The sum = 0.866386
Elapsed time is:  5.424279
Total time is:  59.672992

[bennet@cavium-hpc ~]$ mpirun ./test_mpi
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

I reran with

[bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
2>&1 | tee debug3.log

and the gzipped log is attached.

I thought to try it with a different test program, which spits the error
[cavium-hpc.arc-ts.umich.edu:42853] [[58987,1],0] ORTE_ERROR_LOG: Not
found in file base/ess_base_std_app.c at line 219
[cavium-hpc.arc-ts.umich.edu:42854] [[58987,1],1] ORTE_ERROR_LOG: Not
found in file base/ess_base_std_app.c at line 219
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  store DAEMON URI failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS


At one point, I am almost certain that OMPI mpirun did work, and I am
at a loss to explain why it no longer does.

I have also tried the 3.1.1rc1 version.  I am now going to try 3.0.0,
and we'll try downgrading SLURM to a prior version.

-- bennet


-- bennetOn Mon, Jun 18, 2018 at 10:56 AM r...@open-mpi.org
<r...@open-mpi.org> wrote:
>
> Hmmm...well, the error has changed from your initial report. Turning off the 
> firewall was the solution to that problem.
>
> This problem is different - it isn’t the orted that failed in the log you 
> sent, but the application proc that couldn’t initialize. It looks like that 
> app was compiled against some earlier version of OMPI? It is looking for 
> something that no longer exists. I saw that you compiled it with a simple 
> “gcc” instead of our wrapper compiler “mpicc” - any particular reason? My 
> guess is that your compile picked up some older version of OMPI on the system.
>
> Ralph
>
>
> > On Jun 17, 2018, at 2:51 PM, Bennet Fauber <ben...@umich.edu> wrote:
> >
> > I rebuilt with --enable-debug, then ran with
> >
> > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> > salloc: Pending job allocation 158
> > salloc: job 158 queued and waiting for resources
> > salloc: job 158 has been allocated resources
> > salloc: Granted job allocation 158
> >
> > [bennet@cavium-hpc ~]$ srun ./test_mpi
> > The sum = 0.866386
> > Elapsed time is:  5.426759
> > The sum = 0.866386
> > Elapsed time is:  5.424068
> > The sum = 0.866386
> > Elapsed time is:  5.426195
> > The sum = 0.866386
> > Elapsed time is:  5.426059
> > The sum = 0.866386
> > Elapsed time is:  5.423192
> > The sum = 0.866386
> > Elapsed time is:  5.426252
> > The sum = 0.866386
> > Elapsed time is:  5.425444
> > The sum = 0.866386
> > Elapsed time is:  5.423647
> > The sum = 0.866386
> > Elapsed time is:  5.426082
> > The sum = 0.866386
> > Elapsed time is:  5.425936
> > The sum = 0.866386
> > Elapsed time is:  5.423964
> > Total time is:  59.677830
> >
> > [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
> > 2>&1 | tee debug2.log
> >
> > The zipped debug log should be attached.
> >
> > I did that after using systemctl to turn off the firewall on the login
> > node from which the mpirun is executed, as well as on the host on
> > which it runs.
> >
> > [bennet@cavium-hpc ~]$ mpirun hostname
> > --------------------------------------------------------------------------
> > An ORTE daemon has unexpectedly failed after launch and before
> > communicating back to mpirun. This could be caused by a number
> > of factors, including an inability to create a connection back
> > to mpirun due to a lack of common network interfaces and/or no
> > route found between them. Please check network connectivity
> > (including firewalls and network routing requirements).
> > --------------------------------------------------------------------------
> >
> > [bennet@cavium-hpc ~]$ squeue
> >             JOBID PARTITION     NAME     USER ST       TIME  NODES
> > NODELIST(REASON)
> >               158  standard     bash   bennet  R      14:30      1 cav01
> > [bennet@cavium-hpc ~]$ srun hostname
> > cav01.arc-ts.umich.edu
> > [ repeated 23 more times ]
> >
> > As always, your help is much appreciated,
> >
> > -- bennet
> >
> > On Sun, Jun 17, 2018 at 1:06 PM r...@open-mpi.org <r...@open-mpi.org> wrote:
> >>
> >> Add --enable-debug to your OMPI configure cmd line, and then add --mca 
> >> plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote 
> >> daemon isn’t starting - this will give you some info as to why.
> >>
> >>
> >>> On Jun 17, 2018, at 9:07 AM, Bennet Fauber <ben...@umich.edu> wrote:
> >>>
> >>> I have a compiled binary that will run with srun but not with mpirun.
> >>> The attempts to run with mpirun all result in failures to initialize.
> >>> I have tried this on one node, and on two nodes, with firewall turned
> >>> on and with it off.
> >>>
> >>> Am I missing some command line option for mpirun?
> >>>
> >>> OMPI built from this configure command
> >>>
> >>> $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
> >>> --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
> >>> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external
> >>> --with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
> >>> FC=gfortran
> >>>
> >>> All tests from `make check` passed, see below.
> >>>
> >>> [bennet@cavium-hpc ~]$ mpicc --show
> >>> gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
> >>> -L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
> >>> -Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
> >>> -Wl,--enable-new-dtags
> >>> -L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
> >>>
> >>> The test_mpi was compiled with
> >>>
> >>> $ gcc -o test_mpi test_mpi.c -lm
> >>>
> >>> This is the runtime library path
> >>>
> >>> [bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH
> >>> /opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
> >>>
> >>>
> >>> These commands are given in exact sequence in which they were entered
> >>> at a console.
> >>>
> >>> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> >>> salloc: Pending job allocation 156
> >>> salloc: job 156 queued and waiting for resources
> >>> salloc: job 156 has been allocated resources
> >>> salloc: Granted job allocation 156
> >>>
> >>> [bennet@cavium-hpc ~]$ mpirun ./test_mpi
> >>> --------------------------------------------------------------------------
> >>> An ORTE daemon has unexpectedly failed after launch and before
> >>> communicating back to mpirun. This could be caused by a number
> >>> of factors, including an inability to create a connection back
> >>> to mpirun due to a lack of common network interfaces and/or no
> >>> route found between them. Please check network connectivity
> >>> (including firewalls and network routing requirements).
> >>> --------------------------------------------------------------------------
> >>>
> >>> [bennet@cavium-hpc ~]$ srun ./test_mpi
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.425439
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.427427
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.422579
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.424168
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.423951
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.422414
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.427156
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.424834
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.425103
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.422415
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.422948
> >>> Total time is:  59.668622
> >>>
> >>> Thanks,    -- bennet
> >>>
> >>>
> >>> make check results
> >>> ----------------------------------------------
> >>>
> >>> make  check-TESTS
> >>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
> >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
> >>> PASS: predefined_gap_test
> >>> PASS: predefined_pad_test
> >>> SKIP: dlopen_test
> >>> ============================================================================
> >>> Testsuite summary for Open MPI 3.1.0
> >>> ============================================================================
> >>> # TOTAL: 3
> >>> # PASS:  2
> >>> # SKIP:  1
> >>> # XFAIL: 0
> >>> # FAIL:  0
> >>> # XPASS: 0
> >>> # ERROR: 0
> >>> ============================================================================
> >>> [ elided ]
> >>> PASS: atomic_cmpset_noinline
> >>>   - 5 threads: Passed
> >>> PASS: atomic_cmpset_noinline
> >>>   - 8 threads: Passed
> >>> ============================================================================
> >>> Testsuite summary for Open MPI 3.1.0
> >>> ============================================================================
> >>> # TOTAL: 8
> >>> # PASS:  8
> >>> # SKIP:  0
> >>> # XFAIL: 0
> >>> # FAIL:  0
> >>> # XPASS: 0
> >>> # ERROR: 0
> >>> ============================================================================
> >>> [ elided ]
> >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
> >>> PASS: ompi_rb_tree
> >>> PASS: opal_bitmap
> >>> PASS: opal_hash_table
> >>> PASS: opal_proc_table
> >>> PASS: opal_tree
> >>> PASS: opal_list
> >>> PASS: opal_value_array
> >>> PASS: opal_pointer_array
> >>> PASS: opal_lifo
> >>> PASS: opal_fifo
> >>> ============================================================================
> >>> Testsuite summary for Open MPI 3.1.0
> >>> ============================================================================
> >>> # TOTAL: 10
> >>> # PASS:  10
> >>> # SKIP:  0
> >>> # XFAIL: 0
> >>> # FAIL:  0
> >>> # XPASS: 0
> >>> # ERROR: 0
> >>> ============================================================================
> >>> [ elided ]
> >>> make  opal_thread opal_condition
> >>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
> >>> CC       opal_thread.o
> >>> CCLD     opal_thread
> >>> CC       opal_condition.o
> >>> CCLD     opal_condition
> >>> make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
> >>> make  check-TESTS
> >>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
> >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
> >>> ============================================================================
> >>> Testsuite summary for Open MPI 3.1.0
> >>> ============================================================================
> >>> # TOTAL: 0
> >>> # PASS:  0
> >>> # SKIP:  0
> >>> # XFAIL: 0
> >>> # FAIL:  0
> >>> # XPASS: 0
> >>> # ERROR: 0
> >>> ============================================================================
> >>> [ elided ]
> >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype'
> >>> PASS: opal_datatype_test
> >>> PASS: unpack_hetero
> >>> PASS: checksum
> >>> PASS: position
> >>> PASS: position_noncontig
> >>> PASS: ddt_test
> >>> PASS: ddt_raw
> >>> PASS: unpack_ooo
> >>> PASS: ddt_pack
> >>> PASS: external32
> >>> ============================================================================
> >>> Testsuite summary for Open MPI 3.1.0
> >>> ============================================================================
> >>> # TOTAL: 10
> >>> # PASS:  10
> >>> # SKIP:  0
> >>> # XFAIL: 0
> >>> # FAIL:  0
> >>> # XPASS: 0
> >>> # ERROR: 0
> >>> ============================================================================
> >>> [ elided ]
> >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util'
> >>> PASS: opal_bit_ops
> >>> PASS: opal_path_nfs
> >>> PASS: bipartite_graph
> >>> ============================================================================
> >>> Testsuite summary for Open MPI 3.1.0
> >>> ============================================================================
> >>> # TOTAL: 3
> >>> # PASS:  3
> >>> # SKIP:  0
> >>> # XFAIL: 0
> >>> # FAIL:  0
> >>> # XPASS: 0
> >>> # ERROR: 0
> >>> ============================================================================
> >>> [ elided ]
> >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss'
> >>> PASS: dss_buffer
> >>> PASS: dss_cmp
> >>> PASS: dss_payload
> >>> PASS: dss_print
> >>> ============================================================================
> >>> Testsuite summary for Open MPI 3.1.0
> >>> ============================================================================
> >>> # TOTAL: 4
> >>> # PASS:  4
> >>> # SKIP:  0
> >>> # XFAIL: 0
> >>> # FAIL:  0
> >>> # XPASS: 0
> >>> # ERROR: 0
> >>> ============================================================================
> >>> _______________________________________________
> >>> users mailing list
> >>> users@lists.open-mpi.org
> >>> https://lists.open-mpi.org/mailman/listinfo/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> users@lists.open-mpi.org
> >> https://lists.open-mpi.org/mailman/listinfo/users
> > <debug2.log.gz>_______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

Attachment: debug3.log.gz
Description: application/gzip

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to