To eliminate possibilities, I removed all other versions of OpenMPI from the system, and rebuilt using the same build script as was used to generate the prior report.
[bennet@cavium-hpc bennet]$ ./ompi-3.1.0bd.sh Checking compilers and things OMPI is ompi COMP_NAME is gcc_7_1_0 SRC_ROOT is /sw/arcts/centos7/src PREFIX_ROOT is /sw/arcts/centos7 PREFIX is /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd CONFIGURE_FLAGS are COMPILERS are CC=gcc CXX=g++ FC=gfortran Currently Loaded Modules: 1) gcc/7.1.0 gcc (ARM-build-14) 7.1.0 Copyright (C) 2017 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Using the following configure command ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd/share/man --with-pmix=/opt/pmix/2.0.2 --with-libevent=external --with-hwloc=external --with-slurm --disable-dlopen --enable-debug CC=gcc CXX=g++ FC=gfortran The tar ball is 2e783873f6b206aa71f745762fa15da5 /sw/arcts/centos7/src/ompi/openmpi-3.1.0.tar.gz I still get [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 salloc: Pending job allocation 165 salloc: job 165 queued and waiting for resources salloc: job 165 has been allocated resources salloc: Granted job allocation 165 [bennet@cavium-hpc ~]$ srun ./test_mpi The sum = 0.866386 Elapsed time is: 5.425549 The sum = 0.866386 Elapsed time is: 5.422826 The sum = 0.866386 Elapsed time is: 5.427676 The sum = 0.866386 Elapsed time is: 5.424928 The sum = 0.866386 Elapsed time is: 5.422060 The sum = 0.866386 Elapsed time is: 5.425431 The sum = 0.866386 Elapsed time is: 5.424350 The sum = 0.866386 Elapsed time is: 5.423037 The sum = 0.866386 Elapsed time is: 5.427727 The sum = 0.866386 Elapsed time is: 5.424922 The sum = 0.866386 Elapsed time is: 5.424279 Total time is: 59.672992 [bennet@cavium-hpc ~]$ mpirun ./test_mpi -------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- I reran with [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi 2>&1 | tee debug3.log and the gzipped log is attached. I thought to try it with a different test program, which spits the error [cavium-hpc.arc-ts.umich.edu:42853] [[58987,1],0] ORTE_ERROR_LOG: Not found in file base/ess_base_std_app.c at line 219 [cavium-hpc.arc-ts.umich.edu:42854] [[58987,1],1] ORTE_ERROR_LOG: Not found in file base/ess_base_std_app.c at line 219 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): store DAEMON URI failed --> Returned value Not found (-13) instead of ORTE_SUCCESS At one point, I am almost certain that OMPI mpirun did work, and I am at a loss to explain why it no longer does. I have also tried the 3.1.1rc1 version. I am now going to try 3.0.0, and we'll try downgrading SLURM to a prior version. -- bennet -- bennetOn Mon, Jun 18, 2018 at 10:56 AM r...@open-mpi.org <r...@open-mpi.org> wrote: > > Hmmm...well, the error has changed from your initial report. Turning off the > firewall was the solution to that problem. > > This problem is different - it isn’t the orted that failed in the log you > sent, but the application proc that couldn’t initialize. It looks like that > app was compiled against some earlier version of OMPI? It is looking for > something that no longer exists. I saw that you compiled it with a simple > “gcc” instead of our wrapper compiler “mpicc” - any particular reason? My > guess is that your compile picked up some older version of OMPI on the system. > > Ralph > > > > On Jun 17, 2018, at 2:51 PM, Bennet Fauber <ben...@umich.edu> wrote: > > > > I rebuilt with --enable-debug, then ran with > > > > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 > > salloc: Pending job allocation 158 > > salloc: job 158 queued and waiting for resources > > salloc: job 158 has been allocated resources > > salloc: Granted job allocation 158 > > > > [bennet@cavium-hpc ~]$ srun ./test_mpi > > The sum = 0.866386 > > Elapsed time is: 5.426759 > > The sum = 0.866386 > > Elapsed time is: 5.424068 > > The sum = 0.866386 > > Elapsed time is: 5.426195 > > The sum = 0.866386 > > Elapsed time is: 5.426059 > > The sum = 0.866386 > > Elapsed time is: 5.423192 > > The sum = 0.866386 > > Elapsed time is: 5.426252 > > The sum = 0.866386 > > Elapsed time is: 5.425444 > > The sum = 0.866386 > > Elapsed time is: 5.423647 > > The sum = 0.866386 > > Elapsed time is: 5.426082 > > The sum = 0.866386 > > Elapsed time is: 5.425936 > > The sum = 0.866386 > > Elapsed time is: 5.423964 > > Total time is: 59.677830 > > > > [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi > > 2>&1 | tee debug2.log > > > > The zipped debug log should be attached. > > > > I did that after using systemctl to turn off the firewall on the login > > node from which the mpirun is executed, as well as on the host on > > which it runs. > > > > [bennet@cavium-hpc ~]$ mpirun hostname > > -------------------------------------------------------------------------- > > An ORTE daemon has unexpectedly failed after launch and before > > communicating back to mpirun. This could be caused by a number > > of factors, including an inability to create a connection back > > to mpirun due to a lack of common network interfaces and/or no > > route found between them. Please check network connectivity > > (including firewalls and network routing requirements). > > -------------------------------------------------------------------------- > > > > [bennet@cavium-hpc ~]$ squeue > > JOBID PARTITION NAME USER ST TIME NODES > > NODELIST(REASON) > > 158 standard bash bennet R 14:30 1 cav01 > > [bennet@cavium-hpc ~]$ srun hostname > > cav01.arc-ts.umich.edu > > [ repeated 23 more times ] > > > > As always, your help is much appreciated, > > > > -- bennet > > > > On Sun, Jun 17, 2018 at 1:06 PM r...@open-mpi.org <r...@open-mpi.org> wrote: > >> > >> Add --enable-debug to your OMPI configure cmd line, and then add --mca > >> plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote > >> daemon isn’t starting - this will give you some info as to why. > >> > >> > >>> On Jun 17, 2018, at 9:07 AM, Bennet Fauber <ben...@umich.edu> wrote: > >>> > >>> I have a compiled binary that will run with srun but not with mpirun. > >>> The attempts to run with mpirun all result in failures to initialize. > >>> I have tried this on one node, and on two nodes, with firewall turned > >>> on and with it off. > >>> > >>> Am I missing some command line option for mpirun? > >>> > >>> OMPI built from this configure command > >>> > >>> $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b > >>> --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man > >>> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external > >>> --with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++ > >>> FC=gfortran > >>> > >>> All tests from `make check` passed, see below. > >>> > >>> [bennet@cavium-hpc ~]$ mpicc --show > >>> gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread > >>> -L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath > >>> -Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib > >>> -Wl,--enable-new-dtags > >>> -L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi > >>> > >>> The test_mpi was compiled with > >>> > >>> $ gcc -o test_mpi test_mpi.c -lm > >>> > >>> This is the runtime library path > >>> > >>> [bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH > >>> /opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib > >>> > >>> > >>> These commands are given in exact sequence in which they were entered > >>> at a console. > >>> > >>> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 > >>> salloc: Pending job allocation 156 > >>> salloc: job 156 queued and waiting for resources > >>> salloc: job 156 has been allocated resources > >>> salloc: Granted job allocation 156 > >>> > >>> [bennet@cavium-hpc ~]$ mpirun ./test_mpi > >>> -------------------------------------------------------------------------- > >>> An ORTE daemon has unexpectedly failed after launch and before > >>> communicating back to mpirun. This could be caused by a number > >>> of factors, including an inability to create a connection back > >>> to mpirun due to a lack of common network interfaces and/or no > >>> route found between them. Please check network connectivity > >>> (including firewalls and network routing requirements). > >>> -------------------------------------------------------------------------- > >>> > >>> [bennet@cavium-hpc ~]$ srun ./test_mpi > >>> The sum = 0.866386 > >>> Elapsed time is: 5.425439 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.427427 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.422579 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.424168 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.423951 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.422414 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.427156 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.424834 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.425103 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.422415 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.422948 > >>> Total time is: 59.668622 > >>> > >>> Thanks, -- bennet > >>> > >>> > >>> make check results > >>> ---------------------------------------------- > >>> > >>> make check-TESTS > >>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers' > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers' > >>> PASS: predefined_gap_test > >>> PASS: predefined_pad_test > >>> SKIP: dlopen_test > >>> ============================================================================ > >>> Testsuite summary for Open MPI 3.1.0 > >>> ============================================================================ > >>> # TOTAL: 3 > >>> # PASS: 2 > >>> # SKIP: 1 > >>> # XFAIL: 0 > >>> # FAIL: 0 > >>> # XPASS: 0 > >>> # ERROR: 0 > >>> ============================================================================ > >>> [ elided ] > >>> PASS: atomic_cmpset_noinline > >>> - 5 threads: Passed > >>> PASS: atomic_cmpset_noinline > >>> - 8 threads: Passed > >>> ============================================================================ > >>> Testsuite summary for Open MPI 3.1.0 > >>> ============================================================================ > >>> # TOTAL: 8 > >>> # PASS: 8 > >>> # SKIP: 0 > >>> # XFAIL: 0 > >>> # FAIL: 0 > >>> # XPASS: 0 > >>> # ERROR: 0 > >>> ============================================================================ > >>> [ elided ] > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class' > >>> PASS: ompi_rb_tree > >>> PASS: opal_bitmap > >>> PASS: opal_hash_table > >>> PASS: opal_proc_table > >>> PASS: opal_tree > >>> PASS: opal_list > >>> PASS: opal_value_array > >>> PASS: opal_pointer_array > >>> PASS: opal_lifo > >>> PASS: opal_fifo > >>> ============================================================================ > >>> Testsuite summary for Open MPI 3.1.0 > >>> ============================================================================ > >>> # TOTAL: 10 > >>> # PASS: 10 > >>> # SKIP: 0 > >>> # XFAIL: 0 > >>> # FAIL: 0 > >>> # XPASS: 0 > >>> # ERROR: 0 > >>> ============================================================================ > >>> [ elided ] > >>> make opal_thread opal_condition > >>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads' > >>> CC opal_thread.o > >>> CCLD opal_thread > >>> CC opal_condition.o > >>> CCLD opal_condition > >>> make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads' > >>> make check-TESTS > >>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads' > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads' > >>> ============================================================================ > >>> Testsuite summary for Open MPI 3.1.0 > >>> ============================================================================ > >>> # TOTAL: 0 > >>> # PASS: 0 > >>> # SKIP: 0 > >>> # XFAIL: 0 > >>> # FAIL: 0 > >>> # XPASS: 0 > >>> # ERROR: 0 > >>> ============================================================================ > >>> [ elided ] > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype' > >>> PASS: opal_datatype_test > >>> PASS: unpack_hetero > >>> PASS: checksum > >>> PASS: position > >>> PASS: position_noncontig > >>> PASS: ddt_test > >>> PASS: ddt_raw > >>> PASS: unpack_ooo > >>> PASS: ddt_pack > >>> PASS: external32 > >>> ============================================================================ > >>> Testsuite summary for Open MPI 3.1.0 > >>> ============================================================================ > >>> # TOTAL: 10 > >>> # PASS: 10 > >>> # SKIP: 0 > >>> # XFAIL: 0 > >>> # FAIL: 0 > >>> # XPASS: 0 > >>> # ERROR: 0 > >>> ============================================================================ > >>> [ elided ] > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util' > >>> PASS: opal_bit_ops > >>> PASS: opal_path_nfs > >>> PASS: bipartite_graph > >>> ============================================================================ > >>> Testsuite summary for Open MPI 3.1.0 > >>> ============================================================================ > >>> # TOTAL: 3 > >>> # PASS: 3 > >>> # SKIP: 0 > >>> # XFAIL: 0 > >>> # FAIL: 0 > >>> # XPASS: 0 > >>> # ERROR: 0 > >>> ============================================================================ > >>> [ elided ] > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss' > >>> PASS: dss_buffer > >>> PASS: dss_cmp > >>> PASS: dss_payload > >>> PASS: dss_print > >>> ============================================================================ > >>> Testsuite summary for Open MPI 3.1.0 > >>> ============================================================================ > >>> # TOTAL: 4 > >>> # PASS: 4 > >>> # SKIP: 0 > >>> # XFAIL: 0 > >>> # FAIL: 0 > >>> # XPASS: 0 > >>> # ERROR: 0 > >>> ============================================================================ > >>> _______________________________________________ > >>> users mailing list > >>> users@lists.open-mpi.org > >>> https://lists.open-mpi.org/mailman/listinfo/users > >> > >> _______________________________________________ > >> users mailing list > >> users@lists.open-mpi.org > >> https://lists.open-mpi.org/mailman/listinfo/users > > <debug2.log.gz>_______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
debug3.log.gz
Description: application/gzip
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users