Well, this is kind of interesting. I can strip the configure line back and get mpirun to work on one node, but then neither srun nor mpirun within a SLURM job will run. I can add back configure options to get to
./configure \ --prefix=${PREFIX} \ --mandir=${PREFIX}/share/man \ --with-pmix=/opt/pmix/2.0.2 \ --with-slurm and the situation does not seem to change. Then I add libevent, ./configure \ --prefix=${PREFIX} \ --mandir=${PREFIX}/share/man \ --with-pmix=/opt/pmix/2.0.2 \ --with-libevent=external \ --with-slurm and it works again with srun but fails to run the binary with mpirun. It is late, and I am baffled. On Mon, Jun 18, 2018 at 9:02 PM Bennet Fauber <ben...@umich.edu> wrote: > > Ryan, > > With srun it's fine. Only with mpirun is there a problem, and that is > both on a single node and on multiple nodes. SLURM was built against > pmix 2.0.2, and I am pretty sure that SLURM's default is pmix. We are > running a recent patch of SLURM, I think. SLURM and OMPI are both > being built using the same installation of pmix. > > [bennet@cavium-hpc etc]$ srun --version > slurm 17.11.7 > > [bennet@cavium-hpc etc]$ grep pmi slurm.conf > MpiDefault=pmix > > [bennet@cavium-hpc pmix]$ srun --mpi=list > srun: MPI types are... > srun: pmix_v2 > srun: openmpi > srun: none > srun: pmi2 > srun: pmix > > I think I said that I was pretty sure I had got this to work with both > mpirun and srun at one point, but I am unable to find the magic a > second time. > > > > > On Mon, Jun 18, 2018 at 4:44 PM Ryan Novosielski <novos...@rutgers.edu> wrote: > > > > What MPI is SLURM set to use/how was that compiled? Out of the box, the > > SLURM MPI is set to “none”, or was last I checked, and so isn’t necessarily > > doing MPI. Now, I did try this with OpenMPI 2.1.1 and it looked right > > either way (OpenMPI built with “--with-pmi"), but for MVAPICH2 this > > definitely made a difference: > > > > [novosirj@amarel1 novosirj]$ srun --mpi=none -N 4 -n 16 --ntasks-per-node=4 > > ./mpi_hello_world-intel-17.0.4-mvapich2-2.2 > > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 > > processors > > [slepner032.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: > > Bus error (signal 7) > > srun: error: slepner032: task 10: Bus error > > > > [novosirj@amarel1 novosirj]$ srun --mpi=pmi2 -N 4 -n 16 --ntasks-per-node=4 > > ./mpi_hello_world-intel-17.0.4-mvapich2-2.2 > > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 16 > > processors > > Hello world from processor slepner028.amarel.rutgers.edu, rank 1 out of 16 > > processors > > Hello world from processor slepner028.amarel.rutgers.edu, rank 2 out of 16 > > processors > > Hello world from processor slepner028.amarel.rutgers.edu, rank 3 out of 16 > > processors > > Hello world from processor slepner035.amarel.rutgers.edu, rank 12 out of 16 > > processors > > Hello world from processor slepner035.amarel.rutgers.edu, rank 13 out of 16 > > processors > > Hello world from processor slepner035.amarel.rutgers.edu, rank 14 out of 16 > > processors > > Hello world from processor slepner035.amarel.rutgers.edu, rank 15 out of 16 > > processors > > Hello world from processor slepner031.amarel.rutgers.edu, rank 4 out of 16 > > processors > > Hello world from processor slepner031.amarel.rutgers.edu, rank 5 out of 16 > > processors > > Hello world from processor slepner031.amarel.rutgers.edu, rank 6 out of 16 > > processors > > Hello world from processor slepner031.amarel.rutgers.edu, rank 7 out of 16 > > processors > > Hello world from processor slepner032.amarel.rutgers.edu, rank 8 out of 16 > > processors > > Hello world from processor slepner032.amarel.rutgers.edu, rank 9 out of 16 > > processors > > Hello world from processor slepner032.amarel.rutgers.edu, rank 10 out of 16 > > processors > > Hello world from processor slepner032.amarel.rutgers.edu, rank 11 out of 16 > > processors > > > > > On Jun 17, 2018, at 5:51 PM, Bennet Fauber <ben...@umich.edu> wrote: > > > > > > I rebuilt with --enable-debug, then ran with > > > > > > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 > > > salloc: Pending job allocation 158 > > > salloc: job 158 queued and waiting for resources > > > salloc: job 158 has been allocated resources > > > salloc: Granted job allocation 158 > > > > > > [bennet@cavium-hpc ~]$ srun ./test_mpi > > > The sum = 0.866386 > > > Elapsed time is: 5.426759 > > > The sum = 0.866386 > > > Elapsed time is: 5.424068 > > > The sum = 0.866386 > > > Elapsed time is: 5.426195 > > > The sum = 0.866386 > > > Elapsed time is: 5.426059 > > > The sum = 0.866386 > > > Elapsed time is: 5.423192 > > > The sum = 0.866386 > > > Elapsed time is: 5.426252 > > > The sum = 0.866386 > > > Elapsed time is: 5.425444 > > > The sum = 0.866386 > > > Elapsed time is: 5.423647 > > > The sum = 0.866386 > > > Elapsed time is: 5.426082 > > > The sum = 0.866386 > > > Elapsed time is: 5.425936 > > > The sum = 0.866386 > > > Elapsed time is: 5.423964 > > > Total time is: 59.677830 > > > > > > [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi > > > 2>&1 | tee debug2.log > > > > > > The zipped debug log should be attached. > > > > > > I did that after using systemctl to turn off the firewall on the login > > > node from which the mpirun is executed, as well as on the host on > > > which it runs. > > > > > > [bennet@cavium-hpc ~]$ mpirun hostname > > > -------------------------------------------------------------------------- > > > An ORTE daemon has unexpectedly failed after launch and before > > > communicating back to mpirun. This could be caused by a number > > > of factors, including an inability to create a connection back > > > to mpirun due to a lack of common network interfaces and/or no > > > route found between them. Please check network connectivity > > > (including firewalls and network routing requirements). > > > -------------------------------------------------------------------------- > > > > > > [bennet@cavium-hpc ~]$ squeue > > > JOBID PARTITION NAME USER ST TIME NODES > > > NODELIST(REASON) > > > 158 standard bash bennet R 14:30 1 cav01 > > > [bennet@cavium-hpc ~]$ srun hostname > > > cav01.arc-ts.umich.edu > > > [ repeated 23 more times ] > > > > > > As always, your help is much appreciated, > > > > > > -- bennet > > > > > > On Sun, Jun 17, 2018 at 1:06 PM r...@open-mpi.org <r...@open-mpi.org> > > > wrote: > > >> > > >> Add --enable-debug to your OMPI configure cmd line, and then add --mca > > >> plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote > > >> daemon isn’t starting - this will give you some info as to why. > > >> > > >> > > >>> On Jun 17, 2018, at 9:07 AM, Bennet Fauber <ben...@umich.edu> wrote: > > >>> > > >>> I have a compiled binary that will run with srun but not with mpirun. > > >>> The attempts to run with mpirun all result in failures to initialize. > > >>> I have tried this on one node, and on two nodes, with firewall turned > > >>> on and with it off. > > >>> > > >>> Am I missing some command line option for mpirun? > > >>> > > >>> OMPI built from this configure command > > >>> > > >>> $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b > > >>> --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man > > >>> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external > > >>> --with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++ > > >>> FC=gfortran > > >>> > > >>> All tests from `make check` passed, see below. > > >>> > > >>> [bennet@cavium-hpc ~]$ mpicc --show > > >>> gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread > > >>> -L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath > > >>> -Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib > > >>> -Wl,--enable-new-dtags > > >>> -L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi > > >>> > > >>> The test_mpi was compiled with > > >>> > > >>> $ gcc -o test_mpi test_mpi.c -lm > > >>> > > >>> This is the runtime library path > > >>> > > >>> [bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH > > >>> /opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib > > >>> > > >>> > > >>> These commands are given in exact sequence in which they were entered > > >>> at a console. > > >>> > > >>> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 > > >>> salloc: Pending job allocation 156 > > >>> salloc: job 156 queued and waiting for resources > > >>> salloc: job 156 has been allocated resources > > >>> salloc: Granted job allocation 156 > > >>> > > >>> [bennet@cavium-hpc ~]$ mpirun ./test_mpi > > >>> -------------------------------------------------------------------------- > > >>> An ORTE daemon has unexpectedly failed after launch and before > > >>> communicating back to mpirun. This could be caused by a number > > >>> of factors, including an inability to create a connection back > > >>> to mpirun due to a lack of common network interfaces and/or no > > >>> route found between them. Please check network connectivity > > >>> (including firewalls and network routing requirements). > > >>> -------------------------------------------------------------------------- > > >>> > > >>> [bennet@cavium-hpc ~]$ srun ./test_mpi > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.425439 > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.427427 > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.422579 > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.424168 > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.423951 > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.422414 > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.427156 > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.424834 > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.425103 > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.422415 > > >>> The sum = 0.866386 > > >>> Elapsed time is: 5.422948 > > >>> Total time is: 59.668622 > > >>> > > >>> Thanks, -- bennet > > >>> > > >>> > > >>> make check results > > >>> ---------------------------------------------- > > >>> > > >>> make check-TESTS > > >>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers' > > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers' > > >>> PASS: predefined_gap_test > > >>> PASS: predefined_pad_test > > >>> SKIP: dlopen_test > > >>> ============================================================================ > > >>> Testsuite summary for Open MPI 3.1.0 > > >>> ============================================================================ > > >>> # TOTAL: 3 > > >>> # PASS: 2 > > >>> # SKIP: 1 > > >>> # XFAIL: 0 > > >>> # FAIL: 0 > > >>> # XPASS: 0 > > >>> # ERROR: 0 > > >>> ============================================================================ > > >>> [ elided ] > > >>> PASS: atomic_cmpset_noinline > > >>> - 5 threads: Passed > > >>> PASS: atomic_cmpset_noinline > > >>> - 8 threads: Passed > > >>> ============================================================================ > > >>> Testsuite summary for Open MPI 3.1.0 > > >>> ============================================================================ > > >>> # TOTAL: 8 > > >>> # PASS: 8 > > >>> # SKIP: 0 > > >>> # XFAIL: 0 > > >>> # FAIL: 0 > > >>> # XPASS: 0 > > >>> # ERROR: 0 > > >>> ============================================================================ > > >>> [ elided ] > > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class' > > >>> PASS: ompi_rb_tree > > >>> PASS: opal_bitmap > > >>> PASS: opal_hash_table > > >>> PASS: opal_proc_table > > >>> PASS: opal_tree > > >>> PASS: opal_list > > >>> PASS: opal_value_array > > >>> PASS: opal_pointer_array > > >>> PASS: opal_lifo > > >>> PASS: opal_fifo > > >>> ============================================================================ > > >>> Testsuite summary for Open MPI 3.1.0 > > >>> ============================================================================ > > >>> # TOTAL: 10 > > >>> # PASS: 10 > > >>> # SKIP: 0 > > >>> # XFAIL: 0 > > >>> # FAIL: 0 > > >>> # XPASS: 0 > > >>> # ERROR: 0 > > >>> ============================================================================ > > >>> [ elided ] > > >>> make opal_thread opal_condition > > >>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads' > > >>> CC opal_thread.o > > >>> CCLD opal_thread > > >>> CC opal_condition.o > > >>> CCLD opal_condition > > >>> make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads' > > >>> make check-TESTS > > >>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads' > > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads' > > >>> ============================================================================ > > >>> Testsuite summary for Open MPI 3.1.0 > > >>> ============================================================================ > > >>> # TOTAL: 0 > > >>> # PASS: 0 > > >>> # SKIP: 0 > > >>> # XFAIL: 0 > > >>> # FAIL: 0 > > >>> # XPASS: 0 > > >>> # ERROR: 0 > > >>> ============================================================================ > > >>> [ elided ] > > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype' > > >>> PASS: opal_datatype_test > > >>> PASS: unpack_hetero > > >>> PASS: checksum > > >>> PASS: position > > >>> PASS: position_noncontig > > >>> PASS: ddt_test > > >>> PASS: ddt_raw > > >>> PASS: unpack_ooo > > >>> PASS: ddt_pack > > >>> PASS: external32 > > >>> ============================================================================ > > >>> Testsuite summary for Open MPI 3.1.0 > > >>> ============================================================================ > > >>> # TOTAL: 10 > > >>> # PASS: 10 > > >>> # SKIP: 0 > > >>> # XFAIL: 0 > > >>> # FAIL: 0 > > >>> # XPASS: 0 > > >>> # ERROR: 0 > > >>> ============================================================================ > > >>> [ elided ] > > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util' > > >>> PASS: opal_bit_ops > > >>> PASS: opal_path_nfs > > >>> PASS: bipartite_graph > > >>> ============================================================================ > > >>> Testsuite summary for Open MPI 3.1.0 > > >>> ============================================================================ > > >>> # TOTAL: 3 > > >>> # PASS: 3 > > >>> # SKIP: 0 > > >>> # XFAIL: 0 > > >>> # FAIL: 0 > > >>> # XPASS: 0 > > >>> # ERROR: 0 > > >>> ============================================================================ > > >>> [ elided ] > > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss' > > >>> PASS: dss_buffer > > >>> PASS: dss_cmp > > >>> PASS: dss_payload > > >>> PASS: dss_print > > >>> ============================================================================ > > >>> Testsuite summary for Open MPI 3.1.0 > > >>> ============================================================================ > > >>> # TOTAL: 4 > > >>> # PASS: 4 > > >>> # SKIP: 0 > > >>> # XFAIL: 0 > > >>> # FAIL: 0 > > >>> # XPASS: 0 > > >>> # ERROR: 0 > > >>> ============================================================================ > > >>> _______________________________________________ > > >>> users mailing list > > >>> users@lists.open-mpi.org > > >>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0 > > >> > > >> _______________________________________________ > > >> users mailing list > > >> users@lists.open-mpi.org > > >> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0 > > > <debug2.log.gz>_______________________________________________ > > > users mailing list > > > users@lists.open-mpi.org > > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0 > > > > -- > > ____ > > || \\UTGERS, |---------------------------*O*--------------------------- > > ||_// the State | Ryan Novosielski - novos...@rutgers.edu > > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > > || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark > > `' > > > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users