What MPI is SLURM set to use/how was that compiled? Out of the box, the SLURM 
MPI is set to “none”, or was last I checked, and so isn’t necessarily doing 
MPI. Now, I did try this with OpenMPI 2.1.1 and it looked right either way 
(OpenMPI built with “--with-pmi"), but for MVAPICH2 this definitely made a 
difference:

[novosirj@amarel1 novosirj]$ srun --mpi=none -N 4 -n 16 --ntasks-per-node=4 
./mpi_hello_world-intel-17.0.4-mvapich2-2.2
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
processors
[slepner032.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: Bus 
error (signal 7)
srun: error: slepner032: task 10: Bus error

[novosirj@amarel1 novosirj]$ srun --mpi=pmi2 -N 4 -n 16 --ntasks-per-node=4 
./mpi_hello_world-intel-17.0.4-mvapich2-2.2
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 16 
processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 1 out of 16 
processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 2 out of 16 
processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 3 out of 16 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 12 out of 16 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 13 out of 16 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 14 out of 16 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 15 out of 16 
processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 4 out of 16 
processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 5 out of 16 
processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 6 out of 16 
processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 7 out of 16 
processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 8 out of 16 
processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 9 out of 16 
processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 10 out of 16 
processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 11 out of 16 
processors

> On Jun 17, 2018, at 5:51 PM, Bennet Fauber <ben...@umich.edu> wrote:
> 
> I rebuilt with --enable-debug, then ran with
> 
> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> salloc: Pending job allocation 158
> salloc: job 158 queued and waiting for resources
> salloc: job 158 has been allocated resources
> salloc: Granted job allocation 158
> 
> [bennet@cavium-hpc ~]$ srun ./test_mpi
> The sum = 0.866386
> Elapsed time is:  5.426759
> The sum = 0.866386
> Elapsed time is:  5.424068
> The sum = 0.866386
> Elapsed time is:  5.426195
> The sum = 0.866386
> Elapsed time is:  5.426059
> The sum = 0.866386
> Elapsed time is:  5.423192
> The sum = 0.866386
> Elapsed time is:  5.426252
> The sum = 0.866386
> Elapsed time is:  5.425444
> The sum = 0.866386
> Elapsed time is:  5.423647
> The sum = 0.866386
> Elapsed time is:  5.426082
> The sum = 0.866386
> Elapsed time is:  5.425936
> The sum = 0.866386
> Elapsed time is:  5.423964
> Total time is:  59.677830
> 
> [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
> 2>&1 | tee debug2.log
> 
> The zipped debug log should be attached.
> 
> I did that after using systemctl to turn off the firewall on the login
> node from which the mpirun is executed, as well as on the host on
> which it runs.
> 
> [bennet@cavium-hpc ~]$ mpirun hostname
> --------------------------------------------------------------------------
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> --------------------------------------------------------------------------
> 
> [bennet@cavium-hpc ~]$ squeue
>             JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>               158  standard     bash   bennet  R      14:30      1 cav01
> [bennet@cavium-hpc ~]$ srun hostname
> cav01.arc-ts.umich.edu
> [ repeated 23 more times ]
> 
> As always, your help is much appreciated,
> 
> -- bennet
> 
> On Sun, Jun 17, 2018 at 1:06 PM r...@open-mpi.org <r...@open-mpi.org> wrote:
>> 
>> Add --enable-debug to your OMPI configure cmd line, and then add --mca 
>> plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote 
>> daemon isn’t starting - this will give you some info as to why.
>> 
>> 
>>> On Jun 17, 2018, at 9:07 AM, Bennet Fauber <ben...@umich.edu> wrote:
>>> 
>>> I have a compiled binary that will run with srun but not with mpirun.
>>> The attempts to run with mpirun all result in failures to initialize.
>>> I have tried this on one node, and on two nodes, with firewall turned
>>> on and with it off.
>>> 
>>> Am I missing some command line option for mpirun?
>>> 
>>> OMPI built from this configure command
>>> 
>>> $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
>>> --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
>>> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external
>>> --with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
>>> FC=gfortran
>>> 
>>> All tests from `make check` passed, see below.
>>> 
>>> [bennet@cavium-hpc ~]$ mpicc --show
>>> gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
>>> -L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
>>> -Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
>>> -Wl,--enable-new-dtags
>>> -L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
>>> 
>>> The test_mpi was compiled with
>>> 
>>> $ gcc -o test_mpi test_mpi.c -lm
>>> 
>>> This is the runtime library path
>>> 
>>> [bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH
>>> /opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
>>> 
>>> 
>>> These commands are given in exact sequence in which they were entered
>>> at a console.
>>> 
>>> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
>>> salloc: Pending job allocation 156
>>> salloc: job 156 queued and waiting for resources
>>> salloc: job 156 has been allocated resources
>>> salloc: Granted job allocation 156
>>> 
>>> [bennet@cavium-hpc ~]$ mpirun ./test_mpi
>>> --------------------------------------------------------------------------
>>> An ORTE daemon has unexpectedly failed after launch and before
>>> communicating back to mpirun. This could be caused by a number
>>> of factors, including an inability to create a connection back
>>> to mpirun due to a lack of common network interfaces and/or no
>>> route found between them. Please check network connectivity
>>> (including firewalls and network routing requirements).
>>> --------------------------------------------------------------------------
>>> 
>>> [bennet@cavium-hpc ~]$ srun ./test_mpi
>>> The sum = 0.866386
>>> Elapsed time is:  5.425439
>>> The sum = 0.866386
>>> Elapsed time is:  5.427427
>>> The sum = 0.866386
>>> Elapsed time is:  5.422579
>>> The sum = 0.866386
>>> Elapsed time is:  5.424168
>>> The sum = 0.866386
>>> Elapsed time is:  5.423951
>>> The sum = 0.866386
>>> Elapsed time is:  5.422414
>>> The sum = 0.866386
>>> Elapsed time is:  5.427156
>>> The sum = 0.866386
>>> Elapsed time is:  5.424834
>>> The sum = 0.866386
>>> Elapsed time is:  5.425103
>>> The sum = 0.866386
>>> Elapsed time is:  5.422415
>>> The sum = 0.866386
>>> Elapsed time is:  5.422948
>>> Total time is:  59.668622
>>> 
>>> Thanks,    -- bennet
>>> 
>>> 
>>> make check results
>>> ----------------------------------------------
>>> 
>>> make  check-TESTS
>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
>>> PASS: predefined_gap_test
>>> PASS: predefined_pad_test
>>> SKIP: dlopen_test
>>> ============================================================================
>>> Testsuite summary for Open MPI 3.1.0
>>> ============================================================================
>>> # TOTAL: 3
>>> # PASS:  2
>>> # SKIP:  1
>>> # XFAIL: 0
>>> # FAIL:  0
>>> # XPASS: 0
>>> # ERROR: 0
>>> ============================================================================
>>> [ elided ]
>>> PASS: atomic_cmpset_noinline
>>>   - 5 threads: Passed
>>> PASS: atomic_cmpset_noinline
>>>   - 8 threads: Passed
>>> ============================================================================
>>> Testsuite summary for Open MPI 3.1.0
>>> ============================================================================
>>> # TOTAL: 8
>>> # PASS:  8
>>> # SKIP:  0
>>> # XFAIL: 0
>>> # FAIL:  0
>>> # XPASS: 0
>>> # ERROR: 0
>>> ============================================================================
>>> [ elided ]
>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
>>> PASS: ompi_rb_tree
>>> PASS: opal_bitmap
>>> PASS: opal_hash_table
>>> PASS: opal_proc_table
>>> PASS: opal_tree
>>> PASS: opal_list
>>> PASS: opal_value_array
>>> PASS: opal_pointer_array
>>> PASS: opal_lifo
>>> PASS: opal_fifo
>>> ============================================================================
>>> Testsuite summary for Open MPI 3.1.0
>>> ============================================================================
>>> # TOTAL: 10
>>> # PASS:  10
>>> # SKIP:  0
>>> # XFAIL: 0
>>> # FAIL:  0
>>> # XPASS: 0
>>> # ERROR: 0
>>> ============================================================================
>>> [ elided ]
>>> make  opal_thread opal_condition
>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
>>> CC       opal_thread.o
>>> CCLD     opal_thread
>>> CC       opal_condition.o
>>> CCLD     opal_condition
>>> make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
>>> make  check-TESTS
>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
>>> ============================================================================
>>> Testsuite summary for Open MPI 3.1.0
>>> ============================================================================
>>> # TOTAL: 0
>>> # PASS:  0
>>> # SKIP:  0
>>> # XFAIL: 0
>>> # FAIL:  0
>>> # XPASS: 0
>>> # ERROR: 0
>>> ============================================================================
>>> [ elided ]
>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype'
>>> PASS: opal_datatype_test
>>> PASS: unpack_hetero
>>> PASS: checksum
>>> PASS: position
>>> PASS: position_noncontig
>>> PASS: ddt_test
>>> PASS: ddt_raw
>>> PASS: unpack_ooo
>>> PASS: ddt_pack
>>> PASS: external32
>>> ============================================================================
>>> Testsuite summary for Open MPI 3.1.0
>>> ============================================================================
>>> # TOTAL: 10
>>> # PASS:  10
>>> # SKIP:  0
>>> # XFAIL: 0
>>> # FAIL:  0
>>> # XPASS: 0
>>> # ERROR: 0
>>> ============================================================================
>>> [ elided ]
>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util'
>>> PASS: opal_bit_ops
>>> PASS: opal_path_nfs
>>> PASS: bipartite_graph
>>> ============================================================================
>>> Testsuite summary for Open MPI 3.1.0
>>> ============================================================================
>>> # TOTAL: 3
>>> # PASS:  3
>>> # SKIP:  0
>>> # XFAIL: 0
>>> # FAIL:  0
>>> # XPASS: 0
>>> # ERROR: 0
>>> ============================================================================
>>> [ elided ]
>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss'
>>> PASS: dss_buffer
>>> PASS: dss_cmp
>>> PASS: dss_payload
>>> PASS: dss_print
>>> ============================================================================
>>> Testsuite summary for Open MPI 3.1.0
>>> ============================================================================
>>> # TOTAL: 4
>>> # PASS:  4
>>> # SKIP:  0
>>> # XFAIL: 0
>>> # FAIL:  0
>>> # XPASS: 0
>>> # ERROR: 0
>>> ============================================================================
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0
> <debug2.log.gz>_______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0

--
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB C630, Newark
     `'

Attachment: signature.asc
Description: Message signed with OpenPGP

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to