Here are the results when logging in to the compute node via ssh and running as 
you suggest:

[binf102:fischega] $ mpirun -np 2 -mca btl openib,sm,self ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting

Here are the results when executing over Torque (launch the shell with "qsub -l 
nodes=2 -I"):

[binf316:fischega] $ mpirun -np 2 -mca btl openib,sm,self ring_c
ring_c: 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
 udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == 
((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed.
[binf316:21584] *** Process received signal ***
[binf316:21584] Signal: Aborted (6)
[binf316:21584] Signal code:  (-6)
ring_c: 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
 udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == 
((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed.
[binf316:21583] *** Process received signal ***
[binf316:21583] Signal: Aborted (6)
[binf316:21583] Signal code:  (-6)
[binf316:21584] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7fe33a2637c0]
[binf316:21584] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7fe339f0fb55]
[binf316:21584] [ 2] /lib64/libc.so.6(abort+0x181)[0x7fe339f11131]
[binf316:21584] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7fe339f08a10]
[binf316:21584] [ 4] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7fe3355a984b]
[binf316:21584] [ 5] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7fe3355a8474]
[binf316:21584] [ 6] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7fe3355a1316]
[binf316:21584] [ 7] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7fe33558a817]
[binf316:21584] [ 8] [binf316:21583] [ 0] 
/lib64/libpthread.so.0(+0xf7c0)[0x7f3b586697c0]
[binf316:21583] [ 1] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7fe33a532a5e]
[binf316:21584] [ 9] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7fe3357ccd42]
[binf316:21584] [10] /lib64/libc.so.6(gsignal+0x35)[0x7f3b58315b55]
[binf316:21583] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f3b58317131]
[binf316:21583] [ 3] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7fe33a531d1b]
[binf316:21584] [11] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7fe3344e7739]
[binf316:21584] [12] /lib64/libc.so.6(__assert_fail+0xf0)[0x7f3b5830ea10]
[binf316:21583] [ 4] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7f3b539af84b]
[binf316:21583] [ 5] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7f3b539ae474]
[binf316:21583] [ 6] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7f3b539a7316]
[binf316:21583] [ 7] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7f3b53990817]
[binf316:21583] [ 8] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_pml_base_select+0x26e)[0x7fe33a5589b2]
[binf316:21584] [13] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7f3b58938a5e]
[binf316:21583] [ 9] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7f3b53bd2d42]
[binf316:21583] [10] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7fe33a4c533c]
[binf316:21584] [14] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7f3b58937d1b]
[binf316:21583] [11] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7f3b528ed739]
[binf316:21583] [12] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_pml_base_select+0x26e)[0x7f3b5895e9b2]
[binf316:21583] [13] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7fe33a4fa386]
[binf316:21584] [15] ring_c[0x40096f]
[binf316:21584] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7fe339efbc36]
[binf316:21584] [17] ring_c[0x400889]
[binf316:21584] *** End of error message ***
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f3b588cb33c]
[binf316:21583] [14] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f3b58900386]
[binf316:21583] [15] ring_c[0x40096f]
[binf316:21583] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f3b58301c36]
[binf316:21583] [17] ring_c[0x400889]
[binf316:21583] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 21583 on node xxxx316 exited on 
signal 6 (Aborted).
--------------------------------------------------------------------------

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Thursday, June 05, 2014 7:57 PM
To: Open MPI Users
Subject: Re: [OMPI users] openib segfaults with Torque

Hmmm...I'm not sure how that is going to run with only one proc (I don't know 
if the program is protected against that scenario). If you run with -np 2 -mca 
btl openib,sm,self, is it happy?


On Jun 5, 2014, at 2:16 PM, Fischer, Greg A. 
<fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote:


Here's the command I'm invoking and the terminal output.  (Some of this 
information doesn't appear to be captured in the backtrace.)

[binf316:fischega] $ mpirun -np 1 -mca btl openib,self ring_c
ring_c: 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
 udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == 
((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed.
[binf316:04549] *** Process received signal ***
[binf316:04549] Signal: Aborted (6)
[binf316:04549] Signal code:  (-6)
[binf316:04549] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7f7f5955e7c0]
[binf316:04549] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7f7f5920ab55]
[binf316:04549] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f7f5920c131]
[binf316:04549] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7f7f59203a10]
[binf316:04549] [ 4] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7f7f548a484b]
[binf316:04549] [ 5] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7f7f548a3474]
[binf316:04549] [ 6] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7f7f5489c316]
[binf316:04549] [ 7] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7f7f54885817]
[binf316:04549] [ 8] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7f7f5982da5e]
[binf316:04549] [ 9] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7f7f54ac7d42]
[binf316:04549] [10] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7f7f5982cd1b]
[binf316:04549] [11] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7f7f539ed739]
[binf316:04549] [12] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_pml_base_select+0x26e)[0x7f7f598539b2]
[binf316:04549] [13] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f7f597c033c]
[binf316:04549] [14] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f7f597f5386]
[binf316:04549] [15] ring_c[0x40096f]
[binf316:04549] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f7f591f6c36]
[binf316:04549] [17] ring_c[0x400889]
[binf316:04549] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 4549 on node xxxx316 exited on 
signal 6 (Aborted).
--------------------------------------------------------------------------

From: Fischer, Greg A.
Sent: Thursday, June 05, 2014 5:10 PM
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Cc: Fischer, Greg A.
Subject: openib segfaults with Torque

OpenMPI Users,

After encountering difficulty with the Intel compilers (see the "intermittent 
segfaults with openib on ring_c.c" thread), I installed GCC-4.8.3 and 
recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib BTL 
in a typical BASH environment. Everything appeared to work fine, so I went on 
my merry way compiling the rest of my dependencies.

After getting my dependencies and applications compiled, I began observing 
segfaults when submitting the applications through Torque. I recompiled OpenMPI 
with debug options, ran "ring_c" over the openib BTL in an interactive Torque 
session ("qsub -I"), and got the backtrace below. All other system settings 
described in the previous thread are the same. Any thoughts on how to resolve 
this issue?

Core was generated by `ring_c'.
Program terminated with signal 6, Aborted.
#0  0x00007f7f5920ab55 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f7f5920ab55 in raise () from /lib64/libc.so.6
#1  0x00007f7f5920c0c5 in abort () from /lib64/libc.so.6
#2  0x00007f7f59203a10 in __assert_fail () from /lib64/libc.so.6
#3  0x00007f7f548a484b in udcm_module_finalize (btl=0x716680, cpc=0x718c40) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734
#4  0x00007f7f548a3474 in udcm_component_query (btl=0x716680, cpc=0x717be8) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476
#5  0x00007f7f5489c316 in ompi_btl_openib_connect_base_select_for_local_port 
(btl=0x716680) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273
#6  0x00007f7f54885817 in btl_openib_component_init 
(num_btl_modules=0x7fff906aa420, enable_progress_threads=false, 
enable_mpi_threads=false)
    at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:2703
#7  0x00007f7f5982da5e in mca_btl_base_select (enable_progress_threads=false, 
enable_mpi_threads=false) at 
../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108
#8  0x00007f7f54ac7d42 in mca_bml_r2_component_init (priority=0x7fff906aa4f4, 
enable_progress_threads=false, enable_mpi_threads=false) at 
../../../../../openmpi-1.8.1/ompi/mca/bml/r2/bml_r2_component.c:88
#9  0x00007f7f5982cd1b in mca_bml_base_init (enable_progress_threads=false, 
enable_mpi_threads=false) at 
../../../../openmpi-1.8.1/ompi/mca/bml/base/bml_base_init.c:69
#10 0x00007f7f539ed739 in mca_pml_ob1_component_init (priority=0x7fff906aa630, 
enable_progress_threads=false, enable_mpi_threads=false)
    at ../../../../../openmpi-1.8.1/ompi/mca/pml/ob1/pml_ob1_component.c:271
#11 0x00007f7f598539b2 in mca_pml_base_select (enable_progress_threads=false, 
enable_mpi_threads=false) at 
../../../../openmpi-1.8.1/ompi/mca/pml/base/pml_base_select.c:128
#12 0x00007f7f597c033c in ompi_mpi_init (argc=1, argv=0x7fff906aa928, 
requested=0, provided=0x7fff906aa7d8) at 
../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:604
#13 0x00007f7f597f5386 in PMPI_Init (argc=0x7fff906aa82c, argv=0x7fff906aa820) 
at pinit.c:84
#14 0x000000000040096f in main (argc=1, argv=0x7fff906aa928) at ring_c.c:19

Greg
_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to