> I need it's the backtrace on the process which generate the > segfault. Second, in order to understand the backtrace, it's > better to have run debug version of Open MPI. Without the > debug version we only see the address where the fault occur > without having access to the line number ...
How about this, this is the section that I was stepping through in order to get the first error I usually run into ... "mx_connect fail for node-1:0 with key aaaaffff (error Endpoint closed or not connectable!)" // gdb output Breakpoint 1, 0x00002ac856bd92e0 in opal_progress () from /usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0 (gdb) s Single stepping until exit from function opal_progress, which has no line number information. 0x00002ac857361540 in sched_yield () from /lib/libc.so.6 (gdb) s Single stepping until exit from function sched_yield, which has no line number information. opal_condition_wait (c=0x5098e0, m=0x5098a0) at condition.h:80 80 while (c->c_signaled == 0) { (gdb) s 81 opal_progress(); (gdb) s Breakpoint 1, 0x00002ac856bd92e0 in opal_progress () from /usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0 (gdb) s Single stepping until exit from function opal_progress, which has no line number information. 0x00002ac857361540 in sched_yield () from /lib/libc.so.6 (gdb) backtrace #0 0x00002ac857361540 in sched_yield () from /lib/libc.so.6 #1 0x0000000000402f60 in opal_condition_wait (c=0x5098e0, m=0x5098a0) at condition.h:81 #2 0x0000000000402b3c in orterun (argc=17, argv=0x7fff54151088) at orterun.c:427 #3 0x0000000000402713 in main (argc=17, argv=0x7fff54151088) at main.c:13 --- This is the mpirun output as I was stepping through it. At the end of this is the error that the backtrace above shows. [node-2:11909] top: openmpi-sessions-ggrobe@node-2_0 [node-2:11909] tmp: /tmp [node-1:10719] procdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1/0 [node-1:10719] jobdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1 [node-1:10719] unidir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414 [node-1:10719] top: openmpi-sessions-ggrobe@node-1_0 [node-1:10719] tmp: /tmp [juggernaut:17414] spawn: in job_state_callback(jobid = 1, state = 0x4) [juggernaut:17414] Info: Setting up debugger process table for applications MPIR_being_debugged = 0 MPIR_debug_gate = 0 MPIR_debug_state = 1 MPIR_acquired_pre_main = 0 MPIR_i_am_starter = 0 MPIR_proctable_size = 6 MPIR_proctable: (i, host, exe, pid) = (0, node-1, /home/ggrobe/Projects/ompi/cpi/./cpi, 10719) (i, host, exe, pid) = (1, node-1, /home/ggrobe/Projects/ompi/cpi/./cpi, 10720) (i, host, exe, pid) = (2, node-1, /home/ggrobe/Projects/ompi/cpi/./cpi, 10721) (i, host, exe, pid) = (3, node-1, /home/ggrobe/Projects/ompi/cpi/./cpi, 10722) (i, host, exe, pid) = (4, node-2, /home/ggrobe/Projects/ompi/cpi/./cpi, 11908) (i, host, exe, pid) = (5, node-2, /home/ggrobe/Projects/ompi/cpi/./cpi, 11909) [node-1:10718] sess_dir_finalize: proc session dir not empty - leaving [node-1:10718] sess_dir_finalize: proc session dir not empty - leaving [node-1:10721] procdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1/2 [node-1:10721] jobdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414/1 [node-1:10721] unidir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-17414 [node-1:10721] top: openmpi-sessions-ggrobe@node-1_0 [node-1:10721] tmp: /tmp [node-1:10720] mx_connect fail for node-1:0 with key aaaaffff (error Endpoint closed or not connectable!)