> >> PS: Is there any way you can attach to the processes with gdb ? I > >> would like to see the backtrace as showed by gdb in order > to be able > >> to figure out what's wrong there. > > > > When I can get more detailed dbg, I'll send. Though I'm not > clear on > > what executable is being searched for below. > > > > $ mpirun -dbg=gdb --prefix /usr/local/openmpi-1.2b3r13030 -x > > LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 > --mca pml > > cm --mca mtl mx ./cpi > > FWIW, note that "-dbg" is not a recognized Open MPI mpirun > command line switch -- after all the debugging information, > Open MPI finally gets to telling you: >
Sorry, wrong mpi, ok ... Fwiw, here's a working crash w/ just the -d option. The problem I'm trying to get to right now is how to dbg the 2nd process on the 2nd node since that's where the crash is always happening. One process past the 1st node works find (5 procs w/ 4 per node), but when a second process on the 2nd node starts or anything more than that, the crashes will occur. $ mpirun -d --prefix /usr/local/openmpi-1.2b3r13030 -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 6 --mca pml cm --mca mtl mx ./cpi > dbg.out 2>&1 [juggernaut:15087] connect_uni: connection not allowed [juggernaut:15087] connect_uni: connection not allowed [juggernaut:15087] connect_uni: connection not allowed [juggernaut:15087] connect_uni: connection not allowed [juggernaut:15087] connect_uni: connection not allowed [juggernaut:15087] connect_uni: connection not allowed [juggernaut:15087] connect_uni: connection not allowed [juggernaut:15087] connect_uni: connection not allowed [juggernaut:15087] connect_uni: connection not allowed [juggernaut:15087] [0,0,0] setting up session dir with [juggernaut:15087] universe default-universe-15087 [juggernaut:15087] user ggrobe [juggernaut:15087] host juggernaut [juggernaut:15087] jobid 0 [juggernaut:15087] procid 0 [juggernaut:15087] procdir: /tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-15087/0/0 [juggernaut:15087] jobdir: /tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-15087/0 [juggernaut:15087] unidir: /tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-15087 [juggernaut:15087] top: openmpi-sessions-ggrobe@juggernaut_0 [juggernaut:15087] tmp: /tmp [juggernaut:15087] [0,0,0] contact_file /tmp/openmpi-sessions-ggrobe@juggernaut_0/default-universe-15087/univers e-setup.txt [juggernaut:15087] [0,0,0] wrote setup file [juggernaut:15087] pls:rsh: local csh: 0, local sh: 1 [juggernaut:15087] pls:rsh: assuming same remote shell as local shell [juggernaut:15087] pls:rsh: remote csh: 0, remote sh: 1 [juggernaut:15087] pls:rsh: final template argv: [juggernaut:15087] pls:rsh: /usr/bin/ssh <template> orted --debug --bootproxy 1 --name <template> --num_procs 3 --vpid_start 0 --nodename <template> --universe ggrobe@juggernaut:default-universe-15087 --nsreplica "0.0.0;tcp://192.168.2.10:52099" --gprreplica "0.0.0;tcp://192.168.2.10:52099" [juggernaut:15087] pls:rsh: launching on node node-1 [juggernaut:15087] pls:rsh: node-1 is a REMOTE node [juggernaut:15087] pls:rsh: executing: /usr/bin/ssh node-1 PATH=/usr/local/openmpi-1.2b3r13030/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/openmpi-1.2b3r13030/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /usr/local/openmpi-1.2b3r13030/bin/orted --debug --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --nodename node-1 --universe ggrobe@juggernaut:default-universe-15087 --nsreplica "0.0.0;tcp://192.168.2.10:52099" --gprreplica "0.0.0;tcp://192.168.2.10:52099" [juggernaut:15087] pls:rsh: launching on node node-2 [juggernaut:15087] pls:rsh: node-2 is a REMOTE node [juggernaut:15087] pls:rsh: executing: /usr/bin/ssh node-2 PATH=/usr/local/openmpi-1.2b3r13030/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/openmpi-1.2b3r13030/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /usr/local/openmpi-1.2b3r13030/bin/orted --debug --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename node-2 --universe ggrobe@juggernaut:default-universe-15087 --nsreplica "0.0.0;tcp://192.168.2.10:52099" --gprreplica "0.0.0;tcp://192.168.2.10:52099" [node-2:11499] [0,0,2] setting up session dir with [node-2:11499] universe default-universe-15087 [node-2:11499] user ggrobe [node-2:11499] host node-2 [node-2:11499] jobid 0 [node-2:11499] procid 2 [node-1:10307] procdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087/0/1 [node-1:10307] jobdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087/0 [node-1:10307] unidir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087 [node-1:10307] top: openmpi-sessions-ggrobe@node-1_0 [node-2:11499] procdir: /tmp/openmpi-sessions-ggrobe@node-2_0/default-universe-15087/0/2 [node-2:11499] jobdir: /tmp/openmpi-sessions-ggrobe@node-2_0/default-universe-15087/0 [node-2:11499] unidir: /tmp/openmpi-sessions-ggrobe@node-2_0/default-universe-15087 [node-2:11499] top: openmpi-sessions-ggrobe@node-2_0 [node-2:11499] tmp: /tmp [node-1:10307] tmp: /tmp [node-2:11500] [0,1,4] setting up session dir with [node-2:11500] universe default-universe-15087 [node-2:11500] user ggrobe [node-2:11500] host node-2 [node-2:11500] jobid 1 [node-2:11500] procid 4 [node-2:11501] [0,1,5] setting up session dir with [node-2:11501] universe default-universe-15087 [node-2:11501] user ggrobe [node-2:11501] host node-2 [node-2:11501] jobid 1 [node-2:11501] procid 5 [node-1:10308] [0,1,0] setting up session dir with [node-1:10308] universe default-universe-15087 [node-1:10308] user ggrobe [node-1:10308] host node-1 [node-1:10308] jobid 1 [node-1:10308] procid 0 [node-2:11500] procdir: /tmp/openmpi-sessions-ggrobe@node-2_0/default-universe-15087/1/4 [node-2:11500] jobdir: /tmp/openmpi-sessions-ggrobe@node-2_0/default-universe-15087/1 [node-2:11500] unidir: /tmp/openmpi-sessions-ggrobe@node-2_0/default-universe-15087 [node-2:11500] top: openmpi-sessions-ggrobe@node-2_0 [node-2:11500] tmp: /tmp [node-2:11501] procdir: /tmp/openmpi-sessions-ggrobe@node-2_0/default-universe-15087/1/5 [node-2:11501] jobdir: /tmp/openmpi-sessions-ggrobe@node-2_0/default-universe-15087/1 [node-2:11501] unidir: /tmp/openmpi-sessions-ggrobe@node-2_0/default-universe-15087 [node-2:11501] top: openmpi-sessions-ggrobe@node-2_0 [node-2:11501] tmp: /tmp [node-1:10308] procdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087/1/0 [node-1:10308] jobdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087/1 [node-1:10308] unidir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087 [node-1:10308] top: openmpi-sessions-ggrobe@node-1_0 [node-1:10308] tmp: /tmp [node-1:10311] [0,1,3] setting up session dir with [node-1:10311] universe default-universe-15087 [node-1:10311] user ggrobe [node-1:10311] host node-1 [node-1:10311] jobid 1 [node-1:10311] procid 3 [node-1:10310] [0,1,2] setting up session dir with [node-1:10310] universe default-universe-15087 [node-1:10310] user ggrobe [node-1:10310] host node-1 [node-1:10310] jobid 1 [node-1:10310] procid 2 [node-1:10311] procdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087/1/3 [node-1:10311] jobdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087/1 [node-1:10311] unidir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087 [node-1:10311] top: openmpi-sessions-ggrobe@node-1_0 [node-1:10311] tmp: /tmp [node-1:10310] procdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087/1/2 [node-1:10310] jobdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087/1 [node-1:10310] unidir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087 [node-1:10310] top: openmpi-sessions-ggrobe@node-1_0 [node-1:10310] tmp: /tmp [node-1:10309] [0,1,1] setting up session dir with [node-1:10309] universe default-universe-15087 [node-1:10309] user ggrobe [node-1:10309] host node-1 [node-1:10309] jobid 1 [node-1:10309] procid 1 [node-1:10309] procdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087/1/1 [node-1:10309] jobdir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087/1 [node-1:10309] unidir: /tmp/openmpi-sessions-ggrobe@node-1_0/default-universe-15087 [node-1:10309] top: openmpi-sessions-ggrobe@node-1_0 [node-1:10309] tmp: /tmp Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:(nil) Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:(nil) [0] func:/usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0(opal_backtrace_ print+0x1f) [0x2b8b99905d3f] [1] func:/usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0 [0x2b8b99904891] [2] func:/lib/libpthread.so.0 [0x2b8b99ec6d00] [3] func:/opt/mx/lib/libmyriexpress.so(mx_open_endpoint+0x6df) [0x2b8b9cb072af] [4] func:/usr/local/openmpi-1.2b3r13030/lib/openmpi/mca_mtl_mx.so(ompi_mtl_m x_module_init+0x20) [0x2b8b9c9fcb50] [5] func:/usr/local/openmpi-1.2b3r13030/lib/openmpi/mca_mtl_mx.so [0x2b8b9c9fccb5] [6] func:/usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0(ompi_mtl_base_select +0x6f) [0x2b8b9966165f] [7] func:/usr/local/openmpi-1.2b3r13030/lib/openmpi/mca_pml_cm.so [0x2b8b9c6d1aa6] [8] func:/usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0(mca_pml_base_select+ 0x113) [0x2b8b99663ef3] [9] func:/usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0(ompi_mpi_init+0x45e) [0x2b8b9962c7de] [10] func:/usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0(MPI_Init+0x83) [0x2b8b9964d903] [11] func:./cpi(main+0x42) [0x400cd5] [12] func:/lib/libc.so.6(__libc_start_main+0xf4) [0x2b8b99fed134] [13] func:./cpi [0x400bd9] *** End of error message *** ^@[0] func:/usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0(opal_backtrace_ print+0x1f) [0x2b548c138d3f] [1] func:/usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0 [0x2b548c137891] [2] func:/lib/libpthread.so.0 [0x2b548c6f9d00] [3] func:/opt/mx/lib/libmyriexpress.so(mx_open_endpoint+0x6df) [0x2b548f33a2af] [4] func:/usr/local/openmpi-1.2b3r13030/lib/openmpi/mca_mtl_mx.so(ompi_mtl_m x_module_init+0x20) [0x2b548f22fb50] [5] func:/usr/local/openmpi-1.2b3r13030/lib/openmpi/mca_mtl_mx.so [0x2b548f22fcb5] [6] func:/usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0(ompi_mtl_base_select +0x6f) [0x2b548be9465f] [7] func:/usr/local/openmpi-1.2b3r13030/lib/openmpi/mca_pml_cm.so [0x2b548ef04aa6] [8] func:/usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0(mca_pml_base_select+ 0x113) [0x2b548be96ef3] [9] func:/usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0(ompi_mpi_init+0x45e) [0x2b548be5f7de] [10] func:/usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0(MPI_Init+0x83) [0x2b548be80903] [11] func:./cpi(main+0x42) [0x400cd5] [12] func:/lib/libc.so.6(__libc_start_main+0xf4) [0x2b548c820134] [13] func:./cpi [0x400bd9] *** End of error message *** ^@[node-1:10307] sess_dir_finalize: proc session dir not empty - leaving [juggernaut:15087] spawn: in job_state_callback(jobid = 1, state = 0x80) mpirun noticed that job rank 0 with PID 0 on node node-1 exited on signal 15. [node-1:10307] sess_dir_finalize: job session dir not empty - leaving [node-2:11499] sess_dir_finalize: job session dir not empty - leaving 5 additional processes aborted (not shown) [juggernaut:15087] sess_dir_finalize: proc session dir not empty - leaving [node-1:10307] sess_dir_finalize: proc session dir not empty - leaving [node-2:11499] sess_dir_finalize: proc session dir not empty - leaving