On Thu, 2006-02-02 at 15:19 -0700, Galen M. Shipman wrote: > Is it possible for you to get a stack trace where this is hanging? > > You might try: > > > mpirun -prefix /opt/ompi -wdir `pwd` -machinefile /root/machines -np > 2 -d xterm -e gdb PMB-MPI1 > >
I did that, and when it was hanging I control-C'd in each gdb and asked for a bt. Here's the debug output from the mpirun command: ================================================================================================== mpirun -prefix /opt/ompi -wdir `pwd` -machinefile /root/machines -np 2 -d xterm -e gdb PMB-MPI1 [bench1:16017] procdir: (null) [bench1:16017] jobdir: (null) [bench1:16017] unidir: /tmp/openmpi-sessions-root@bench1_0/default-universe [bench1:16017] top: openmpi-sessions-root@bench1_0 [bench1:16017] tmp: /tmp [bench1:16017] connect_uni: contact info read [bench1:16017] connect_uni: connection not allowed [bench1:16017] [0,0,0] setting up session dir with [bench1:16017] tmpdir /tmp [bench1:16017] universe default-universe-16017 [bench1:16017] user root [bench1:16017] host bench1 [bench1:16017] jobid 0 [bench1:16017] procid 0 [bench1:16017] procdir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017/0/0 [bench1:16017] jobdir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017/0 [bench1:16017] unidir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017 [bench1:16017] top: openmpi-sessions-root@bench1_0 [bench1:16017] tmp: /tmp [bench1:16017] [0,0,0] contact_file /tmp/openmpi-sessions-root@bench1_0/default-universe-16017/universe-setup.txt [bench1:16017] [0,0,0] wrote setup file [bench1:16017] spawn: in job_state_callback(jobid = 1, state = 0x1) [bench1:16017] pls:rsh: local csh: 0, local bash: 1 [bench1:16017] pls:rsh: assuming same remote shell as local shell [bench1:16017] pls:rsh: remote csh: 0, remote bash: 1 [bench1:16017] pls:rsh: final template argv: [bench1:16017] pls:rsh: /usr/bin/ssh -X <template> orted --debug --bootproxy 1 --name <template> --num_procs 3 --vpid_start 0 --nodename <template> --universe root@bench1:default-universe-16017 --nsreplica "0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --gprreplica "0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --mpi-call-yield 0 [bench1:16017] pls:rsh: launching on node bench2 [bench1:16017] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0 [bench1:16017] pls:rsh: bench2 is a REMOTE node [bench1:16017] pls:rsh: executing: /usr/bin/ssh -X bench2 PATH=/opt/ompi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/ompi/lib: $LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/ompi/bin/orted --debug --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --nodename bench2 --universe root@bench1:default-universe-16017 --nsreplica "0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --gprreplica "0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --mpi-call-yield 0 [bench2:16980] [0,0,1] setting up session dir with [bench2:16980] universe default-universe-16017 [bench2:16980] user root [bench2:16980] host bench2 [bench2:16980] jobid 0 [bench2:16980] procid 1 [bench2:16980] procdir: /tmp/openmpi-sessions-root@bench2_0/default-universe-16017/0/1 [bench2:16980] jobdir: /tmp/openmpi-sessions-root@bench2_0/default-universe-16017/0 [bench2:16980] unidir: /tmp/openmpi-sessions-root@bench2_0/default-universe-16017 [bench2:16980] top: openmpi-sessions-root@bench2_0 [bench2:16980] tmp: /tmp [bench1:16017] pls:rsh: launching on node bench1 [bench1:16017] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0 [bench1:16017] pls:rsh: bench1 is a LOCAL node [bench1:16017] pls:rsh: reset PATH: /opt/ompi/bin:/sbin:/usr/sbin:/usr/local/sbin:/opt/gnome/sbin:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/gnome/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/opt/ompi/bin [bench1:16017] pls:rsh: reset LD_LIBRARY_PATH: /opt/ompi/lib [bench1:16017] pls:rsh: executing: orted --debug --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename bench1 --universe root@bench1:default-universe-16017 --nsreplica "0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --gprreplica "0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --mpi-call-yield 0 [bench1:16021] [0,0,2] setting up session dir with [bench1:16021] universe default-universe-16017 [bench1:16021] user root [bench1:16021] host bench1 [bench1:16021] jobid 0 [bench1:16021] procid 2 [bench1:16021] procdir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017/0/2 [bench1:16021] jobdir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017/0 [bench1:16021] unidir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017 [bench1:16021] top: openmpi-sessions-root@bench1_0 [bench1:16021] tmp: /tmp Warning: translation table syntax error: Unknown keysym name: DRemove Warning: ... found while parsing '<Key>DRemove: ignore()' Warning: String to TranslationTable conversion encountered errors Warning: translation table syntax error: Unknown keysym name: DRemove Warning: ... found while parsing '<Key>DRemove: ignore()' Warning: String to TranslationTable conversion encountered errors [bench1:16017] spawn: in job_state_callback(jobid = 1, state = 0x3) [bench1:16017] Info: Setting up debugger process table for applications MPIR_being_debugged = 0 MPIR_debug_gate = 0 MPIR_debug_state = 1 MPIR_acquired_pre_main = 0 MPIR_i_am_starter = 0 MPIR_proctable_size = 2 MPIR_proctable: (i, host, exe, pid) = (0, bench1, /usr/bin/xterm, 16025) (i, host, exe, pid) = (1, bench2, /usr/bin/xterm, 16984) [bench1:16017] spawn: in job_state_callback(jobid = 1, state = 0x4) Here's the output in one xterm: ========================================================================= GNU gdb 6.3 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-suse-linux"...Using host libthread_db library "/lib64/tls/libthread_db.so.1". (gdb) run Starting program: /root/SRC_PMB/PMB-MPI1 [Thread debugging using libthread_db enabled] [New Thread 46912509890080 (LWP 16984)] [bench2:16984] [0,1,0] setting up session dir with [bench2:16984] universe default-universe-16017 [bench2:16984] user root [bench2:16984] host bench2 [bench2:16984] jobid 1 [bench2:16984] procid 0 [bench2:16984] procdir: /tmp/openmpi-sessions-root@bench2_0/default-universe-160 17/1/0 [bench2:16984] jobdir: /tmp/openmpi-sessions-root@bench2_0/default-universe-1601 7/1 [bench2:16984] unidir: /tmp/openmpi-sessions-root@bench2_0/default-universe-1601 7 [bench2:16984] top: openmpi-sessions-root@bench2_0 [bench2:16984] tmp: /tmp [bench2:16984] [0,1,0] ompi_mpi_init completed #--------------------------------------------------- # PALLAS MPI Benchmark Suite V2.2, MPI-1 part #--------------------------------------------------- # Date : Thu Feb 2 09:51:32 2006 # Machine : x86_64# System : Linux # Release : 2.6.13-15-smp # Version : #7 SMP Mon Jan 30 12:05:45 PST 2006 # # Minimum message length in bytes: 0 # Maximum message length in bytes: 4194304 # # MPI_Datatype : MPI_BYTE # MPI_Datatype for reductions : MPI_FLOAT # MPI_Op : MPI_SUM # # # List of Benchmarks to run: # PingPong # PingPing # Sendrecv # Exchange # Allreduce # Reduce # Reduce_scatter # Allgather # Allgatherv # Alltoall # Bcast # Barrier Program received signal SIGINT, Interrupt. [Switching to Thread 46912509890080 (LWP 16984)] mthca_poll_cq (ibcq=0x718820, ne=1, wc=0x7fffffd071a0) at cq.c:469 469 cq.c: No such file or directory. in cq.c (gdb) inf st #0 mthca_poll_cq (ibcq=0x718820, ne=1, wc=0x7fffffd071a0) at cq.c:469 #1 0x00002aaaadef1a85 in mca_btl_openib_component_progress () from /opt/ompi/lib/openmpi/mca_btl_openib.so #2 0x00002aaaadde8f62 in mca_bml_r2_progress () from /opt/ompi/lib/openmpi/mca_bml_r2.so #3 0x00002aaaaaec79c0 in opal_progress () from /opt/ompi/lib/libopal.so.0 #4 0x00002aaaadac7255 in mca_pml_ob1_recv () from /opt/ompi/lib/openmpi/mca_pml_ob1.so #5 0x00002aaaaea434c2 in ompi_coll_tuned_reduce_intra_basic_linear () from /opt/ompi/lib/openmpi/mca_coll_tuned.so #6 0x00002aaaaea405e6 in ompi_coll_tuned_allreduce_intra_nonoverlapping () from /opt/ompi/lib/openmpi/mca_coll_tuned.so #7 0x00002aaaaac06b17 in ompi_comm_nextcid () from /opt/ompi/lib/libmpi.so.0 #8 0x00002aaaaac0513b in ompi_comm_split () from /opt/ompi/lib/libmpi.so.0 #9 0x00002aaaaac2bcd8 in PMPI_Comm_split () from /opt/ompi/lib/libmpi.so.0 #10 0x0000000000403b81 in Set_Communicator () #11 0x000000000040385e in Init_Communicator () #12 0x0000000000402e06 in main () (gdb) Here's the output in the other xterm: ================================================================================================ GNU gdb 6.3 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-suse-linux"...Using host libthread_db library "/lib64/tls/libthread_db.so.1". (gdb) run Starting program: /root/SRC_PMB/PMB-MPI1 [Thread debugging using libthread_db enabled] [New Thread 46912509889280 (LWP 16025)] [bench1:16025] [0,1,1] setting up session dir with [bench1:16025] universe default-universe-16017 [bench1:16025] user root [bench1:16025] host bench1 [bench1:16025] jobid 1 [bench1:16025] procid 1 [bench1:16025] procdir: /tmp/openmpi-sessions-root@bench1_0/default-universe-160 17/1/1 [bench1:16025] jobdir: /tmp/openmpi-sessions-root@bench1_0/default-universe-1601 7/1 [bench1:16025] unidir: /tmp/openmpi-sessions-root@bench1_0/default-universe-1601 7 [bench1:16025] top: openmpi-sessions-root@bench1_0 [bench1:16025] tmp: /tmp [bench1:16025] [0,1,1] ompi_mpi_init completed Program received signal SIGINT, Interrupt. [Switching to Thread 46912509889280 (LWP 16025)] 0x00002aaaab493fc5 in pthread_spin_lock () from /lib64/tls/libpthread.so.0 (gdb) inf st #0 0x00002aaaab493fc5 in pthread_spin_lock () from /lib64/tls/libpthread.so.0 #1 0x00002aaaaeb50d3e in mthca_poll_cq (ibcq=0x8b4990, ne=1, wc=0x7fffffbe0290) at cq.c:454 #2 0x00002aaaaddf0ce0 in mca_btl_openib_component_progress () from /opt/ompi/lib/openmpi/mca_btl_openib.so #3 0x00002aaaadce7f62 in mca_bml_r2_progress () from /opt/ompi/lib/openmpi/mca_bml_r2.so #4 0x00002aaaaaec79c0 in opal_progress () from /opt/ompi/lib/libopal.so.0 #5 0x00002aaaad9c6255 in mca_pml_ob1_recv () from /opt/ompi/lib/openmpi/mca_pml_ob1.so #6 0x00002aaaae940ba9 in ompi_coll_tuned_bcast_intra_basic_linear () from /opt/ompi/lib/openmpi/mca_coll_tuned.so #7 0x00002aaaae52830f in mca_coll_basic_allgather_intra () from /opt/ompi/lib/openmpi/mca_coll_basic.so #8 0x00002aaaaac04ee3 in ompi_comm_split () from /opt/ompi/lib/libmpi.so.0 #9 0x00002aaaaac2bcd8 in PMPI_Comm_split () from /opt/ompi/lib/libmpi.so.0 #10 0x0000000000403b81 in Set_Communicator () #11 0x000000000040385e in Init_Communicator () #12 0x0000000000402e06 in main () (gdb)