On Thu, 2006-02-02 at 15:19 -0700, Galen M. Shipman wrote:

> Is it possible for you to get a stack trace where this is hanging?
> 
> You might try:
> 
> 
> mpirun -prefix /opt/ompi -wdir `pwd` -machinefile /root/machines -np  
> 2 -d xterm -e gdb PMB-MPI1
> 
> 

I did that, and when it was hanging I control-C'd in each gdb and asked
for a bt.

Here's the debug output from the mpirun command:
==================================================================================================

mpirun -prefix /opt/ompi -wdir `pwd` -machinefile /root/machines -np 2
-d xterm -e gdb PMB-MPI1
[bench1:16017] procdir: (null)
[bench1:16017] jobdir: (null)
[bench1:16017]
unidir: /tmp/openmpi-sessions-root@bench1_0/default-universe
[bench1:16017] top: openmpi-sessions-root@bench1_0
[bench1:16017] tmp: /tmp
[bench1:16017] connect_uni: contact info read
[bench1:16017] connect_uni: connection not allowed
[bench1:16017] [0,0,0] setting up session dir with
[bench1:16017]  tmpdir /tmp
[bench1:16017]  universe default-universe-16017
[bench1:16017]  user root
[bench1:16017]  host bench1
[bench1:16017]  jobid 0
[bench1:16017]  procid 0
[bench1:16017]
procdir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017/0/0
[bench1:16017]
jobdir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017/0
[bench1:16017]
unidir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017
[bench1:16017] top: openmpi-sessions-root@bench1_0
[bench1:16017] tmp: /tmp
[bench1:16017] [0,0,0]
contact_file 
/tmp/openmpi-sessions-root@bench1_0/default-universe-16017/universe-setup.txt
[bench1:16017] [0,0,0] wrote setup file
[bench1:16017] spawn: in job_state_callback(jobid = 1, state = 0x1)
[bench1:16017] pls:rsh: local csh: 0, local bash: 1
[bench1:16017] pls:rsh: assuming same remote shell as local shell
[bench1:16017] pls:rsh: remote csh: 0, remote bash: 1
[bench1:16017] pls:rsh: final template argv:
[bench1:16017] pls:rsh:     /usr/bin/ssh -X <template> orted --debug
--bootproxy 1 --name <template> --num_procs 3 --vpid_start 0 --nodename
<template> --universe root@bench1:default-universe-16017 --nsreplica
"0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --gprreplica
"0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --mpi-call-yield 0
[bench1:16017] pls:rsh: launching on node bench2
[bench1:16017] pls:rsh: not oversubscribed -- setting
mpi_yield_when_idle to 0
[bench1:16017] pls:rsh: bench2 is a REMOTE node
[bench1:16017] pls:rsh: executing: /usr/bin/ssh -X bench2
PATH=/opt/ompi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/ompi/lib:
$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/ompi/bin/orted --debug
--bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --nodename
bench2 --universe root@bench1:default-universe-16017 --nsreplica
"0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --gprreplica
"0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --mpi-call-yield 0
[bench2:16980] [0,0,1] setting up session dir with
[bench2:16980]  universe default-universe-16017
[bench2:16980]  user root
[bench2:16980]  host bench2
[bench2:16980]  jobid 0
[bench2:16980]  procid 1
[bench2:16980]
procdir: /tmp/openmpi-sessions-root@bench2_0/default-universe-16017/0/1
[bench2:16980]
jobdir: /tmp/openmpi-sessions-root@bench2_0/default-universe-16017/0
[bench2:16980]
unidir: /tmp/openmpi-sessions-root@bench2_0/default-universe-16017
[bench2:16980] top: openmpi-sessions-root@bench2_0
[bench2:16980] tmp: /tmp
[bench1:16017] pls:rsh: launching on node bench1
[bench1:16017] pls:rsh: not oversubscribed -- setting
mpi_yield_when_idle to 0
[bench1:16017] pls:rsh: bench1 is a LOCAL node
[bench1:16017] pls:rsh: reset
PATH: 
/opt/ompi/bin:/sbin:/usr/sbin:/usr/local/sbin:/opt/gnome/sbin:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/gnome/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/opt/ompi/bin
[bench1:16017] pls:rsh: reset LD_LIBRARY_PATH: /opt/ompi/lib
[bench1:16017] pls:rsh: executing: orted --debug --bootproxy 1 --name
0.0.2 --num_procs 3 --vpid_start 0 --nodename bench1 --universe
root@bench1:default-universe-16017 --nsreplica
"0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --gprreplica
"0.0.0;tcp://10.1.40.61:32793;tcp://10.2.40.61:32793" --mpi-call-yield 0
[bench1:16021] [0,0,2] setting up session dir with
[bench1:16021]  universe default-universe-16017
[bench1:16021]  user root
[bench1:16021]  host bench1
[bench1:16021]  jobid 0
[bench1:16021]  procid 2
[bench1:16021]
procdir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017/0/2
[bench1:16021]
jobdir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017/0
[bench1:16021]
unidir: /tmp/openmpi-sessions-root@bench1_0/default-universe-16017
[bench1:16021] top: openmpi-sessions-root@bench1_0
[bench1:16021] tmp: /tmp
Warning: translation table syntax error: Unknown keysym name:  DRemove
Warning: ... found while parsing '<Key>DRemove: ignore()'
Warning: String to TranslationTable conversion encountered errors
Warning: translation table syntax error: Unknown keysym name:  DRemove
Warning: ... found while parsing '<Key>DRemove: ignore()'
Warning: String to TranslationTable conversion encountered errors
[bench1:16017] spawn: in job_state_callback(jobid = 1, state = 0x3)
[bench1:16017] Info: Setting up debugger process table for applications
  MPIR_being_debugged = 0
  MPIR_debug_gate = 0
  MPIR_debug_state = 1
  MPIR_acquired_pre_main = 0
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 2
  MPIR_proctable:
    (i, host, exe, pid) = (0, bench1, /usr/bin/xterm, 16025)
    (i, host, exe, pid) = (1, bench2, /usr/bin/xterm, 16984)
[bench1:16017] spawn: in job_state_callback(jobid = 1, state = 0x4)


Here's the output in one xterm:
=========================================================================
GNU gdb 6.3
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you
are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for
details.
This GDB was configured as "x86_64-suse-linux"...Using host libthread_db
library
 "/lib64/tls/libthread_db.so.1".

(gdb) run
Starting program: /root/SRC_PMB/PMB-MPI1 
[Thread debugging using libthread_db enabled]
[New Thread 46912509890080 (LWP 16984)]
[bench2:16984] [0,1,0] setting up session dir with
[bench2:16984]  universe default-universe-16017
[bench2:16984]  user root
[bench2:16984]  host bench2
[bench2:16984]  jobid 1
[bench2:16984]  procid 0
[bench2:16984]
procdir: /tmp/openmpi-sessions-root@bench2_0/default-universe-160
17/1/0
[bench2:16984]
jobdir: /tmp/openmpi-sessions-root@bench2_0/default-universe-1601
7/1
[bench2:16984]
unidir: /tmp/openmpi-sessions-root@bench2_0/default-universe-1601
7
[bench2:16984] top: openmpi-sessions-root@bench2_0
[bench2:16984] tmp: /tmp
[bench2:16984] [0,1,0] ompi_mpi_init completed
#---------------------------------------------------
#    PALLAS MPI Benchmark Suite V2.2, MPI-1 part    
#---------------------------------------------------
# Date       : Thu Feb  2 09:51:32 2006
# Machine    : x86_64# System     : Linux
# Release    : 2.6.13-15-smp
# Version    : #7 SMP Mon Jan 30 12:05:45 PST 2006

#
# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM  
#
#

# List of Benchmarks to run:

# PingPong
# PingPing
# Sendrecv
# Exchange
# Allreduce
# Reduce
# Reduce_scatter
# Allgather
# Allgatherv
# Alltoall
# Bcast
# Barrier

Program received signal SIGINT, Interrupt.
[Switching to Thread 46912509890080 (LWP 16984)]
mthca_poll_cq (ibcq=0x718820, ne=1, wc=0x7fffffd071a0) at cq.c:469
469     cq.c: No such file or directory.
        in cq.c
(gdb) inf st
#0  mthca_poll_cq (ibcq=0x718820, ne=1, wc=0x7fffffd071a0) at cq.c:469
#1  0x00002aaaadef1a85 in mca_btl_openib_component_progress ()
   from /opt/ompi/lib/openmpi/mca_btl_openib.so
#2  0x00002aaaadde8f62 in mca_bml_r2_progress ()
   from /opt/ompi/lib/openmpi/mca_bml_r2.so
#3  0x00002aaaaaec79c0 in opal_progress ()
from /opt/ompi/lib/libopal.so.0
#4  0x00002aaaadac7255 in mca_pml_ob1_recv ()
   from /opt/ompi/lib/openmpi/mca_pml_ob1.so
#5  0x00002aaaaea434c2 in ompi_coll_tuned_reduce_intra_basic_linear ()
   from /opt/ompi/lib/openmpi/mca_coll_tuned.so
#6  0x00002aaaaea405e6 in ompi_coll_tuned_allreduce_intra_nonoverlapping
()
   from /opt/ompi/lib/openmpi/mca_coll_tuned.so
#7  0x00002aaaaac06b17 in ompi_comm_nextcid ()
from /opt/ompi/lib/libmpi.so.0
#8  0x00002aaaaac0513b in ompi_comm_split ()
from /opt/ompi/lib/libmpi.so.0
#9  0x00002aaaaac2bcd8 in PMPI_Comm_split ()
from /opt/ompi/lib/libmpi.so.0
#10 0x0000000000403b81 in Set_Communicator ()
#11 0x000000000040385e in Init_Communicator ()
#12 0x0000000000402e06 in main ()
(gdb) 

Here's the output in the other xterm:
================================================================================================
GNU gdb 6.3
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you
are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for
details.
This GDB was configured as "x86_64-suse-linux"...Using host libthread_db
library
 "/lib64/tls/libthread_db.so.1".

(gdb) run
Starting program: /root/SRC_PMB/PMB-MPI1 
[Thread debugging using libthread_db enabled]
[New Thread 46912509889280 (LWP 16025)]
[bench1:16025] [0,1,1] setting up session dir with
[bench1:16025]  universe default-universe-16017
[bench1:16025]  user root
[bench1:16025]  host bench1
[bench1:16025]  jobid 1
[bench1:16025]  procid 1
[bench1:16025]
procdir: /tmp/openmpi-sessions-root@bench1_0/default-universe-160
17/1/1
[bench1:16025]
jobdir: /tmp/openmpi-sessions-root@bench1_0/default-universe-1601
7/1
[bench1:16025]
unidir: /tmp/openmpi-sessions-root@bench1_0/default-universe-1601
7
[bench1:16025] top: openmpi-sessions-root@bench1_0
[bench1:16025] tmp: /tmp
[bench1:16025] [0,1,1] ompi_mpi_init completed

Program received signal SIGINT, Interrupt.
[Switching to Thread 46912509889280 (LWP 16025)]
0x00002aaaab493fc5 in pthread_spin_lock ()
from /lib64/tls/libpthread.so.0
(gdb) inf st
#0  0x00002aaaab493fc5 in pthread_spin_lock ()
from /lib64/tls/libpthread.so.0
#1  0x00002aaaaeb50d3e in mthca_poll_cq (ibcq=0x8b4990, ne=1, 
    wc=0x7fffffbe0290) at cq.c:454
#2  0x00002aaaaddf0ce0 in mca_btl_openib_component_progress ()
   from /opt/ompi/lib/openmpi/mca_btl_openib.so
#3  0x00002aaaadce7f62 in mca_bml_r2_progress ()
   from /opt/ompi/lib/openmpi/mca_bml_r2.so
#4  0x00002aaaaaec79c0 in opal_progress ()
from /opt/ompi/lib/libopal.so.0
#5  0x00002aaaad9c6255 in mca_pml_ob1_recv ()
   from /opt/ompi/lib/openmpi/mca_pml_ob1.so
#6  0x00002aaaae940ba9 in ompi_coll_tuned_bcast_intra_basic_linear ()
   from /opt/ompi/lib/openmpi/mca_coll_tuned.so
#7  0x00002aaaae52830f in mca_coll_basic_allgather_intra ()
   from /opt/ompi/lib/openmpi/mca_coll_basic.so
#8  0x00002aaaaac04ee3 in ompi_comm_split ()
from /opt/ompi/lib/libmpi.so.0
#9  0x00002aaaaac2bcd8 in PMPI_Comm_split ()
from /opt/ompi/lib/libmpi.so.0
#10 0x0000000000403b81 in Set_Communicator ()
#11 0x000000000040385e in Init_Communicator ()
#12 0x0000000000402e06 in main ()
(gdb) 




Reply via email to