We unfortunately don't have much visibility into the PSM device (meaning: Open MPI is a thin shim on top of the underlying libpsm, which handles all the MPI point-to-point semantics itself). So we can't even ask you to run padb to look at the message queues, because we don't have access to them. :-\
Can you try running with TCP and see if that also deadlocks? If it does, you can at least run padb to have a look at the message queues. On Mar 21, 2012, at 11:15 AM, Brock Palen wrote: > Forgotten stack as promised, it keeps changing at the lower level > opal_progress, but never moves above that. > > [yccho@nyx0817 ~]$ padb -Ormgr=orte --all --stack-trace --tree --all > Stack trace(s) for thread: 1 > ----------------- > [0-63] (64 processes) > ----------------- > main() at ?:? > Loci::makeQuery(Loci::rule_db const&, Loci::fact_db&, > std::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&)() at ?:? > Loci::execute_list::execute(Loci::fact_db&, Loci::sched_db&)() at ?:? > Loci::execute_list::execute(Loci::fact_db&, Loci::sched_db&)() at ?:? > Loci::execute_list::execute(Loci::fact_db&, Loci::sched_db&)() at ?:? > Loci::execute_loop::execute(Loci::fact_db&, Loci::sched_db&)() at ?:? > Loci::execute_list::execute(Loci::fact_db&, Loci::sched_db&)() at > ?:? > Loci::execute_list::execute(Loci::fact_db&, Loci::sched_db&)() > at ?:? > Loci::execute_loop::execute(Loci::fact_db&, Loci::sched_db&)() > at ?:? > Loci::execute_list::execute(Loci::fact_db&, > Loci::sched_db&)() at ?:? > Loci::execute_rule::execute(Loci::fact_db&, > Loci::sched_db&)() at ?:? > streamUns::HypreSolveUnit::compute(Loci::sequence > const&)() at ?:? > hypre_BoomerAMGSetup() at ?:? > hypre_BoomerAMGBuildInterp() at ?:? > ----------------- > [0,2-3,5-16,18-19,21-24,27-34,36-63] (57 processes) > ----------------- > hypre_ParCSRMatrixExtractBExt() at ?:? > hypre_ParCSRMatrixExtractBExt_Arrays() at ?:? > hypre_ParCSRCommHandleDestroy() at ?:? > PMPI_Waitall() at ?:? > ----------------- > [0,2-3,5,7-16,18-19,21-24,27-34,36-63] (56 > processes) > ----------------- > ompi_request_default_wait_all() at ?:? > opal_progress() at ?:? > ----------------- > [6] (1 processes) > ----------------- > ompi_mtl_psm_progress() at ?:? > ----------------- > [1,4,17,20,25-26,35] (7 processes) > ----------------- > hypre_ParCSRCommHandleDestroy() at ?:? > PMPI_Waitall() at ?:? > ompi_request_default_wait_all() at ?:? > opal_progress() at ?:? > Stack trace(s) for thread: 2 > ----------------- > [0-63] (64 processes) > ----------------- > start_thread() at ?:? > ips_ptl_pollintr() at ptl_rcvthread.c:324 > poll() at ?:? > > > Brock Palen > www.umich.edu/~brockp > CAEN Advanced Computing > bro...@umich.edu > (734)936-1985 > > > > On Mar 21, 2012, at 11:14 AM, Brock Palen wrote: > >> I have a users code that appears to be hanging some times on MPI_Waitall(), >> stack trace from padb below. It is on qlogic IB using the psm mtl. >> Without knowing what requests go to which rank, how can I check that this >> code didn't just get its self into a deadlock? Is there a way to get a >> reable list of every ranks posted sends? And then query an wiating >> MPI_Waitall() of a running job to get what rends/recvs it is waiting on? >> >> Thanks! >> >> Brock Palen >> www.umich.edu/~brockp >> CAEN Advanced Computing >> bro...@umich.edu >> (734)936-1985 >> >> >> > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/