We unfortunately don't have much visibility into the PSM device (meaning: Open 
MPI is a thin shim on top of the underlying libpsm, which handles all the MPI 
point-to-point semantics itself).  So we can't even ask you to run padb to look 
at the message queues, because we don't have access to them.  :-\

Can you try running with TCP and see if that also deadlocks?  If it does, you 
can at least run padb to have a look at the message queues.


On Mar 21, 2012, at 11:15 AM, Brock Palen wrote:

> Forgotten stack as promised, it keeps changing at the lower level 
> opal_progress, but never moves above that.
> 
> [yccho@nyx0817 ~]$ padb -Ormgr=orte --all --stack-trace --tree --all 
> Stack trace(s) for thread: 1
> -----------------
> [0-63] (64 processes)
> -----------------
> main() at ?:?
>  Loci::makeQuery(Loci::rule_db const&, Loci::fact_db&, 
> std::basic_string<char, std::char_traits<char>, std::allocator<char> > 
> const&)() at ?:?
>    Loci::execute_list::execute(Loci::fact_db&, Loci::sched_db&)() at ?:?
>      Loci::execute_list::execute(Loci::fact_db&, Loci::sched_db&)() at ?:?
>        Loci::execute_list::execute(Loci::fact_db&, Loci::sched_db&)() at ?:?
>          Loci::execute_loop::execute(Loci::fact_db&, Loci::sched_db&)() at ?:?
>            Loci::execute_list::execute(Loci::fact_db&, Loci::sched_db&)() at 
> ?:?
>              Loci::execute_list::execute(Loci::fact_db&, Loci::sched_db&)() 
> at ?:?
>                Loci::execute_loop::execute(Loci::fact_db&, Loci::sched_db&)() 
> at ?:?
>                  Loci::execute_list::execute(Loci::fact_db&, 
> Loci::sched_db&)() at ?:?
>                    Loci::execute_rule::execute(Loci::fact_db&, 
> Loci::sched_db&)() at ?:?
>                      streamUns::HypreSolveUnit::compute(Loci::sequence 
> const&)() at ?:?
>                        hypre_BoomerAMGSetup() at ?:?
>                          hypre_BoomerAMGBuildInterp() at ?:?
>                            -----------------
>                            [0,2-3,5-16,18-19,21-24,27-34,36-63] (57 processes)
>                            -----------------
>                            hypre_ParCSRMatrixExtractBExt() at ?:?
>                              hypre_ParCSRMatrixExtractBExt_Arrays() at ?:?
>                                hypre_ParCSRCommHandleDestroy() at ?:?
>                                  PMPI_Waitall() at ?:?
>                                    -----------------
>                                    [0,2-3,5,7-16,18-19,21-24,27-34,36-63] (56 
> processes)
>                                    -----------------
>                                    ompi_request_default_wait_all() at ?:?
>                                      opal_progress() at ?:?
>                                    -----------------
>                                    [6] (1 processes)
>                                    -----------------
>                                    ompi_mtl_psm_progress() at ?:?
>                            -----------------
>                            [1,4,17,20,25-26,35] (7 processes)
>                            -----------------
>                            hypre_ParCSRCommHandleDestroy() at ?:?
>                              PMPI_Waitall() at ?:?
>                                ompi_request_default_wait_all() at ?:?
>                                  opal_progress() at ?:?
> Stack trace(s) for thread: 2
> -----------------
> [0-63] (64 processes)
> -----------------
> start_thread() at ?:?
>  ips_ptl_pollintr() at ptl_rcvthread.c:324
>    poll() at ?:?
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On Mar 21, 2012, at 11:14 AM, Brock Palen wrote:
> 
>> I have a users code that appears to be hanging some times on MPI_Waitall(),  
>> stack trace from padb below.  It is on qlogic IB using the psm mtl.
>> Without knowing what requests go to which rank, how can I check that this 
>> code didn't just get its self into a deadlock?  Is there a way to get a 
>> reable list of every ranks posted sends?  And then query an wiating 
>> MPI_Waitall() of a running job to get what rends/recvs it is waiting on?
>> 
>> Thanks!
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to