On Mon, 2009-05-18 at 17:05 -0400, Noam Bernstein wrote:
> The code is complicated, the input files are big and lead to long  
> computation
> times, so I don't think I'll be able to make a simple test case.   
> Instead
> I attached to the hanging processes (all 8 of them) with gdb
> during  the hang. The stack trace is below.  Nodes seem to spend most of
> their time in the  btl_openib_component_progress(), and occasionally in
> mca_pml_ob1_progress().  I.e. not completely stuck, but not making  
> progress.

Can you confirm that *all* processes are in PMPI_Allreduce at some
point, the collectives commonly get blamed for a lot of hangs and it's
not always the correct place to look.

> P.S. I get a similar hang with MVAPICH, in a nearby but different part  
> of the
> code (on an MPI_Bcast, specifically), increasing my tendency to believe
> that it's OFED's fault.  But maybe the stack trace will suggest to  
> someone
> where it might be stuck, and therefore perhaps an mca flag to try?

This strikes me as a filesystem problem more than MPI per se.  Again
with MVAPICH are all your processes in MPI_Bcast or just some of them?

Ashley,

Reply via email to