On Mon, 2009-05-18 at 17:05 -0400, Noam Bernstein wrote: > The code is complicated, the input files are big and lead to long > computation > times, so I don't think I'll be able to make a simple test case. > Instead > I attached to the hanging processes (all 8 of them) with gdb > during the hang. The stack trace is below. Nodes seem to spend most of > their time in the btl_openib_component_progress(), and occasionally in > mca_pml_ob1_progress(). I.e. not completely stuck, but not making > progress.
Can you confirm that *all* processes are in PMPI_Allreduce at some point, the collectives commonly get blamed for a lot of hangs and it's not always the correct place to look. > P.S. I get a similar hang with MVAPICH, in a nearby but different part > of the > code (on an MPI_Bcast, specifically), increasing my tendency to believe > that it's OFED's fault. But maybe the stack trace will suggest to > someone > where it might be stuck, and therefore perhaps an mca flag to try? This strikes me as a filesystem problem more than MPI per se. Again with MVAPICH are all your processes in MPI_Bcast or just some of them? Ashley,