On May 19, 2009, at 12:13 PM, Ashley Pittman wrote:

On Tue, 2009-05-19 at 11:01 -0400, Noam Bernstein wrote:

I'd suspect the filesystem too, except that it's hung up in an MPI
call.  As I said
before, the whole thing is bizarre.  It doesn't matter where the
executable is,
just what CWD is (i.e. I can do mpirun /scratch/exec or mpirun /home/
bernstei/exec,
but if it's sitting in /scratch it'll hang).  And I've been running
other codes both from NFS and from scratch directories for months,
and never had a problem.

That is indeed odd but it shouldn't be too hard to track down, how often
does the failure occur?  Presumably when you say you have three
invocations of the program they communicate via files, is the location
of these files changing?

The hang is completely repeatable. Every time I run in the configuration that hangs, the third invokation hangs (i.e. running from the scratch directory).
The three invokations happen like this:
serial code creates subdirectory, write input files, call "mpirun ...", reads output files repeat, in a new subdirectory (except that the input files are configured to do a rather different calculation) repeat again, in a new subdirectory (back to the first type of calculation).

I can try to save the input and output files and recreate the process
by running mpirun myself, rather that via the serial code.


I assume you're certain it's actually hanging and not just failing to
converge?

Yes - no output is generated for a very long time, and attaching
with gdb shows that it's stuck in basically one place.



Finally if you could run it with "--mca btl ^ofed" to rule out the ofed stack causing the problem that would be useful. You'd need to check the
syntax here.


I'll try that (or actually the corrected syntax in the next message.


This isn't so suspicious, if there is a problem with some processes it's
common for other processes to continue till the next collective call.

Yeah, I guess so.

                                                                                
                                Noam

Reply via email to