On Jul 1, 2009, at 3:10 PM, Daniel Miles wrote:
Hi, everybody.
I’m having trouble where one of my client nodes crashes while I have
an MPI job on it. When this happens, the mpirun process on the head
node never returns.
This shouldn't happen - we should cleanly abort. What version are you
using?
I can kill it with a SIGINT (ctrl-c) and it still cleans up its
child processes on the remaining healthy client nodes but I don’t
get any of the results from those client processes.
At the moment, we sigterm the remaining healthy children when you ctrl-
c. I do believe that Rolf (Sun) put some code in our development trunk
that first hits the procs with a signal that they can catch to cleanup
before being whacked, but that isn't in a release yet (assuming I
remember it right anyway). If I'm mis-remembering, I can certainly add
that capability.
Sounds like something we should do, assuming the MPI std allows it
(and mechanics work out).
Does anybody have any ideas about how I could create a more fault-
tolerant MPI job? In an ideal world, my head node would report that
it lost the connection to a client node and keep going as if that
client never existed (so that the results of the job are what they
would have been if the crashed-node wasn’t part of the job to begin
with).
That would be nice...but I'm not sure anyone knows how to do that
right now. The problem is that MPI operations involving ranks on that
client node will suddenly hang without warning, and there is no way to
know that something is wrong.
There is work going on to enable what you describe, but it is still in
the research phase.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users