Re: [OMPI users] Disconnections

Ralph Castain Wed, 1 Jul 2009 18:01:00 -0400


On Jul 1, 2009, at 3:10 PM, Daniel Miles wrote:

Hi, everybody.
I’m having trouble where one of my client nodes crashes while I havean MPI job on it. When this happens, the mpirun process on the headnode never returns.

This shouldn't happen - we should cleanly abort. What version are youusing?

I can kill it with a SIGINT (ctrl-c) and it still cleans up itschild processes on the remaining healthy client nodes but I don’tget any of the results from those client processes.

At the moment, we sigterm the remaining healthy children when you ctrl-c. I do believe that Rolf (Sun) put some code in our development trunkthat first hits the procs with a signal that they can catch to cleanupbefore being whacked, but that isn't in a release yet (assuming Iremember it right anyway). If I'm mis-remembering, I can certainly addthat capability.

Sounds like something we should do, assuming the MPI std allows it(and mechanics work out).

Does anybody have any ideas about how I could create a more fault-tolerant MPI job? In an ideal world, my head node would report thatit lost the connection to a client node and keep going as if thatclient never existed (so that the results of the job are what theywould have been if the crashed-node wasn’t part of the job to beginwith).

That would be nice...but I'm not sure anyone knows how to do thatright now. The problem is that MPI operations involving ranks on thatclient node will suddenly hang without warning, and there is no way toknow that something is wrong.

There is work going on to enable what you describe, but it is still inthe research phase.

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Disconnections

Reply via email to