The current version of Open MPI does not support continued operation of an MPI 
application after process failure within a job. If a process dies, so will the 
MPI job. Note that this is true of many MPI implementations out there at the 
moment.

At Oak Ridge National Laboratory, we are working on a version of Open MPI that 
will be able to run-through process failure, if the application wishes to do 
so. The semantics and interfaces needed to support this functionality are being 
actively developed by the MPI Forums Fault Tolerance Working Group, and can be 
found at the wiki page below:
  https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization

This work is on-going, but once we have a stable prototype we will assess how 
to bring it back to the mainline Open MPI trunk. For the moment, there is no 
public release of this branch, but once there is we will be sure to announce it 
on the appropriate Open MPI mailing list for folks to start playing around with 
it.

-- Josh

On Jan 27, 2011, at 9:11 AM, Kirk Stako wrote:

> Hi,
> 
> I was wondering what support Open MPI has for allowing a job to
> continue running when one or more processes in the job die
> unexpectedly? Is there a special mpirun flag for this? Any other ways?
> 
> It seems obvious that collectives will fail once a process dies, but
> would it be possible to create a new group (if you knew which ranks
> are dead) that excludes the dead processes - then turn this group into
> a working communicator?
> 
> Thanks,
> Kirk
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 

------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey


Reply via email to