Actually, I honestly don't remember even having that discussion. In looking at it, this would be relatively easy to implement if someone really wanted it.
Only issue: user would bear full responsibility for OMPI not cleaning up failed jobs since we wouldn't terminate upon seeing a proc fail. Definitely not something you'd want to do in production! On Sep 16, 2011, at 6:55 AM, Josh Hursey wrote: > Though I do not share George's pessimism about acceptance to the Open > MPI community, it has been slightly difficult to add such a > non-standard feature to the code base for various reasons. > > At ORNL, I have been developing a prototype for the MPI Forum Fault > Tolerance Working Group [1] of the Run-Through Stabilization proposal > [2,3]. This would allow the application to continue running and using > MPI functions even though processes fail during execution. We have > been doing some limited alpha releases for some friendly application > developers desiring to play with the prototype for a while now. We are > hoping to do a more public beta release in the coming months. I'll > likely post a message to the ompi-devel list once it is ready. > > -- Josh > > [1] http://svn.mpi-forum.org/trac/mpi-forum-web/wiki/FaultToleranceWikiPage > [2] See PDF on > https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization > [3] See PDF on > https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization_2 > > On Thu, Sep 15, 2011 at 4:14 PM, George Bosilca <bosi...@eecs.utk.edu> wrote: >> Rob, >> >> The Open MPI community did consider such as option, but it deemed it as >> uninteresting. However, we (UTK team) have a patched version supporting >> several fault tolerant modes, including the one you described in your email. >> If you are interested please contact me directly. >> >> Thanks, >> george. >> >> >> On Sep 12, 2011, at 20:43 , Ralph Castain wrote: >> >>> We don't have anything similar in OMPI. There are fault tolerance modes, >>> but not like the one you describe. >>> >>> On Sep 12, 2011, at 5:52 PM, Rob Stewart wrote: >>> >>>> Hi, >>>> >>>> I have implemented a simple fault tolerant ping pong C program with MPI, >>>> here: http://pastebin.com/7mtmQH2q >>>> >>>> MPICH2 offers a parameter with mpiexec: >>>> $ mpiexec -disable-auto-cleanup >>>> >>>> .. as described here: http://trac.mcs.anl.gov/projects/mpich2/ticket/1421 >>>> >>>> It is fault tolerant in the respect that, when I ssh to one of the nodes >>>> in the hosts file, and kill the relevant process, the MPI job is not >>>> terminated. Simply, the ping will not prompt a pong from the dead node, >>>> but the ping-pong runs forever on the remaining live nodes. >>>> >>>> Is such an feature available for openMPI, either via mpiexec or some other >>>> means? >>>> >>>> >>>> -- >>>> Rob Stewart >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > > > > -- > Joshua Hursey > Postdoctoral Research Associate > Oak Ridge National Laboratory > http://users.nccs.gov/~jjhursey > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users