On Jan 27, 2011, at 7:47 AM, Reuti wrote: > Am 27.01.2011 um 15:23 schrieb Joshua Hursey: > >> The current version of Open MPI does not support continued operation of an >> MPI application after process failure within a job. If a process dies, so >> will the MPI job. Note that this is true of many MPI implementations out >> there at the moment. >> >> At Oak Ridge National Laboratory, we are working on a version of Open MPI >> that will be able to run-through process failure, if the application wishes >> to do so. The semantics and interfaces needed to support this functionality >> are being actively developed by the MPI Forums Fault Tolerance Working >> Group, and can be found at the wiki page below: >> https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization > > I had a look at this document, but what is really covered - the application > has to react on the notification of a failed rank and act appropriate on its > own? > > Having a true ability to survive a dying process (i.e. rank) which might be > computing already for hours would mean to have some kind of "rank RAID" or > "rank Parchive". E.g. start 12 ranks when you need 10 - what ever 2 ranks are > failing, your job will be ready in time.
We have the run-time part of this done - of course, figuring out the MPI part of the problem is harder ;-) > > -- Reuti > > >> This work is on-going, but once we have a stable prototype we will assess >> how to bring it back to the mainline Open MPI trunk. For the moment, there >> is no public release of this branch, but once there is we will be sure to >> announce it on the appropriate Open MPI mailing list for folks to start >> playing around with it. >> >> -- Josh >> >> On Jan 27, 2011, at 9:11 AM, Kirk Stako wrote: >> >>> Hi, >>> >>> I was wondering what support Open MPI has for allowing a job to >>> continue running when one or more processes in the job die >>> unexpectedly? Is there a special mpirun flag for this? Any other ways? >>> >>> It seems obvious that collectives will fail once a process dies, but >>> would it be possible to create a new group (if you knew which ranks >>> are dead) that excludes the dead processes - then turn this group into >>> a working communicator? >>> >>> Thanks, >>> Kirk >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> ------------------------------------ >> Joshua Hursey >> Postdoctoral Research Associate >> Oak Ridge National Laboratory >> http://users.nccs.gov/~jjhursey >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users