On Jan 27, 2011, at 7:47 AM, Reuti wrote:

> Am 27.01.2011 um 15:23 schrieb Joshua Hursey:
> 
>> The current version of Open MPI does not support continued operation of an 
>> MPI application after process failure within a job. If a process dies, so 
>> will the MPI job. Note that this is true of many MPI implementations out 
>> there at the moment.
>> 
>> At Oak Ridge National Laboratory, we are working on a version of Open MPI 
>> that will be able to run-through process failure, if the application wishes 
>> to do so. The semantics and interfaces needed to support this functionality 
>> are being actively developed by the MPI Forums Fault Tolerance Working 
>> Group, and can be found at the wiki page below:
>> https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization
> 
> I had a look at this document, but what is really covered - the application 
> has to react on the notification of a failed rank and act appropriate on its 
> own?
> 
> Having a true ability to survive a dying process (i.e. rank) which might be 
> computing already for hours would mean to have some kind of "rank RAID" or 
> "rank Parchive". E.g. start 12 ranks when you need 10 - what ever 2 ranks are 
> failing, your job will be ready in time.

We have the run-time part of this done - of course, figuring out the MPI part 
of the problem is harder ;-)

> 
> -- Reuti
> 
> 
>> This work is on-going, but once we have a stable prototype we will assess 
>> how to bring it back to the mainline Open MPI trunk. For the moment, there 
>> is no public release of this branch, but once there is we will be sure to 
>> announce it on the appropriate Open MPI mailing list for folks to start 
>> playing around with it.
>> 
>> -- Josh
>> 
>> On Jan 27, 2011, at 9:11 AM, Kirk Stako wrote:
>> 
>>> Hi,
>>> 
>>> I was wondering what support Open MPI has for allowing a job to
>>> continue running when one or more processes in the job die
>>> unexpectedly? Is there a special mpirun flag for this? Any other ways?
>>> 
>>> It seems obvious that collectives will fail once a process dies, but
>>> would it be possible to create a new group (if you knew which ranks
>>> are dead) that excludes the dead processes - then turn this group into
>>> a working communicator?
>>> 
>>> Thanks,
>>> Kirk
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
>> ------------------------------------
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to