Thanks for the feedback. More below:

Is there any MPI implementions which meet the following requirements:

1, it doesn't terminate the whole job when a node is dead?

2, it allows the spare node to replace the dead node and take over the work
of the dead node?

As far as I know, FT-MPI meets the two requirements, but it hasn't updated
since 2004. Open-mpi is said to combine serveral projects including FT-MPI,
but so far, it only provides checkpoinr/restart as a way of fault-tolerance.


Best Regards
Rui

2010/6/29 Jeff Squyres <jsquy...@cisco.com>

> On Jun 29, 2010, at 3:44 AM, 王睿 wrote:
>
> > 1, suppose a MPI program involves several nodes, if one node dead, will
> the program terminate?
>
> Open MPI will terminate the whole job, yes.
>
> > 2, Is there any possibility to extend or shrink the size of MPI
> communicator size? If so, we can use spare node to replace the dead node?
>
> Currently, no.
>
> Fault tolerance and resiliency is an active topic of research and
> discussion in the MPI-3 forum.  But for the moment, most MPI implementations
> -- including Open MPI -- have fairly draconian responses to the loss of a
> process and/or node (i.e., kill the rest of the job).
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to