Thanks for the feedback. More below: Is there any MPI implementions which meet the following requirements:
1, it doesn't terminate the whole job when a node is dead? 2, it allows the spare node to replace the dead node and take over the work of the dead node? As far as I know, FT-MPI meets the two requirements, but it hasn't updated since 2004. Open-mpi is said to combine serveral projects including FT-MPI, but so far, it only provides checkpoinr/restart as a way of fault-tolerance. Best Regards Rui 2010/6/29 Jeff Squyres <jsquy...@cisco.com> > On Jun 29, 2010, at 3:44 AM, 王睿 wrote: > > > 1, suppose a MPI program involves several nodes, if one node dead, will > the program terminate? > > Open MPI will terminate the whole job, yes. > > > 2, Is there any possibility to extend or shrink the size of MPI > communicator size? If so, we can use spare node to replace the dead node? > > Currently, no. > > Fault tolerance and resiliency is an active topic of research and > discussion in the MPI-3 forum. But for the moment, most MPI implementations > -- including Open MPI -- have fairly draconian responses to the loss of a > process and/or node (i.e., kill the rest of the job). > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >