It has been clearly stated that the official position pushed forward by a majority of the Open MPI developer community is that fault tolerance is not needed so we (read this as the official version of Open MPI) do not support it.
However, a group of researchers have been working toward a version of Open MPI that supports the last fault tolerance proposal submitted for consideration to the MPI Forum. You can access it at https://bitbucket.org/jjhursey/ompi-ulfm-rts. george. On Jun 19, 2012, at 09:58 , 陈松 wrote: > Hi all, > > Can anyone explain me the fault tolerant features in OpenMPI? I've read the > FAQs and some papers about this topic listed in open-mpi.org, but still can't > figure out when one node of my supercomputer system fails down during > computing, what would happen with the fault tolerant mechanism in OpenMPI, > and what should we system administrator do after the failure (or before). > > Can anyone help me? My boss want me to deploy OpenMPI in our system cuz he > want the fault tolerant feature. > > Thanks very much. > > > > --------------- > CHEN Song > R&D Department > National Supercomputer Center in Tianjin > Binhai New Area, Tianjin, China > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users