You are correct that the Open MPI project combined the efforts of a few preexisting MPI implementations towards building a single, extensible MPI implementation with the best features of the prior MPI implementations. From the beginning of the project the Open MPI developer community has desired to provide a solid MPI 2 (soon MPI 3) compliant MPI implementation. Features outside of the MPI standard, such as fault tolerance, have been (and are) goals as well.
The fault tolerance efforts in Open MPI have been mostly pursued by the research side of the community. As such, maintenance support for these features is often challenging and a point of frequent discussion in the core developer community. There are users for each of these fault tolerance features/techniques, so they are important to provide. Integrating these features into Open MPI without diminishing performance, scalability, and usability is often a delicate software engineering challenge. Per the prior comments on this thread, it can often lead to heated debate. :) In the Open MPI trunk and 1.6 release series there are a few fault tolerance features that you might be interested in, all with various degrees of functionality and support. Each of these features are advancements on the fault tolerance features from the LAM/MPI, MPICH-V, FT-MPI, and LA-MPI projects. Checkpoint/Restart support allows a user to manually (via a command line tool) checkpoint and restart an MPI application, migrate processes in the machine, and/or ask Open MPI to automatically restart failed processes on spare resources. Additionally, the application can use APIs to checkpoint/restart/migrate processes without using the command line tools. This C/R technique is similar to the feature provided by LAM/MPI, and was developed by Indiana University (for my PhD work). For more details see the link below: http://www.open-mpi.org/faq/?category=ft#cr-support Message logging support was added a while back by UTK, but I am uncertain about its current state. This technique is similar to the features provided by the MPICH-V project. For more details, I think the wiki page below describes the functionality: https://svn.open-mpi.org/trac/ompi/wiki/EventLog_CR The MPI Forum standardization body's Fault Tolerance Working Group has a proposal for application managed fault tolerance. In essence this is similar to the FT-MPI work, although the interface is quite a bit different. This feature is not yet in the Open MPI trunk, but you can find a beta release and more information at the link below: http://www.open-mpi.org/~jjhursey/projects/ft-open-mpi/ End-to-end data reliability worked at one point in time, but I do not know if it is being maintained. This is similar to the fault tolerance features found in LA-MPI. For information about that project see the link below: http://www.open-mpi.org/faq/?category=ft#dr-support There are also research projects that are exploring other fault tolerance techniques above MPI, such as peer based checkpointing and replication. So far, these projects have tried to stay above the MPI layer for portability, and have not requested any specific extensions of Open MPI (maybe with the exception of the work in the MPI Forum, cited above). Below are links to two such projects, though there are many others out there: http://sourceforge.net/projects/scalablecr/ http://prod.sandia.gov/techlib/access-control.cgi/2011/112488.pdf So that should give you an overview of the current state of fault tolerance techniques in Open MPI. To your question about what you can expect if a process crashes in your Open MPI job. By default, Open MPI will kill your entire MPI job and the user will have to restart the job from either the beginning of execution or from any checkpoint files that the application has written. Open MPI defaults to killing the entire MPI job since that is what is often expected by MPI applications, as most use the default MPI error handler MPI_ERRORS_ARE_FATAL: http://www.netlib.org/utk/papers/mpi-book/node177.html Last I checked, the current Open MPI trunk will terminate the entire job even if the user set MPI_ERRORS_RETURN on their communicators. A reason for this is that the behavior of MPI after returning such an error is undefined. The MPI Forum Fault Tolerance working group is working to define this behavior. So if this is of interest see the MPI Forum work cited above. If you want a fault tolerance feature, such as automatic checkpoint/restart recovery, you will need to create a build of Open MPI with that feature enabled. There are instructions on the various links above about how to do so. If you are particularly interested in one feature or have a strong use case for a set of features, then that is important information for the Open MPI developer community. This will help use as a project prioritize the maintenance of various features in the Open MPI project. Best of luck, Josh On Wed, Jun 20, 2012 at 2:59 AM, 陈松 <chens...@nscc-tj.gov.cn> wrote: > As far as I know, OMPI combines the fault tolerant features in FT-MPI, > LA-MPI and LAM/MPI, is this statement still correct now? Or as you say, OMPI > supports checkpoint/restart(like in LAM/MPI) only? I don't know the details > of FT-MPI or LA-MPI, aren't they useful or necesarry? > > In fact, what I really want to know is, suppose I run a job on N processors > with OMPI, and one (or some) of these processors crashes, then what would be > done by the fault-tolerant mechanism of OMPI? Meanwhile what should the > sys-admin do(like restart the crashed node) ? > > In my understanding, after the crash, the sys-admin should restart the > crashed node(if it can be restarted), and then do the rollback by some sort > of command, while the OMPI would help hang up all the computing process, > waiting for rollback command, is this correct? > > thanks again. > > > > --------- 原始邮件信息 --------- > 发件人: "Open MPI Users" <us...@open-mpi.org> > 收件人: "Open MPI Users" <us...@open-mpi.org> > 主题: Re: [OMPI users] 2012/06/18 14:35:07 自动保存草稿 > 日期: 2012/06/20 01:26:08, Wednesday > > > That's a little bit strong - OMPI still supports checkpoint/restart as a > fault tolerance mechanism. There really isn't anything the sys admin has to > do, though - what is required is that users periodically order their > programs to checkpoint so they can be restarted after a failure. > > Checkpointing is typically done either by the app itself (say, when it > reaches some point it feels is a good one to save), or using a script that > just orders a checkpoint every so many seconds. > > What we have said is that we don't believe the FT "run thru failure" > position pushed by UTK is particularly required at this time. Partly a > question of impact vs benefit, mostly due to competing approaches offering > equivalent fault recovery capability with less impact. But that's a separate > discussion. > > > On Jun 19, 2012, at 11:16 AM, George Bosilca wrote: > > It has been clearly stated that the official position pushed forward by a > majority of the Open MPI developer community is that fault tolerance is not > needed so we (read this as the official version of Open MPI) do not support > it. > > However, a group of researchers have been working toward a version of Open > MPI that supports the last fault tolerance proposal submitted for > consideration to the MPI Forum. You can access it > at https://bitbucket.org/jjhursey/ompi-ulfm-rts. > > george. > > On Jun 19, 2012, at 09:58 , 陈松 wrote: > > Hi all, > > Can anyone explain me the fault tolerant features in OpenMPI? I've read the > FAQs and some papers about this topic listed in open-mpi.org, but still > can't figure out when one node of my supercomputer system fails down during > computing, what would happen with the fault tolerant mechanism in OpenMPI, > and what should we system administrator do after the failure (or before). > > Can anyone help me? My boss want me to deploy OpenMPI in our system cuz he > want the fault tolerant feature. > > Thanks very much. > > > > --------------- > CHEN Song > R&D Department > National Supercomputer Center in Tianjin > Binhai New Area, Tianjin, China > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > ________________________________ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey