The official support page for the C/R features is hosted by Indiana University (linked from the Open MPI FAQs): http://osl.iu.edu/research/ft/ompi-cr/
The instructions probably need to be cleaned up (some of the release references are not quite correct any longer). But the following should give you a build of Open MPI with C/R support: shell$ ./configure --with-ft=cr --with-ft=cr --enable-opal-multi-threads You will also need to enable it on the command line with mpirun: shell$ mpirun -am ft-enable-cr my-app Best, Josh On Mon, Jun 25, 2012 at 6:21 AM, 陈松 <chens...@nscc-tj.gov.cn> wrote: > THANK YOU for your detailed answer. > > [quote]If you want a fault tolerance feature, such as automatic > checkpoint/restart recovery, you will need to create a build of Open > MPI with that feature enabled. There are instructions on the various > links above about how to do so.[/quote] > > > Could you give me some kind of official guide to enable the C/R feature? I > googled some aritcles but there seems problems with those methods. > > Best wishes. > > --------- 原始邮件信息 --------- > 发件人: "Open MPI Users" <us...@open-mpi.org> > 收件人: "Open MPI Users" <us...@open-mpi.org> > 主题: [OMPI users] Re: [OMPI users] 回复: Re: [OMPI users] 2012/06/18 14:35:07 > 自动保存草稿 > 日期: 2012/06/20 21:43:27, Wednesday > > You are correct that the Open MPI project combined the efforts of a > few preexisting MPI implementations towards building a single, > extensible MPI implementation with the best features of the prior MPI > implementations. From the beginning of the project the Open MPI > developer community has desired to provide a solid MPI 2 (soon MPI 3) > compliant MPI implementation. Features outside of the MPI standard, > such as fault tolerance, have been (and are) goals as well. > > The fault tolerance efforts in Open MPI have been mostly pursued by > the research side of the community. As such, maintenance support for > these features is often challenging and a point of frequent discussion > in the core developer community. There are users for each of these > fault tolerance features/techniques, so they are important to provide. > Integrating these features into Open MPI without diminishing > performance, scalability, and usability is often a delicate software > engineering challenge. Per the prior comments on this thread, it can > often lead to heated debate. :) > > > In the Open MPI trunk and 1.6 release series there are a few fault > tolerance features that you might be interested in, all with various > degrees of functionality and support. Each of these features are > advancements on the fault tolerance features from the LAM/MPI, > MPICH-V, FT-MPI, and LA-MPI projects. > > Checkpoint/Restart support allows a user to manually (via a command > line tool) checkpoint and restart an MPI application, migrate > processes in the machine, and/or ask Open MPI to automatically restart > failed processes on spare resources. Additionally, the application can > use APIs to checkpoint/restart/migrate processes without using the > command line tools. This C/R technique is similar to the feature > provided by LAM/MPI, and was developed by Indiana University (for my > PhD work). For more details see the link below: > http://www.open-mpi.org/faq/?category=ft#cr-support > > Message logging support was added a while back by UTK, but I am > uncertain about its current state. This technique is similar to the > features provided by the MPICH-V project. For more details, I think > the wiki page below describes the functionality: > https://svn.open-mpi.org/trac/ompi/wiki/EventLog_CR > > The MPI Forum standardization body's Fault Tolerance Working Group has > a proposal for application managed fault tolerance. In essence this is > similar to the FT-MPI work, although the interface is quite a bit > different. This feature is not yet in the Open MPI trunk, but you can > find a beta release and more information at the link below: > http://www.open-mpi.org/~jjhursey/projects/ft-open-mpi/ > > End-to-end data reliability worked at one point in time, but I do not > know if it is being maintained. This is similar to the fault tolerance > features found in LA-MPI. For information about that project see the > link below: > http://www.open-mpi.org/faq/?category=ft#dr-support > > There are also research projects that are exploring other fault > tolerance techniques above MPI, such as peer based checkpointing and > replication. So far, these projects have tried to stay above the MPI > layer for portability, and have not requested any specific extensions > of Open MPI (maybe with the exception of the work in the MPI Forum, > cited above). Below are links to two such projects, though there are > many others out there: > http://sourceforge.net/projects/scalablecr/ > http://prod.sandia.gov/techlib/access-control.cgi/2011/112488.pdf > > > So that should give you an overview of the current state of fault > tolerance techniques in Open MPI. To your question about what you can > expect if a process crashes in your Open MPI job. By default, Open MPI > will kill your entire MPI job and the user will have to restart the > job from either the beginning of execution or from any checkpoint > files that the application has written. Open MPI defaults to killing > the entire MPI job since that is what is often expected by MPI > applications, as most use the default MPI error handler > MPI_ERRORS_ARE_FATAL: > http://www.netlib.org/utk/papers/mpi-book/node177.html > > Last I checked, the current Open MPI trunk will terminate the entire > job even if the user set MPI_ERRORS_RETURN on their communicators. A > reason for this is that the behavior of MPI after returning such an > error is undefined. The MPI Forum Fault Tolerance working group is > working to define this behavior. So if this is of interest see the MPI > Forum work cited above. > > If you want a fault tolerance feature, such as automatic > checkpoint/restart recovery, you will need to create a build of Open > MPI with that feature enabled. There are instructions on the various > links above about how to do so. > > > If you are particularly interested in one feature or have a strong use > case for a set of features, then that is important information for the > Open MPI developer community. This will help use as a project > prioritize the maintenance of various features in the Open MPI > project. > > > Best of luck, > Josh > > On Wed, Jun 20, 2012 at 2:59 AM, 陈松 <chens...@nscc-tj.gov.cn> wrote: >> As far as I know, OMPI combines the fault tolerant features in FT-MPI, >> LA-MPI and LAM/MPI, is this statement still correct now? Or as you say, >> OMPI >> supports checkpoint/restart(like in LAM/MPI) only? I don't know the >> details >> of FT-MPI or LA-MPI, aren't they useful or necesarry? >> >> In fact, what I really want to know is, suppose I run a job on N >> processors >> with OMPI, and one (or some) of these processors crashes, then what would >> be >> done by the fault-tolerant mechanism of OMPI? Meanwhile what should the >> sys-admin do(like restart the crashed node) ? >> >> In my understanding, after the crash, the sys-admin should restart the >> crashed node(if it can be restarted), and then do the rollback by some >> sort >> of command, while the OMPI would help hang up all the computing process, >> waiting for rollback command, is this correct? >> >> thanks again. >> >> >> >> --------- 原始邮件信息 --------- >> 发件人: "Open MPI Users" <us...@open-mpi.org> >> 收件人: "Open MPI Users" <us...@open-mpi.org> >> 主题: Re: [OMPI users] 2012/06/18 14:35:07 自动保存草稿 >> 日期: 2012/06/20 01:26:08, Wednesday >> >> >> That's a little bit strong - OMPI still supports checkpoint/restart as a >> fault tolerance mechanism. There really isn't anything the sys admin has >> to >> do, though - what is required is that users periodically order their >> programs to checkpoint so they can be restarted after a failure. >> >> Checkpointing is typically done either by the app itself (say, when it >> reaches some point it feels is a good one to save), or using a script that >> just orders a checkpoint every so many seconds. >> >> What we have said is that we don't believe the FT "run thru failure" >> position pushed by UTK is particularly required at this time. Partly a >> question of impact vs benefit, mostly due to competing approaches offering >> equivalent fault recovery capability with less impact. But that's a >> separate >> discussion. >> >> >> On Jun 19, 2012, at 11:16 AM, George Bosilca wrote: >> >> It has been clearly stated that the official position pushed forward by a >> majority of the Open MPI developer community is that fault tolerance is >> not >> needed so we (read this as the official version of Open MPI) do not >> support >> it. >> >> However, a group of researchers have been working toward a version of Open >> MPI that supports the last fault tolerance proposal submitted for >> consideration to the MPI Forum. You can access it >> at https://bitbucket.org/jjhursey/ompi-ulfm-rts. >> >> george. >> >> On Jun 19, 2012, at 09:58 , 陈松 wrote: >> >> Hi all, >> >> Can anyone explain me the fault tolerant features in OpenMPI? I've read >> the >> FAQs and some papers about this topic listed in open-mpi.org, but still >> can't figure out when one node of my supercomputer system fails down >> during >> computing, what would happen with the fault tolerant mechanism in OpenMPI, >> and what should we system administrator do after the failure (or before). >> >> Can anyone help me? My boss want me to deploy OpenMPI in our system cuz he >> want the fault tolerant feature. >> >> Thanks very much. >> >> >> >> --------------- >> CHEN Song >> R&D Department >> National Supercomputer Center in Tianjin >> Binhai New Area, Tianjin, China >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> >> ________________________________ >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > Joshua Hursey > Postdoctoral Research Associate > Oak Ridge National Laboratory > http://users.nccs.gov/~jjhursey > > ________________________________ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey