I know that in the past it has been supported via toolkits like BLCR, but I don't know the current level of support, to be honest. I think I heard somewhere that the checkpoint/restart support in OpenMPI was going away in some fashion.
In any case, if you have the ability to set up application-aware, application-specific checkpointing, it will be a much better solution than something that's application-agnostic. The checkpoint files will be smaller (the application knows what in memory is important, and what isn't), coordination will be better between processes, you have some level of assurance that you won't have PID conflicts or problems when the PID ends up different, etc. I suspect someone on the list can answer your question about the built-in checkpoint/restart code better than I can. But in general, if you have a choice between checkpointing external and internal to your application, choose the application-internal checkpointing. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 07/19/2013 01:34 PM, Erik Nelson wrote: > I run mpi on an NSF computer. One of the conditions of use is that jobs > are limited to 24 hr > duration to provide democratic allotment to its users. > > A long program can require many restarts, so it becomes necessary to > store the state of the > program in memory, print it, recompile, and and read the state to start > again. > > I seem to remember a simpler approach (check point restart?) in which > the state of the .exe > code is saved and then simply restarted from its current position. > > Is there something like this for restarting an mpi program? > > Thanks, Erik > > > -- > Erik Nelson > > Howard Hughes Medical Institute > 6001 Forest Park Blvd., Room ND10.124 > Dallas, Texas 75235-9050 > > p : 214 645 5981 > f : 214 645 5948 > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >