Re: [OMPI users] check point restart

Ralph Castain Fri, 19 Jul 2013 15:51:38 -0400

On Jul 19, 2013, at 12:41 PM, Lloyd Brown <lloyd_br...@byu.edu> wrote:


> I know that in the past it has been supported via toolkits like BLCR,
> but I don't know the current level of support, to be honest.  I think I
> heard somewhere that the checkpoint/restart support in OpenMPI was going
> away in some fashion.

It is still somewhat there thru the 1.6 series, but may have suffered some 
bitrot in the latest 1.6 release(s). The developer who maintained that 
functionality has taken on another position, so support isn't as strong as it 
was. Currently, it isn't available in the 1.7 series.

> 
> In any case, if you have the ability to set up application-aware,
> application-specific checkpointing, it will be a much better solution
> than something that's application-agnostic.  The checkpoint files will
> be smaller (the application knows what in memory is important, and what
> isn't), coordination will be better between processes, you have some
> level of assurance that you won't have PID conflicts or problems when
> the PID ends up different, etc.
> 
> I suspect someone on the list can answer your question about the
> built-in checkpoint/restart code better than I can.  But in general, if
> you have a choice between checkpointing external and internal to your
> application, choose the application-internal checkpointing.

Definitely agree - internal is much better. I don't understand the comment 
about printing and recompiling. Usually, people just have the app write its 
intermediate results to a file, and provide a cmd line option pointing to that 
file upon restart so the app knows to read and start from that point. The app 
requires a routine to read the file and set itself up to continue, but that's a 
one-time implementation thing.

> 
> 
> 
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> http://marylou.byu.edu
> 
> On 07/19/2013 01:34 PM, Erik Nelson wrote:
>> I run mpi on an NSF computer. One of the conditions of use is that jobs
>> are limited to 24 hr
>> duration to provide democratic allotment to its users.
>> 
>> A long program can require many restarts, so it becomes necessary to
>> store the state of the 
>> program in memory, print it, recompile, and and read the state to start
>> again.
>> 
>> I seem to remember a simpler approach (check point restart?) in which
>> the state of the .exe
>> code is saved and then simply restarted from its current position.
>> 
>> Is there something like this for restarting an mpi program?
>> 
>> Thanks, Erik
>> 
>> 
>> -- 
>> Erik Nelson
>> 
>> Howard Hughes Medical Institute
>> 6001 Forest Park Blvd., Room ND10.124
>> Dallas, Texas 75235-9050
>> 
>> p : 214 645 5981
>> f : 214 645 5948
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] check point restart

Reply via email to