Re: [gmx-users] Checkpointing GROMACS jobs

David van der Spoel Mon, 28 Jan 2008 11:09:51 -0800

Steven Kirk wrote:

Hello,
I have been using GROMACS for some very long (in wall clock terms)simulations, and am curious as to how other users on this list solve theproblem of checkpointing long MD runs. It's a problem because of thetendency of computational nodes in large HPC facilities (the moreprocessors, the more prevalent the problem, it seems) to keel over nearthe end of a very time consuming run. Intermittent disk and schedulerfaults can also trigger such conditions.
Checkpointing at the operating system level is very system-specific, andoccasionally compilers can produce executable 'dump' files that continuefrom where your program left off, but I'm thinking that someone musthave automated this process directly using conventionally-compiledGROMACS executables.
Of course, it is possible to do an exact continuation from a crashed runusing .edr and trajectory (.trr) files by generating a new .tpr from thelast trajectory frame that had both position and velocity data. Thisseems to be, by necessity, an entirely interactive process (unlesssomeone out there has a cool auto-restart script ..).
I am thinking more in terms of 'proactive' checkpointing for long jobs,by the following process:
A script parses the desired .mdp file describing the user's MD run of Ttimesteps, then asks the user how many sections (N) to split the runinto. The script will then auto-generate a shell script containing allthe necessary GROMACS commands to:
* Generate a new .mdp file almost identical to the original, but withthe number of timesteps set to T/N.
* Run N successive mdrun commands, where the output .trr and .edr filesfrom each short run using the modified .mdp file are used, to generatean 'exact restart' .tpr file for the next 'mdrun' command, with theappropriate continuation flag set.
* Log (to a file) how many of the N partial runs have been completed, insuch a way that if the shell script containing the commands isrestarted, it will jump to the correct point in the sequence, restartingfrom the most recently completed partial run.
Has anyone else already solved this problem, or have a methodimplementing some of the desirable properties above that I can thenextend to do exactly the things described above?

Most queue system allow you to chain jobs, that is, let the next onestart after the previous one finished. In PBS this is done alike


qsub -Wdepend=afterok:prev_jobid

combining this with a script to start the jobs you are all set. Ipresume you are aware of tpbconv -extend, or tpbconv -until ?


--
David.
________________________________________________________________________
David van der Spoel, PhD, Assoc. Prof., Molecular Biophysics group,
Dept. of Cell and Molecular Biology, Uppsala University.
Husargatan 3, Box 596,          75124 Uppsala, Sweden
phone:  46 18 471 4205          fax: 46 18 511 755
[EMAIL PROTECTED]       [EMAIL PROTECTED]   http://folding.bmc.uu.se
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
_______________________________________________
gmx-users mailing list    gmx-users@gromacs.org
http://www.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at http://www.gromacs.org/search before posting!

Please don't post (un)subscribe requests to the list. Use thewww interface or send it to [EMAIL PROTECTED]

Can't post? Read http://www.gromacs.org/mailing_lists/users.php

Re: [gmx-users] Checkpointing GROMACS jobs

Reply via email to