Steven Kirk wrote:
/ Hello,
/>/ />/ I have been using GROMACS for some very long (in wall clock terms) />/ simulations, and am curious as to how other users on this list solve the />/ problem of checkpointing long MD runs. It's a problem because of the />/ tendency of computational nodes in large HPC facilities (the more />/ processors, the more prevalent the problem, it seems) to keel over near />/ the end of a very time consuming run. Intermittent disk and scheduler />/ faults can also trigger such conditions. />/ />/ Checkpointing at the operating system level is very system-specific, and />/ occasionally compilers can produce executable 'dump' files that continue />/ from where your program left off, but I'm thinking that someone must />/ have automated this process directly using conventionally-compiled />/ GROMACS executables. />/ />/ Of course, it is possible to do an exact continuation from a crashed run />/ using .edr and trajectory (.trr) files by generating a new .tpr from the />/ last trajectory frame that had both position and velocity data. This />/ seems to be, by necessity, an entirely interactive process (unless />/ someone out there has a cool auto-restart script ..). />/ />/ I am thinking more in terms of 'proactive' checkpointing for long jobs, />/ by the following process: />/ />/ A script parses the desired .mdp file describing the user's MD run of T />/ timesteps, then asks the user how many sections (N) to split the run />/ into. The script will then auto-generate a shell script containing all />/ the necessary GROMACS commands to: />/ />/ * Generate a new .mdp file almost identical to the original, but with />/ the number of timesteps set to T/N. />/ />/ * Run N successive mdrun commands, where the output .trr and .edr files />/ from each short run using the modified .mdp file are used, to generate />/ an 'exact restart' .tpr file for the next 'mdrun' command, with the />/ appropriate continuation flag set. />/ />/ * Log (to a file) how many of the N partial runs have been completed, in />/ such a way that if the shell script containing the commands is />/ restarted, it will jump to the correct point in the sequence, restarting />/ from the most recently completed partial run. />/ />/ Has anyone else already solved this problem, or have a method />/ implementing some of the desirable properties above that I can then />/ extend to do exactly the things described above? />/ />/ /

I have just posted some of my scripts to the wiki

http://wiki.gromacs.org/index.php/Checkpointing_Jobs

_______________________________________________
gmx-users mailing list    gmx-users@gromacs.org
http://www.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at http://www.gromacs.org/search before posting!
Please don't post (un)subscribe requests to the list. Use the www interface or send it to [EMAIL PROTECTED]
Can't post? Read http://www.gromacs.org/mailing_lists/users.php

Reply via email to