Re: [gmx-users] chiller failure leads to truncated .cpt and _prev.cpt files using gromacs 4.6.1

Justin Lemkul Wed, 27 Mar 2013 04:58:16 -0700


On 3/26/13 11:13 PM, Christopher Neale wrote:

Dear Matthew:

Thank you for noticing the file size. This is a very good lead.
I had not noticed that this was special. Indeed, here is the complete listing 
for truncated/corrupt .cpt files:

-rw-r----- 1 cneale cneale 1048576 Mar 26 18:53 md3.cpt
-rw-r----- 1 cneale cneale 1048576 Mar 26 18:54 md3.cpt
-rw-r----- 1 cneale cneale 1048576 Mar 26 18:54 md3.cpt
-rw-r----- 1 cneale cneale 1048576 Mar 26 18:54 md3.cpt
-rw-r----- 1 cneale cneale 1048576 Mar 26 18:50 md3.cpt
-rw-r----- 1 cneale cneale 1048576 Mar 26 18:50 md3.cpt
-rw-r----- 1 cneale cneale 1048576 Mar 26 18:50 md3.cpt
-rw-r----- 1 cneale cneale 1048576 Mar 26 18:51 md3.cpt
-rw-r----- 1 cneale cneale 1048576 Mar 26 18:51 md3.cpt
-rw-r----- 1 cneale cneale 2097152 Mar 26 18:52 md3.cpt
-rw-r----- 1 cneale cneale 2097152 Mar 26 18:52 md3.cpt
-rw-r----- 1 cneale cneale 2097152 Mar 26 18:52 md3.cpt
-rw-r----- 1 cneale cneale 2097152 Mar 26 18:52 md3.cpt

I will contact my sysadmins and let them know about your suggestions.

Nevertheless, I respectfully reject the idea that there is really nothing that 
can be done about this inside
gromacs. About 6 years ago, I worked on a cluster with massive sporadic NSF 
delays. The only solution to
automate runs on that machine was to, for example, use sed to create a .mdp 
from a template .mdp file, which had ;;;EOF as the last line and then to poll 
the created mdp file for ;;;EOF until it existed prior to running
grompp (at the time I was using mdrun -sort and desorting with an in-house 
script prior to domain
decomposition, so I had to stop/start gromacs every coupld of hours). This is 
not to say that such things are
ideal, but I think  gromacs would be all the better if it was able to avoid 
with problems like this regardless of
the cluster setup.

Please note that, over the years, I have seen this on 4 different clusters 
(albeit with different versions of
gromacs), but that is to say that it's not just one setup that is to blame.

Matthew, please don't take my comments the wrong way. I deeply appreciate your 
help. I just want to put it
out there that I believe that gromacs would be better if it didn't overwrite 
good .cpt files with truncated/corrupt
.cpt files ever, even if the cluster catches on fire or the earth's magnetic 
field reverses, etc.
Also, I suspect that sysadmins don't have a lot of time to test their clusters 
for graceful exit upon chiller failure
conditions, so a super-careful regime of .cpt update will always be useful.

Thank you again for your help, I'll take it to my sysadmins, who are very good 
and may be able to remedy
this on their cluster, but who knows what cluster I will be using in 5 years.

Perhaps this is a case where the -cpnum option would be useful? That may causea lot of checkpoint files to accumulate, depending on the length of the run, butperhaps a scripted cleanup routine to preserve some subset of backups would beuseful.


-Justin

--
========================================

Justin A. Lemkul, Ph.D.
Research Scientist
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin

========================================
--
gmx-users mailing list    gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!

* Please don't post (un)subscribe requests to the list. Use thewww interface or send it to gmx-users-requ...@gromacs.org.

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Re: [gmx-users] chiller failure leads to truncated .cpt and _prev.cpt files using gromacs 4.6.1

Reply via email to