Dear Users: A cluster that I use went down today with a chiller failure. I lost all 16 jobs (running gromacs 4.6.1). For 13 of these jobs, not only is the .cpt file truncated, but also the _prev.cpt file is truncated, meaning that I am going to have to go back through the files, extract a frame, make a new .tpr file (using a new, custom .mdp file to get the timestamp right), restart the runs, and then later join the trajectory data fragments.
I have experienced this a number of times over the years with different versions of gromacs (see, for example, http://redmine.gromacs.org/issues/790 ) and wonder if anybody else has experienced this? Also, does anybody have some advice on how to handle this? For now, my idea is to run a script in the background to periodically check the .cpt file and make a copy if it is not corrupted/truncated so that I can always restart. If it is useful information, both the .cpt and the _prev.cpt files have the same size and timestamp, but are smaller than non-corrupted .cpt files. E.g.: $ ls -ltr --full-time *cpt -rw-r----- 1 cneale cneale 1963536 2013-03-22 17:18:04.000000000 -0700 md2d_prev.cpt -rw-r----- 1 cneale cneale 1963536 2013-03-22 17:18:04.000000000 -0700 md2d.cpt -rw-r----- 1 cneale cneale 1048576 2013-03-26 12:46:02.000000000 -0700 md3.cpt -rw-r----- 1 cneale cneale 1048576 2013-03-26 12:46:03.000000000 -0700 md3_prev.cpt Where, above, md2d.cpt was from the last stage of my equilibration and md3.cpt was from my production. Here is another example from a different run with corruption: $ ls -ltr --full-time *cpt -rw-r----- 1 cneale cneale 2209508 2013-03-21 08:24:33.000000000 -0700 md2d_prev.cpt -rw-r----- 1 cneale cneale 2209508 2013-03-21 08:24:33.000000000 -0700 md2d.cpt -rw-r----- 1 cneale cneale 2097152 2013-03-26 12:46:01.000000000 -0700 md3_prev.cpt -rw-r----- 1 cneale cneale 2097152 2013-03-26 12:46:01.000000000 -0700 md3.cpt I detect corruption/truncation in the .cpt file like this: $ gmxcheck -f md3.cpt Fatal error: Checkpoint file corrupted/truncated, or maybe you are out of disk space? For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors Also, I confirmed the problem by trying to run mdrun: $ mdrun -nt 1 -deffnm md3 -cpi md3.cpt -nsteps 5000000000 Fatal error: Checkpoint file corrupted/truncated, or maybe you are out of disk space? For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors (and I get the same thing using md3_prev.cpt) I am not out of disk space, but probably some type of condition like that existed when the chiller failed and the system went down: $ df -h . Filesystem Size Used Avail Use% Mounted on 342T 57T 281T 17% /global/scratch Nor am I out of quota (although I have no command to show that here). There is no corruption of .edr, .trr, or .xtc files The .log files end like this: Writing checkpoint, step 194008250 at Tue Mar 26 10:49:31 2013 Writing checkpoint, step 194932330 at Tue Mar 26 11:49:31 2013 Step Time Lambda 195757661 391515.32200 0.00000 Writing checkpoint, step 195757661 at Tue Mar 26 12:46:02 2013 I am motivated to help solve this problem, but have no idea how to stop gromacs from copying corrupted/truncated checkpoint files to _prev.cpt . I presume that one could write a magic number to the end of the .cpt file and test that it exists prior to moving .cpt to _prev.cpt , but perhaps I misunderstand the problem. If needs be, perhaps mdrun could call gmxcheck, since that tool seems to detect the corruption/truncation. If it's done every 30 minutes, it shouldn't affect the performance. Thank you for any advice, Chris. -- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists