Dear Matthew: Thank you for noticing the file size. This is a very good lead. I had not noticed that this was special. Indeed, here is the complete listing for truncated/corrupt .cpt files:
-rw-r----- 1 cneale cneale 1048576 Mar 26 18:53 md3.cpt -rw-r----- 1 cneale cneale 1048576 Mar 26 18:54 md3.cpt -rw-r----- 1 cneale cneale 1048576 Mar 26 18:54 md3.cpt -rw-r----- 1 cneale cneale 1048576 Mar 26 18:54 md3.cpt -rw-r----- 1 cneale cneale 1048576 Mar 26 18:50 md3.cpt -rw-r----- 1 cneale cneale 1048576 Mar 26 18:50 md3.cpt -rw-r----- 1 cneale cneale 1048576 Mar 26 18:50 md3.cpt -rw-r----- 1 cneale cneale 1048576 Mar 26 18:51 md3.cpt -rw-r----- 1 cneale cneale 1048576 Mar 26 18:51 md3.cpt -rw-r----- 1 cneale cneale 2097152 Mar 26 18:52 md3.cpt -rw-r----- 1 cneale cneale 2097152 Mar 26 18:52 md3.cpt -rw-r----- 1 cneale cneale 2097152 Mar 26 18:52 md3.cpt -rw-r----- 1 cneale cneale 2097152 Mar 26 18:52 md3.cpt I will contact my sysadmins and let them know about your suggestions. Nevertheless, I respectfully reject the idea that there is really nothing that can be done about this inside gromacs. About 6 years ago, I worked on a cluster with massive sporadic NSF delays. The only solution to automate runs on that machine was to, for example, use sed to create a .mdp from a template .mdp file, which had ;;;EOF as the last line and then to poll the created mdp file for ;;;EOF until it existed prior to running grompp (at the time I was using mdrun -sort and desorting with an in-house script prior to domain decomposition, so I had to stop/start gromacs every coupld of hours). This is not to say that such things are ideal, but I think gromacs would be all the better if it was able to avoid with problems like this regardless of the cluster setup. Please note that, over the years, I have seen this on 4 different clusters (albeit with different versions of gromacs), but that is to say that it's not just one setup that is to blame. Matthew, please don't take my comments the wrong way. I deeply appreciate your help. I just want to put it out there that I believe that gromacs would be better if it didn't overwrite good .cpt files with truncated/corrupt .cpt files ever, even if the cluster catches on fire or the earth's magnetic field reverses, etc. Also, I suspect that sysadmins don't have a lot of time to test their clusters for graceful exit upon chiller failure conditions, so a super-careful regime of .cpt update will always be useful. Thank you again for your help, I'll take it to my sysadmins, who are very good and may be able to remedy this on their cluster, but who knows what cluster I will be using in 5 years. Again, thank you for your assistance, it is very useful, Chris. -- original message -- Dear Chris, While it's always possible that GROMACS can be improved (or debugged), this smells more like a system-level problem. The corrupt checkpoint files are precisely 1MiB or 2MiB, which suggests strongly either 1) GROMACS was in the middle of a buffer flush when it was killed (but the filesystem did everything right; it was just sent incomplete data), or 2) the filesystem itself wrote a truncated file (but GROMACS wrote it successfully, the data was buffered, and GROMACS went on its merry way). #1 could happen, for example, if GROMACS was killed with SIGKILL while copying .cpt to _prev.cpt -- if GROMACS even copies, rather than renames -- its checkpoint files. #2 could happen in any number of ways, depending on precisely how your disks, filesystems, and network filesystems are all configured (for example, if a RAID array goes down hard with per-drive writeback caches enabled, or your NFS system is soft-mounted and either client or server goes down). With the sizes of the truncated checkpoint files being very convenient numbers, my money is on #2. Have you contacted your sysadmins to report this? They may be able to take some steps to try to prevent this, and (if this is indeed a system problem) doing so would provide all their users an increased measure of safety for their data. Cheers, MZ -- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists