On Fri, Mar 29, 2013 at 12:25 PM, Berk Hess <g...@hotmail.com> wrote:
> Hi, > > I don't know enough about the details of Lustre to understand what's going > on exactly. > But I think mdrun can't do more then check the return value of fsync and > believe that the file is completely flushed to disk. Possibly Lustre does > some syncing, but doesn't actually flush the file physically to disk, which > could lead to corruption when power goes down unexpectedly. > But I hope this would happen so infrequently that you can take your losses > (of up to the queue time, which is, hopefully, around 24 hours). > Full loss of checkpoint data might occur if the fsync() has incorrectly returned 0 at least twice and writing to disk has not actually completed. But that does not sound like what happened to Chris in the few minutes before power loss. When writing a new checkpoint file, GROMACS writes the new checkpoint to a temporary name, does a copy of the old checkpoint file to _prev.cpt, and only then renames the temporary checkpoint file to the normal name. This way there is always a valid checkpoint file on disk, even if there's a failure during any stage. But if the filesystem is coded/configured to cheat such that any of those operations are not as atomic as they should be (and that's a tempting thing to do, to "speed up" the filesystem), and more than one file operation is actually still pending when a power failure occurs, there will be a problem. Chris's .cpt timestamps are perhaps consistent with this scenario. If such a scenario could exist, the best thing mdrun could do is to re-load the checkpoint files after closing or renaming, and re-compute the md5sum (like gmxcheck does). However, the filesystem would probably just give mdrun the copy that is sitting around in a memory buffer waiting to be flushed, so that is a false security (and costs execution time for all the correctly-functioning filesystems). If the return from fsync() was a lie, then there's nothing upon which GROMACS can rely, I'm afraid. Mark > I assume your problem is that you don't even have the checkpoint file of > the previous simulation part left. Another option would then be using mdrun > -noappend > > Cheers, > > Berk > > ---------------------------------------- > > From: chris.ne...@mail.utoronto.ca > > To: gmx-users@gromacs.org > > Date: Fri, 29 Mar 2013 01:15:06 +0000 > > Subject: [gmx-users] chiller failure leads to truncated .cpt and > _prev.cpt files using gromacs 4.6.1 > > > > Thank you, Berk, Justin, and Matthew, for your assistance. > > > > I checked with my sysadmin, who said: > > > > The /global/scratch FS is Lustre. It is fully POSIX and the fsync etc > > are fully and well implemented. However when the 'power off' command is > > issued there is no way OS can finish I/O in a controlled way. > > > > Note that the power off command was given when they > > realized that they had lost all cooling in the data room, and they had > just a few > > minutes to react, forcing them to shutdown all compute nodes. > > > > Justin's suggestion to use -cpnum is good, although I think it will be > easier to simply have a script that runs > > gmxcheck once every 12 hours and backs up the .cpt file if it is ok. > > > > I don't know enough about computer OS's to say if there is any possible > way for gromacs to avoid this > > in the future, but if it was possible, then it would be useful. > > > > Thank you again, > > Chris. > > > > -- original message -- > > > > Gromacs calls fsync for every checkpoint file written: > > > > fsync() transfers ("flushes") all modified in-core data of (i.e., modi- > > fied buffer cache pages for) the file referred to by the file descrip- > > tor fd to the disk device (or other permanent storage device) so that > > all changed information can be retrieved even after the system crashed > > or was rebooted. This includes writing through or flushing a disk > > cache if present. The call blocks until the device reports that the > > transfer has completed. It also flushes metadata information associ- > > ated with the file (see stat(2)). > > > > If fsync fails, mdrun exits with a fatal error. > > We have experience with unreliable AFS file systems, where fsync mdrun > could wait for hours and fail, > > for which we added an environment variable. > > So either fsync is not supported on your system (highly unlikely) > > or your file system returns 0, indicating the file was synched, but it > actually didn't fully sync. > > > > Note that we first write a new checkpoint file with number, fynsc that, > then move the current > > to _prev (thereby loosing the old prev) and then the numbered one to the > current. > > So you should never end up with only corrupted files, unless fsync > doesn't do what it's supposed to do. > > > > Cheers, > > > > Berk > > > > -- > > gmx-users mailing list gmx-users@gromacs.org > > http://lists.gromacs.org/mailman/listinfo/gmx-users > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > > * Please don't post (un)subscribe requests to the list. Use the > > www interface or send it to gmx-users-requ...@gromacs.org. > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > -- > gmx-users mailing list gmx-users@gromacs.org > http://lists.gromacs.org/mailman/listinfo/gmx-users > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > * Please don't post (un)subscribe requests to the list. Use the > www interface or send it to gmx-users-requ...@gromacs.org. > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > -- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists