James Andrewartha <jam...@daa.com.au> wrote: > Recently there's been discussion [1] in the Linux community about how > filesystems should deal with rename(2), particularly in the case of a crash. > ext4 was found to truncate files after a crash, that had been written with > open("foo.tmp"), write(), close() and then rename("foo.tmp", "foo"). This is > because ext4 uses delayed allocation and may not write the contents to disk > immediately, but commits metadata changes quite frequently. So when > rename("foo.tmp","foo") is committed to disk, it has a length of zero which > is later updated when the data is written to disk. This means after a crash, > "foo" is zero-length, and both the new and the old data has been lost, which > is undesirable. This doesn't happen when using ext3's default settings > because ext3 writes data to disk before metadata (which has performance > problems, see Firefox 3 and fsync[2]) > > Ted T'so's (the main author of ext3 and ext4) response is that applications > which perform open(),write(),close(),rename() in the expectation that they > will either get the old data or the new data, but not no data at all, are > broken, and instead should call open(),write(),fsync(),close(),rename(). > Most other people are arguing that POSIX says rename(2) is atomic, and while > POSIX doesn't specify crash recovery, returning no data at all after a crash > is clearly wrong, and excessive use of fsync is overkill and > counter-productive (Ted later proposes a "yes-I-really-mean-it" flag for > fsync). I've omitted a lot of detail, but I think this is the core of the > argument.
The problem in this case is not whether rename() is atomic but whether the file that replaces the old file in an atomic rename() operation is in a stable state on the disk before calling rename(). The calling sequence of the failing code was: f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666); write(f, "dat", size); close(f); rename("new", "old"); The only granted way to have the file "new" in a stable state on the disk is to call: f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666); write(f, "dat", size); fsync(f); close(f); Do not forget to check error codes..... If the application would call: f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666); if (write(f, "dat", size) != size) fail(); if (fsync(f) < 0) fail() if (close(f) < 0) fail() if (rename("new", "old") < 0) fail(); and if after a crash there is neither the old file nor the new file on the disk in a consistent state, then you may blame the file system. Jörg -- EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin j...@cs.tu-berlin.de (uni) joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss