James Andrewartha <jam...@daa.com.au> wrote:

> Recently there's been discussion [1] in the Linux community about how
> filesystems should deal with rename(2), particularly in the case of a crash.
> ext4 was found to truncate files after a crash, that had been written with
> open("foo.tmp"), write(), close() and then rename("foo.tmp", "foo"). This is
>  because ext4 uses delayed allocation and may not write the contents to disk
> immediately, but commits metadata changes quite frequently. So when
> rename("foo.tmp","foo") is committed to disk, it has a length of zero which
> is later updated when the data is written to disk. This means after a crash,
> "foo" is zero-length, and both the new and the old data has been lost, which
> is undesirable. This doesn't happen when using ext3's default settings
> because ext3 writes data to disk before metadata (which has performance
> problems, see Firefox 3 and fsync[2])
>
> Ted T'so's (the main author of ext3 and ext4) response is that applications
> which perform open(),write(),close(),rename() in the expectation that they
> will either get the old data or the new data, but not no data at all, are
> broken, and instead should call open(),write(),fsync(),close(),rename().
> Most other people are arguing that POSIX says rename(2) is atomic, and while
> POSIX doesn't specify crash recovery, returning no data at all after a crash
> is clearly wrong, and excessive use of fsync is overkill and
> counter-productive (Ted later proposes a "yes-I-really-mean-it" flag for
> fsync). I've omitted a lot of detail, but I think this is the core of the
> argument.

The problem in this case is not whether rename() is atomic but whether the
file that replaces the old file in an atomic rename() operation is in a 
stable state on the disk before calling rename().

The calling sequence of the failing code was:

f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666);
write(f, "dat", size);
close(f);
rename("new", "old");

The only granted way to have the file "new" in a stable state on the disk
is to call:

f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666);
write(f, "dat", size);
fsync(f);
close(f);

Do not forget to check error codes.....

If the application would call:

f = open("new", O_WRONLY|O_CREATE|O_TRUNC, 0666);
if (write(f, "dat", size) != size)
        fail();
if (fsync(f) < 0)
        fail()
if (close(f) < 0)
        fail()
if (rename("new", "old") < 0)
        fail();

and if after a crash there is neither the old file nor the
new file on the disk in a consistent state, then you may blame the
file system.


Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
       j...@cs.tu-berlin.de                (uni)  
       joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to