@Chris

I hate to keep repeating myself, but the 2.6.30 patches will cause open-
write-close-rename (what I call "replace via rename") to have the
semantic you want.   It will do that by forcing a block allocation on
the rename, and then when you do the journal commit, it will block
waiting for the data writes to complete.  So it will do what you want.
Please note that this is an ext4-specific hack; there is no guarantee
that btrfs, ZFS, tux3, reiser4 will implement anything like this.   And
all of these filesystems do implement delayed allocation, and will have
exactly the same issue.   You and others keep talk about how this is a
MUST implement, but the reality is that it is not mandated by POSIX, and
implementing these sorts of things will hurt benchmarks, and real-life
server workloads.  So don't count on other filesystems implementing the
same hacks.

@CowbowTim,

Actually ext4's fsync() is smarter; it won't force out other files' data
blocks, because of delayed allocation.   If you write a new 1G file,
thanks to delayed allocation, the blocks aren't allocated, so an fsync()
of some other file will not cause that 1G file to be forced out to disk.
What will happen instead is that the VM subsystem will gradually dribble
out that 1G file over a period of time controlled by
/proc/sys/vm/dirty_expire_centisecs and
/proc/sys/vm/dirty_writeback_centisecs.

This problem you describe with fsync() and ext3's data=ordered mode is
unique to ext3; no other filesystem has it.  Fortunately or
unfortuately, ext3 is the most common/popularly used filesystem, so
people have gotten used to its quirks, and worse yet, seem to assume
that they are true for all other filesystems.  One of the reasons why we
implemented delayed allocation was precisely to solve this problem.   Of
course, we're now running into the issue that there are people who have
been avoiding fsync() at all costs thanks to ext3, so now we're trying
to implement some hacks so that ext4 will behave somewhat similar to
ext3 in at least some circumstances.

The problem here is really balance; if I implement a data=alloc-on-
commit mode, it will have all of the downsides of ext3 with respect to
fsync() being slow for "entagled writes" (where you have both a large
file which you are copying and a small file which you are fsync()'ing).
So it will encourage the same bad behaviour which will mean people will
still have the same bad habits when they decide they want to switch to
some new more featureful filesystem, like btrfs.   The one good thing
about the "alloc-on-replace-via-truncate" and "alloc-on-replace-via-
rename" is it handles the most annoying set of problems (which is an
existing file getting rewritten turning into a zero-length file on a
crash), without necessarily causing an implied fsync() on commit for all
dirty files (which is what ext3 was doing).

It's interesting that some people keep talking about how the implied
fsync() is so terribly, and simultaneously arguing that ext3's behaviour
is want they want --- what ext3 was doing was effectively a forced
fsync() for all dirty files at each commit (which happens every 5
seconds by default) --- maybe people didn't realize that was what was
going on, but that's precisely what ext3's data=ordered means.

-- 
Ext4 data loss
https://bugs.launchpad.net/bugs/317781
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to