On Thu, Sep 1, 2016 at 6:35 PM, Kai Krakow <hurikha...@gmail.com> wrote:
> Am Tue, 30 Aug 2016 17:59:02 -0400
> schrieb Rich Freeman <ri...@gentoo.org>:
>
>>
>> That depends on the mode of operation.  In journal=data I believe
>> everything gets written twice, which should make it fairly immune to
>> most forms of corruption.
>
> No, journal != data integrity. Journal only ensure that data is written
> transactionally. You won't end up with messed up meta data, and from
> API perspective and with journal=data, a partial written block of data
> will be rewritten after recovering from a crash - up to the last fsync.
> If it happens that this last fsync was half way into a file: Well, then
> there's only your work written upto the half of the file.

Well, sure, but all an application needs to do is make sure it calls
write on whole files, and not half-files.  It doesn't need to fsync as
far as I'm aware.  It just needs to write consistent files in one
system call.  Then that write either will or won't make it to disk,
but you won't get half of a write.

> Journals only ensure consistency on API level, not integrity.

Correct, but this is way better than not journaling or ordering data,
which protects the metadata but doesn't ensure your files aren't
garbled even if the application is careful.

>
> If you need integrity, so then file system can tell you if your file is
> broken or not, you need checksums.
>

Btrfs and zfs fail in the exact same way in this particular regard.
If you call write with half of a file, btrfs/zfs will tell you that
half of that file was successfully written.  But, it won't hold up for
the other half of the file that the kernel hasn't been told about.

The checksumming in these filesystems really only protects data from
modification after it is written.  Sectors that were only half-written
during an outage which have inconsistent checksums probably won't even
be looked at during an fsck/mount, because the filesystem is just
going to replay the journal and write right over them (or to some new
block, still treating the half-written data as unallocated).  These
filesystems don't go scrubbing the disk to figure out what happened,
they just replay the log back to the last checkpoint.  The checksums
are just used during routine reads to ensure the data wasn't somehow
corrupted after it was written, in which case a good copy is used,
assuming one exists.  If not at least you'll know about the problem.

> If you need a way to recover from a half written file, you need a CoW
> file system where you could, by luck, go back some generations.

Only if you've kept snapshots, or plan to hex-edit your disk/etc.  The
solution here is to correctly use the system calls.

>
>> f2fs would also have this benefit.  Data is not overwritten in-place
>> in a log-based filesystem; they're essentially journaled by their
>> design (actually, they're basically what you get if you ditch the
>> regular part of the filesystem and keep nothing but the journal).
>
> This is log-structed, not journalled. You pointed that out, yes, but
> you weakened that by writing "basically the same". I think the
> difference is important. Mostly because the journal is a fixed area on
> the disk, while a log-structured file system has no such journal.

My point was that they're equivalent from the standpoint that every
write either completes or fails and you don't get half-written data.
Yes, I know how f2fs actually works, and this wasn't intended to be a
primer on log-based filesystems.  The COW filesystems have similar
benefits since they don't overwrite data in place, other than maybe
their superblocks (or whatever you call them).  I don't know what the
on-disk format of zfs is, but btrfs has multiple copies of the tree
root with a generation number so if something dies partway it is
really easy for it to figure out where it left off (if none of the
roots were updated then any partial tree structures laid down are in
unallocated space and just get rewritten on the next commit, and if
any were written then you have a fully consistent new tree used to
update the remaining roots).

One of these days I'll have to read up on the on-disk format of zfs as
I suspect it would make an interest contrast with btrfs.

>
> This point was raised because it supports checksums, not because it
> supports CoW.

Sure, but both provide benefits in these contexts.  And the only COW
filesystems are also the only ones I'm aware of (at least in popular
use) that have checksums.

>
> Log structered file systems are, btw, interesting for write-mostly
> workloads on spinning disks because head movements are minimized.
> They are not automatically helping dumb/simple flash translation layers.
> This incorporates a little more logic by exploiting the internal
> structure of flash (writing only sequentially in page sized blocks,
> garbage collection and reuse only on erase block level). F2fs and
> bcache (as a caching layer) do this. Not sure about the others.

Sure.  It is just really easy to do big block erases in a log-based
filesystem since everything tends to be written (and overwritten)
sequentially.  You can of course build a log-based filesystem that
doesn't perform well on flash.  They would still tend to have the
benefits of data journaling (for free; the cost is fragmentation which
is of course a bigger issue on disks).

-- 
Rich

Reply via email to