On Thu, Sep 1, 2016 at 6:35 PM, Kai Krakow <hurikha...@gmail.com> wrote: > Am Tue, 30 Aug 2016 17:59:02 -0400 > schrieb Rich Freeman <ri...@gentoo.org>: > >> >> That depends on the mode of operation. In journal=data I believe >> everything gets written twice, which should make it fairly immune to >> most forms of corruption. > > No, journal != data integrity. Journal only ensure that data is written > transactionally. You won't end up with messed up meta data, and from > API perspective and with journal=data, a partial written block of data > will be rewritten after recovering from a crash - up to the last fsync. > If it happens that this last fsync was half way into a file: Well, then > there's only your work written upto the half of the file.
Well, sure, but all an application needs to do is make sure it calls write on whole files, and not half-files. It doesn't need to fsync as far as I'm aware. It just needs to write consistent files in one system call. Then that write either will or won't make it to disk, but you won't get half of a write. > Journals only ensure consistency on API level, not integrity. Correct, but this is way better than not journaling or ordering data, which protects the metadata but doesn't ensure your files aren't garbled even if the application is careful. > > If you need integrity, so then file system can tell you if your file is > broken or not, you need checksums. > Btrfs and zfs fail in the exact same way in this particular regard. If you call write with half of a file, btrfs/zfs will tell you that half of that file was successfully written. But, it won't hold up for the other half of the file that the kernel hasn't been told about. The checksumming in these filesystems really only protects data from modification after it is written. Sectors that were only half-written during an outage which have inconsistent checksums probably won't even be looked at during an fsck/mount, because the filesystem is just going to replay the journal and write right over them (or to some new block, still treating the half-written data as unallocated). These filesystems don't go scrubbing the disk to figure out what happened, they just replay the log back to the last checkpoint. The checksums are just used during routine reads to ensure the data wasn't somehow corrupted after it was written, in which case a good copy is used, assuming one exists. If not at least you'll know about the problem. > If you need a way to recover from a half written file, you need a CoW > file system where you could, by luck, go back some generations. Only if you've kept snapshots, or plan to hex-edit your disk/etc. The solution here is to correctly use the system calls. > >> f2fs would also have this benefit. Data is not overwritten in-place >> in a log-based filesystem; they're essentially journaled by their >> design (actually, they're basically what you get if you ditch the >> regular part of the filesystem and keep nothing but the journal). > > This is log-structed, not journalled. You pointed that out, yes, but > you weakened that by writing "basically the same". I think the > difference is important. Mostly because the journal is a fixed area on > the disk, while a log-structured file system has no such journal. My point was that they're equivalent from the standpoint that every write either completes or fails and you don't get half-written data. Yes, I know how f2fs actually works, and this wasn't intended to be a primer on log-based filesystems. The COW filesystems have similar benefits since they don't overwrite data in place, other than maybe their superblocks (or whatever you call them). I don't know what the on-disk format of zfs is, but btrfs has multiple copies of the tree root with a generation number so if something dies partway it is really easy for it to figure out where it left off (if none of the roots were updated then any partial tree structures laid down are in unallocated space and just get rewritten on the next commit, and if any were written then you have a fully consistent new tree used to update the remaining roots). One of these days I'll have to read up on the on-disk format of zfs as I suspect it would make an interest contrast with btrfs. > > This point was raised because it supports checksums, not because it > supports CoW. Sure, but both provide benefits in these contexts. And the only COW filesystems are also the only ones I'm aware of (at least in popular use) that have checksums. > > Log structered file systems are, btw, interesting for write-mostly > workloads on spinning disks because head movements are minimized. > They are not automatically helping dumb/simple flash translation layers. > This incorporates a little more logic by exploiting the internal > structure of flash (writing only sequentially in page sized blocks, > garbage collection and reuse only on erase block level). F2fs and > bcache (as a caching layer) do this. Not sure about the others. Sure. It is just really easy to do big block erases in a log-based filesystem since everything tends to be written (and overwritten) sequentially. You can of course build a log-based filesystem that doesn't perform well on flash. They would still tend to have the benefits of data journaling (for free; the cost is fragmentation which is of course a bigger issue on disks). -- Rich