On Thu, Apr 5, 2018 at 2:00 AM, Craig Ringer <cr...@2ndquadrant.com> wrote: > I've tried xfs, jfs, ext3, ext4, even vfat. All behave the same on EIO. > Didn't try zfs-on-linux or other platforms yet.
I think ZFS will be an outlier here, at least in a pure write()/fsync() test. (1) It doesn't even use the OS page cache, except when you mmap()*. (2) Its idea of syncing data is to journal it, and its journal presumably isn't in the OS page cache. In other words it doesn't use Linux's usual write-back code paths. While contemplating what exactly it would do (not sure), I came across an interesting old thread on the freebsd-current mailing list that discusses UFS, ZFS and the meaning of POSIX fsync(). Here we see a report of FreeBSD + UFS doing exactly what the code suggests: https://lists.freebsd.org/pipermail/freebsd-current/2007-August/076578.html That is, it keeps the pages dirty so it tells the truth later. Apparently like Solaris/Illumos (based on drive-by code inspection, see explicit treatment of retrying, though I'm not entirely sure if the retry flag is set just for async write-back), and apparently unlike every other kernel I've tried to grok so far (things descended from ancestral BSD but not descended from FreeBSD, with macOS/Darwin apparently in the first category for this purpose). Here's a new ticket in the NetBSD bug database for this stuff: http://gnats.netbsd.org/53152 As mentioned in that ticket and by Andres earlier in this thread, keeping the page dirty isn't the only strategy that would work and may be problematic in different ways (it tells the truth but floods your cache with unflushable stuff until eventually you force unmount it and your buffers are eventually invalidated after ENXIO errors? I don't know.). I have no qualified opinion on that. I just know that we need a way for fsync() to tell the truth about all preceding writes or our checkpoints are busted. *We mmap() + msync() in pg_flush_data() if you don't have sync_file_range(), and I see now that that is probably not a great idea on ZFS because you'll finish up double-buffering (or is that triple-buffering?), flooding your page cache with transient data. Oops. That is off-topic and not relevant for the checkpoint correctness topic of this thread through, since pg_flush_data() is advisory only. -- Thomas Munro http://www.enterprisedb.com