On Tue, Apr 11, 2023 at 2:15 PM Andres Freund <and...@anarazel.de> wrote: > And the fix has been merged into > https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/log/?h=for-next > > I think that means it'll have to wait for 6.4 development to open (in a few > weeks), and then will be merged into the stable branches from there.
Great! Let's hope/assume for now that that'll fix phenomenon #2. That still leaves the checksum-vs-concurrent-modification thing that I called phenomenon #1, which we've not actually hit with PostgreSQL yet but is clearly possible and can be seen with the stand-alone repro-program I posted upthread. You wrote: On Mon, Apr 10, 2023 at 2:57 PM Andres Freund <and...@anarazel.de> wrote: > I think we really need to think about whether we eventually we want to do > something to avoid modifying pages while IO is in progress. The only > alternative is for filesystems to make copies of everything in the IO path, > which is far from free (and obviously prevents from using DMA for the whole > IO). The copy we do to avoid the same problem when checksums are enabled, > shows up quite prominently in write-heavy profiles, so there's a "purely > postgres" reason to avoid these issues too. +1 I wonder what the other file systems that maintain checksums (see list at [1]) do when the data changes underneath a write. ZFS's policy is conservative[2], while BTRFS took the demons-will-fly-out-of-your-nose route. I can see arguments for both approaches (ZFS can only reach zero-copy optimum by turning off checksums completely, while BTRFS is happy to assume that if you break this programming rule that is not written down anywhere then you must never want to see your data ever again). What about ReFS? CephFS? I tried to find out what POSIX says about this WRT synchronous pwrite() (as Tom suggested, maybe we're doing something POSIX doesn't allow), but couldn't find it in my first attempt. It *does* say it's undefined for aio_write() (which means that my prototype io_method=posix_aio code that uses that stuff is undefined in presense of hintbit modifications). I don't really see why it should vary between synchronous and asynchronous interfaces (considering the existence of threads, shared memory etc, the synchronous interface only removes one thread from list of possible suspects that could flip some bits). But yeah, in any case, it doesn't seem great that we do that. [1] https://en.wikipedia.org/wiki/Comparison_of_file_systems#Block_capabilities [2] https://openzfs.topicbox.com/groups/developer/T950b02acdf392290/odirect-semantics-in-zfs