Hi, As we learned in the build farm[1], at least one checksumming filesystem doesn't like it if you write data from memory that is still changing, when direct I/O is enabled. That was BTRFS, but the next version of OpenZFS seems to have a similar allergy. In Andres's thread about fixing that (among other problems) for hint bits, he pointed out that WAL must have the same problem[2]. I am splitting that case off to a new thread given that the problem and potential solutions seem pretty independent.
Concretely, on BTRFS you could write WAL apparently successfully, but later if you tried to read it you'd get EIO because the checksum was computed badly over moving data. On ZFS 2.3 (out soon) I think a backend would crash/panic during the write, so different but also bad. Some other obvious candidates in that general COW + integrity family would be (1) APFS but no, it doesn't seem to have user data checksums, (2) ReFS, which has user data checksums but doesn't enable them by default and I have no idea what it would do in the same circumstances (but Windows already explicitly tells us never to touch the buffer during an I/O, so we're already breaking the rules regardless), (3) bcachefs, which has user data checksums by default but it looks like it has its own bounce buffer to defend against this type of problem[3]. (Anyone got clues about other exotic systems?) Here's an early experimental patch to try to do something about that, by making a temporary copy of the moving trailing buffer. It's not free, but the behaviour is only activated when you turn on debug_io_direct=wal. No significant change to default behaviour hopefully, but direct I/O WAL writes look like this: postgres=# insert into t values (42); pwrite(14,"\^X\M-Q\^E\0\^A\0\0\0\0\M-`\M^P"...,8192,0x90e000) = 8192 postgres=# insert into t select generate_series(1, 1000); pwritev(14,[{"\^X\M-Q\^E\0\^A\0\0\0\0`\M^O\r\0"...,57344}, {"\^X\M-Q\^E\0\^A\0\0\0\0@\M^P\r\0"...,8192}],2,0x8f6000) = 65536 postgres=# begin; postgres=*# insert into t select generate_series(1, 10000000); [walwriter] pwrite(5,"\^X\M-Q\^E\0\^A\0\0\0\0@`\^Z\0\0"...,2080768,0x604000) = 2080768 In the first case it had to take a copy as it was only writing the tail, but otherwise it looks like unpatched. In the second case, it wrote as much as it could safely directly from WAL buffers, but then had to write the tail page out with a temporary copy. In the third case, bulk writing activity doesn't do partial pages so gets the traditional behaviour. It might also be an idea to copy only the the part we really care about and zero the rest, and in either case perhaps only out to the nearest PG_IO_ALIGN_SIZE (step size at which it is OK to end a direct I/O write), though that last bit might not be worth the hassle if we plan to just use smaller blocks anyway. Interaction with other potential future changes: Obviously AIO might require something more... bouncy than a stack buffer. 4K WAL buffers would reduce the size of the copied region (along with other benefits[4]). A no-overwrite WAL buffer mode (to match the behaviour of other databases) would avoid the cost completely (along with other benefits). What else could we do, if not something like this? [1] https://www.postgresql.org/message-id/ca+hukgksbaz78fw3wtf3q8arqkcz1ggstfrfidpbu-j9ofz...@mail.gmail.com [2] https://www.postgresql.org/message-id/jo5p5nthb3hxwfj7qifiu2rxk5y3hexbwxs5d6x2jotsvj3bq5%40jhtrriubncjb [3] https://bcachefs-docs.readthedocs.io/en/latest/feat-checksumming.html [4] https://www.postgresql.org/message-id/flat/20231009230805.funj5ipoggjyzjz6%40awork3.anarazel.de
From bdbcaf7f8ba970e947590d86e6a19d4cb0087d70 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Tue, 1 Oct 2024 06:06:57 +1300 Subject: [PATCH] Don't write changing WAL buffers with direct I/O. Filesystems that have checksums for user data sometimes don't like it if you modify the data in memory while it is being written to disk. XLogWrite() allowed that with trailing partially filled WAL pages, because it always writes the whole WAL buffer, and yet other backends are allowed to fill in the rest of the buffer at the same time. Defend against corruption by taking a temporary copy of any trailing partial page. Use a vectored write if necessary to avoid an extra system call. Unfortunately on Windows pg_pwritev() is implemented with a loop and the default wal_sync_method will therefore sometimes wait for synchronous I/O twice, but that is mitigated by the fact that we don't activate the new behavior for buffered WAL writes, still the default. (It'd be possible to WriteFileGather() but that'd take more infrastructural changes do deal with overlapped I/O.) XXX Zeroes in unstable region instead? Per observation from Andres Freund that the WAL must have the same problem that we'd already diagnosed for relation data with concurrently modified hint bits. Discussion: https://postgr.es/m/CA%2BhUKGL-sZrfwcdme8jERPbn%2BsGbY13sRCvq8b9Hp%3DhaWpC6fw%40mail.gmail.com --- src/backend/access/transam/xlog.c | 43 ++++++++++++++++++++++++++++++- 1 file changed, 42 insertions(+), 1 deletion(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 6f58412bcab..c8994ad8aad 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2426,6 +2426,16 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) Size nleft; ssize_t written; instr_time start; + bool copy_tail; + + /* + * Do we need to treat the tail page specially to avoid writing + * data out while the buffer is still potentially changing? We + * enable this only for direct I/O currently, where that is known + * to cause checksum failures on some file systems. + */ + copy_tail = (io_direct_flags & IO_DIRECT_WAL) != 0 && + ispartialpage; /* OK to write the page(s) */ from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ; @@ -2433,6 +2443,37 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) nleft = nbytes; do { + char tail_page[XLOG_BLCKSZ]; + struct iovec iov[2]; + int iovcnt = 0; + + if (!copy_tail || nleft > XLOG_BLCKSZ) + { + /* + * Write as many pages as possible directly from WAL + * buffers. + */ + iov[iovcnt].iov_base = from; + iov[iovcnt].iov_len = nleft - (copy_tail ? XLOG_BLCKSZ : 0); + ++iovcnt; + } + + if (copy_tail) + { + size_t tail_nleft = Min(nleft, XLOG_BLCKSZ); + + /* + * Copy the tail page into a temporary buffer and write + * that part from there instead, because the WAL buffer is + * potentially still changing underneath our feet and our + * defense against that is enabled. + */ + memcpy(tail_page, from + nleft - tail_nleft, tail_nleft); + iov[iovcnt].iov_base = tail_page; + iov[iovcnt].iov_len = tail_nleft; + ++iovcnt; + } + errno = 0; /* Measure I/O timing to write WAL data */ @@ -2442,7 +2483,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) INSTR_TIME_SET_ZERO(start); pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE); - written = pg_pwrite(openLogFile, from, nleft, startoffset); + written = pg_pwritev(openLogFile, iov, iovcnt, startoffset); pgstat_report_wait_end(); /* -- 2.47.0