Hi,

As we learned in the build farm[1], at least one checksumming
filesystem doesn't like it if you write data from memory that is still
changing, when direct I/O is enabled.  That was BTRFS, but the next
version of OpenZFS seems to have a similar allergy.  In Andres's
thread about fixing that (among other problems) for hint bits, he
pointed out that WAL must have the same problem[2].  I am splitting
that case off to a new thread given that the problem and potential
solutions seem pretty independent.

Concretely, on BTRFS you could write WAL apparently successfully, but
later if you tried to read it you'd get EIO because the checksum was
computed badly over moving data.  On ZFS 2.3 (out soon) I think a
backend would crash/panic during the write, so different but also bad.
Some other obvious candidates in that general COW + integrity family
would be (1) APFS but no, it doesn't seem to have user data checksums,
(2) ReFS, which has user data checksums but doesn't enable them by
default and I have no idea what it would do in the same circumstances
(but Windows already explicitly tells us never to touch the buffer
during an I/O, so we're already breaking the rules regardless), (3)
bcachefs, which has user data checksums by default but it looks like
it has its own bounce buffer to defend against this type of
problem[3].  (Anyone got clues about other exotic systems?)

Here's an early experimental patch to try to do something about that,
by making a temporary copy of the moving trailing buffer.  It's not
free, but the behaviour is only activated when you turn on
debug_io_direct=wal.  No significant change to default behaviour
hopefully, but direct I/O WAL writes look like this:

postgres=# insert into t values (42);
pwrite(14,"\^X\M-Q\^E\0\^A\0\0\0\0\M-`\M^P"...,8192,0x90e000) = 8192

postgres=# insert into t select generate_series(1, 1000);
pwritev(14,[{"\^X\M-Q\^E\0\^A\0\0\0\0`\M^O\r\0"...,57344},
            {"\^X\M-Q\^E\0\^A\0\0\0\0@\M^P\r\0"...,8192}],2,0x8f6000) = 65536

postgres=# begin;
postgres=*# insert into t select generate_series(1, 10000000);
[walwriter]
pwrite(5,"\^X\M-Q\^E\0\^A\0\0\0\0@`\^Z\0\0"...,2080768,0x604000) = 2080768

In the first case it had to take a copy as it was only writing the
tail, but otherwise it looks like unpatched.  In the second case, it
wrote as much as it could safely directly from WAL buffers, but then
had to write the tail page out with a temporary copy.  In the third
case, bulk writing activity doesn't do partial pages so gets the
traditional behaviour.

It might also be an idea to copy only the the part we really care
about and zero the rest, and in either case perhaps only out to the
nearest PG_IO_ALIGN_SIZE (step size at which it is OK to end a direct
I/O write), though that last bit might not be worth the hassle if we
plan to just use smaller blocks anyway.

Interaction with other potential future changes:  Obviously AIO might
require something more... bouncy than a stack buffer.  4K WAL buffers
would reduce the size of the copied region (along with other
benefits[4]).  A no-overwrite WAL buffer mode (to match the behaviour
of other databases) would avoid the cost completely (along with other
benefits).

What else could we do, if not something like this?

[1] 
https://www.postgresql.org/message-id/ca+hukgksbaz78fw3wtf3q8arqkcz1ggstfrfidpbu-j9ofz...@mail.gmail.com
[2] 
https://www.postgresql.org/message-id/jo5p5nthb3hxwfj7qifiu2rxk5y3hexbwxs5d6x2jotsvj3bq5%40jhtrriubncjb
[3] https://bcachefs-docs.readthedocs.io/en/latest/feat-checksumming.html
[4] 
https://www.postgresql.org/message-id/flat/20231009230805.funj5ipoggjyzjz6%40awork3.anarazel.de
From bdbcaf7f8ba970e947590d86e6a19d4cb0087d70 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Tue, 1 Oct 2024 06:06:57 +1300
Subject: [PATCH] Don't write changing WAL buffers with direct I/O.

Filesystems that have checksums for user data sometimes don't like it if
you modify the data in memory while it is being written to disk.
XLogWrite() allowed that with trailing partially filled WAL pages,
because it always writes the whole WAL buffer, and yet other backends
are allowed to fill in the rest of the buffer at the same time.

Defend against corruption by taking a temporary copy of any trailing
partial page.  Use a vectored write if necessary to avoid an extra
system call.

Unfortunately on Windows pg_pwritev() is implemented with a loop and the
default wal_sync_method will therefore sometimes wait for synchronous
I/O twice, but that is mitigated by the fact that we don't activate the
new behavior for buffered WAL writes, still the default.  (It'd be
possible to WriteFileGather() but that'd take more infrastructural
changes do deal with overlapped I/O.)

XXX Zeroes in unstable region instead?

Per observation from Andres Freund that the WAL must have the same
problem that we'd already diagnosed for relation data with concurrently
modified hint bits.

Discussion: https://postgr.es/m/CA%2BhUKGL-sZrfwcdme8jERPbn%2BsGbY13sRCvq8b9Hp%3DhaWpC6fw%40mail.gmail.com
---
 src/backend/access/transam/xlog.c | 43 ++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f58412bcab..c8994ad8aad 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2426,6 +2426,16 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			Size		nleft;
 			ssize_t		written;
 			instr_time	start;
+			bool		copy_tail;
+
+			/*
+			 * Do we need to treat the tail page specially to avoid writing
+			 * data out while the buffer is still potentially changing?  We
+			 * enable this only for direct I/O currently, where that is known
+			 * to cause checksum failures on some file systems.
+			 */
+			copy_tail = (io_direct_flags & IO_DIRECT_WAL) != 0 &&
+				ispartialpage;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2433,6 +2443,37 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			nleft = nbytes;
 			do
 			{
+				char		tail_page[XLOG_BLCKSZ];
+				struct iovec iov[2];
+				int			iovcnt = 0;
+
+				if (!copy_tail || nleft > XLOG_BLCKSZ)
+				{
+					/*
+					 * Write as many pages as possible directly from WAL
+					 * buffers.
+					 */
+					iov[iovcnt].iov_base = from;
+					iov[iovcnt].iov_len = nleft - (copy_tail ? XLOG_BLCKSZ : 0);
+					++iovcnt;
+				}
+
+				if (copy_tail)
+				{
+					size_t		tail_nleft = Min(nleft, XLOG_BLCKSZ);
+
+					/*
+					 * Copy the tail page into a temporary buffer and write
+					 * that part from there instead, because the WAL buffer is
+					 * potentially still changing underneath our feet and our
+					 * defense against that is enabled.
+					 */
+					memcpy(tail_page, from + nleft - tail_nleft, tail_nleft);
+					iov[iovcnt].iov_base = tail_page;
+					iov[iovcnt].iov_len = tail_nleft;
+					++iovcnt;
+				}
+
 				errno = 0;
 
 				/* Measure I/O timing to write WAL data */
@@ -2442,7 +2483,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 					INSTR_TIME_SET_ZERO(start);
 
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
-				written = pg_pwrite(openLogFile, from, nleft, startoffset);
+				written = pg_pwritev(openLogFile, iov, iovcnt, startoffset);
 				pgstat_report_wait_end();
 
 				/*
-- 
2.47.0

Reply via email to