On Thu, Nov 24, 2022 at 11:05 AM Tom Lane <t...@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.mu...@gmail.com> writes:
> > On Wed, Nov 23, 2022 at 11:03 PM Thomas Munro <thomas.mu...@gmail.com> 
> > wrote:
> > As for what to do about it, some ideas:
> > 2.  Retry after a short time on checksum failure.  The probability is
> > already miniscule, and becomes pretty close to 0 if we read thrice
> > 100ms apart.
>
> > First thought is that 2 is appropriate level of complexity for this
> > rare and stupid problem.
>
> Yeah, I was thinking the same.  A variant could be "repeat until
> we see the same calculated checksum twice".

Hmm.  While writing a comment to explain why that's good enough, I
realised it's not really true for a standby that control file writes
are always expected to be far apart in time.  XLogFlush->
UpdateMinRecoveryPoint() could coincide badly with our N attempts for
any small N and for any nap time, which I think makes your idea better
than mine.

With some cartoon-level understanding of what's going on (to wit: I
think the kernel just pins the page but doesn't use a page-level
content lock or range lock, so what you're seeing is raw racing memcpy
calls and unsynchronised cache line effects), I guess you'd be fairly
likely to make "progress" in seeing more new data even if you didn't
sleep in between, but who knows.  So I have a 10ms sleep to make
progress very likely; given your algorithm it doesn't matter if you
didn't make all the progress, just some.  Since this is reachable from
SQL, I think we also need a CFI call so you can't get uninterruptibly
stuck here?

I wrote a stupid throw-away function to force a write.  If you have an
ext4 system to hand (xfs, zfs, apfs, ufs, others don't suffer from
this) you can do:

 do $$ begin for i in 1..100000000 do loop perform
pg_update_control_file(); end loop; end; $$;

... while you also do:

 select pg_control_system();
 \watch 0.001

... and you'll soon see:

ERROR:  calculated CRC checksum does not match value stored in file

The attached draft patch fixes it.
From 9f1856aeb56be888123702fe471d8388da66439f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Thu, 24 Nov 2022 00:55:03 +0000
Subject: [PATCH 1/2] XXX Dirty hack to clobber control file for testing

---
 src/backend/access/transam/xlog.c | 10 ++++++++++
 src/include/catalog/pg_proc.dat   |  8 ++++++++
 2 files changed, 18 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a31fbbff78..88de05ab35 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2502,6 +2502,16 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
 	LWLockRelease(ControlFileLock);
 }
 
+Datum
+pg_update_control_file()
+{
+	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+	ControlFile->minRecoveryPoint++;		/* XXX change something to affect CRC! */
+	UpdateControlFile();
+	LWLockRelease(ControlFileLock);
+	PG_RETURN_VOID();
+}
+
 /*
  * Ensure that all XLOG data through the given position is flushed to disk.
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f15aa2dbb1..8177c1657c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11679,6 +11679,14 @@
   proargnames => '{pg_control_version,catalog_version_no,system_identifier,pg_control_last_modified}',
   prosrc => 'pg_control_system' },
 
+{ oid => '8888',
+  descr => 'update control file',
+  proname => 'pg_update_control_file', provolatile => 'v', prorettype => 'void',
+  proargtypes => '', proallargtypes => '',
+  proargmodes => '{}',
+  proargnames => '{}',
+  prosrc => 'pg_update_control_file' },
+
 { oid => '3442',
   descr => 'pg_controldata checkpoint state information as a function',
   proname => 'pg_control_checkpoint', provolatile => 'v',
-- 
2.35.1

From a63818a32d661dba563cedfdb85731e522b3c6a9 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Thu, 24 Nov 2022 13:28:22 +1300
Subject: [PATCH 2/2] Try to tolerate concurrent reads and writes of control
 file.

Various frontend programs and SQL-callable backend functions read the
control file without any kind of interlocking against concurrent writes.
Linux ext4 doesn't implement the atomicity required by POSIX here, so a
concurrent reader can see only partial effects of an in-progress write.

Tolerate this by retrying until we get two reads in a row with the same
checksum, after an idea from Tom Lane.

Reported-by: Andres Freund <and...@anarazel.de>
Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 src/common/controldata_utils.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/src/common/controldata_utils.c b/src/common/controldata_utils.c
index 2d1f35bbd1..200d24df02 100644
--- a/src/common/controldata_utils.c
+++ b/src/common/controldata_utils.c
@@ -56,12 +56,19 @@ get_controlfile(const char *DataDir, bool *crc_ok_p)
 	char		ControlFilePath[MAXPGPATH];
 	pg_crc32c	crc;
 	int			r;
+	bool		first_try;
+	pg_crc32c	last_crc;
 
 	Assert(crc_ok_p);
 
 	ControlFile = palloc_object(ControlFileData);
 	snprintf(ControlFilePath, MAXPGPATH, "%s/global/pg_control", DataDir);
 
+	first_try = true;
+	INIT_CRC32C(last_crc);
+
+retry:
+
 #ifndef FRONTEND
 	if ((fd = OpenTransientFile(ControlFilePath, O_RDONLY | PG_BINARY)) == -1)
 		ereport(ERROR,
@@ -117,6 +124,24 @@ get_controlfile(const char *DataDir, bool *crc_ok_p)
 
 	*crc_ok_p = EQ_CRC32C(crc, ControlFile->crc);
 
+	/*
+	 * With unlucky timing on filesystems that don't implement atomicity of
+	 * concurrent reads and writes (such as Linux ext4), we might have seen
+	 * garbage if the server was writing to the file at the same time.  Keep
+	 * retrying until we see the same CRC twice.
+	 */
+	if (!*crc_ok_p && (first_try || !EQ_CRC32C(crc, last_crc)))
+	{
+		first_try = false;
+		last_crc = crc;
+		pg_usleep(10000);
+
+#ifndef FRONTEND
+		CHECK_FOR_INTERRUPTS();
+#endif
+		goto retry;
+	}
+
 	/* Make sure the control file is valid byte order. */
 	if (ControlFile->pg_control_version % 65536 == 0 &&
 		ControlFile->pg_control_version / 65536 != 0)
-- 
2.35.1

Reply via email to