On Thu, Nov 24, 2022 at 11:05 AM Tom Lane <t...@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.mu...@gmail.com> writes: > > On Wed, Nov 23, 2022 at 11:03 PM Thomas Munro <thomas.mu...@gmail.com> > > wrote: > > As for what to do about it, some ideas: > > 2. Retry after a short time on checksum failure. The probability is > > already miniscule, and becomes pretty close to 0 if we read thrice > > 100ms apart. > > > First thought is that 2 is appropriate level of complexity for this > > rare and stupid problem. > > Yeah, I was thinking the same. A variant could be "repeat until > we see the same calculated checksum twice".
Hmm. While writing a comment to explain why that's good enough, I realised it's not really true for a standby that control file writes are always expected to be far apart in time. XLogFlush-> UpdateMinRecoveryPoint() could coincide badly with our N attempts for any small N and for any nap time, which I think makes your idea better than mine. With some cartoon-level understanding of what's going on (to wit: I think the kernel just pins the page but doesn't use a page-level content lock or range lock, so what you're seeing is raw racing memcpy calls and unsynchronised cache line effects), I guess you'd be fairly likely to make "progress" in seeing more new data even if you didn't sleep in between, but who knows. So I have a 10ms sleep to make progress very likely; given your algorithm it doesn't matter if you didn't make all the progress, just some. Since this is reachable from SQL, I think we also need a CFI call so you can't get uninterruptibly stuck here? I wrote a stupid throw-away function to force a write. If you have an ext4 system to hand (xfs, zfs, apfs, ufs, others don't suffer from this) you can do: do $$ begin for i in 1..100000000 do loop perform pg_update_control_file(); end loop; end; $$; ... while you also do: select pg_control_system(); \watch 0.001 ... and you'll soon see: ERROR: calculated CRC checksum does not match value stored in file The attached draft patch fixes it.
From 9f1856aeb56be888123702fe471d8388da66439f Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Thu, 24 Nov 2022 00:55:03 +0000 Subject: [PATCH 1/2] XXX Dirty hack to clobber control file for testing --- src/backend/access/transam/xlog.c | 10 ++++++++++ src/include/catalog/pg_proc.dat | 8 ++++++++ 2 files changed, 18 insertions(+) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index a31fbbff78..88de05ab35 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2502,6 +2502,16 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force) LWLockRelease(ControlFileLock); } +Datum +pg_update_control_file() +{ + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); + ControlFile->minRecoveryPoint++; /* XXX change something to affect CRC! */ + UpdateControlFile(); + LWLockRelease(ControlFileLock); + PG_RETURN_VOID(); +} + /* * Ensure that all XLOG data through the given position is flushed to disk. * diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index f15aa2dbb1..8177c1657c 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -11679,6 +11679,14 @@ proargnames => '{pg_control_version,catalog_version_no,system_identifier,pg_control_last_modified}', prosrc => 'pg_control_system' }, +{ oid => '8888', + descr => 'update control file', + proname => 'pg_update_control_file', provolatile => 'v', prorettype => 'void', + proargtypes => '', proallargtypes => '', + proargmodes => '{}', + proargnames => '{}', + prosrc => 'pg_update_control_file' }, + { oid => '3442', descr => 'pg_controldata checkpoint state information as a function', proname => 'pg_control_checkpoint', provolatile => 'v', -- 2.35.1
From a63818a32d661dba563cedfdb85731e522b3c6a9 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Thu, 24 Nov 2022 13:28:22 +1300 Subject: [PATCH 2/2] Try to tolerate concurrent reads and writes of control file. Various frontend programs and SQL-callable backend functions read the control file without any kind of interlocking against concurrent writes. Linux ext4 doesn't implement the atomicity required by POSIX here, so a concurrent reader can see only partial effects of an in-progress write. Tolerate this by retrying until we get two reads in a row with the same checksum, after an idea from Tom Lane. Reported-by: Andres Freund <and...@anarazel.de> Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de --- src/common/controldata_utils.c | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/src/common/controldata_utils.c b/src/common/controldata_utils.c index 2d1f35bbd1..200d24df02 100644 --- a/src/common/controldata_utils.c +++ b/src/common/controldata_utils.c @@ -56,12 +56,19 @@ get_controlfile(const char *DataDir, bool *crc_ok_p) char ControlFilePath[MAXPGPATH]; pg_crc32c crc; int r; + bool first_try; + pg_crc32c last_crc; Assert(crc_ok_p); ControlFile = palloc_object(ControlFileData); snprintf(ControlFilePath, MAXPGPATH, "%s/global/pg_control", DataDir); + first_try = true; + INIT_CRC32C(last_crc); + +retry: + #ifndef FRONTEND if ((fd = OpenTransientFile(ControlFilePath, O_RDONLY | PG_BINARY)) == -1) ereport(ERROR, @@ -117,6 +124,24 @@ get_controlfile(const char *DataDir, bool *crc_ok_p) *crc_ok_p = EQ_CRC32C(crc, ControlFile->crc); + /* + * With unlucky timing on filesystems that don't implement atomicity of + * concurrent reads and writes (such as Linux ext4), we might have seen + * garbage if the server was writing to the file at the same time. Keep + * retrying until we see the same CRC twice. + */ + if (!*crc_ok_p && (first_try || !EQ_CRC32C(crc, last_crc))) + { + first_try = false; + last_crc = crc; + pg_usleep(10000); + +#ifndef FRONTEND + CHECK_FOR_INTERRUPTS(); +#endif + goto retry; + } + /* Make sure the control file is valid byte order. */ if (ControlFile->pg_control_version % 65536 == 0 && ControlFile->pg_control_version / 65536 != 0) -- 2.35.1