Re: sync_standbys_defined read/write race on startup

Michael Paquier Wed, 09 Apr 2025 19:33:34 -0700

On Wed, Apr 09, 2025 at 06:21:50PM +0300, Maksim.Melnikov wrote:
> Test 041_checkpoint_at_promote.pl is really good example
> of using injection points, but anyway, I still don't see
> how injection points can help us here. In failed test case
> we started postgres, immediately open psql connection and commit
> prepared transaction. Postgres, on start, before accepting
> connections, fork checkpointer and checkpointer initialize
> shared parameter sync_standbys_defined in CheckpointerMain,
> before process loop. So, in most attempts, we can't call
> injection_points_attach in psql before
> macro INJECTION_POINT() code will be executed in
> checkpointer(I mean INJECTION_POINT() code instead sleep()),
> bacause checkpointer process is forking before database
> system is ready to accept connections.


Confirmed.  One thing where it would be possible to make things work
is to introduce some persistency of the injection points, say these
are flushed at shutdown.  We could do that without touching at the
backend code and only in the module injection_points, but perhaps this
thread is not enough to justify this extra complexity.

> The manual execution of checkpoint can't help us too.
> 
> P.S.
> sorry for style errors, I am new in postgres, so can miss
> some rules. Anyway, to avoid any community rules mistakes,
> attached patch with correct naming.

No worries, we are all here to learn, as am I.

+    sync_standbys_defined = WalSndCtl->sync_standbys_defined & 
SYNCSTANDBYSINITIALIZED
+        ? (WalSndCtl->sync_standbys_defined & SYNCSTANDBYSDEFINED) : 
SyncStandbysDefined();

So this is the core point of the patch: if the sync standby data has
not been initialized, we try the only thing we can do with a check on
the GUC value.  Then:
- If the GUC has no value, then we know for sure that there is no
point to wait, so the patch skips the wait.
- If the GUC has a value, it may be possible that we have to wait but
we are not sure yet without letting the checkpointer process the GUC
and update the information in shared memory, so we decide that it's
better to wait and let the checkpointer process the setup until it
wakes up the queue, relying on the fact that there should be no
processes that wait as long as the GUC is not set.

This is mentioned in your commit message but there is close to zero
explanation about the reason of this choice, and even worse this fact
is hidden to the reader of the code because of the overly-complex
setup around sync_standbys_defined and the fact that there is no
comment to document this fact.

+/*
+ * WalSndCtlData->sync_standbys_defined bit flags. 7th bit check 
standbys_defined or not,
+ * 8th bit check WalSndCtlData->sync_standbys_defined field has been 
initialized.
+ */
+#define SYNCSTANDBYSINITIALIZED    (1 << 0)
+#define SYNCSTANDBYSDEFINED        (1 << 1)
+
+#define SET_SYNCSTANDBYS_INITIALIZED(sync_standbys)    \
+    (sync_standbys | SYNCSTANDBYSINITIALIZED)
+#define SET_SYNCSTANDBYS_DEFINED(sync_standbys)    \
+    (sync_standbys | SYNCSTANDBYSINITIALIZED | SYNCSTANDBYSDEFINED)
+#define RESET_SYNCSTANDBYS_DEFINED(sync_standbys)    \
+    (SET_SYNCSTANDBYS_INITIALIZED(sync_standbys) & (~SYNCSTANDBYSDEFINED))
+#define IS_SYNCSTANDBYS_INITIALIZED_ONLY(sync_standbys)    \
+    ((sync_standbys & (SYNCSTANDBYSINITIALIZED | SYNCSTANDBYSDEFINED)) == 
SYNCSTANDBYSINITIALIZED)
+

The use of the flags is a good idea to protect the atomicity of the
lookup in the fast path, but hiding that in a set of macros, with some
of them being used in only one location reduces the readability of the
code in walsender.c and syncrep.c because we would need to do a
permanent mapping with what walsender_private.h uses.  The reset one
depends on the set one, which is also confusing.

A couple of comments in walsender.c and syncrep.c also need a refresh
around the manipulations of sync_standbys_defined.

Also, the name "sync_standbys_defined" is not adapted anymore, as it
stores a status of the data in shared memory associated to the sync
standbys, so I've switched that to sync_standbys_status, with bits8 as
type so as we have the same size as a boolean.

I have taken a stab at all that, finishing with the attached.  The
patch has a pg_usleep() that I have used to stress the CI.  That
should be removed in the final result, obviously.

Side note: I am pretty sure that this issue has been the origin of
some spurious failures in the buildfarm.  I cannot point my finger to
a particular one now, but it would not go unnoticed especially on the
slow animals..

What do you think?
--
Michael

From f60a004116bc21440b8769c75f11126ac4ef667f Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Thu, 10 Apr 2025 11:12:41 +0900
Subject: [PATCH v5] Fix sync_standbys_defined race on startup.

WalSndCtl->sync_standbys_defined, can be updated by checkpointer process
only and in backends it can be read. So, on postgres startup, we have
race between checkpointer process and backends.

Notes: blah..
---
 src/include/replication/walsender_private.h | 23 +++++-
 src/backend/replication/syncrep.c           | 85 +++++++++++++++++----
 src/backend/replication/walsender.c         |  6 +-
 3 files changed, 92 insertions(+), 22 deletions(-)

diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 0fc77f1b4af7..92243cce1d04 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -96,11 +96,11 @@ typedef struct
 	XLogRecPtr	lsn[NUM_SYNC_REP_WAIT_MODE];
 
 	/*
-	 * Are any sync standbys defined?  Waiting backends can't reload the
-	 * config file safely, so checkpointer updates this value as needed.
-	 * Protected by SyncRepLock.
+	 * Status of data related to the synchronous standbys.  Waiting backends
+	 * can't reload the config file safely, so checkpointer updates this value
+	 * as needed. Protected by SyncRepLock.
 	 */
-	bool		sync_standbys_defined;
+	bits8		sync_standbys_status;
 
 	/* used as a registry of physical / logical walsenders to wake */
 	ConditionVariable wal_flush_cv;
@@ -116,6 +116,21 @@ typedef struct
 	WalSnd		walsnds[FLEXIBLE_ARRAY_MEMBER];
 } WalSndCtlData;
 
+/* Flags for WalSndCtlData->sync_standbys_status */
+
+/*
+ * Is the synchronous standby data initialized from the GUC?  This is flipped
+ * the first time synchronous_standby_names is processed by the checkpointer.
+ */
+#define SYNC_STANDBY_INIT			(1 << 0)
+
+/*
+ * Is the synchronous standby data defined?  This is set when
+ * synchronous_standby_names has some data, after being processed by the
+ * checkpointer.
+ */
+#define SYNC_STANDBY_DEFINED		(1 << 1)
+
 extern PGDLLIMPORT WalSndCtlData *WalSndCtl;
 
 
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index d75e39680355..4c9f550f7dc3 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -161,16 +161,23 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	 * sync replication standby names defined.
 	 *
 	 * Since this routine gets called every commit time, it's important to
-	 * exit quickly if sync replication is not requested. So we check
-	 * WalSndCtl->sync_standbys_defined flag without the lock and exit
-	 * immediately if it's false. If it's true, we need to check it again
-	 * later while holding the lock, to check the flag and operate the sync
-	 * rep queue atomically. This is necessary to avoid the race condition
-	 * described in SyncRepUpdateSyncStandbysDefined(). On the other hand, if
-	 * it's false, the lock is not necessary because we don't touch the queue.
+	 * exit quickly if sync replication is not requested.
+	 *
+	 * We check WalSndCtl->sync_standbys_status flag without the lock and exit
+	 * immediately if SYNC_STANDBY_INIT is set (the checkpointer has
+	 * initialized this data) but SYNC_STANDBY_DEFINED is missing (no sync
+	 * replication requested).
+	 *
+	 * If SYNC_STANDBY_DEFINED is set, we need to check the status again later
+	 * while holding the lock, to check the flag and operate the sync rep
+	 * queue atomically.  This is necessary to avoid the race condition
+	 * described in SyncRepUpdateSyncStandbysDefined().  On the other hand, if
+	 * SYNC_STANDBY_DEFINED is not set, the lock is not necessary because we
+	 * don't touch the queue.
 	 */
 	if (!SyncRepRequested() ||
-		!((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
+		((((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_status) &
+		 (SYNC_STANDBY_INIT | SYNC_STANDBY_DEFINED)) == SYNC_STANDBY_INIT)
 		return;
 
 	/* Cap the level for anything other than commit to remote flush only. */
@@ -186,16 +193,38 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
 	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
 
 	/*
-	 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
-	 * set.  See SyncRepUpdateSyncStandbysDefined.
+	 * We don't wait for sync rep if SYNC_STANDBY_DEFINED is not set.  See
+	 * SyncRepUpdateSyncStandbysDefined().
 	 *
 	 * Also check that the standby hasn't already replied. Unlikely race
 	 * condition but we'll be fetching that cache line anyway so it's likely
 	 * to be a low cost check.
+	 *
+	 * If the sync standby data has not been initialized yet
+	 * (SYNC_STANDBY_INIT is not set), then fall back to a direct GUC check.
 	 */
-	if (!WalSndCtl->sync_standbys_defined ||
-		lsn <= WalSndCtl->lsn[mode])
+	if (WalSndCtl->sync_standbys_status & SYNC_STANDBY_INIT)
 	{
+		if ((WalSndCtl->sync_standbys_status & SYNC_STANDBY_DEFINED) == 0 ||
+			lsn <= WalSndCtl->lsn[mode])
+		{
+			LWLockRelease(SyncRepLock);
+			return;
+		}
+	}
+	else if (!SyncStandbysDefined())
+	{
+		/*
+		 * If we are here, the sync standby data has not been initialized yet,
+		 * so fall back to the best thing we can do: a check on
+		 * SyncStandbysDefined() to see if the GUC is set or not.
+		 *
+		 * If the GUC has a value, we wait until the checkpointer updates the
+		 * status data because we cannot be sure yet if we should wait or not.
+		 * If the GUC has *no* value, we are sure that there is no point to
+		 * wait; this matters for example when initializing a cluster, where
+		 * we should never wait, and no sync standbys is the default behavior.
+		 */
 		LWLockRelease(SyncRepLock);
 		return;
 	}
@@ -912,7 +941,7 @@ SyncRepWakeQueue(bool all, int mode)
 
 /*
  * The checkpointer calls this as needed to update the shared
- * sync_standbys_defined flag, so that backends don't remain permanently wedged
+ * sync_standbys_status flag, so that backends don't remain permanently wedged
  * if synchronous_standby_names is unset.  It's safe to check the current value
  * without the lock, because it's only ever updated by one process.  But we
  * must take the lock to change it.
@@ -922,8 +951,11 @@ SyncRepUpdateSyncStandbysDefined(void)
 {
 	bool		sync_standbys_defined = SyncStandbysDefined();
 
-	if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
+	if (sync_standbys_defined !=
+		((WalSndCtl->sync_standbys_status & SYNC_STANDBY_DEFINED) != 0))
 	{
+		pg_usleep(2 * 1000 * 1000L);	/* 2s */
+
 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
 
 		/*
@@ -946,7 +978,30 @@ SyncRepUpdateSyncStandbysDefined(void)
 		 * backend that hasn't yet reloaded its config might go to sleep on
 		 * the queue (and never wake up).  This prevents that.
 		 */
-		WalSndCtl->sync_standbys_defined = sync_standbys_defined;
+		WalSndCtl->sync_standbys_status = SYNC_STANDBY_INIT |
+			(sync_standbys_defined ? SYNC_STANDBY_DEFINED : 0);
+
+		LWLockRelease(SyncRepLock);
+	}
+	else if ((WalSndCtl->sync_standbys_status & SYNC_STANDBY_INIT) == 0)
+	{
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+
+		/*
+		 * Note that there is no need to wake up the queues here.  We would
+		 * reach this path only if SyncStandbysDefined() returns false, or it
+		 * would mean that some backends are waiting with the GUC set.  See
+		 * SyncRepWaitForLSN().
+		 */
+		Assert(!SyncStandbysDefined());
+
+		/*
+		 * Even if there is no sync standby defined, let the readers of this
+		 * information know that the sync standby data has been initialized.
+		 * This can just be done once, hence the previous check on
+		 * SYNC_STANDBY_INIT to avoid useless work.
+		 */
+		WalSndCtl->sync_standbys_status |= SYNC_STANDBY_INIT;
 
 		LWLockRelease(SyncRepLock);
 	}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 216baeda5cd2..987a7336ec84 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1674,13 +1674,13 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
 	 * When skipping empty transactions in synchronous replication, we send a
 	 * keepalive message to avoid delaying such transactions.
 	 *
-	 * It is okay to check sync_standbys_defined flag without lock here as in
-	 * the worst case we will just send an extra keepalive message when it is
+	 * It is okay to check sync_standbys_status without lock here as in the
+	 * worst case we will just send an extra keepalive message when it is
 	 * really not required.
 	 */
 	if (skipped_xact &&
 		SyncRepRequested() &&
-		((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
+		(((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_status & SYNC_STANDBY_DEFINED))
 	{
 		WalSndKeepalive(false, lsn);
 
-- 
2.49.0

signature.asc
Description: PGP signature

Re: sync_standbys_defined read/write race on startup

Reply via email to