Re: High CPU consumption in cascade replication with large number of walsenders and ConditionVariable broadcast issues

2025-03-31 Thread Alexey Makhmutov
> We’ve prepared two test patches on top of current master to address 
both issues:

> ...
> * 0002-Implement-batching-for-cascade-replication.patch – test patch 
to implement possible batching approach in xlogreceiver.c with timer. 
Currently it uses GUC variables to allow testing of different batch 
sizes and timeout values.


I've played with the second patch a little more and made some 
adjustments to it:
1. Setup timer only if we actually have applied messages, which are 
(potentially) not yet signaled to walsenders.

2. Notify logical walsenders without delay if time line has changed.

Modified patch is attached.

Thanks,
Alexey
From c691724332dd5a78b98d606188383d6b37c98021 Mon Sep 17 00:00:00 2001
From: Rustam Khamidullin 
Date: Fri, 14 Mar 2025 18:18:34 +0700
Subject: [PATCH 2/2] Implement batching for WAL records notification during
 cascade replication

Currently standby server notifies walsenders after applying of each WAL
record during cascade replication. This creates a bottleneck in case of large
number of sender processes during WalSndWakeup invocation. This change
introduces batching for such notifications, which are now sent either after
certain number of applied records or specified time interval (whichever comes
first).

Co-authored-by: Alexey Makhmutov 
---
 src/backend/access/transam/xlogrecovery.c | 61 ++-
 src/backend/postmaster/startup.c  |  1 +
 src/backend/utils/misc/guc_tables.c   | 22 
 src/include/access/xlogrecovery.h |  4 ++
 src/include/utils/timeout.h   |  1 +
 5 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 0aa3ab59085..3e3e43d5888 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -64,6 +64,7 @@
 #include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/pg_rusage.h"
+#include "utils/timeout.h"
 
 /* Unsupported old recovery command file names (relative to $PGDATA) */
 #define RECOVERY_COMMAND_FILE	"recovery.conf"
@@ -147,6 +148,8 @@ bool		InArchiveRecovery = false;
 static bool StandbyModeRequested = false;
 bool		StandbyMode = false;
 
+#define StandbyWithCascadeReplication() (AmStartupProcess() && StandbyMode && AllowCascadeReplication())
+
 /* was a signal file present at startup? */
 static bool standby_signal_file_found = false;
 static bool recovery_signal_file_found = false;
@@ -298,6 +301,14 @@ bool		reachedConsistency = false;
 static char *replay_image_masked = NULL;
 static char *primary_image_masked = NULL;
 
+/* Maximum number of applied records in batch before notifying walsender during cascade replication */
+int cascadeReplicationMaxBatchSize;
+
+/* Maximum batching delay before notifying walsender during cascade replication */
+int cascadeReplicationMaxBatchDelay;
+
+/* Counter for collected records during cascade replication */
+static volatile sig_atomic_t appliedRecords = 0;
 
 /*
  * Shared-memory state for WAL recovery.
@@ -1839,6 +1850,17 @@ PerformWalRecovery(void)
 		 * end of main redo apply loop
 		 */
 
+		/* Ensure that notification for batched messages is sent */
+		if (StandbyWithCascadeReplication() &&
+cascadeReplicationMaxBatchSize > 1 &&
+appliedRecords > 0)
+		{
+			if (cascadeReplicationMaxBatchDelay > 0)
+disable_timeout(STANDBY_CASCADE_WAL_SEND_TIMEOUT, false);
+			appliedRecords = 0;
+			WalSndWakeup(false, true);
+		}
+
 		if (reachedRecoveryTarget)
 		{
 			if (!reachedConsistency)
@@ -2037,8 +2059,30 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	 *be created otherwise)
 	 * --
 	 */
-	if (AllowCascadeReplication())
-		WalSndWakeup(switchedTLI, true);
+
+	if (StandbyWithCascadeReplication())
+	{
+		if (cascadeReplicationMaxBatchSize <= 1)
+			WalSndWakeup(switchedTLI, true);
+		else
+		{
+			bool batchFlushRequired = ++appliedRecords >=
+	cascadeReplicationMaxBatchSize;
+			if (batchFlushRequired)
+			{
+appliedRecords = 0;
+if (cascadeReplicationMaxBatchDelay > 0)
+	disable_timeout(STANDBY_CASCADE_WAL_SEND_TIMEOUT, false);
+			}
+
+			WalSndWakeup(switchedTLI, batchFlushRequired);
+
+			/* Setup timeout to limit maximum delay for notifications */
+			if (appliedRecords == 1 && cascadeReplicationMaxBatchDelay > 0)
+enable_timeout_after(STANDBY_CASCADE_WAL_SEND_TIMEOUT,
+		cascadeReplicationMaxBatchDelay);
+			}
+	}
 
 	/*
 	 * If rm_redo called XLogRequestWalReceiverReply, then we wake up the
@@ -5064,3 +5108,16 @@ assign_recovery_target_xid(const char *newval, void *extra)
 	else
 		recoveryTarget = RECOVERY_TARGET_UNSET;
 }
+
+/*
+ * Timer handler for batch notifications in cascade replication
+ */
+void
+StandbyWalSendTimeoutHandler(void)
+{
+	if (appliedRecords > 0)
+	{
+		appliedRe

High CPU consumption in cascade replication with large number of walsenders and ConditionVariable broadcast issues

2025-03-29 Thread Alexey Makhmutov
 ourselves on the 
list of processes which need to receive our signal). I think, that the 
correct approach here is to just return control to the client just after 
the wait if ‘cv_sleep_target’ has changed. We do not need to add 
ourselves to the wakeup list in such case, as client will check exit 
condition in the cycle and then either invoke 
ConditionVariableCancelSleep or ConditionVariableTimedSleep again. Both 
methods will correctly update the CV state, so we won’t miss any 
notifications or leave the CV in the improper state.


We’ve prepared two test patches on top of current master to address both 
issues:
* 0001-CV-correctly-handle-cv_sleep_target-change.patch – adjust CV 
logic to correctly process case with ‘cv_sleep_target’ change during sleep.
* 0002-Implement-batching-for-cascade-replication.patch – test patch to 
implement possible batching approach in xlogreceiver.c with timer. 
Currently it uses GUC variables to allow testing of different batch 
sizes and timeout values.


With both patches applied we’ve noticed a significant reduction in CPU 
consumption while using the synthetic test program mentioned above. If 
only the second patch is applied to the build, then problem with CV 
variable could be reproduced by building server with casserts enabled 
and running TAP tests in ‘src/test/recovery’ directory.


It would be great to hear any thoughts on these observations and fixing 
approaches, as well as possible pitfalls of proposed changes.


Thanks,
Alexey
From e5cfabfa7fd8f05f1a7c8421e60087d672fb59e4 Mon Sep 17 00:00:00 2001
From: Alexey Makhmutov 
Date: Fri, 28 Mar 2025 16:16:36 +0300
Subject: [PATCH 1/2] Ensure that content of the wait list on CV is aligned
 with cv_sleep_target value

Wait on the ConditionVariable could be interleaved with interruption handlers,
such as timers. If such handler uses CV calls (i.e. ConditionVariableBroadcast),
then value of the cv_sleep_target could be null or point to a different CV after
wakeup. In this case we should not try to add ourselves to the wakeup wait list,
as subsequent call to ConditionVariableCancelSleep won't be able to find our CV
to clear its list. We should instead return control to the client, so the exit
condition for the CV external wait loop could be checked. If wait is still
required, then next invocation of ConditionVariableTimedSleep will perform the
required setup procedures for CV. Behavior before this change results in the
mismatch between cv_sleep_target value and content of the wakeup list, so the
next invocation of ConditionVariableBroadcast method may fail due to presence of
current process in the list.
---
 src/backend/storage/lmgr/condition_variable.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 228303e77f7..3b5d9764403 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -165,6 +165,17 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 		/* Reset latch before examining the state of the wait list. */
 		ResetLatch(MyLatch);
 
+		/*
+		 * First, check that we are still expected to wait on this variable.
+		 * If sleep target has changed, then we need to return spuriously to
+		 * allow caller to check the exit continue and invoke the wait.
+		 * function again to properly prepare for the next sleep.
+		 * Note, that we don't need to change wait list in this case, as we
+		 * should be deleted from the list in case of cv_sleep_target change.
+		 */
+		if (cv != cv_sleep_target)
+			return false;
+
 		/*
 		 * If this process has been taken out of the wait list, then we know
 		 * that it has been signaled by ConditionVariableSignal (or
-- 
2.49.0

From 75d3e41f0833845b7abc9dff6dc9d4857c1c340c Mon Sep 17 00:00:00 2001
From: Rustam Khamidullin 
Date: Fri, 14 Mar 2025 18:18:34 +0700
Subject: [PATCH 2/2] Implement batching for WAL records notification during
 cascade replication

Currently standby server notifies walsenders after applying of each WAL
record during cascade replication. This creates a bottleneck in case of large
number of sender processes during WalSndWakeup invocation. This change
introduces batching for such notifications, which are now sent either after
certain number of applied records or specified time interval (whichever comes
first).

Co-authored-by: Alexey Makhmutov 
---
 src/backend/access/transam/xlogrecovery.c | 68 ++-
 src/backend/postmaster/startup.c  |  1 +
 src/backend/utils/misc/guc_tables.c   | 22 
 src/include/access/xlogrecovery.h |  4 ++
 src/include/utils/timeout.h   |  1 +
 5 files changed, 94 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 0aa3ab59085..84a6c176593 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/back

Re: High CPU consumption in cascade replication with large number of walsenders and ConditionVariable broadcast issues

2025-04-22 Thread Alexey Makhmutov
After playing with the patch a little more, we’ve come to the 
conclusion, that the idea to signal CV broadcast from the timer handler 
is a risky one, as it creates a large number of execution paths, which 
are not present in current code base. It's hard to prove correctness of 
application behavior in each such case, so we've decided to use a 
different approach.


In the new version of the patch we use timer handler only to set the 
flag, which is then checked in the ProcessStartupProcInterrupts 
function. This allow us to send signal on timeout if the startup process 
is waiting for the arrival of new WAL records (in ReadRecord), but the 
notification may be delayed while record is being applied (during redo 
handler invocation from ApplyWalRecord). This could increase delay for 
some corner cases with non-trivial WAL records like ‘drop database’, but 
this should be a rare case and walsender process have its own limit on 
the wait time, so the delay won’t be indefinite even in this case.


A new variant of the patch is attached in case anybody else wants to 
play with this patch and approach. A slightly improved test case and 
formatted CV patch (which is not strictly required anymore for this 
case) are attached as well.


Thanks,
Alexey

From b6749b210cceabcc0c382935ec412182b047fbb4 Mon Sep 17 00:00:00 2001
From: Alexey Makhmutov 
Date: Fri, 28 Mar 2025 16:16:36 +0300
Subject: [PATCH 1/2] Ensure that content of the wait list on CV is aligned
 with cv_sleep_target value

Wait on the ConditionVariable could be interleaved with interruption handlers,
such as timers. If such handler uses CV calls (i.e. ConditionVariableBroadcast),
then value of the cv_sleep_target could be null or point to a different CV after
wakeup. In this case we should not try to add ourselves to the wakeup wait list,
as subsequent call to ConditionVariableCancelSleep won't be able to find our CV
to clear its list. We should instead return control to the client, so the exit
condition for the CV external wait loop could be checked. If wait is still
required, then next invocation of ConditionVariableTimedSleep will perform the
required setup procedures for CV. Behavior before this change results in the
mismatch between cv_sleep_target value and content of the wakeup list, so the
next invocation of ConditionVariableBroadcast method may fail due to presence of
current process in the list.
---
 src/backend/storage/lmgr/condition_variable.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 228303e77f7..ab4ad7e5641 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -165,6 +165,17 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
 		/* Reset latch before examining the state of the wait list. */
 		ResetLatch(MyLatch);
 
+		/*
+		 * First, check that we are still expected to wait on this variable.
+		 * If sleep target has changed, then we need to return spuriously to
+		 * allow caller to check the exit condition and invoke the wait
+		 * function again to properly prepare for the next sleep. Note, that
+		 * we don't need to change wait list in this case, as we should be
+		 * deleted from the list in case of cv_sleep_target change.
+		 */
+		if (cv != cv_sleep_target)
+			return false;
+
 		/*
 		 * If this process has been taken out of the wait list, then we know
 		 * that it has been signaled by ConditionVariableSignal (or
-- 
2.49.0

From 6eaeb7a7fbb40fc3a8fb04a2ea22df9a54747a0a Mon Sep 17 00:00:00 2001
From: Rustam Khamidullin 
Date: Fri, 14 Mar 2025 18:18:34 +0700
Subject: [PATCH 2/2] Implement batching for WAL records notification during
 cascade replication

Currently standby server notifies walsenders after applying of each WAL
record during cascade replication. This creates a bottleneck in case of large
number of sender processes during WalSndWakeup invocation. This change
introduces batching for such notifications, which are now sent either after
certain number of applied records or specified time interval (whichever comes
first).

Co-authored-by: Alexey Makhmutov 
---
 src/backend/access/transam/xlogrecovery.c | 99 ++-
 src/backend/postmaster/startup.c  |  5 ++
 src/backend/utils/misc/guc_tables.c   | 22 +
 src/include/access/xlogrecovery.h |  8 ++
 src/include/utils/timeout.h   |  1 +
 5 files changed, 133 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 6ce979f2d8b..43e95602223 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -64,6 +64,7 @@
 #include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/pg_rusage.h"
+#include "utils/timeout.h"
 
 /* Unsu