Re: High CPU consumption in cascade replication with large number of walsenders and ConditionVariable broadcast issues
> We’ve prepared two test patches on top of current master to address both issues: > ... > * 0002-Implement-batching-for-cascade-replication.patch – test patch to implement possible batching approach in xlogreceiver.c with timer. Currently it uses GUC variables to allow testing of different batch sizes and timeout values. I've played with the second patch a little more and made some adjustments to it: 1. Setup timer only if we actually have applied messages, which are (potentially) not yet signaled to walsenders. 2. Notify logical walsenders without delay if time line has changed. Modified patch is attached. Thanks, Alexey From c691724332dd5a78b98d606188383d6b37c98021 Mon Sep 17 00:00:00 2001 From: Rustam Khamidullin Date: Fri, 14 Mar 2025 18:18:34 +0700 Subject: [PATCH 2/2] Implement batching for WAL records notification during cascade replication Currently standby server notifies walsenders after applying of each WAL record during cascade replication. This creates a bottleneck in case of large number of sender processes during WalSndWakeup invocation. This change introduces batching for such notifications, which are now sent either after certain number of applied records or specified time interval (whichever comes first). Co-authored-by: Alexey Makhmutov --- src/backend/access/transam/xlogrecovery.c | 61 ++- src/backend/postmaster/startup.c | 1 + src/backend/utils/misc/guc_tables.c | 22 src/include/access/xlogrecovery.h | 4 ++ src/include/utils/timeout.h | 1 + 5 files changed, 87 insertions(+), 2 deletions(-) diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index 0aa3ab59085..3e3e43d5888 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -64,6 +64,7 @@ #include "utils/pg_lsn.h" #include "utils/ps_status.h" #include "utils/pg_rusage.h" +#include "utils/timeout.h" /* Unsupported old recovery command file names (relative to $PGDATA) */ #define RECOVERY_COMMAND_FILE "recovery.conf" @@ -147,6 +148,8 @@ bool InArchiveRecovery = false; static bool StandbyModeRequested = false; bool StandbyMode = false; +#define StandbyWithCascadeReplication() (AmStartupProcess() && StandbyMode && AllowCascadeReplication()) + /* was a signal file present at startup? */ static bool standby_signal_file_found = false; static bool recovery_signal_file_found = false; @@ -298,6 +301,14 @@ bool reachedConsistency = false; static char *replay_image_masked = NULL; static char *primary_image_masked = NULL; +/* Maximum number of applied records in batch before notifying walsender during cascade replication */ +int cascadeReplicationMaxBatchSize; + +/* Maximum batching delay before notifying walsender during cascade replication */ +int cascadeReplicationMaxBatchDelay; + +/* Counter for collected records during cascade replication */ +static volatile sig_atomic_t appliedRecords = 0; /* * Shared-memory state for WAL recovery. @@ -1839,6 +1850,17 @@ PerformWalRecovery(void) * end of main redo apply loop */ + /* Ensure that notification for batched messages is sent */ + if (StandbyWithCascadeReplication() && +cascadeReplicationMaxBatchSize > 1 && +appliedRecords > 0) + { + if (cascadeReplicationMaxBatchDelay > 0) +disable_timeout(STANDBY_CASCADE_WAL_SEND_TIMEOUT, false); + appliedRecords = 0; + WalSndWakeup(false, true); + } + if (reachedRecoveryTarget) { if (!reachedConsistency) @@ -2037,8 +2059,30 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl *be created otherwise) * -- */ - if (AllowCascadeReplication()) - WalSndWakeup(switchedTLI, true); + + if (StandbyWithCascadeReplication()) + { + if (cascadeReplicationMaxBatchSize <= 1) + WalSndWakeup(switchedTLI, true); + else + { + bool batchFlushRequired = ++appliedRecords >= + cascadeReplicationMaxBatchSize; + if (batchFlushRequired) + { +appliedRecords = 0; +if (cascadeReplicationMaxBatchDelay > 0) + disable_timeout(STANDBY_CASCADE_WAL_SEND_TIMEOUT, false); + } + + WalSndWakeup(switchedTLI, batchFlushRequired); + + /* Setup timeout to limit maximum delay for notifications */ + if (appliedRecords == 1 && cascadeReplicationMaxBatchDelay > 0) +enable_timeout_after(STANDBY_CASCADE_WAL_SEND_TIMEOUT, + cascadeReplicationMaxBatchDelay); + } + } /* * If rm_redo called XLogRequestWalReceiverReply, then we wake up the @@ -5064,3 +5108,16 @@ assign_recovery_target_xid(const char *newval, void *extra) else recoveryTarget = RECOVERY_TARGET_UNSET; } + +/* + * Timer handler for batch notifications in cascade replication + */ +void +StandbyWalSendTimeoutHandler(void) +{ + if (appliedRecords > 0) + { + appliedRe
High CPU consumption in cascade replication with large number of walsenders and ConditionVariable broadcast issues
ourselves on the list of processes which need to receive our signal). I think, that the correct approach here is to just return control to the client just after the wait if ‘cv_sleep_target’ has changed. We do not need to add ourselves to the wakeup list in such case, as client will check exit condition in the cycle and then either invoke ConditionVariableCancelSleep or ConditionVariableTimedSleep again. Both methods will correctly update the CV state, so we won’t miss any notifications or leave the CV in the improper state. We’ve prepared two test patches on top of current master to address both issues: * 0001-CV-correctly-handle-cv_sleep_target-change.patch – adjust CV logic to correctly process case with ‘cv_sleep_target’ change during sleep. * 0002-Implement-batching-for-cascade-replication.patch – test patch to implement possible batching approach in xlogreceiver.c with timer. Currently it uses GUC variables to allow testing of different batch sizes and timeout values. With both patches applied we’ve noticed a significant reduction in CPU consumption while using the synthetic test program mentioned above. If only the second patch is applied to the build, then problem with CV variable could be reproduced by building server with casserts enabled and running TAP tests in ‘src/test/recovery’ directory. It would be great to hear any thoughts on these observations and fixing approaches, as well as possible pitfalls of proposed changes. Thanks, Alexey From e5cfabfa7fd8f05f1a7c8421e60087d672fb59e4 Mon Sep 17 00:00:00 2001 From: Alexey Makhmutov Date: Fri, 28 Mar 2025 16:16:36 +0300 Subject: [PATCH 1/2] Ensure that content of the wait list on CV is aligned with cv_sleep_target value Wait on the ConditionVariable could be interleaved with interruption handlers, such as timers. If such handler uses CV calls (i.e. ConditionVariableBroadcast), then value of the cv_sleep_target could be null or point to a different CV after wakeup. In this case we should not try to add ourselves to the wakeup wait list, as subsequent call to ConditionVariableCancelSleep won't be able to find our CV to clear its list. We should instead return control to the client, so the exit condition for the CV external wait loop could be checked. If wait is still required, then next invocation of ConditionVariableTimedSleep will perform the required setup procedures for CV. Behavior before this change results in the mismatch between cv_sleep_target value and content of the wakeup list, so the next invocation of ConditionVariableBroadcast method may fail due to presence of current process in the list. --- src/backend/storage/lmgr/condition_variable.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c index 228303e77f7..3b5d9764403 100644 --- a/src/backend/storage/lmgr/condition_variable.c +++ b/src/backend/storage/lmgr/condition_variable.c @@ -165,6 +165,17 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout, /* Reset latch before examining the state of the wait list. */ ResetLatch(MyLatch); + /* + * First, check that we are still expected to wait on this variable. + * If sleep target has changed, then we need to return spuriously to + * allow caller to check the exit continue and invoke the wait. + * function again to properly prepare for the next sleep. + * Note, that we don't need to change wait list in this case, as we + * should be deleted from the list in case of cv_sleep_target change. + */ + if (cv != cv_sleep_target) + return false; + /* * If this process has been taken out of the wait list, then we know * that it has been signaled by ConditionVariableSignal (or -- 2.49.0 From 75d3e41f0833845b7abc9dff6dc9d4857c1c340c Mon Sep 17 00:00:00 2001 From: Rustam Khamidullin Date: Fri, 14 Mar 2025 18:18:34 +0700 Subject: [PATCH 2/2] Implement batching for WAL records notification during cascade replication Currently standby server notifies walsenders after applying of each WAL record during cascade replication. This creates a bottleneck in case of large number of sender processes during WalSndWakeup invocation. This change introduces batching for such notifications, which are now sent either after certain number of applied records or specified time interval (whichever comes first). Co-authored-by: Alexey Makhmutov --- src/backend/access/transam/xlogrecovery.c | 68 ++- src/backend/postmaster/startup.c | 1 + src/backend/utils/misc/guc_tables.c | 22 src/include/access/xlogrecovery.h | 4 ++ src/include/utils/timeout.h | 1 + 5 files changed, 94 insertions(+), 2 deletions(-) diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index 0aa3ab59085..84a6c176593 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/back
Re: High CPU consumption in cascade replication with large number of walsenders and ConditionVariable broadcast issues
After playing with the patch a little more, we’ve come to the conclusion, that the idea to signal CV broadcast from the timer handler is a risky one, as it creates a large number of execution paths, which are not present in current code base. It's hard to prove correctness of application behavior in each such case, so we've decided to use a different approach. In the new version of the patch we use timer handler only to set the flag, which is then checked in the ProcessStartupProcInterrupts function. This allow us to send signal on timeout if the startup process is waiting for the arrival of new WAL records (in ReadRecord), but the notification may be delayed while record is being applied (during redo handler invocation from ApplyWalRecord). This could increase delay for some corner cases with non-trivial WAL records like ‘drop database’, but this should be a rare case and walsender process have its own limit on the wait time, so the delay won’t be indefinite even in this case. A new variant of the patch is attached in case anybody else wants to play with this patch and approach. A slightly improved test case and formatted CV patch (which is not strictly required anymore for this case) are attached as well. Thanks, Alexey From b6749b210cceabcc0c382935ec412182b047fbb4 Mon Sep 17 00:00:00 2001 From: Alexey Makhmutov Date: Fri, 28 Mar 2025 16:16:36 +0300 Subject: [PATCH 1/2] Ensure that content of the wait list on CV is aligned with cv_sleep_target value Wait on the ConditionVariable could be interleaved with interruption handlers, such as timers. If such handler uses CV calls (i.e. ConditionVariableBroadcast), then value of the cv_sleep_target could be null or point to a different CV after wakeup. In this case we should not try to add ourselves to the wakeup wait list, as subsequent call to ConditionVariableCancelSleep won't be able to find our CV to clear its list. We should instead return control to the client, so the exit condition for the CV external wait loop could be checked. If wait is still required, then next invocation of ConditionVariableTimedSleep will perform the required setup procedures for CV. Behavior before this change results in the mismatch between cv_sleep_target value and content of the wakeup list, so the next invocation of ConditionVariableBroadcast method may fail due to presence of current process in the list. --- src/backend/storage/lmgr/condition_variable.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c index 228303e77f7..ab4ad7e5641 100644 --- a/src/backend/storage/lmgr/condition_variable.c +++ b/src/backend/storage/lmgr/condition_variable.c @@ -165,6 +165,17 @@ ConditionVariableTimedSleep(ConditionVariable *cv, long timeout, /* Reset latch before examining the state of the wait list. */ ResetLatch(MyLatch); + /* + * First, check that we are still expected to wait on this variable. + * If sleep target has changed, then we need to return spuriously to + * allow caller to check the exit condition and invoke the wait + * function again to properly prepare for the next sleep. Note, that + * we don't need to change wait list in this case, as we should be + * deleted from the list in case of cv_sleep_target change. + */ + if (cv != cv_sleep_target) + return false; + /* * If this process has been taken out of the wait list, then we know * that it has been signaled by ConditionVariableSignal (or -- 2.49.0 From 6eaeb7a7fbb40fc3a8fb04a2ea22df9a54747a0a Mon Sep 17 00:00:00 2001 From: Rustam Khamidullin Date: Fri, 14 Mar 2025 18:18:34 +0700 Subject: [PATCH 2/2] Implement batching for WAL records notification during cascade replication Currently standby server notifies walsenders after applying of each WAL record during cascade replication. This creates a bottleneck in case of large number of sender processes during WalSndWakeup invocation. This change introduces batching for such notifications, which are now sent either after certain number of applied records or specified time interval (whichever comes first). Co-authored-by: Alexey Makhmutov --- src/backend/access/transam/xlogrecovery.c | 99 ++- src/backend/postmaster/startup.c | 5 ++ src/backend/utils/misc/guc_tables.c | 22 + src/include/access/xlogrecovery.h | 8 ++ src/include/utils/timeout.h | 1 + 5 files changed, 133 insertions(+), 2 deletions(-) diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index 6ce979f2d8b..43e95602223 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -64,6 +64,7 @@ #include "utils/pg_lsn.h" #include "utils/ps_status.h" #include "utils/pg_rusage.h" +#include "utils/timeout.h" /* Unsu