Hi, On Mon, Jun 08, 2026 at 08:47:48AM +0000, Bertrand Drouvot wrote: > Attached: > > 0001: To fix this race > > Fix by checking GetWALInsertionTimeLineIfSet() when RecoveryInProgress() > returns true. If InsertTimeLineID is already set (non-zero), the new timeline > is > established and we use it directly, avoiding attempts to read from segments > that > may have been removed. > > 0002: Adding a test in 035_standby_logical_decoding.pl > > It makes use of a new injection point "promotion-after-wal-segment-cleanup" in > StartupXLOG(), right after CleanupAfterArchiveRecovery() removes old timeline > WAL segments but before SharedRecoveryState is set to RECOVERY_STATE_DONE. > > The test fails without the fix in 0001 so it also somehow proves that the > diagnostic is right. > > 0003: Apply the same timeline fix to read_local_xlog_page_guts() > > Indeed, it could hit the same race as mentioned by Xuneng-San. > > 0004: Add a test for 0003
Re-attaching v1 patches here to have the cfbot focusing on those. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
>From 16f5402418e4115c1308e64fa6b6d1e5db778789 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot <[email protected]> Date: Mon, 8 Jun 2026 05:52:09 +0000 Subject: [PATCH v1 1/4] Fix race condition in logical decoding timeline selection during promotion During promotion, there is a window where RecoveryInProgress() still returns true but old timeline WAL segments have already been removed or recycled by RemoveNonParentXlogFiles() in CleanupAfterArchiveRecovery(). This is because, in StartupXLOG(), WAL segments are cleaned up before SharedRecoveryState transitions to RECOVERY_STATE_DONE. If a walsender performing logical decoding calls logical_read_xlog_page() during this window, it would get the old timeline from GetXLogReplayRecPtr(), then attempt to open a WAL segment on that old timeline which no longer exists, resulting in: ERROR: requested WAL segment ... has already been removed Fix by checking GetWALInsertionTimeLineIfSet() when RecoveryInProgress() returns true. If InsertTimeLineID is already set (non-zero), the new timeline is established and we use it directly, avoiding attempts to read from segments that may have been removed. Reported-by: Alexander Lakhin <[email protected]> Author: Bertrand Drouvot <[email protected]> Reviewed-by: Discussion: https://postgr.es/m/7daef094-abf3-4672-bc23-3df4763b16a3%40gmail.com --- src/backend/replication/walsender.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) 100.0% src/backend/replication/ diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 04aa770d981..e80ed052077 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -1104,7 +1104,24 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req am_cascading_walsender = RecoveryInProgress(); if (am_cascading_walsender) - GetXLogReplayRecPtr(&currTLI); + { + TimeLineID insertTLI; + + /* + * If the insertion timeline has already been set, use it. + * InsertTimeLineID is set before old timeline WAL segments are + * removed and before SharedRecoveryState flips to + * RECOVERY_STATE_DONE. So there is a window where + * RecoveryInProgress() still returns true but the old timeline's WAL + * segments have already been removed or recycled. Using the insertion + * timeline avoids attempting to read from those removed segments. + */ + insertTLI = GetWALInsertionTimeLineIfSet(); + if (insertTLI != 0) + currTLI = insertTLI; + else + GetXLogReplayRecPtr(&currTLI); + } else currTLI = GetWALInsertionTimeLine(); -- 2.34.1
>From 4fdf81fc15f89f6c53cb1780ccf69f96b72de623 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot <[email protected]> Date: Mon, 8 Jun 2026 07:04:16 +0000 Subject: [PATCH v1 2/4] Add injection-point test for logical decoding timeline race during promotion Add an injection point "promotion-after-wal-segment-cleanup" in StartupXLOG(), right after CleanupAfterArchiveRecovery() removes old timeline WAL segments but before SharedRecoveryState is set to RECOVERY_STATE_DONE. Add a test in 035_standby_logical_decoding.pl that uses this injection point to deterministically reproduce the race condition where a walsender doing logical decoding would pick the old timeline after segment removal, resulting in: ERROR: requested WAL segment ... has already been removed The test pauses the startup process at the injection point, starts pg_recvlogical (whose walsender must read WAL from the removed segment), then resumes promotion and verifies decoding succeeds. Author: Bertrand Drouvot <[email protected]> Reviewed-by: Discussion: https://postgr.es/m/7daef094-abf3-4672-bc23-3df4763b16a3%40gmail.com --- src/backend/access/transam/xlog.c | 2 + .../t/035_standby_logical_decoding.pl | 66 +++++++++++++++++++ 2 files changed, 68 insertions(+) 97.5% src/test/recovery/t/ diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index d69d03b2ef3..6c2304fef33 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -6571,6 +6571,8 @@ StartupXLOG(void) if (ArchiveRecoveryRequested) CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog, newTLI); + INJECTION_POINT("promotion-after-wal-segment-cleanup", NULL); + /* * Local WAL inserts enabled, so it's time to finish initialization of * commit timestamp. diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index 4421059f100..ce80123844d 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -1060,4 +1060,70 @@ is($cascading_stdout, $expected, 'got same expected output from pg_recvlogical decoding session on cascading standby' ); +################################################## +# Test that logical decoding on standby correctly handles the timeline +# change during promotion. There is a window during promotion where +# RecoveryInProgress() still returns true but old timeline WAL segments +# have already been removed. Verify the walsender uses the correct +# timeline in this window. +################################################## + +# Create a logical slot on the cascading standby for this test. +$node_cascading_standby->create_logical_slot_on_standby($node_standby, + 'race_slot', 'testdb'); + +# Insert data so the slot has WAL to decode. +$node_standby->safe_psql('testdb', + qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(10,13) s;] +); +$node_standby->wait_for_replay_catchup($node_cascading_standby); + +# Create the injection_points extension on the cascading standby. +$node_standby->safe_psql('testdb', 'CREATE EXTENSION injection_points;'); +$node_standby->wait_for_replay_catchup($node_cascading_standby); + +# Attach injection point to pause startup after WAL segment cleanup +# but before RecoveryInProgress() flips to false. +$node_cascading_standby->safe_psql('testdb', + "SELECT injection_points_attach('promotion-after-wal-segment-cleanup', 'wait');" +); + +# Promote with no-wait so we can synchronize with the injection point. +$node_cascading_standby->safe_psql('testdb', "SELECT pg_promote(false)"); + +# Wait for startup to pause after removing old timeline WAL segments. +$node_cascading_standby->wait_for_event('startup', + 'promotion-after-wal-segment-cleanup'); + +# Start pg_recvlogical +my ($stdout2, $stderr2); +my $handle2 = IPC::Run::start( + [ + 'pg_recvlogical', + '--dbname' => $node_cascading_standby->connstr('testdb'), + '--slot' => 'race_slot', + '--option' => 'include-xids=0', + '--option' => 'skip-empty-xacts=1', + '--file' => '-', + '--no-loop', + '--start', + ], + '>' => \$stdout2, + '2>' => \$stderr2, + IPC::Run::timeout($default_timeout)); + +# Wait for the walsender to acquire the slot. +$node_cascading_standby->poll_query_until('testdb', + qq[SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'race_slot' AND active_pid IS NOT NULL)] +) or die "slot race_slot never became active"; + +# Resume promotion. +$node_cascading_standby->safe_psql('testdb', + "SELECT injection_points_wakeup('promotion-after-wal-segment-cleanup');"); + +# Verify pg_recvlogical successfully decodes the data. +$pump_timeout = IPC::Run::timer($default_timeout); +ok( pump_until($handle2, $pump_timeout, \$stdout2, qr/COMMIT/s), + 'pg_recvlogical works during promotion timeline switch'); + done_testing(); -- 2.34.1
>From 70374fd88eebae77b53f65957624b1e9929f1685 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot <[email protected]> Date: Mon, 8 Jun 2026 07:04:43 +0000 Subject: [PATCH v1 3/4] Apply the same timeline fix to read_local_xlog_page_guts() read_local_xlog_page_guts() has the same race as logical_read_xlog_page() (that has been fixed in commit xxxxx): when RecoveryInProgress() returns true during promotion, it gets the old timeline from GetXLogReplayRecPtr() while old WAL segments may already be removed. This affects SQL-callable logical decoding functions (pg_logical_slot_get_changes, pg_logical_slot_peek_changes) running on a standby during promotion. Apply the same fix: use GetWALInsertionTimeLineIfSet() to detect that the insertion timeline is already established and use it directly. Reported-by: Xuneng Zhou <[email protected]> Author: Bertrand Drouvot <[email protected]> Reviewed-by: Discussion: https://postgr.es/m/7daef094-abf3-4672-bc23-3df4763b16a3%40gmail.com --- src/backend/access/transam/xlogutils.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) 100.0% src/backend/access/transam/ diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index 5fbe39133b8..fdc341d8fa4 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -896,7 +896,19 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr, if (!RecoveryInProgress()) read_upto = GetFlushRecPtr(&currTLI); else + { + TimeLineID insertTLI; + read_upto = GetXLogReplayRecPtr(&currTLI); + + /* + * If the insertion timeline has already been set, use it. See + * logical_read_xlog_page() for details. + */ + insertTLI = GetWALInsertionTimeLineIfSet(); + if (insertTLI != 0) + currTLI = insertTLI; + } tli = currTLI; /* -- 2.34.1
>From 307322a340e2257e8a814479489fc0ea4be19af6 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot <[email protected]> Date: Mon, 8 Jun 2026 07:05:35 +0000 Subject: [PATCH v1 4/4] Add SQL-path test for read_local_xlog_page_guts() timeline race Extend the injection-point test in 035_standby_logical_decoding.pl to also cover the SQL-callable decoding path (pg_logical_slot_peek_changes) which exercises read_local_xlog_page_guts(). The test uses the same promotion and injection point as the walsender test: a background psql session is connected before promotion, then issues pg_logical_slot_peek_changes while startup is paused at the injection point. Author: Bertrand Drouvot <[email protected]> Reviewed-by: Discussion: https://postgr.es/m/7daef094-abf3-4672-bc23-3df4763b16a3%40gmail.com --- .../t/035_standby_logical_decoding.pl | 28 +++++++++++++++++-- 1 file changed, 25 insertions(+), 3 deletions(-) 100.0% src/test/recovery/t/ diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index ce80123844d..146c33ce39d 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -1068,11 +1068,13 @@ is($cascading_stdout, $expected, # timeline in this window. ################################################## -# Create a logical slot on the cascading standby for this test. +# Create logical slots on the cascading standby for this test. $node_cascading_standby->create_logical_slot_on_standby($node_standby, 'race_slot', 'testdb'); +$node_cascading_standby->create_logical_slot_on_standby($node_standby, + 'race_slot_sql', 'testdb'); -# Insert data so the slot has WAL to decode. +# Insert data so the slots have WAL to decode. $node_standby->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(10,13) s;] ); @@ -1082,6 +1084,9 @@ $node_standby->wait_for_replay_catchup($node_cascading_standby); $node_standby->safe_psql('testdb', 'CREATE EXTENSION injection_points;'); $node_standby->wait_for_replay_catchup($node_cascading_standby); +# Open a background psql session BEFORE promotion for the SQL decoding test. +my $decode_session = $node_cascading_standby->background_psql('testdb'); + # Attach injection point to pause startup after WAL segment cleanup # but before RecoveryInProgress() flips to false. $node_cascading_standby->safe_psql('testdb', @@ -1112,6 +1117,14 @@ my $handle2 = IPC::Run::start( '2>' => \$stderr2, IPC::Run::timeout($default_timeout)); +# Issue SQL decoding (read_local_xlog_page_guts path) on the pre-connected +# session. +$decode_session->query_until( + qr/decoding_started/, qq( + \\echo decoding_started + SELECT count(*) FROM pg_logical_slot_peek_changes('race_slot_sql', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); +)); + # Wait for the walsender to acquire the slot. $node_cascading_standby->poll_query_until('testdb', qq[SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'race_slot' AND active_pid IS NOT NULL)] @@ -1121,9 +1134,18 @@ $node_cascading_standby->poll_query_until('testdb', $node_cascading_standby->safe_psql('testdb', "SELECT injection_points_wakeup('promotion-after-wal-segment-cleanup');"); -# Verify pg_recvlogical successfully decodes the data. +# Verify pg_recvlogical successfully decodes the data (walsender path). $pump_timeout = IPC::Run::timer($default_timeout); ok( pump_until($handle2, $pump_timeout, \$stdout2, qr/COMMIT/s), 'pg_recvlogical works during promotion timeline switch'); +# Verify SQL decoding returns a numeric result (read_local_xlog_page_guts path). +my $sql_timeout = IPC::Run::timer($default_timeout); +ok( pump_until( + $decode_session->{run}, $sql_timeout, + \$decode_session->{stdout}, qr/\d+/m), + 'pg_logical_slot_peek_changes works during promotion timeline switch'); + +$decode_session->quit; + done_testing(); -- 2.34.1
