Hi, On Tue, Jun 09, 2026 at 03:49:50PM +0800, Xuneng Zhou wrote: > On Mon, Jun 8, 2026 at 10:34 PM Xuneng Zhou <[email protected]> wrote: > > > > On Mon, Jun 8, 2026 at 10:22 PM Bertrand Drouvot > > <[email protected]> wrote: > > I've readed through the patch set. They look good overall.
Thanks for the review! > Here're > some comments on them: > > 1) In the commit messages and comments for all four patches, the > reason why the target WAL segment cannot be found on the old timeline > is described as follows: > > "old timeline WAL segments have already been removed or > recycled by RemoveNonParentXlogFiles() in CleanupAfterArchiveRecovery()." > > Is mentioning the 'remove' case only a bit narrow? > > The timeline-selection comment says this explicitly: > "there's no guarantee the old segment will still exist. It may have been > deleted or renamed with a .partial suffix" > > How about phrasing it like: > old timeline WAL files may have been removed, recycled, or renamed to > .partial. > > After running the reproducer provided by Hayato-san, the standby’s > pg_wal directory looked like this following the failure: > 000000010000000000000003.partial > 00000002.history > 000000020000000000000003 > 000000020000000000000004 > > So in this repro, the requested file: > > 000000010000000000000003 > > was not unlinked as a regular "removed" file. It had been renamed to: > > 000000010000000000000003.partial > > but the log says this explicitly: > ERROR: requested WAL segment 000000010000000000000003 has already been removed > > It appears inconsistent to me... I'm not sure. The error message says "has already been removed" and the commit messages and comments says"removed or recycled": those are consistent with the error message. We're describing the symptom from the walsender's perspective, not the exact file operation that caused it. > 2) Injection points in tests 0002 and 0004 > > It does not prove this: > walsender has reached logical_read_xlog_page() while startup is paused > > 3) Stricter synchronization point in both tests > Both tests use this condition "active_pid IS NOT NULL" for > synchronization at the walsender side. However, it only proves that > pg_recvlogical has connected walsender has acquired the logical slot, > not necessarily the walsender is paused after acquiring the slot and > before the promotion window is set. There are several potential states > for walsender in this condition: > > walsender is just after ReplicationSlotAcquire() > it has called XLogBeginRead() > it is already inside logical_read_xlog_page() > it already opened the WAL segment > it already failed or succeeded > > The test cannot distinguish those states reliably. > > So we may still need another injection point for synchronization at > the walsender side I agree that with v1 the test could have been fragile. It's fixed in v2 without having to add a second injection point. All we have to do is to ensure that the decoding occurred while the startup is paused on the new injection point. 0002 does that by starting the new walsender and doing the decoding while the startup is paused 0004 does that by ensuring the pre-connected session triggers the decoding while the startup is paused > 4) Stricter result checks in test files > The surrounding 035 test is stricter than the test in 0002. It first > waits for COMMITs, then compares exact decoded output. Should we > adhere to this pattern too? That's done in v2 (and also adress Hayato-san feedback). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
>From 3c32323624425559c256b3ec59853029609e1b65 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot <[email protected]> Date: Tue, 9 Jun 2026 09:53:44 +0000 Subject: [PATCH v2 1/4] Fix race condition in logical decoding timeline selection during promotion During promotion, there is a window where RecoveryInProgress() still returns true but old timeline WAL segments have already been removed or recycled by RemoveNonParentXlogFiles() in CleanupAfterArchiveRecovery(). This is because, in StartupXLOG(), WAL segments are cleaned up before SharedRecoveryState transitions to RECOVERY_STATE_DONE. If a walsender performing logical decoding calls logical_read_xlog_page() during this window, it would get the old timeline from GetXLogReplayRecPtr(), then attempt to open a WAL segment on that old timeline which no longer exists, resulting in: ERROR: requested WAL segment ... has already been removed Fix by checking GetWALInsertionTimeLineIfSet() when RecoveryInProgress() returns true. If InsertTimeLineID is already set (non-zero), the new timeline is established and we use it directly, avoiding attempts to read from segments that may have been removed. Reported-by: Alexander Lakhin <[email protected]> Author: Bertrand Drouvot <[email protected]> Reviewed-by: Xuneng Zhou <[email protected]> Reviewed-by: Hayato Kuroda <[email protected]> Discussion: https://postgr.es/m/7daef094-abf3-4672-bc23-3df4763b16a3%40gmail.com --- src/backend/replication/walsender.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) 100.0% src/backend/replication/ diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 04aa770d981..e80ed052077 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -1104,7 +1104,24 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req am_cascading_walsender = RecoveryInProgress(); if (am_cascading_walsender) - GetXLogReplayRecPtr(&currTLI); + { + TimeLineID insertTLI; + + /* + * If the insertion timeline has already been set, use it. + * InsertTimeLineID is set before old timeline WAL segments are + * removed and before SharedRecoveryState flips to + * RECOVERY_STATE_DONE. So there is a window where + * RecoveryInProgress() still returns true but the old timeline's WAL + * segments have already been removed or recycled. Using the insertion + * timeline avoids attempting to read from those removed segments. + */ + insertTLI = GetWALInsertionTimeLineIfSet(); + if (insertTLI != 0) + currTLI = insertTLI; + else + GetXLogReplayRecPtr(&currTLI); + } else currTLI = GetWALInsertionTimeLine(); -- 2.34.1
>From 3247f796350791b9b2fcf48ec9360e2ef2bb4140 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot <[email protected]> Date: Tue, 9 Jun 2026 09:54:28 +0000 Subject: [PATCH v2 2/4] Add injection-point test for logical decoding timeline race during promotion Add an injection point "promotion-after-wal-segment-cleanup" in StartupXLOG(), right after CleanupAfterArchiveRecovery() removes old timeline WAL segments but before SharedRecoveryState is set to RECOVERY_STATE_DONE. Add a test in 035_standby_logical_decoding.pl that uses this injection point to deterministically reproduce the race condition where a walsender doing logical decoding would pick the old timeline after segment removal, resulting in: ERROR: requested WAL segment ... has already been removed The test pauses the startup process at the injection point, starts pg_recvlogical (whose walsender must read WAL from the removed segment), verifies decoding succeeds while startup is still paused, then resumes promotion. Author: Bertrand Drouvot <[email protected]> Reviewed-by: Xuneng Zhou <[email protected]> Reviewed-by: Hayato Kuroda <[email protected]> Discussion: https://postgr.es/m/7daef094-abf3-4672-bc23-3df4763b16a3%40gmail.com --- src/backend/access/transam/xlog.c | 2 + .../t/035_standby_logical_decoding.pl | 74 +++++++++++++++++++ 2 files changed, 76 insertions(+) 97.8% src/test/recovery/t/ diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index d69d03b2ef3..6c2304fef33 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -6571,6 +6571,8 @@ StartupXLOG(void) if (ArchiveRecoveryRequested) CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog, newTLI); + INJECTION_POINT("promotion-after-wal-segment-cleanup", NULL); + /* * Local WAL inserts enabled, so it's time to finish initialization of * commit timestamp. diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index 4421059f100..9b45c819241 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -1060,4 +1060,78 @@ is($cascading_stdout, $expected, 'got same expected output from pg_recvlogical decoding session on cascading standby' ); +################################################## +# Test that logical decoding on standby correctly handles the timeline +# change during promotion. There is a window during promotion where +# RecoveryInProgress() still returns true but old timeline WAL segments +# have already been removed. Verify the walsender uses the correct +# timeline in this window. +################################################## + +# Create a logical slot on the cascading standby for this test. +$node_cascading_standby->create_logical_slot_on_standby($node_standby, + 'race_slot', 'testdb'); + +# Insert data so the slot has WAL to decode. +$node_standby->safe_psql('testdb', + qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(10,13) s;] +); +$node_standby->wait_for_replay_catchup($node_cascading_standby); + +$expected = q{BEGIN +table public.decoding_test: INSERT: x[integer]:10 y[text]:'10' +table public.decoding_test: INSERT: x[integer]:11 y[text]:'11' +table public.decoding_test: INSERT: x[integer]:12 y[text]:'12' +table public.decoding_test: INSERT: x[integer]:13 y[text]:'13' +COMMIT}; + +# Create the injection_points extension on the cascading standby. +$node_standby->safe_psql('testdb', 'CREATE EXTENSION injection_points;'); +$node_standby->wait_for_replay_catchup($node_cascading_standby); + +# Attach injection point to pause startup after WAL segment cleanup +# but before RecoveryInProgress() flips to false. +$node_cascading_standby->safe_psql('testdb', + "SELECT injection_points_attach('promotion-after-wal-segment-cleanup', 'wait');" +); + +# Promote with no-wait so we can synchronize with the injection point. +$node_cascading_standby->safe_psql('testdb', "SELECT pg_promote(false)"); + +# Wait for startup to pause after removing old timeline WAL segments. +$node_cascading_standby->wait_for_event('startup', + 'promotion-after-wal-segment-cleanup'); + +# Start pg_recvlogical. +my ($stdout2, $stderr2); +my $handle2 = IPC::Run::start( + [ + 'pg_recvlogical', + '--dbname' => $node_cascading_standby->connstr('testdb'), + '--slot' => 'race_slot', + '--option' => 'include-xids=0', + '--option' => 'skip-empty-xacts=1', + '--file' => '-', + '--no-loop', + '--start', + ], + '>' => \$stdout2, + '2>' => \$stderr2, + IPC::Run::timeout($default_timeout)); + +# Verify pg_recvlogical successfully decodes the data while startup is still +# paused. This proves the walsender went through logical_read_xlog_page() +# and selected the correct timeline in the race window. +$pump_timeout = IPC::Run::timer($default_timeout); +ok( pump_until($handle2, $pump_timeout, \$stdout2, qr/COMMIT/s), + 'pg_recvlogical works during promotion timeline switch'); +chomp($stdout2); +is($stdout2, $expected, + 'got expected output from pg_recvlogical during promotion timeline switch' +); + +# Resume promotion. +$node_cascading_standby->safe_psql('testdb', + "SELECT injection_points_wakeup('promotion-after-wal-segment-cleanup');"); + done_testing(); -- 2.34.1
>From 2b8cac32063b1977ed42f15a867e6ff22aba9c40 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot <[email protected]> Date: Tue, 9 Jun 2026 09:54:45 +0000 Subject: [PATCH v2 3/4] Fix race condition in read_local_xlog_page_guts() timeline selection during promotion read_local_xlog_page_guts() has the same race as logical_read_xlog_page() (that has been fixed in commit xxxxx): when RecoveryInProgress() returns true during promotion, it gets the old timeline from GetXLogReplayRecPtr() while old WAL segments may already be removed. This affects SQL-callable logical decoding functions (pg_logical_slot_get_changes, pg_logical_slot_peek_changes) running on a standby during promotion. Apply the same fix: use GetWALInsertionTimeLineIfSet() to detect that the insertion timeline is already established and use it directly. Reported-by: Xuneng Zhou <[email protected]> Author: Bertrand Drouvot <[email protected]> Reviewed-by: Xuneng Zhou <[email protected]> Reviewed-by: Hayato Kuroda <[email protected]> Discussion: https://postgr.es/m/7daef094-abf3-4672-bc23-3df4763b16a3%40gmail.com --- src/backend/access/transam/xlogutils.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) 100.0% src/backend/access/transam/ diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index 5fbe39133b8..fdc341d8fa4 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -896,7 +896,19 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr, if (!RecoveryInProgress()) read_upto = GetFlushRecPtr(&currTLI); else + { + TimeLineID insertTLI; + read_upto = GetXLogReplayRecPtr(&currTLI); + + /* + * If the insertion timeline has already been set, use it. See + * logical_read_xlog_page() for details. + */ + insertTLI = GetWALInsertionTimeLineIfSet(); + if (insertTLI != 0) + currTLI = insertTLI; + } tli = currTLI; /* -- 2.34.1
>From 1e4fca4590258d6f1a70660f3b6da241ec82d080 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot <[email protected]> Date: Tue, 9 Jun 2026 09:56:05 +0000 Subject: [PATCH v2 4/4] Add SQL-path test for read_local_xlog_page_guts() timeline race Extend the injection-point test in 035_standby_logical_decoding.pl to also cover the SQL-callable decoding path (pg_logical_slot_peek_changes) which exercises read_local_xlog_page_guts(). The test uses the same promotion and injection point as the walsender test: a background psql session is connected before promotion, then issues pg_logical_slot_peek_changes while startup is paused at the injection point. Author: Bertrand Drouvot <[email protected]> Reviewed-by: Xuneng Zhou <[email protected]> Reviewed-by: Hayato Kuroda <[email protected]> Discussion: https://postgr.es/m/7daef094-abf3-4672-bc23-3df4763b16a3%40gmail.com --- .../recovery/t/035_standby_logical_decoding.pl | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) 100.0% src/test/recovery/t/ diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl index 9b45c819241..6ec338ebf32 100644 --- a/src/test/recovery/t/035_standby_logical_decoding.pl +++ b/src/test/recovery/t/035_standby_logical_decoding.pl @@ -1068,11 +1068,13 @@ is($cascading_stdout, $expected, # timeline in this window. ################################################## -# Create a logical slot on the cascading standby for this test. +# Create logical slots on the cascading standby for this test. $node_cascading_standby->create_logical_slot_on_standby($node_standby, 'race_slot', 'testdb'); +$node_cascading_standby->create_logical_slot_on_standby($node_standby, + 'race_slot_sql', 'testdb'); -# Insert data so the slot has WAL to decode. +# Insert data so the slots have WAL to decode. $node_standby->safe_psql('testdb', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(10,13) s;] ); @@ -1089,6 +1091,9 @@ COMMIT}; $node_standby->safe_psql('testdb', 'CREATE EXTENSION injection_points;'); $node_standby->wait_for_replay_catchup($node_cascading_standby); +# Open a background psql session BEFORE promotion for the SQL decoding test. +my $decode_session = $node_cascading_standby->background_psql('testdb'); + # Attach injection point to pause startup after WAL segment cleanup # but before RecoveryInProgress() flips to false. $node_cascading_standby->safe_psql('testdb', @@ -1130,6 +1135,14 @@ is($stdout2, $expected, 'got expected output from pg_recvlogical during promotion timeline switch' ); +# Verify SQL decoding (read_local_xlog_page_guts path) on the pre-connected +# session. +my $sql_out = $decode_session->query_safe( + "SELECT data FROM pg_logical_slot_peek_changes('race_slot_sql', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1')" +); +is($sql_out, $expected, + 'pg_logical_slot_peek_changes works during promotion timeline switch'); + # Resume promotion. $node_cascading_standby->safe_psql('testdb', "SELECT injection_points_wakeup('promotion-after-wal-segment-cleanup');"); -- 2.34.1
