Hi! Thanks for review! v2 patch attached comment changed, count(1) replaced with count(*).
Hi Sergey, Thanks for the report and patch. I think the analysis is right, and the fix is in the right place. The gap traces back to commit 7185eddf, which deliberately dropped the wait_for_catchup() and switched the primary from teardown_node() to a clean stop(), on the grounds that a clean stop flushes all WAL to both standbys before exiting. That's true, but only for standbys whose walsender is *connected* at shutdown time -- and ->start() only waits for the postmaster to accept connections, not for the standby's walreceiver to have connected back to the primary. So if a standby hasn't connected yet when the primary stops, the clean-shutdown flush skips it, and we're back to the exact "standbys received different amounts of WAL -> timeline fork on reconnect" failure that 7185eddf was meant to fix. Polling pg_stat_replication until both walsenders are present closes that hole: it re-establishes the precondition the clean-stop design silently assumed. And connection is enough here -- the walsender shutdown path sends all WAL up to the shutdown checkpoint regardless of catchup state -- so there's no need to additionally check state = 'streaming'. One small thing: the rest of this file uses count(*), so I'd write count(*) = 2 rather than count(1) = 2 just for local consistency. And the comment reads a little better as something like "Wait until both standbys have connected to the primary", since by this point they've already started -- what we're waiting for is the connection. Regards, Ewan On Tue, Jun 16, 2026 at 4:01 PM Sergey Tatarintsev <[email protected]> wrote:Hi hackers! I found that after commit 7185eddf0522b3146ed1ff6e063e8e129e77c706 we got little omission in TAP test 004_timeline_switch: ... my $node_standby_1 = PostgreSQL::Test::Cluster->new('standby_1'); ... $node_primary->stop; There is no guarantee that standby_1 and standby_2 was successfully connected to primary and start streaming before primary stopped. I think we must ensure that primary knows about standby_1 and standby_2 -- With best regards, Sergey Tatarintsev, PostgresPro
-- With best regards, Sergey Tatarintsev, PostgresPro
From dd449396060c71c34bcb03e5c1c5de0cb0868da0 Mon Sep 17 00:00:00 2001 From: Sergey Tatarintsev <[email protected]> Date: Tue, 16 Jun 2026 11:57:39 +0700 Subject: [PATCH] Fix 004_timeline_switch TAP test: wait for standbys starts before primary stops --- src/test/recovery/t/004_timeline_switch.pl | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl index e0b3851927c..c87f079cef8 100644 --- a/src/test/recovery/t/004_timeline_switch.pl +++ b/src/test/recovery/t/004_timeline_switch.pl @@ -30,6 +30,10 @@ $node_standby_2->init_from_backup($node_primary, $backup_name, has_streaming => 1); $node_standby_2->start; +# Wait until both standbys have connected to the primary +$node_primary->poll_query_until('postgres', + "SELECT count(*) = 2 FROM pg_stat_replication"); + # Create some content on primary $node_primary->safe_psql('postgres', "CREATE TABLE tab_int AS SELECT generate_series(1,1000) AS a"); -- 2.43.0
