On Thursday, July 11, 2024 6:21 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Tue, Jul 9, 2024 at 4:30 PM Alexander Lakhin <exclus...@gmail.com> > wrote: > > > > Please look at another failure of the test [1]: > > [13:28:05.647](2.460s) not ok 26 - failover slot is synced > > [13:28:05.648](0.001s) # Failed test 'failover slot is synced' > > # at > /home/bf/bf-build/skink-master/HEAD/pgsql/src/bin/pg_basebackup/t/04 > 0_pg_createsubscriber.pl line 307. > > [13:28:05.648](0.000s) # got: '' > > # expected: 'failover_slot' > > > > with 040_pg_createsubscriber_node_s.log containing: > > 2024-07-08 13:28:05.369 UTC [3985464][client backend][0/2:0] LOG: > > statement: SELECT pg_sync_replication_slots() > > 2024-07-08 13:28:05.557 UTC [3985464][client backend][0/2:0] LOG: > > could not sync slot "failover_slot" as remote slot precedes local slot > > 2024-07-08 13:28:05.557 UTC [3985464][client backend][0/2:0] DETAIL: > > Remote slot has LSN 0/30047B8 and catalog xmin 743, but local slot has LSN > 0/30047B8 and catalog xmin 744. > > > > I could not reproduce it locally, but I've discovered that that > > subtest somehow depends on pg_createsubscriber executed for the > > 'primary contains unmet conditions on node P' check. For example with > > this test modification: > > @@ -249,7 +249,7 @@ command_fails( > > $node_p->connstr($db1), '--socket-directory', > > $node_s->host, '--subscriber-port', > > $node_s->port, '--database', > > - $db1, '--database', > > + 'XXX', '--database', > > $db2 > > ], > > 'primary contains unmet conditions on node P'); > > > > I see the same failure: > > 2024-07-09 10:19:43.284 UTC [938890] 040_pg_createsubscriber.pl LOG: > > statement: SELECT pg_sync_replication_slots() > > 2024-07-09 10:19:43.292 UTC [938890] 040_pg_createsubscriber.pl LOG: > > could not sync slot "failover_slot" as remote slot precedes local slot > > 2024-07-09 10:19:43.292 UTC [938890] 040_pg_createsubscriber.pl > > DETAIL: Remote slot has LSN 0/3004780 and catalog xmin 743, but local > slot has LSN 0/3004780 and catalog xmin 744. > > > > Thus maybe even a normal pg_createsubscriber run can affect the > > primary server (it's catalog xmin) differently? > > > > Yes, pg_createsubscriber can affect the primary server's catalog xmin because > it starts the standby server that can send HSFeedback (See > XLogWalRcvSendHSFeedback()), which can advance the physical slot's xmin > corresponding the following Insert in the test: > > # Insert another row on node P and wait node S to catch up > $node_p->safe_psql($db1, "INSERT INTO tbl1 VALUES('second row')"); > $node_p->wait_for_replay_catchup($node_s); > > In the success case, pg_createsubscriber is able to send HSFeedback and in > the failure case, it won't. We can see the following logs in > 040_pg_createsubscriber_node_p.log: > > 2024-07-08 13:28:00.872 UTC [3982331][walsender][:0] FATAL: the database > system is starting up > 2024-07-08 13:28:00.875 UTC [3982328][startup][:0] LOG: database system > was shut down at 2024-07-08 13:28:00 UTC > 2024-07-08 13:28:01.105 UTC [3981996][postmaster][:0] LOG: database > system is ready to accept connections > > This shows that when the test 'primary contains unmet conditions on node P' > starts the standby server the corresponding primary node was not ready > because we just restarted node_p before that test and didn't ensure that the > node_p is up and ready to accept connections before starting the > pg_createsubscriber test. > > Even in the successful cases where the standby is able to connect to primary > for test 'primary contains unmet conditions on node P', there is no guarantee > that xmin of the physical slot will be updated at least, we don't have > anything in > the test to ensure the same. > > Now as before creating logical replication, we didn't ensure that the physical > slot's xmin has been caught up to the latest value, the test can lead to > failure > like: "Remote slot has LSN 0/3004780 and catalog xmin 743, but local slot has > LSN 0/3004780 and catalog xmin 744". > > The xmin on standby could have been advanced due to the following Insert: > # Insert another row on node P and wait node S to catch up > $node_p->safe_psql($db1, "INSERT INTO tbl1 VALUES('second row')"); > $node_p->wait_for_replay_catchup($node_s); > > We don't wait for the xmin to catch up corresponding to this insert and I > don't > know if there is a way to do that. So, we should move this Insert to after > the call > to pg_sync_replication_slots(). It won't impact the general test of > pg_createsubscriber.
The analysis and suggestion look reasonable to me. Here is a small patch which does the same. > Thanks to Hou-San for helping me in the analysis of this BF failure. Best Regards, Hou zj
0001-fix-unstable-test-in-040_pg_createsubscriber.patch
Description: 0001-fix-unstable-test-in-040_pg_createsubscriber.patch