Dear Bharath, Peter, > Looks like BF animals aren't happy, please check - > > https://buildfarm.postgresql.org/cgi-bin/show_failures.pl. > > Looks like sanitizer failures. There were a few messages about that > recently, but those were all just about freeing memory after use, which > we don't necessarily require for client programs. So maybe something else.
It seems that there are several time of failures, [1] and [2]. ## Analysis for failure 1 The failure caused by a time lag between walreceiver finishes and pg_is_in_recovery() returns true. According to the output [1], it seems that the tool failed at wait_for_end_recovery() with the message "standby server disconnected from the primary". Also, lines "redo done at..." and "terminating walreceiver process due to administrator command" meant that walreceiver was requested to shut down by XLogShutdownWalRcv(). According to the source, we confirm that walreceiver is shut down in StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, SharedRecoveryState is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return true) at the latter part of StartupXLOG(). So, if there is a delay between FinishWalRecovery() and change the state, the check in wait_for_end_recovery() would be failed during the time. Since we allow to miss the walreceiver 10 times and it is checked once per second, the failure occurs if the time lag is longer than 10 seconds. I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS larger, but it's not a fundamental solution. ## Analysis for failure 2 According to [2], the physical replication slot which is specified as primary_slot_name was not used by the walsender process. At that time walsender has not existed. ``` ... pg_createsubscriber: publisher: current wal senders: 0 pg_createsubscriber: command is: SELECT 1 FROM pg_catalog.pg_replication_slots WHERE active AND slot_name = 'physical_slot' pg_createsubscriber: error: could not obtain replication slot information: got 0 rows, expected 1 row ... ``` Currently standby must be stopped before the command and current code does not block the flow to ensure the replication is started. So there is a possibility that the checking is run before walsender is launched. One possible approach is to wait until the replication starts. Alternative one is to ease the condition. How do you think? [1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-03-25%2013%3A03%3A07 [2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-03-25%2013%3A53%3A58 Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/