On Wed, 22 May 2024 at 16:50, Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Mon, May 20, 2024 at 4:30 PM Shlok Kyal <shlok.kyal....@gmail.com> wrote: > > > > > > I was trying to test this utility when 'sync_replication_slots' is on > > > and it gets in an ERROR loop [1] and never finishes. Please find the > > > postgresql.auto used on the standby attached. I think if the standby > > > has enabled sync_slots, you need to pass dbname in > > > GenerateRecoveryConfig(). I couldn't test it further but I wonder if > > > there are already synced slots on the standby (either due to > > > 'sync_replication_slots' or users have used > > > pg_sync_replication_slots() before invoking pg_createsubscriber), > > > those would be retained as it is on new subscriber and lead to > > > unnecessary WAL retention and dead rows. > > > > > > [1] > > > 2024-04-30 11:50:43.239 IST [12536] LOG: slot sync worker started > > > 2024-04-30 11:50:43.247 IST [12536] ERROR: slot synchronization > > > requires dbname to be specified in primary_conninfo > > > > Hi, > > > > I tested the scenario posted by Amit in [1], in which retaining synced > > slots lead to unnecessary WAL retention and ERROR. This is raised as > > the second open point in [2]. > > The steps to reproduce the issue: > > (1) Setup physical replication with sync slot feature turned on by > > setting sync_replication_slots = 'true' or using > > pg_sync_replication_slots() on the standby node. > > For physical replication setup, run pg_basebackup with -R and -d option. > > (2) Create a logical replication slot on primary node with failover > > option as true. A corresponding slot is created on standby as part of > > sync slot feature. > > (3) Run pg_createsubscriber on standby node. > > (4) On Checking for the replication slot on standby node, I noticed > > that the logical slots created in step 2 are retained. > > I have attached the script to reproduce the issue. > > > > I and Kuroda-san worked to resolve open points. Here are patches to > > solve the second and third point in [2]. > > Patches proposed by Euler are also attached just in case, but they > > were not modified. > > > > v2-0001: not changed > > > > Shouldn't we modify it as per the suggestion given in the email [1]? I > am wondering if we can entirely get rid of checking the primary > business and simply rely on recovery_timeout and keep checking > server_is_in_recovery(). If so, we can modify the test to use > non-default recovery_timeout (say 180s or something similar if we have > used it at any other place). As an additional check we can ensure that > constent_lsn is present on standby. >
I have made changes as per your suggestion. I have used pg_last_wal_receive_lsn to get lsn last wal received on standby and check if consistent_lsn is already present on the standby. I have only made changes in v3-0001. v3-0002, v3-0003, v3-0004 and v3-0005 are not modified. Thanks and Regards, Shlok Kyal
v3-0005-Avoid-outputing-some-messages-in-dry_run-mode.patch
Description: Binary data
v3-0003-Disable-slot-sync-during-the-convertion.patch
Description: Binary data
v3-0001-Improve-the-code-that-checks-if-the-recovery-is-f.patch
Description: Binary data
v3-0004-Drop-replication-slots-which-had-been-synchronize.patch
Description: Binary data
v3-0002-Improve-the-code-that-checks-if-the-primary-slot-.patch
Description: Binary data