> Just to summarize, apart from BF failures for which we had some > discussion, I could recall the following open points: > > 1. After promotion, the pre-existing replication objects should be > removed (either optionally or always), otherwise, it can lead to a new > subscriber not being able to restart or getting some unwarranted data. > [1][2]. > I tried to reproduce the case and found a case where pre-existing replication objects can cause unwanted scenario:
Suppose we have a setup of nodes N1, N2 and N3. N1 and N2 are in streaming replication where N1 is primary and N2 is standby. N3 and N1 are in logical replication where N3 is publisher and N1 is subscriber. The subscription created on N1 is replicated to N2 due to streaming replication. Now, after we run pg_createsubscriber on N2 and start the N2 server, we get the following logs repetitively: 2024-05-22 11:37:18.619 IST [27344] ERROR: could not start WAL streaming: ERROR: replication slot "test1" is active for PID 27202 2024-05-22 11:37:18.622 IST [27317] LOG: background worker "logical replication apply worker" (PID 27344) exited with exit code 1 2024-05-22 11:37:23.610 IST [27349] LOG: logical replication apply worker for subscription "test1" has started 2024-05-22 11:37:23.624 IST [27349] ERROR: could not start WAL streaming: ERROR: replication slot "test1" is active for PID 27202 2024-05-22 11:37:23.627 IST [27317] LOG: background worker "logical replication apply worker" (PID 27349) exited with exit code 1 2024-05-22 11:37:28.616 IST [27382] LOG: logical replication apply worker for subscription "test1" has started Note: 'test1' is the name of the subscription created on N1 initially and by default, slot name is the same as the subscription name. Once the N2 server is started after running pg_createsubscriber, the subscription that was earlier replicated by streaming replication will now try to connect to the publisher. Since the subscription name in N2 is the same as the subscription created in N1, it will not be able to start a replication slot as the slot with the same name is active for logical replication between N3 and N1. Also, there would be a case where N1 becomes down for some time. Then in that case subscription on N2 will connect to the publication on N3 and now data from N3 will be replicated to N2 instead of N1. And once N1 is up again, subscription on N1 will not be able to connect to publication on N3 as it is already connected to N2. This can lead to data inconsistency. This error did not happen before running pg_createsubscriber on standby node N2, because there is no 'logical replication launcher' process on standby node. Thanks and Regards, Shlok Kyal