Hi,
On 10/24/23 7:44 AM, Ajin Cherian wrote:
On Mon, Oct 23, 2023 at 11:22 PM Drouvot, Bertrand
<bertranddrouvot...@gmail.com> wrote:
@@ -602,6 +602,9 @@ CreateDecodingContext(XLogRecPtr start_lsn,
SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
}
+ /* set failover in the slot, as requested */
+ slot->data.failover = ctx->failover;
+
I think we can get rid of this change in CreateDecodingContext().
Yes, I too noticed this in my testing, however just removing this from
CreateDecodingContext will not allow us to change the slot's failover flag
using Alter subscription.
Oh right.
I am thinking of moving this change to
StartLogicalReplication prior to calling CreateDecodingContext by
parsing the command options in StartReplicationCmd
without adding it to the LogicalDecodingContext.
Yeah, that looks like a good place to update "failover".
Doing more testing and I have a couple of remarks about he current behavior.
1) Let's imagine that:
- there is no standby
- standby_slot_names is set to a valid slot on the primary (but due to the
above, not linked to any standby)
- then a create subscription on a subscriber WITH (failover = true) would start
the
synchronisation but never finish (means leaving a "synchronisation" slot like
"pg_32811_sync_24576_7293415241672430356"
in place coming from ReplicationSlotNameForTablesync()).
That's expected, but maybe we should emit a warning in
WalSndWaitForStandbyConfirmation() on the primary when there is
a slot part of standby_slot_names which is not active/does not have an
active_pid attached to it?
2) When we create a subscription, another slot is created during the
subscription synchronization, namely
like "pg_16397_sync_16388_7293447291374081805" (coming from
ReplicationSlotNameForTablesync()).
This extra slot appears to have failover also set to true.
So, If the standby refresh the list of slot to sync when the subscription is
still synchronizing we'd see things like
on the standby:
LOG: waiting for remote slot "mysub" LSN (0/C0034808) and catalog xmin (756)
to pass local slot LSN (0/C0034840) and and catalog xmin (756)
LOG: wait over for remote slot "mysub" as its LSN (0/C00368B0) and catalog
xmin (756) has now passed local slot LSN (0/C0034840) and catalog xmin (756)
LOG: waiting for remote slot "pg_16397_sync_16388_7293447291374081805" LSN
(0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C00368E8) and and catalog
xmin (756)
WARNING: slot "pg_16397_sync_16388_7293447291374081805" disappeared from the
primary, aborting slot creation
I'm not sure this "pg_16397_sync_16388_7293447291374081805" should have
failover set to true. If there is a failover
during the subscription creation, better to re-launch the subscription instead?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com