On Thursday, January 11, 2024 11:42 PM Bertrand Drouvot <bertranddrouvot...@gmail.com> wrote:
Hi, > On Thu, Jan 11, 2024 at 04:22:56PM +0530, Amit Kapila wrote: > > On Tue, Jan 9, 2024 at 6:39 PM Amit Kapila <amit.kapil...@gmail.com> > wrote: > > > > > > +static bool > > > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot > > > +*remote_slot) > > > { > > > ... > > > + /* Slot ready for sync, so sync it. */ else { > > > + /* > > > + * Sanity check: With hot_standby_feedback enabled and > > > + * invalidations handled appropriately as above, this should never > > > + * happen. > > > + */ > > > + if (remote_slot->restart_lsn < slot->data.restart_lsn) elog(ERROR, > > > + "cannot synchronize local slot \"%s\" LSN(%X/%X)" > > > + " to remote slot's LSN(%X/%X) as synchronization" > > > + " would move it backwards", remote_slot->name, > > > + LSN_FORMAT_ARGS(slot->data.restart_lsn), > > > + LSN_FORMAT_ARGS(remote_slot->restart_lsn)); > > > ... > > > } > > > > > > I was thinking about the above code in the patch and as far as I can > > > think this can only occur if the same name slot is re-created with > > > prior restart_lsn after the existing slot is dropped. Normally, the > > > newly created slot (with the same name) will have higher restart_lsn > > > but one can mimic it by copying some older slot by using > > > pg_copy_logical_replication_slot(). > > > > > > I don't think as mentioned in comments even if hot_standby_feedback > > > is temporarily set to off, the above shouldn't happen. It can only > > > lead to invalidated slots on standby. > > I also think so. > > > > > > > To close the above race, I could think of the following ways: > > > 1. Drop and re-create the slot. > > > 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves > > > ahead of local_slot's LSN then we can update it; but as mentioned in > > > your previous comment, we need to update all other fields as well. > > > If we follow this then we probably need to have a check for > > > catalog_xmin as well. > > IIUC, this would be a sync slot (so not usable until promotion) that could > not be > used anyway (invalidated), so I'll vote for drop / re-create then. Such race can happen when user drop and re-create the same failover slot on primary as well. For example, user dropped one failover slot and them immediately created a new one by copying from an old slot(using pg_copy_logical_replication_slot). Then the slotsync worker will find the restart_lsn of this slot go backwards. The steps: ---- SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput', false, false, true); SELECT 'init' FROM pg_create_logical_replication_slot('test', 'pgoutput', false, false, true); - Advance the restart_lsn of 'test' slot CREATE TABLE test2(a int); INSERT INTO test2 SELECT generate_series(1,10000,1); SELECT slot_name FROM pg_replication_slot_advance('test', pg_current_wal_lsn()); - re-create the test slot but based on the old isolation_slot. SELECT pg_drop_replication_slot('test'); SELECT 'copy' FROM pg_copy_logical_replication_slot('isolation_slot', 'test'); Then the restart_lsn of 'test' slot will go backwards. Best Regards, Hou zj