On Wed, Mar 4, 2026 at 1:06 PM shveta malik <[email protected]> wrote: > > On Wed, Mar 4, 2026 at 12:26 PM Zhijie Hou (Fujitsu) > <[email protected]> wrote: > > > > On Tuesday, February 17, 2026 12:16 PM Amit Kapila > > <[email protected]> wrote: > > > On Tue, Feb 17, 2026 at 9:13 AM shveta malik <[email protected]> > > > wrote: > > > > > > > > On Mon, Feb 16, 2026 at 4:35 PM Amit Kapila <[email protected]> > > > wrote: > > > > > > > > > > On Fri, Feb 13, 2026 at 7:54 AM Zhijie Hou (Fujitsu) > > > > > <[email protected]> wrote: > > > > > > > > > > > > Thanks for pushing! Here are the remaining patches. > > > > > > > > > > > > > > > > One thing that bothers me about the remaining patch is that it could > > > > > lead to infinite re-tires in the worst case. For example, in first > > > > > try, slot-1 is not synced say due to physical replication delays in > > > > > flushing WALs up to the confirmed_flush_lsn of that slot, then in > > > > > next (re-)try, the same thing happened for slot-2, then in next > > > > > (re-)try, > > > > > slot-3 appears to invalidated on standby but it is valid on primary, > > > > > and so on. What do you think? > > > > > > > > Yes, that is a possibility we cannot rule out. This can also happen > > > > during the first invocation of the API (even without the new changes) > > > > when we attempt to create new slots, they may remain in a temporary > > > > state indefinitely. However, that risk is limited to the initial sync, > > > > until the slots are persisted, which is somewhat expected behavior. > > > > > > > > > > Right. > > > > > > > With the current changes though, the possibility of an indefinite wait > > > > exists during every run. So the question becomes: what would be more > > > > desirable for users -- for the API to finish with the risk that a few > > > > slots are not synced, or for the API to wait longer to ensure that all > > > > slots are properly synced? > > > > > > > > I think that if the primary use case of this API is when a user plans > > > > to run it before a scheduled failover, then it would be better for the > > > > API to wait and ensure everything is properly synced. > > > > > > > > > > I don't think we can guarantee that all slots are synced as per latest > > > primary > > > state in one invocation because some newly created slots can anyway be > > > missed. So why take the risk of infinite waits in the API? I think it may > > > be > > > better to extend the usage of this API (probably with more parameters) > > > based > > > on more user feedback. > > > > > > > I agree that we could wait for more user feedback before extending the > > retry logic to persisted slots. > > > > > > > Independent of whether we consider the entire patch, the following > > > > > bit in the patch in useful as we retry to sync the slots via API. > > > > > @@ -218,7 +219,7 @@ update_local_synced_slot(RemoteSlot > > > > > *remote_slot, Oid remote_dbid) > > > > > * Can get here only if GUC 'synchronized_standby_slots' on the > > > > > * primary server was not configured correctly. > > > > > */ > > > > > - ereport(AmLogicalSlotSyncWorkerProcess() ? LOG : ERROR, > > > > > + ereport(LOG, > > > > > errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > > > > errmsg("skipping slot synchronization because the received slot > > > > > sync" > > > > > " LSN %X/%08X for slot \"%s\" is ahead of the standby position > > > > > %X/%08X", > > > > > > > > > > > > > yes. I agree. > > > > > > > > > > Let's wait for Hou-San's opinion on this one. > > > > +1 for changing this. > > > > Here is the patch set to convert elevel to LOG so that the function > > cyclically > > retry until the standby catches up and the slot is successfully persisted. > > > > patch001 has: > > - logical decoding and must be dropped after promotion. See > + logical decoding and must be dropped after promotion. This function > + retries cyclically until all the failover slots that existed on > + primary at the start of the function call are synchronized. See > > IIUC, this is not completely true though. If the slot is persisted, we > do not try cyclically now, we skip the sync. Isn't it? >
Right, but even if one of the slots is not persisted, it will try to sync again, no? Shall this be > changed to: > > It retries cyclically until all the failover slots that existed on > primary at the start of the function call are persisted and are > sync-ready. > This is too internal specific. I find Hou-san's version better w.r.t user facing docs. * RS_TEMPORARY. Once the decoding from corresponding LSNs can reach a * consistent point, they will be marked as RS_PERSISTENT. * + * If the WAL prior to the remote slot's confirmed_flush_lsn has not been + * flushed on the standby, the slot is marked as RS_TEMPORARY. Once the standby + * catches up and flushes that WAL, the slot is promoted to RS_PERSISTENT. BTW, I don't think we need these additional comments. -- With Regards, Amit Kapila.
