On Fri, Jan 19, 2024 at 10:35 AM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > > Thank you for updating the patch. I have some comments: > > --- > + latestWalEnd = GetWalRcvLatestWalEnd(); > + if (remote_slot->confirmed_lsn > latestWalEnd) > + { > + elog(ERROR, "exiting from slot synchronization as the > received slot sync" > + " LSN %X/%X for slot \"%s\" is ahead of the > standby position %X/%X", > + LSN_FORMAT_ARGS(remote_slot->confirmed_lsn), > + remote_slot->name, > + LSN_FORMAT_ARGS(latestWalEnd)); > + } > > IIUC GetWalRcvLatestWalEnd () returns walrcv->latestWalEnd, which is > typically the primary server's flush position and doesn't mean the LSN > where the walreceiver received/flushed up to.
yes. I think it makes more sense to use something which actually tells flushed-position. I gave it a try by replacing GetWalRcvLatestWalEnd() with GetWalRcvFlushRecPtr() but I see a problem here. Lets say I have enabled the slot-sync feature in a running standby, in that case we are all good (flushedUpto is the same as actual flush-position indicated by LogstreamResult.Flush). But if I restart standby, then I observed that the startup process sets flushedUpto to some value 'x' (see [1]) while when the wal-receiver starts, it sets 'LogstreamResult.Flush' to another value (see [2]) which is always greater than 'x'. And we do not update flushedUpto with the 'LogstreamResult.Flush' value in walreceiver until we actually do an operation on primary. Performing a data change on primary sends WALs to standby which then hits XLogWalRcvFlush() and updates flushedUpto same as LogstreamResult.Flush. Until then we have a situation where slots received on standby are ahead of flushedUpto and thus slotsync worker keeps one erroring out. I am yet to find out why flushedUpto is set to a lower value than 'LogstreamResult.Flush' at the start of standby. Or maybe am I using the wrong function GetWalRcvFlushRecPtr() and should be using something else instead? [1]: Startup process sets 'flushedUpto' here: ReadPageInternal-->XLogPageRead-->WaitForWALToBecomeAvailable-->RequestXLogStreaming [2]: Walreceiver sets 'LogstreamResult.Flush' here but do not update 'flushedUpto' here: WalReceiverMain(): LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL) > Does it really happen > that the slot's confirmed_flush_lsn is higher than the primary's flush > lsn? It may happen if we have not configured standby_slot_names on primary. In such a case, slots may get updated w/o confirming that standby has taken the change and thus slot-sync worker may fetch the slots which have lsns ahead of the latest WAL position on standby. thanks Shveta