On Tue, Dec 5, 2023 at 2:18 PM Drouvot, Bertrand <bertranddrouvot...@gmail.com> wrote: > > Hi, > > On 12/5/23 6:08 AM, shveta malik wrote: > > On Mon, Dec 4, 2023 at 10:07 PM Drouvot, Bertrand > > <bertranddrouvot...@gmail.com> wrote: > >> Maybe another option could be to have the walreceiver a way to let the > >> slot sync > >> worker knows that it (the walreceiver) was not able to start due to non > >> existing > >> replication slot on the primary? (that way we'd avoid the slot sync worker > >> having > >> to talk to the primary). > > > > Few points: > > 1) I think if we do it, we should do it in generic way i.e. slotsync > > worker should go to no-op if walreceiver is not able to start due to > > any reason and not only due to invalid primary_slot_name. > > Agree. > > > 2) Secondly, slotsync worker needs to make sure it has synced the > > slots so far i.e. worker should not go to no-op immediately on seeing > > missing WalRcv process if there are pending slots to be synced. > > Agree. > > > So the generic way I see to have this optimization is: > > 1) Slotsync worker can use 'WalRcv->pid' to figure out if WalReceiver > > is running or not. > > Not sure that would work because the walreceiver keeps try re-starting > and so get a pid before reaching the "could not start WAL streaming: ERROR: > replication slot "XXXX" does not exist" > error. >
yes, right. pid will keep on toggling. > We may want to add an extra check on walrcv->walRcvState (or should/could be > enough by its own). > But walrcv->walRcvState is set to WALRCV_STREAMING way before > walrcv_startstreaming(). > Agree. Check on 'walrcv->walRcvState' alone should suffice. > Wouldn't that make sense to move it once we are sure that > walrcv_startstreaming() returns true and first_stream is true, here? > > " > if (first_stream) > + { > ereport(LOG, > (errmsg("started streaming > WAL from primary at %X/%X on timeline %u", > > LSN_FORMAT_ARGS(startpoint), startpointTLI))); > + SpinLockAcquire(&walrcv->mutex); > + walrcv->walRcvState = WALRCV_STREAMING; > + SpinLockRelease(&walrcv->mutex); > + } > " > Yes, it makes sense and is the basis for current slot-sync worker changes being discussed. > > 2) Slotsync worker should check null 'WalRcv->pid' only when > > no-activity is observed for threshold time i.e. it can do it during > > existing logic of increasing naptime. > > 3) On finding null 'WalRcv->pid', worker can mark a flag to go to > > no-op unless WalRcv->pid becomes valid again. Marking this flag during > > increasing naptime will guarantee that the worker has taken all the > > changes so far i.e. standby is not lagging in terms of slots. > > > > 2) and 3) looks good to me but with a check on walrcv->walRcvState > looking for WALRCV_STREAMING state instead of looking for a non null > WalRcv->pid. yes. But I think, the worker should enter no-op, when walRcvState is WALRCV_STOPPED and not when walRcvState != WALRCV_STREAMING as it is okay to have WALRCV_WAITING/STARTING/RESTARTING. But the worker should exit no-op only when it finds walRcvState switched back to WALRCV_STREAMING. > > And only if it makes sense to move the walrcv->walRcvState = WALRCV_STREAMING > as > mentioned above (I think it does). > yes, I agree. thanks Shveta