On Tue, Nov 28, 2023 at 3:10 PM shveta malik <shveta.ma...@gmail.com> wrote: > > On Tue, Nov 28, 2023 at 12:19 PM Drouvot, Bertrand > <bertranddrouvot...@gmail.com> wrote: > > > > Hi, > > > > On 11/28/23 4:13 AM, shveta malik wrote: > > > On Mon, Nov 27, 2023 at 4:08 PM Amit Kapila <amit.kapil...@gmail.com> > > > wrote: > > >> > > >> On Mon, Nov 27, 2023 at 2:27 PM Zhijie Hou (Fujitsu) > > >> <houzj.f...@fujitsu.com> wrote: > > >>> > > >>> Here is the updated version(v39_2) which include all the changes made > > >>> in 0002. > > >>> Please use for review, and sorry for the confusion. > > >>> > > >> > > >> --- a/src/backend/replication/logical/launcher.c > > >> +++ b/src/backend/replication/logical/launcher.c > > >> @@ -8,20 +8,27 @@ > > >> * src/backend/replication/logical/launcher.c > > >> * > > >> * NOTES > > >> - * This module contains the logical replication worker launcher which > > >> - * uses the background worker infrastructure to start the logical > > >> - * replication workers for every enabled subscription. > > >> + * This module contains the replication worker launcher which > > >> + * uses the background worker infrastructure to: > > >> + * a) start the logical replication workers for every enabled > > >> subscription > > >> + * when not in standby_mode. > > >> + * b) start the slot sync worker for logical failover slots > > >> synchronization > > >> + * from the primary server when in standby_mode. > > >> > > >> I was wondering do we really need a launcher on standby to invoke > > >> sync-slot worker. If so, why? I guess it may be required for previous > > >> versions where we were managing work for multiple slot-sync workers > > >> which is also questionable in the sense of whether launcher is the > > >> right candidate for the same but now with the single slot-sync worker, > > >> it doesn't seem worth having it. What do you think? > > >> > > >> -- > > > > > > Yes, earlier a manager process was needed to manage multiple slot-sync > > > workers and distribute load among them, but now that does not seem > > > necessary. I gave it a try (PoC) and it seems to work well. If there > > > are no objections to this approach, I can share the patch soon. > > > > > > > +1 on this new approach, thanks! > > PFA v40. This patch has removed Logical Replication Launcher support > to launch slotsync worker. The slot-sync worker is now registered as > bgworker with postmaster, with > bgw_start_time=BgWorkerStart_ConsistentState and > bgw_restart_time=60sec. > > On removal of launcher, now all the validity checks have been shifted > to slot-sync worker itself. This brings us to some point of concerns: > > a) We still need to maintain RecoveryInProgress() check in slotsync > worker. Since worker has the start time of > BgWorkerStart_ConsistentState, it will be started on non-standby as > well. So to ensure that it exists on non-standby, "RecoveryInProgress" > has been introduced at the beginning of the worker. But once it exits, > postmaster will not restart it since it will be clean-exist i.e. > proc_exit(0) (the restart logic of postmaster comes into play only > when there is an abnormal exit). But to exit for the first time on > non-standby, we need that Recovery related check in worker. > > b) "enable_syncslot" check is moved to slotsync worker now. Since > enable_syncslot is PGC_SIGHUP, so proc_exit(1) is currently used to > exit the worker if 'enable_syncslot' is found to be disabled. > 'proc_exit(1)' has been used in order to ensure that the worker is > restarted and GUCs are checked again after restart_time. Downside of > this approach is, if someone has kept "enable_syncslot" as disabled > permanently even on standby, slotsync worker will keep on restarting > and exiting. > > So to overcome the above pain-points, I think a potential approach > will be to start slotsync worker only if 'enable_syncslot' is on and > the system is non-standby. Potential ways (each with some issues) are: >
Correction here: start slotsync worker only if 'enable_syncslot' is on and the system is standby. > 1) Use the current way i.e. register slot-sync worker as bgworker with > postmaster, but introduce extra checks in 'maybe_start_bgworkers'. But > this seems more like a hack. This will need extra changes as currently > once 'maybe_start_bgworkers' is attempted by postmaster, it will > attempt again to start any worker only if the worker had abnormal exit > and restart_time !=0. The current postmatser will not attempt to start > worker on any GUC change. > > 2) Another way maybe to treat slotsync worker as special case and > separate out the start/restart of slotsync worker from bgworker, and > follow what we do for autovacuum launcher(StartAutoVacLauncher) to > keep starting it in the postmaster loop(ServerLoop). In this way, we > may be able to add more checks before starting worker. But by opting > this approach, we will have to manage slotsync worker completely by > ourself as it will be no longer be part of existing > bgworker-registration infra. If this seems okay and there are no other > better options, it can be analyzed further in detail. > > 3) Another approach could be, in order to solve issue (a), introduce a > new start_time 'BgWorkerStart_ConsistentState_HotStandby' which means > start a bgworker only if consistent state is reached and the system is > standby. And for issue (b), lets retain check of enable_syncslot in > the worker itself but make it 'PGC_POSTMASTER'. This will ensure we > can safely exit the worker(proc_exit(0) if enable_syncslot is disabled > and postmaster will not restart it. But I'm not sure if making it > "PGC_POSTMASTER" is acceptable from the user's perspective. > > thanks > Shveta