On Thu, Aug 3, 2023 at 12:28 AM Bharath Rupireddy <bharath.rupireddyforpostg...@gmail.com> wrote: > > On Tue, Aug 1, 2023 at 5:01 PM shveta malik <shveta.ma...@gmail.com> wrote: > > > > > The work division amongst the sync workers can > > > be simple, the logical replication launcher builds a shared memory > > > structure based on number of slots to sync and starts the sync workers > > > dynamically, and each sync worker picks {dboid, slot name, conninfo} > > > from the shared memory, syncs it and proceeds with other slots. > > > > Do you mean the logical replication launcher builds a shared memory > > structure based > > on the number of 'dbs' to sync as I understood from your initial comment? > > Yes. I haven't looked at the 0003 patch posted upthread. However, the > standby must do the following at a minimum: > > - Make GUCs synchronize_slot_names and max_slot_sync_workers of > PGC_POSTMASTER type needing postmaster restart when changed as they > affect the number of slot sync workers.
I agree that max_slot_sync_workers should be allowed to change only during startup but I strongly feel that synchronize_slot_names should be runtime modifiable. We should give that flexibility to the user. > - LR (logical replication) launcher connects to primary to fetch the > logical slots specified in synchronize_slot_names. This is a one-time > task. if synchronize_slot_names='*', we need to fetch slots info at regular intervals even if it is not runtime modifiable. For a runtime modifiable case, it is obvious to reftech it regular intervals. > - LR launcher prepares a dynamic shared memory (created via > dsm_create) with some state like locks for IPC and an array of > {slot_name, dboid_associated_with_slot, is_sync_in_progress} - maximum > number of elements in the array is the number of slots specified in > synchronize_slot_names. This is a one-time task. yes, we need dynamic-shared-memory but it is not a one-time-allocation. If it were a one-time allocation, then there was no need for DSM, only shared memory allocation was enough. It is not a one time allocation in any of the designs. If it is slot based design, slots may keep on varying for '*' case and if it is DB based design, then number of DBs may go beyond the initial memory allocated and we may need reallocation and relaunch of worker and thus the need of DSM. > - LR launcher decides the *best* number of slot sync workers - (based > on some perf numbers) it can just launch, say, one worker per 2 or 4 > or 8 etc. slots. > - Each slot sync worker then picks up a slot from the DSM, connects to > primary using primary conn info, syncs it, and moves to another slot. > The design based on slots i.e. launcher dividing the slots among the available workers, could prove beneficial over db based division for a case where number of slots per DB varies largely and we end up assigning all DBs with lesser slots to one worker while all heavily loaded DBs to another. But other than this, I see lot of pain points: 1) Since we are going to do slots based synching, query construction will be complex. We will have a query with a long 'where' clause: where slots in (slot1, slot2, slots...). 2) Number of pings to primary will be more as we are pinging it slot based instead of DB based. So the information which we could have fetched collectively in one query (if it was db based) is now splitted to multiple queries assuming that there could be cases where slots belonging to the same DBs end up getting splitted among different workers. 3) if number of slots < max number of workers, how are we going to assign the worker? One slot per worker or all in one worker. If it is one slot per worker, it will again be not that efficient as it will result in more network traffic. This needs more thoughts and case to case varying design. > Not having the capability of on-demand stop/launch of slot sync > workers makes the above design simple IMO. > We need to anyways relaunch workers when DSM is reallocated in case Dbs (or sya slots) exceed some initial allocation limit. thanks Shveta