On Mon, Feb 19, 2024 at 9:59 PM shveta malik <shveta.ma...@gmail.com> wrote: > > On Mon, Feb 19, 2024 at 5:32 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > Few comments on 0001 > > Thanks for the feedback. > > > ==================== > > 1. I think it is better to error out when the valid GUC or option is > > not set in ensure_valid_slotsync_params() and > > ensure_valid_remote_info() instead of waiting. And we shouldn't start > > the worker in the first place if not all required GUCs are set. This > > will additionally simplify the code a bit. > > Sure, removed 'ensure' functions. Moved the validation checks to the > postmaster before starting the slot sync worker. > > > 2. > > +typedef struct SlotSyncWorkerCtxStruct > > { > > - /* prevents concurrent slot syncs to avoid slot overwrites */ > > + pid_t pid; > > + bool stopSignaled; > > bool syncing; > > + time_t last_start_time; > > slock_t mutex; > > -} SlotSyncCtxStruct; > > +} SlotSyncWorkerCtxStruct; > > > > I think we don't need to change the name of this struct as this can be > > used both by the worker and the backend. We can probably add the > > comment to indicate that all the fields except 'syncing' are used by > > slotsync worker. > > Modified. > > > 3. Similar to above, function names like SlotSyncShmemInit() shouldn't > > be changed to SlotSyncWorkerShmemInit(). > > Modified. > > > 4. > > +ReplSlotSyncWorkerMain(int argc, char *argv[]) > > { > > ... > > + on_shmem_exit(slotsync_worker_onexit, (Datum) 0); > > ... > > + before_shmem_exit(slotsync_failure_callback, PointerGetDatum(wrconn)); > > ... > > } > > > > Do we need two separate callbacks? Can't we have just one (say > > slotsync_failure_callback) that cleans additional things in case of > > slotsync worker? And, if we need both those callbacks then please add > > some comments for both and why one is before_shmem_exit() and the > > other is on_shmem_exit(). > > I think we can merge these now. Earlier 'on_shmem_exit' was needed to > avoid race-condition between startup and slot sync worker process to > drop 'i' slots on promotion. Now we do not have any such scenario. > But I need some time to analyze it well. Will do it in the next > version. > > > In addition to the above, I have made a few changes in the comments > > and code (cosmetic code changes). Please include those in the next > > version if you find them okay. You need to rename .txt file to remove > > .txt and apply atop v90-0001*. > > Sure, included these. > > Please find the patch001 attached.
I've reviewed the v91 patch. Here are random comments: --- /* * Checks the remote server info. * - * We ensure that the 'primary_slot_name' exists on the remote server and the - * remote server is not a standby node. + * Check whether we are a cascading standby. For non-cascading standbys, it + * also ensures that the 'primary_slot_name' exists on the remote server. */ IIUC what the validate_remote_info() does doesn't not change by this patch, so the previous comment seems to be clearer to me. --- if (remote_in_recovery) + { + /* + * If we are a cascading standby, no need to check further for + * 'primary_slot_name'. + */ ereport(ERROR, errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot synchronize replication slots from a standby server")); + } + else + { + bool primary_slot_valid; - primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull)); - Assert(!isnull); + primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull)); + Assert(!isnull); - if (!primary_slot_valid) - ereport(ERROR, - errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("bad configuration for slot synchronization"), - /* translator: second %s is a GUC variable name */ - errdetail("The replication slot \"%s\" specified by \"%s\" does not exist on the primary server.", - PrimarySlotName, "primary_slot_name")); + if (!primary_slot_valid) + ereport(ERROR, + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("bad configuration for slot synchronization"), + /* translator: second %s is a GUC variable name */ + errdetail("The replication slot \"%s\" specified by \"%s\" does not exist on the primary server.", + PrimarySlotName, "primary_slot_name")); + } I think it's a refactoring rather than changes required by the slotsync worker. We can do that in a separate patch but why do we need this change in the first place? --- + ValidateSlotSyncParams(ERROR); + /* Load the libpq-specific functions */ load_file("libpqwalreceiver", false); - ValidateSlotSyncParams(); + (void) CheckDbnameInConninfo(); Is there any reason why we move where to check the parameters? Some comments not related to the patch but to the existing code: --- It might have already been discussed but is the src/backend/replication/logical the right place for the slocsync.c? If it's independent of logical decoding/replication, is under src/backend/replication could be more appropriate? --- /* Construct query to fetch slots with failover enabled. */ appendStringInfo(&s, "SELECT slot_name, plugin, confirmed_flush_lsn," " restart_lsn, catalog_xmin, two_phase, failover," " database, conflict_reason" " FROM pg_catalog.pg_replication_slots" " WHERE failover and NOT temporary"); /* Execute the query */ res = walrcv_exec(wrconn, s.data, SLOTSYNC_COLUMN_COUNT, slotRow); pfree(s.data); We don't need 's' as the query is constant. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com