Hi, here are a few more review comments for the patch v47-0002 (plus my review comments of v45-0002 [1] are yet to be addressed)
====== 1. General For consistency and readability, try to use variables of the same names whenever they have the same purpose, even when they declared are in different functions. A few like this were already mentioned in the previous review but there are more I keep noticing. For example, 'slotnameChanged' in function, VERSUS 'primary_slot_changed' in the caller. ====== src/backend/replication/logical/slotsync.c 2. +/* + * + * Validates the primary server info. + * + * Using the specified primary server connection, it verifies whether the master + * is a standby itself and returns true in that case to convey the caller that + * we are on the cascading standby. + * But if master is the primary server, it goes ahead and validates + * primary_slot_name. It emits error if the physical slot in primary_slot_name + * does not exist on the primary server. + */ +static bool +validate_primary_info(WalReceiverConn *wrconn) 2a. Extra line top of that comment? ~ 2b. IMO it is too tricky to have a function called "validate_xxx", when actually you gave that return value some special unintuitive meaning other than just validation. IMO it is always better for the returned value to properly match the function name so the expectations are very obvious. So, In this case, I think a better function signature would be like this: SUGGESTION static void validate_primary_info(WalReceiverConn *wrconn, bool *master_is_standby) or static void validate_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby) ~~~ 3. + if (res->status != WALRCV_OK_TUPLES) + ereport(ERROR, + (errmsg("could not fetch recovery and primary_slot_name \"%s\" info from the " + "primary: %s", PrimarySlotName, res->err))); I'm not sure that including "recovery and" in the error message is meaningful to the user, is it? ~~~ 4. slotsync_reread_config +/* + * Re-read the config file. + * + * If any of the slot sync GUCs have changed, re-validate them. The + * worker will exit if the check fails. + * + * Returns TRUE if primary_slot_name is changed, let the caller re-verify it. + */ +static bool +slotsync_reread_config(WalReceiverConn *wrconn) Hm. This is another function where the return value has been butchered to have a special meaning unrelated the the function name. IMO it makes the code unnecessarily confusing. IMO a better function signature here would be: static void slotsync_reread_config(WalReceiverConn *wrconn, bool *primary_slot_name_changed) ~~~ 5. ProcessSlotSyncInterrupts +/* + * Interrupt handler for main loop of slot sync worker. + */ +static bool +ProcessSlotSyncInterrupts(WalReceiverConn *wrconn, bool check_cascading_standby) +{ There is no function comment describing the meaning of the return value. But actually, IMO this is an example of how conflating the meanings of validation VERSUS are_we_cascading_standby in the lower-down function has propagated up to become a big muddle. The code + if (primary_slot_changed || check_cascading_standby) + return validate_primary_info(wrconn); seems unnecessarily hard to understand because, false -- doesn't mean invalid true -- doesn't mean valid Please, consider changing this signature also so the functions return what you would intuitively expect them to return without surprisingly different meanings. SUGGESTION static void ProcessSlotSyncInterrupts(WalReceiverConn *wrconn, bool check_cascading_standby, bool *am_cascading_standby) ~~~ 6. ReplSlotSyncWorkerMain + int rc; + long naptime = WORKER_DEFAULT_NAPTIME_MS; + TimestampTz now; + bool slot_updated; + + /* + * The transaction env is needed by walrcv_exec() in both the slot + * sync and primary info validation flow. + */ + StartTransactionCommand(); + + if (!am_cascading_standby) + { + slot_updated = synchronize_slots(wrconn); + + /* + * If any of the slots get updated in this sync-cycle, use default + * naptime and update 'last_update_time'. But if no activity is + * observed in this sync-cycle, then increase naptime provided + * inactivity time reaches threshold. + */ + now = GetCurrentTimestamp(); + if (slot_updated) + last_update_time = now; + else if (TimestampDifferenceExceeds(last_update_time, + now, WORKER_INACTIVITY_THRESHOLD_MS)) + naptime = WORKER_INACTIVITY_NAPTIME_MS; + } + else + naptime = 6 * WORKER_INACTIVITY_NAPTIME_MS; /* 60 sec */ + + rc = WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, + naptime, + WAIT_EVENT_REPL_SLOTSYNC_MAIN); + + if (rc & WL_LATCH_SET) + ResetLatch(MyLatch); + + am_cascading_standby = + ProcessSlotSyncInterrupts(wrconn, am_cascading_standby); + + CommitTransactionCommand(); IMO it is more natural to avoid negative conditions, so just reverse these. Also, some comment is needed to explain why the longer naptime is needed in this special case. SUGGESTION if (am_cascading_standby) { /* comment the reason .... */ naptime = 6 * WORKER_INACTIVITY_NAPTIME_MS; /* 60 sec */ } else { /* Normal standby */ ... } ====== [1] review of v45-0002. https://www.postgresql.org/message-id/CAHut%2BPtOc7J_n24HJ6f_dFWTuD3X2ApOByQzZf6jZz%2B0wb-ebQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia