Re: Synchronizing slots from primary to standby

Peter Smith Wed, 13 Dec 2023 17:30:54 -0800

Hi, here are a few more review comments for the patch v47-0002

(plus my review comments of v45-0002 [1] are yet to be addressed)


======
1. General

For consistency and readability, try to use variables of the same
names whenever they have the same purpose, even when they declared are
in different functions. A few like this were already mentioned in the
previous review but there are more I keep noticing.

For example,
'slotnameChanged' in function, VERSUS 'primary_slot_changed' in the caller.


======
src/backend/replication/logical/slotsync.c

2.
+/*
+ *
+ * Validates the primary server info.
+ *
+ * Using the specified primary server connection, it verifies whether
the master
+ * is a standby itself and returns true in that case to convey the caller that
+ * we are on the cascading standby.
+ * But if master is the primary server, it goes ahead and validates
+ * primary_slot_name. It emits error if the physical slot in primary_slot_name
+ * does not exist on the primary server.
+ */
+static bool
+validate_primary_info(WalReceiverConn *wrconn)

2a.
Extra line top of that comment?

~

2b.
IMO it is too tricky to have a function called "validate_xxx", when
actually you gave that return value some special unintuitive meaning
other than just validation. IMO it is always better for the returned
value to properly match the function name so the expectations are very
obvious. So, In this case, I think a better function signature would
be like this:

SUGGESTION

static void
validate_primary_info(WalReceiverConn *wrconn, bool *master_is_standby)

or

static void
validate_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby)

~~~

3.
+ if (res->status != WALRCV_OK_TUPLES)
+ ereport(ERROR,
+ (errmsg("could not fetch recovery and primary_slot_name \"%s\" info from the "
+ "primary: %s", PrimarySlotName, res->err)));

I'm not sure that including "recovery and" in the error message is
meaningful to the user, is it?

~~~

4. slotsync_reread_config

+/*
+ * Re-read the config file.
+ *
+ * If any of the slot sync GUCs have changed, re-validate them. The
+ * worker will exit if the check fails.
+ *
+ * Returns TRUE if primary_slot_name is changed, let the caller re-verify it.
+ */
+static bool
+slotsync_reread_config(WalReceiverConn *wrconn)

Hm. This is another function where the return value has been butchered
to have a special meaning unrelated the the function name. IMO it
makes the code unnecessarily confusing.

IMO a better function signature here would be:

static void
slotsync_reread_config(WalReceiverConn *wrconn, bool *primary_slot_name_changed)

~~~

5. ProcessSlotSyncInterrupts

+/*
+ * Interrupt handler for main loop of slot sync worker.
+ */
+static bool
+ProcessSlotSyncInterrupts(WalReceiverConn *wrconn, bool
check_cascading_standby)
+{

There is no function comment describing the meaning of the return
value. But actually, IMO this is an example of how conflating the
meanings of validation VERSUS are_we_cascading_standby in the
lower-down function has propagated up to become a big muddle.

The code
+ if (primary_slot_changed || check_cascading_standby)
+ return validate_primary_info(wrconn);

seems unnecessarily hard to understand because,
false -- doesn't mean invalid
true -- doesn't mean valid

Please, consider changing this signature also so the functions return
what you would intuitively expect them to return without surprisingly
different meanings.

SUGGESTION

static void
ProcessSlotSyncInterrupts(WalReceiverConn *wrconn, bool
check_cascading_standby, bool *am_cascading_standby)

~~~

6. ReplSlotSyncWorkerMain

+ int rc;
+ long naptime = WORKER_DEFAULT_NAPTIME_MS;
+ TimestampTz now;
+ bool slot_updated;
+
+ /*
+ * The transaction env is needed by walrcv_exec() in both the slot
+ * sync and primary info validation flow.
+ */
+ StartTransactionCommand();
+
+ if (!am_cascading_standby)
+ {
+ slot_updated = synchronize_slots(wrconn);
+
+ /*
+ * If any of the slots get updated in this sync-cycle, use default
+ * naptime and update 'last_update_time'. But if no activity is
+ * observed in this sync-cycle, then increase naptime provided
+ * inactivity time reaches threshold.
+ */
+ now = GetCurrentTimestamp();
+ if (slot_updated)
+ last_update_time = now;
+ else if (TimestampDifferenceExceeds(last_update_time,
+ now, WORKER_INACTIVITY_THRESHOLD_MS))
+ naptime = WORKER_INACTIVITY_NAPTIME_MS;
+ }
+ else
+ naptime = 6 * WORKER_INACTIVITY_NAPTIME_MS; /* 60 sec */
+
+ rc = WaitLatch(MyLatch,
+    WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+    naptime,
+    WAIT_EVENT_REPL_SLOTSYNC_MAIN);
+
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+
+ am_cascading_standby =
+ ProcessSlotSyncInterrupts(wrconn, am_cascading_standby);
+
+ CommitTransactionCommand();

IMO it is more natural to avoid negative conditions, so just reverse
these. Also, some comment is needed to explain why the longer naptime
is needed in this special case.


SUGGESTION
if (am_cascading_standby)
{
  /* comment the reason .... */
  naptime = 6 * WORKER_INACTIVITY_NAPTIME_MS; /* 60 sec */
}
else
{
  /* Normal standby */
  ...
}

======
[1] review of v45-0002.
https://www.postgresql.org/message-id/CAHut%2BPtOc7J_n24HJ6f_dFWTuD3X2ApOByQzZf6jZz%2B0wb-ebQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

Reply via email to