Re: Introduce XID age and inactive timeout based replication slot invalidation

Peter Smith Mon, 16 Sep 2024 18:28:14 -0700

Here are a few comments for the patch v46-0001.

======
src/backend/replication/slot.c

1. ReportSlotInvalidation

On Mon, Sep 16, 2024 at 8:01 PM Bharath Rupireddy
<bharath.rupireddyforpostg...@gmail.com> wrote:
>
> On Mon, Sep 9, 2024 at 1:11 PM Peter Smith <smithpb2...@gmail.com> wrote:
> > 3. ReportSlotInvalidation
> >
> > I didn't understand why there was a hint for:
> > "You might need to increase \"%s\".", "max_slot_wal_keep_size"
> >
> > Why aren't these similar cases consistent?
>
> It looks misleading and not very useful. What happens if the removed
> WAL (that's needed for the slot) is put back into pg_wal somehow (by
> manually copying from archive or by some tool/script)? Can the slot
> invalidated due to wal_removed start sending WAL to its clients?
>
> > But you don't have an equivalent hint for timeout invalidation:
> > "You might need to increase \"%s\".", "replication_slot_inactive_timeout"
>
> I removed this per review comments upthread.

IIUC the errors are quite similar, so my previous review comment was
mostly about the unexpected inconsistency of why one of them has a
hint and the other one does not. I don't have a strong opinion about
whether they should both *have* or *not have* hints, so long as they
are treated the same.

If you think the current code hint is not useful then maybe we need a
new thread to address that existing issue. For example, maybe it
should be removed or reworded.

~~~

2. InvalidatePossiblyObsoleteSlot:

+ case RS_INVAL_INACTIVE_TIMEOUT:
+
+ if (!SlotInactiveTimeoutCheckAllowed(s))
+ break;
+
+ /*
+ * Check if the slot needs to be invalidated due to
+ * replication_slot_inactive_timeout GUC.
+ */
+ if (TimestampDifferenceExceeds(s->inactive_since, now,
+    replication_slot_inactive_timeout * 1000))

nit - it might be tidier to avoid multiple breaks by just combining
these conditions. See the nitpick attachment.

~~~

3.
  * - RS_INVAL_WAL_LEVEL: is logical
+ * - RS_INVAL_INACTIVE_TIMEOUT: inactive timeout occurs

nit - use comment wording "inactive slot timeout has occurred", to
make it identical to the comment in slot.h

======
src/test/recovery/t/050_invalidate_slots.pl

4.
+# Despite inactive timeout being set, the synced slot won't get invalidated on
+# its own on the standby. So, we must not see invalidation message in server
+# log.
+$standby1->safe_psql('postgres', "CHECKPOINT");
+ok( !$standby1->log_contains(
+ "invalidating obsolete replication slot \"sync_slot1\"",
+ $logstart),
+ 'check that synced slot sync_slot1 has not been invalidated on standby'
+);
+

It seems kind of brittle to check the logs for something that is NOT
there because any change to the message will make this accidentally
pass. Apart from that, it might anyway be more efficient just to check
the pg_replication_slots again to make sure the 'invalidation_reason
remains' still NULL.

======

Please see the attachment which implements some of the nit changes
mentioned above.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 851120e..0076e4b 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1716,15 +1716,12 @@ 
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
                                                invalidation_cause = cause;
                                        break;
                                case RS_INVAL_INACTIVE_TIMEOUT:
-
-                                       if (!SlotInactiveTimeoutCheckAllowed(s))
-                                               break;
-
                                        /*
                                         * Check if the slot needs to be 
invalidated due to
                                         * replication_slot_inactive_timeout 
GUC.
                                         */
-                                       if 
(TimestampDifferenceExceeds(s->inactive_since, now,
+                                       if (SlotInactiveTimeoutCheckAllowed(s) 
&&
+                                               
TimestampDifferenceExceeds(s->inactive_since, now,
                                                                                
                   replication_slot_inactive_timeout * 1000))
                                        {
                                                invalidation_cause = cause;
@@ -1894,7 +1891,7 @@ 
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
  * - RS_INVAL_HORIZON: requires a snapshot <= the given horizon in the given
  *   db; dboid may be InvalidOid for shared relations
  * - RS_INVAL_WAL_LEVEL: is logical
- * - RS_INVAL_INACTIVE_TIMEOUT: inactive timeout occurs
+ * - RS_INVAL_INACTIVE_TIMEOUT: inactive slot timeout has occurred
  *
  * NB - this runs as part of checkpoint, so avoid raising errors if possible.
  */
diff --git a/src/test/recovery/t/050_invalidate_slots.pl 
b/src/test/recovery/t/050_invalidate_slots.pl
index c53b5b3..e2fdd52 100644
--- a/src/test/recovery/t/050_invalidate_slots.pl
+++ b/src/test/recovery/t/050_invalidate_slots.pl
@@ -87,7 +87,7 @@ is( $standby1->safe_psql(
        'logical slot sync_slot1 is synced to standby');
 
 # Give enough time
-sleep($inactive_timeout+1);
+sleep($inactive_timeout + 1);
 
 # Despite inactive timeout being set, the synced slot won't get invalidated on
 # its own on the standby. So, we must not see invalidation message in server

Re: Introduce XID age and inactive timeout based replication slot invalidation

Reply via email to