Re: Introduce XID age and inactive timeout based replication slot invalidation

vignesh C Thu, 05 Dec 2024 21:34:56 -0800

On Thu, 5 Dec 2024 at 06:44, Peter Smith <[email protected]> wrote:
>
> On Wed, Dec 4, 2024 at 9:27 PM vignesh C <[email protected]> wrote:
> >
> ...
> >
> > Currently, replication slots are invalidated based on the
> > replication_slot_inactive_timeout only during a checkpoint. This means
> > that if the checkpoint_timeout is set to a higher value than the
> > replication_slot_inactive_timeout, slot invalidation will occur only
> > when the checkpoint is triggered. Identifying the invalidation slots
> > might be slightly delayed in this case. As an alternative, users can
> > forcefully invalidate inactive slots that have exceeded the
> > replication_slot_inactive_timeout by forcing a checkpoint. I was
> > thinking we could suggest this in the documentation.
> >
> > +       <para>
> > +        Slot invalidation due to inactive timeout occurs during checkpoint.
> > +        The duration of slot inactivity is calculated using the slot's
> > +        <link 
> > linkend="view-pg-replication-slots">pg_replication_slots</link>.<structfield>inactive_since</structfield>
> > +        value.
> > +       </para>
> > +
> >
> > We could accurately invalidate the slots using the checkpointer
> > process by calculating the invalidation time based on the active_since
> > timestamp and the replication_slot_inactive_timeout, and then set the
> > checkpointer's main wait-latch accordingly for triggering the next
> > checkpoint. Ideally, a different process handling this task would be
> > better, but there is currently no dedicated daemon capable of
> > identifying and managing slots across streaming replication, logical
> > replication, and other slots used by plugins. Additionally,
> > overloading the checkpointer with this responsibility may not be
> > ideal. As an alternative, we could document about this delay in
> > identifying and mention that it could be triggered by forceful manual
> > checkpoint.
> >
>
> Hi Vignesh.
>
> I felt that manipulating the checkpoint timing behind the scenes
> without the user's consent might be a bit of an overreach.


Agree

> But there might still be something else we could do:
>
> 1. We can add the documentation note like you suggested ("we could
> document about this delay in identifying and mention that it could be
> triggered by forceful manual checkpoint").

Yes, that makes sense

> 2. We can also detect such delays in the code. When the invalidation
> occurs (e.g. code fragment below) we could check if there was some
> excessive lag between the slot becoming idle and it being invalidated.
> If the lag is too much (whatever "too much" means) we can log a hint
> for the user to increase the checkpoint frequency (or whatever else we
> might advise them to do).
>
> + /*
> + * Check if the slot needs to be invalidated due to
> + * replication_slot_inactive_timeout GUC.
> + */
> + if (IsSlotInactiveTimeoutPossible(s) &&
> + TimestampDifferenceExceeds(s->inactive_since, now,
> +    replication_slot_inactive_timeout_ms))
> + {
> + invalidation_cause = cause;
> + inactive_since = s->inactive_since;
>
> pseudo-code:
> if (slot invalidation occurred much later after the
> replication_slot_inactive_timeout GUC elapsed)
> {
>   elog(LOG, "This slot was inactive for a period of %s. Slot timeout
> invalidation only occurs at a checkpoint so if you want inactive slots
> to be invalidated in a more timely manner consider reducing the time
> between checkpoints or executing a manual checkpoint.
> (replication_slot_inactive_timeout = %s; checkpoint_timeout = %s,
> ....)"
> }
>
> + }

Determining the correct time may be challenging for users, as it
depends on when the active_since value is set, as well as when the
checkpoint_timeout occurs and the subsequent checkpoint is triggered.
Even if the user sets it to an appropriate value, there is still a
possibility of delayed identification due to the timing of when the
slot's active_timeout is being set. Including this information in the
documentation should be sufficient.

Regards,
Vignesh

Re: Introduce XID age and inactive timeout based replication slot invalidation

Reply via email to