On Thu, 5 Dec 2024 at 06:44, Peter Smith <smithpb2...@gmail.com> wrote: > > On Wed, Dec 4, 2024 at 9:27 PM vignesh C <vignes...@gmail.com> wrote: > > > ... > > > > Currently, replication slots are invalidated based on the > > replication_slot_inactive_timeout only during a checkpoint. This means > > that if the checkpoint_timeout is set to a higher value than the > > replication_slot_inactive_timeout, slot invalidation will occur only > > when the checkpoint is triggered. Identifying the invalidation slots > > might be slightly delayed in this case. As an alternative, users can > > forcefully invalidate inactive slots that have exceeded the > > replication_slot_inactive_timeout by forcing a checkpoint. I was > > thinking we could suggest this in the documentation. > > > > + <para> > > + Slot invalidation due to inactive timeout occurs during checkpoint. > > + The duration of slot inactivity is calculated using the slot's > > + <link > > linkend="view-pg-replication-slots">pg_replication_slots</link>.<structfield>inactive_since</structfield> > > + value. > > + </para> > > + > > > > We could accurately invalidate the slots using the checkpointer > > process by calculating the invalidation time based on the active_since > > timestamp and the replication_slot_inactive_timeout, and then set the > > checkpointer's main wait-latch accordingly for triggering the next > > checkpoint. Ideally, a different process handling this task would be > > better, but there is currently no dedicated daemon capable of > > identifying and managing slots across streaming replication, logical > > replication, and other slots used by plugins. Additionally, > > overloading the checkpointer with this responsibility may not be > > ideal. As an alternative, we could document about this delay in > > identifying and mention that it could be triggered by forceful manual > > checkpoint. > > > > Hi Vignesh. > > I felt that manipulating the checkpoint timing behind the scenes > without the user's consent might be a bit of an overreach.
Agree > But there might still be something else we could do: > > 1. We can add the documentation note like you suggested ("we could > document about this delay in identifying and mention that it could be > triggered by forceful manual checkpoint"). Yes, that makes sense > 2. We can also detect such delays in the code. When the invalidation > occurs (e.g. code fragment below) we could check if there was some > excessive lag between the slot becoming idle and it being invalidated. > If the lag is too much (whatever "too much" means) we can log a hint > for the user to increase the checkpoint frequency (or whatever else we > might advise them to do). > > + /* > + * Check if the slot needs to be invalidated due to > + * replication_slot_inactive_timeout GUC. > + */ > + if (IsSlotInactiveTimeoutPossible(s) && > + TimestampDifferenceExceeds(s->inactive_since, now, > + replication_slot_inactive_timeout_ms)) > + { > + invalidation_cause = cause; > + inactive_since = s->inactive_since; > > pseudo-code: > if (slot invalidation occurred much later after the > replication_slot_inactive_timeout GUC elapsed) > { > elog(LOG, "This slot was inactive for a period of %s. Slot timeout > invalidation only occurs at a checkpoint so if you want inactive slots > to be invalidated in a more timely manner consider reducing the time > between checkpoints or executing a manual checkpoint. > (replication_slot_inactive_timeout = %s; checkpoint_timeout = %s, > ....)" > } > > + } Determining the correct time may be challenging for users, as it depends on when the active_since value is set, as well as when the checkpoint_timeout occurs and the subsequent checkpoint is triggered. Even if the user sets it to an appropriate value, there is still a possibility of delayed identification due to the timing of when the slot's active_timeout is being set. Including this information in the documentation should be sufficient. Regards, Vignesh