Hi,

Thanks for looking into this.

On Mon, Aug 26, 2024 at 4:35 PM Amit Kapila <amit.kapil...@gmail.com> wrote:
>
> Few comments on 0001:
> 1.
> @@ -651,6 +651,13 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid
>
> + /*
> + * Skip the sync if the local slot is already invalidated. We do this
> + * beforehand to avoid slot acquire and release.
> + */
>
> I was wondering why you have added this new check as part of this
> patch. If you see the following comments in the related code, you will
> know why we haven't done this previously.

Removed. Can deal with optimization separately.

> 2.
> + */
> + InvalidateObsoleteReplicationSlots(RS_INVAL_INACTIVE_TIMEOUT,
> +    0, InvalidOid, InvalidTransactionId);
>
> Why do we want to call this for shutdown case (when is_shutdown is
> true)? I understand trying to invalidate slots during regular
> checkpoint but not sure if we need it at the time of shutdown.

Changed it to invalidate only for non-shutdown checkpoints.
inactive_timeout invalidation isn't critical for shutdown unlike
wal_removed which can help shutdown by freeing up some disk space.

> The
> other point is can we try to check the performance impact with 100s of
> slots as mentioned in the code comments?

I first checked how much does the wal_removed invalidation check add to the
checkpoint (see 2nd and 3rd column). I then checked how much
inactive_timeout invalidation check adds to the checkpoint (see 4th
column), it is not more than wal_remove invalidation check. I then checked
how much the wal_removed invalidation check adds for replication slots that
have already been invalidated due to inactive_timeout (see 5th column),
looks like not much.

| # of slots | HEAD (no invalidation) ms | HEAD (wal_removed) ms | PATCHED
(inactive_timeout) ms | PATCHED (inactive_timeout+wal_removed) ms |
|------------|----------------------------|-----------------------|-------------------------------|------------------------------------------|
| 100        | 18.591                     | 370.586               | 359.299
                      | 373.882                                  |
| 1000       | 15.722                     | 4834.901              |
5081.751                      | 5072.128                                 |
| 10000      | 19.261                     | 59801.062             |
61270.406                     | 60270.099                                |

Having said that, I'm okay to implement the optimization specified.
Thoughts?

+ /*
+ * NB: We will make another pass over replication slots for
+ * invalidation checks to keep the code simple. Testing shows that
+ * there is no noticeable overhead (when compared with wal_removed
+ * invalidation) even if we were to do inactive_timeout invalidation
+ * of thousands of replication slots here. If it is ever proven that
+ * this assumption is wrong, we will have to perform the invalidation
+ * checks in the above for loop with the following changes:
+ *
+ * - Acquire ControlLock lock once before the loop.
+ *
+ * - Call InvalidatePossiblyObsoleteSlot for each slot.
+ *
+ * - Handle the cases in which ControlLock gets released just like
+ * InvalidateObsoleteReplicationSlots does.
+ *
+ * - Avoid saving slot info to disk two times for each invalidated
+ * slot.

Please see the attached v43 patches addressing the above review comments.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment: v43-0002-Add-XID-age-based-replication-slot-invalidation.patch
Description: Binary data

Attachment: v43-0001-Add-inactive_timeout-based-replication-slot-inva.patch
Description: Binary data

Reply via email to