Re: Conflict detection for update_deleted in logical replication

Amit Kapila Mon, 13 Jan 2025 20:39:35 -0800

On Tue, Jan 14, 2025 at 7:14 AM Masahiko Sawada <sawada.m...@gmail.com> wrote:
>
> On Sun, Jan 12, 2025 at 10:36 PM Amit Kapila <amit.kapil...@gmail.com> wrote:
> >
> > I don't think we can avoid accumulating garbage especially when the
> > workload on the publisher is more. Consider the current case being
> > discussed, on the publisher, we have 30 clients performing read-write
> > operations and there is only one pair of reader (walsender) and writer
> > (apply_worker) to perform all those write operations on the
> > subscriber. It can never match the speed and the subscriber side is
> > bound to have less performance (or accumulate more bloat) irrespective
> > of its workload. If there is one client on the publisher performing
> > operation, we won't see much degradation but as the number of clients
> > increases, the performance degradation (and bloat) will keep on
> > increasing.
> >
> > There are other scenarios that can lead to the same situation, such as
> > a large table sync, the subscriber node being down for sometime, etc.
> > Basically, any case where apply_side lags by a large amount from the
> > remote node.
> >
> > One idea to prevent the performance degradation or bloat increase is
> > to invalidate the slot, once we notice that subscriber lags (in terms
> > of WAL apply) behind the publisher by a certain threshold. Say we have
> > max_lag (or max_lag_behind_remote) (defined in terms of seconds)
> > subscription option which allows us to stop calculating
> > oldest_nonremovable_xid for that subscription. We can indicate that
> > via some worker_level parameter. Once all the subscriptions on a node
> > that has enabled retain_conflict_info have stopped calculating
> > oldest_nonremovable_xid, we can invalidate the slot. Now, users can
> > check this and need to disable/enable retain_conflict_info to again
> > start retaining the required information. The other way could be that
> > instead of invalidating the slot, we directly drop/re-create the slot
> > or increase its xmin. If we choose to advance the slot automatically
> > without user intervention, we need to let users know via LOG and or
> > via information in the view.
> >
> > I think such a mechanism via the new option max_lag will address your
> > concern: "It's reasonable behavior for this approach but it might not
> > be a reasonable outcome for users if they could be affected by such a
> > performance dip without no way to avoid it." as it will provide a way
> > to avoid performance dip only when there is a possibility of such a
> > dip.
> >
> > I mentioned max_lag as a subscription option instead of a GUC because
> > it applies only to subscriptions that have enabled
> > retain_conflict_info but we can consider it to be a GUC if you and
> > others think so provided the above proposal sounds reasonable. Also,
> > max_lag could be defined in terms of LSN as well but I think time
> > would be easy to configure.
> >
> > Thoughts?
>
> I agree that we cannot avoid accumulating dead tuples when the
> workload on the publisher is more, and which affects the subscriber
> performance. What we need to do is to update slot's xmin as quickly as
> possible to minimize the dead tuple accumulation at least when the
> subscriber is not much behind. If there is a tradeoff for doing so
> (e.g., vs. the publisher performance), we need to provide a way for
> users to balance it.
>


As of now, I can't think of a way to throttle the publisher when the
apply_worker lags. Basically, we need some way to throttle (reduce the
speed of backends) when the apply worker is lagging behind a threshold
margin. Can you think of some way? I thought if one notices frequent
invalidation of the launcher's slot due to max_lag, then they can
rebalance their workload on the publisher.

>
  The max_lag idea sounds interesting for the case
> where the subscriber is much behind. Probably we can visit this idea
> as a new feature after completing this feature.
>

Sure, but what will be our answer to users for cases where the
performance tanks due to bloat accumulation? The tests show that once
the apply_lag becomes large, it becomes almost impossible for the
apply worker to catch up (or take a very long time) and advance the
slot's xmin. The users can disable retain_conflict_info to bring back
the performance and get rid of bloat but I thought it would be easier
for users to do that if we have some knob where they don't need to
wait till actually the problem of bloat/performance dip happens.

-- 
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

Reply via email to