On Tue, Jan 14, 2025 at 7:14 AM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > On Sun, Jan 12, 2025 at 10:36 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > I don't think we can avoid accumulating garbage especially when the > > workload on the publisher is more. Consider the current case being > > discussed, on the publisher, we have 30 clients performing read-write > > operations and there is only one pair of reader (walsender) and writer > > (apply_worker) to perform all those write operations on the > > subscriber. It can never match the speed and the subscriber side is > > bound to have less performance (or accumulate more bloat) irrespective > > of its workload. If there is one client on the publisher performing > > operation, we won't see much degradation but as the number of clients > > increases, the performance degradation (and bloat) will keep on > > increasing. > > > > There are other scenarios that can lead to the same situation, such as > > a large table sync, the subscriber node being down for sometime, etc. > > Basically, any case where apply_side lags by a large amount from the > > remote node. > > > > One idea to prevent the performance degradation or bloat increase is > > to invalidate the slot, once we notice that subscriber lags (in terms > > of WAL apply) behind the publisher by a certain threshold. Say we have > > max_lag (or max_lag_behind_remote) (defined in terms of seconds) > > subscription option which allows us to stop calculating > > oldest_nonremovable_xid for that subscription. We can indicate that > > via some worker_level parameter. Once all the subscriptions on a node > > that has enabled retain_conflict_info have stopped calculating > > oldest_nonremovable_xid, we can invalidate the slot. Now, users can > > check this and need to disable/enable retain_conflict_info to again > > start retaining the required information. The other way could be that > > instead of invalidating the slot, we directly drop/re-create the slot > > or increase its xmin. If we choose to advance the slot automatically > > without user intervention, we need to let users know via LOG and or > > via information in the view. > > > > I think such a mechanism via the new option max_lag will address your > > concern: "It's reasonable behavior for this approach but it might not > > be a reasonable outcome for users if they could be affected by such a > > performance dip without no way to avoid it." as it will provide a way > > to avoid performance dip only when there is a possibility of such a > > dip. > > > > I mentioned max_lag as a subscription option instead of a GUC because > > it applies only to subscriptions that have enabled > > retain_conflict_info but we can consider it to be a GUC if you and > > others think so provided the above proposal sounds reasonable. Also, > > max_lag could be defined in terms of LSN as well but I think time > > would be easy to configure. > > > > Thoughts? > > I agree that we cannot avoid accumulating dead tuples when the > workload on the publisher is more, and which affects the subscriber > performance. What we need to do is to update slot's xmin as quickly as > possible to minimize the dead tuple accumulation at least when the > subscriber is not much behind. If there is a tradeoff for doing so > (e.g., vs. the publisher performance), we need to provide a way for > users to balance it. >
As of now, I can't think of a way to throttle the publisher when the apply_worker lags. Basically, we need some way to throttle (reduce the speed of backends) when the apply worker is lagging behind a threshold margin. Can you think of some way? I thought if one notices frequent invalidation of the launcher's slot due to max_lag, then they can rebalance their workload on the publisher. > The max_lag idea sounds interesting for the case > where the subscriber is much behind. Probably we can visit this idea > as a new feature after completing this feature. > Sure, but what will be our answer to users for cases where the performance tanks due to bloat accumulation? The tests show that once the apply_lag becomes large, it becomes almost impossible for the apply worker to catch up (or take a very long time) and advance the slot's xmin. The users can disable retain_conflict_info to bring back the performance and get rid of bloat but I thought it would be easier for users to do that if we have some knob where they don't need to wait till actually the problem of bloat/performance dip happens. -- With Regards, Amit Kapila.