On Fri, Jan 10, 2025 at 6:13 AM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > 3. If the apply worker cannot catch up, it could enter to a bad loop; > the publisher sends huge amount of data -> the apply worker cannot > catch up -> it needs to wait for a longer time to advance its > oldest_nonremovable_xid -> more garbage are accumulated and then have > the apply more slow -> (looping). I'm not sure how to deal with this > point TBH. We might be able to avoid entering this bad loop once we > resolve the other two points. >
I don't think we can avoid accumulating garbage especially when the workload on the publisher is more. Consider the current case being discussed, on the publisher, we have 30 clients performing read-write operations and there is only one pair of reader (walsender) and writer (apply_worker) to perform all those write operations on the subscriber. It can never match the speed and the subscriber side is bound to have less performance (or accumulate more bloat) irrespective of its workload. If there is one client on the publisher performing operation, we won't see much degradation but as the number of clients increases, the performance degradation (and bloat) will keep on increasing. There are other scenarios that can lead to the same situation, such as a large table sync, the subscriber node being down for sometime, etc. Basically, any case where apply_side lags by a large amount from the remote node. One idea to prevent the performance degradation or bloat increase is to invalidate the slot, once we notice that subscriber lags (in terms of WAL apply) behind the publisher by a certain threshold. Say we have max_lag (or max_lag_behind_remote) (defined in terms of seconds) subscription option which allows us to stop calculating oldest_nonremovable_xid for that subscription. We can indicate that via some worker_level parameter. Once all the subscriptions on a node that has enabled retain_conflict_info have stopped calculating oldest_nonremovable_xid, we can invalidate the slot. Now, users can check this and need to disable/enable retain_conflict_info to again start retaining the required information. The other way could be that instead of invalidating the slot, we directly drop/re-create the slot or increase its xmin. If we choose to advance the slot automatically without user intervention, we need to let users know via LOG and or via information in the view. I think such a mechanism via the new option max_lag will address your concern: "It's reasonable behavior for this approach but it might not be a reasonable outcome for users if they could be affected by such a performance dip without no way to avoid it." as it will provide a way to avoid performance dip only when there is a possibility of such a dip. I mentioned max_lag as a subscription option instead of a GUC because it applies only to subscriptions that have enabled retain_conflict_info but we can consider it to be a GUC if you and others think so provided the above proposal sounds reasonable. Also, max_lag could be defined in terms of LSN as well but I think time would be easy to configure. Thoughts? -- With Regards, Amit Kapila.