On Wed, Mar 9, 2022 at 11:26 AM wangw.f...@fujitsu.com <wangw.f...@fujitsu.com> wrote: > > On Tue, Mar 8, 2022 at 3:52 PM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > I've looked at the patch and have a question: > Thanks for your review and comments. > > > +void > > +SendKeepaliveIfNecessary(LogicalDecodingContext *ctx, bool skipped) { > > + static int skipped_changes_count = 0; > > + > > + /* > > + * skipped_changes_count is reset when processing changes that do > > not > > + * need to be skipped. > > + */ > > + if (!skipped) > > + { > > + skipped_changes_count = 0; > > + return; > > + } > > + > > + /* > > + * After continuously skipping SKIPPED_CHANGES_THRESHOLD > > changes, try to send a > > + * keepalive message. > > + */ > > + #define SKIPPED_CHANGES_THRESHOLD 10000 > > + > > + if (++skipped_changes_count >= SKIPPED_CHANGES_THRESHOLD) > > + { > > + /* Try to send a keepalive message. */ > > + OutputPluginUpdateProgress(ctx, true); > > + > > + /* After trying to send a keepalive message, reset the > > flag. */ > > + skipped_changes_count = 0; > > + } > > +} > > > > Since we send a keepalive after continuously skipping 10000 changes, the > > originally reported issue can still occur if skipping 10000 changes took > > more than > > the timeout and the walsender didn't send any change while that, is that > > right? > Yes, theoretically so. > But after testing, I think this value should be conservative enough not to > reproduce > this bug.
But it really depends on the workload, the server condition, and the timeout value, right? The logical decoding might involve disk I/O much to spill/load intermediate data and the system might be under the high-load condition. Why don't we check both the count and the time? That is, I think we can send a keep-alive either if we skipped 10000 changes or if we didn't sent anything for wal_sender_timeout / 2. Also, the patch changes the current behavior of wal senders; with the patch, we send keep-alive messages even when wal_sender_timeout = 0. But I'm not sure it's a good idea. The subscriber's wal_receiver_timeout might be lower than wal_sender_timeout. Instead, I think it's better to periodically check replies and send a reply to the keep-alive message sent from the subscriber if necessary, for example, every 10000 skipped changes. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/