On Thu, Sep 30, 2021 at 5:56 PM Kyotaro Horiguchi <horikyota....@gmail.com> wrote: > > After the patch applied, that keepalive is sent only when the loop is > actually going to sleep some time. In case the next WAL doesn't come > for KEEPALIVE_TIMEOUT milliseconds, it sends a keepalive. There's a > dubious behavior when sleeptime <= KEEPALIVE_TIMEOUT that it sends a > keepalive immediately. It was (as far as I recall) intentional in > order to make the code simpler. However, on second thought, we will > have the next chance to send keepalive in that case, and intermittent > frequent keepalives can happen with that behavior. So I came to think > that we can omit keepalives at all that case. > > (I myself haven't see the keepalive flood..) >
I tried your updated patch (avoid_keepalive_flood_at_bleeding_edge_of_wal.patch, rebased) and also manually applied your previous keepalive-counting code (count_keepalives2.diff.txt), adapted to the code updates. I tested both the problem originally reported (which used pg_recvlogical) and similarly using pub/sub of the pgbench_history table, and in both cases I found that your patch very significantly reduced the keepalives, so the keepalive flood is no longer seen. I am still a little unsure about the impact on pg_recvlogical --endpos functionality, which is detected by the regression test failure. I did try to update pg_recvlogical, to not rely on a keepalive for --endpos, but so far haven't been successful in doing that. If the test is altered/removed then I think that the documentation for pg_recvlogical --endpos will need updating in some way. Regards, Greg Nancarrow Fujitsu Australia