Greetings, * Bossart, Nathan (bossa...@amazon.com) wrote: > On 11/27/20, 10:58 AM, "Stephen Frost" <sfr...@snowman.net> wrote: > > If you'd like to show that I'm wrong, and it's entirely possible that I > > am, then retry the above with actual load on the system, and also > > actually look at how much outstanding WAL you end up with given the > > different scenarios which has to be replayed during crash recovery. > > I did a little experiment to show the behavior I'm referring to. I
I'm rather confused why you didn't use your patch to show the actual behavior that you'll get as a result of this change..? That seems like what would be meaningful here. I appreciate that the analysis you did might correlate but I don't really get why we'd try to use a proxy for this. > used these settings: > > checkpoint_completion_target = 0.9 > checkpoint_timeout = 30s > max_wal_size = 20GB > WAL segment size is 64MB That's an exceedingly short and very uncommon checkpoint timeout in my experience.. If anything, folks increase checkpoint timeout from the default (in order to reduce WAL traffic and because they have a replica they can flip to in the event of a crash, avoiding having to go through WAL replay on recovery) and so I'm not really sure that it's a sensible thing to look at? Even so though... > Recovery took 14 seconds and replayed ~3.2 GB of WAL. [ ... ] > Recovery took 7 seconds and replayed ~1.5 GB of WAL. This is showing more-or-less what I expected: there's still a large amount of outstanding WAL, even if you use a very low and unusual timeout and attempt to time it perfectly. A question that is still not clear is what happens when you actually do an immediate checkpoint- there would likely still be some outstanding WAL even in that case but I'd expect it to be a whole lot less, which is why that comment exists in the documentation in the first place. > Granted, I used a rather aggressive checkpoint_timeout, but I think > this demonstrates that waiting for a non-immediate checkpoint to > complete can lower the amount of WAL needed for recovery, even though > it might not lower it as much as waiting for an immediate checkpoint > would. The difference here feels like order of magnitudes to me, between an immediate checkpoint and a non-immediate one, vs. a much smaller difference as you've shown here (though, still, kill'ing the postmaster isn't exactly the same as what your patch would be doing, so I don't really like using this particular analysis to answer this question...). If the use-case here was just that you wanted to add more options to the CHECKPOINT command because we have them internally and maybe they'd be useful to expose then these things probably wouldn't matter so much, but to imply that this is really going to cut down on the amount of WAL replay required a lot isn't really coming through even with these results. Thanks, Stephen
signature.asc
Description: PGP signature