On Wed, Jan 18, 2023 at 12:44 PM Robert Haas <robertmh...@gmail.com> wrote: > I don't know enough about the specifics of how this works to have an > intelligent opinion about how likely these particular ideas are to > work out. However, I think it's risky to look at estimates and try to > infer whether they are reliable. It's too easy to be wrong. What we > really want to do is anchor our estimates to some data source that we > know we can trust absolutely. If you trust possibly-bad data less, it > screws up your estimates more slowly, but it still screws them up.
Some of what I'm proposing arguably amounts to deliberately adding a bias. But that's not an unreasonable thing in itself. I think of it as related to the bias-variance tradeoff, which is a concept that comes up a lot in machine learning and statistical inference. We can afford to be quite imprecise at times, especially if we choose a bias that we know has much less potential to do us harm -- some mistakes hurt much more than others. We cannot afford to ever be dramatically wrong, though -- especially in the direction of vacuuming less often. Besides, there is something that we *can* place a relatively high degree of trust in that will still be in the loop here: VACUUM itself. If VACUUM runs then it'll call pgstat_report_vacuum(), which will set the record straight in the event of over estimating dead tuples. To some degree the problem of over estimating dead tuples is self-limiting. > If Andres is correct that what really matter is the number of pages > we're going to have to dirty, we could abandon counting dead tuples > altogether and just count not-all-visible pages in the VM map. That's what matters most from a cost point of view IMV. So it's a big part of the overall picture, but not everything. It tells us relatively little about the benefits, except perhaps when most pages are all-visible. -- Peter Geoghegan