On Thu, Sep 23, 2021 at 10:42 PM Masahiko Sawada <sawada.m...@gmail.com> wrote: > On Thu, Sep 16, 2021 at 7:09 AM Peter Geoghegan <p...@bowt.ie> wrote: > > Enabling index-only scans is a good enough reason to pursue this > > project, even on its own. > > +1
I was hoping that you might be able to work on opportunistically freezing whole pages for Postgres 15. I think that it would make sense to opportunistically make a page that is about to become all_visible during VACUUM become all_frozen instead. Our goal is to make most pages skip all_visible, and go straight to all_frozen directly. Often the page won't need to be dirtied again, ever. Right now freezing is something that we mostly just think about as occurring at the level of tuples, which doesn't seem ideal. This seems related to Robert's project because both projects are connected to the question of how autovacuum scheduling works in general. We will probably need to rethink things like the vacuum_freeze_min_age GUC. (I also think that we might need to reconsider how aggressive/anti-wraparound VACUUMs work, but that's another story.) Obviously this is a case of performing work eagerly; a form of speculation that tries to lower costs in the aggregate, over time. Heuristics that work well on average seem possible, but even excellent heuristics could be wrong -- in the end we're trying to predict the future, which is inherently impossible to do reliably for all workloads. I think that that will be okay, provided that the cost of being wrong is kept low and *fixed* (the exact definition of "fixed" will need to be defined, but the basic idea is that any regression is once per page, not once per page per VACUUM or something). Once it's much cheaper enough to freeze a whole page early (i.e. all tuple headers from all tuples), then the implementation can be wrong 95%+ of the time, and maybe we'll still win by a lot. That may sound bad, until you realize that it's 95% *per VACUUM* -- the entire situation is much better once you think about the picture for the entire table over time and across many different VACUUM operations, and once you think about FPIs in the WAL stream. We'll be paying the cost of freezing in smaller and more predictable increments, too, which can make the whole system more robust. Many pages that all go from all_visible to all_frozen at the same time (just because they crossed some usually-meaningless XID-based threshold) is actually quite risky (this is why I mentioned aggressive VACUUMs in passing). The hard part is getting the cost way down. lazy_scan_prune() uses xl_heap_freeze_tuple records for each tuple it freezes. These obviously have a lot of redundancy across tuples from the same page in practice. And the WAL overhead is much larger just because these are per-tuple records, not per-page records. Getting the cost down is hard because of issues with MultiXacts, freezing xmin but not freezing xmax at the same time, etc. > Logging how vacuum uses and sets VM bits seems a good idea. > I think that we will end up doubly counting the page as scanned_pages > and allfrozen_pages due to the newly added latter change. This seems > wrong to me because we calculate as follows: I agree that that's buggy. Oops. It was just a prototype that I wrote for my own work. I do think that we should have a patch that has some of this, for users, but I am not sure about the details just yet. This is probably too much information for users, but I think it will take me more time to decide what really does matter to users. -- Peter Geoghegan