On Mon, Jul 18, 2022 at 9:10 PM John Naylor <john.nay...@enterprisedb.com> wrote: > On Tue, Jul 19, 2022 at 9:24 AM Andres Freund <and...@anarazel.de> wrote: > > FWIW, I think the best path forward would be to do something similar to the > > simplehash.h approach, so it can be customized to the specific user. > > I figured that would come up at some point. It may be worth doing in the > future, but I think it's way too much to ask for the first use case.
I have a prototype patch that creates a read-only snapshot of the visibility map, and has vacuumlazy.c work off of that when determining with pages to skip. The patch also gets rid of the SKIP_PAGES_THRESHOLD stuff. This is very effective with TPC-C, principally because it really cuts down on the number of scanned_pages that are scanned only because the VM bit is unset concurrently by DML. The window for this is very large when the table is large (and naturally takes a long time to scan), resulting in many more "dead but not yet removable" tuples being encountered than necessary. Which itself causes bogus information in the FSM -- information about the space that VACUUM could free from the page, which is often highly misleading. There are remaining questions about how to do this properly. Right now I'm just copying pages from the VM into local memory, right after OldestXmin is first acquired -- we "lock in" a snapshot of the VM at the earliest opportunity, which is what lazy_scan_skip() actually works off now. There needs to be some consideration given to the resource management aspects of this -- it needs to use memory sensibly, which the current prototype patch doesn't do at all. I'm probably going to seriously pursue this as a project soon, and will probably need some kind of data structure for the local copy. The raw pages are usually quite space inefficient, considering we only need an immutable snapshot of the VM. I wonder if it makes sense to use this as part of this project. It will be possible to know the exact heap pages that will become scanned_pages before scanning even one page with this design (perhaps with caveats about low memory conditions). It could also be very effective as a way of speeding up TID lookups in the reasonably common case where most scanned_pages don't have any LP_DEAD items -- just look it up in our local/materialized copy of the VM first. But even when LP_DEAD items are spread fairly evenly, it could still give us reliable information about the distribution of LP_DEAD items very early on. Maybe the two data structures could even be combined in some way? You can use more memory for the local copy of the VM if you know that you won't need the memory for dead_items. It's kinda the same problem, in a way. -- Peter Geoghegan