On Wed, Jul 7, 2021 at 1:24 PM Peter Geoghegan <p...@bowt.ie> wrote: > I wonder how much it would help to break up that loop into two loops. > Make the callback into a batch operation that generates state that > describes what to do with each and every index tuple on the leaf page. > The first loop would build a list of TIDs, then you'd call into > vacuumlazy.c and get it to process the TIDs, and finally the second > loop would physically delete the TIDs that need to be deleted. This > would mean that there would be only one call per leaf page per > btbulkdelete(). This would reduce the number of calls to the callback > by at least 100x, and maybe more than 1000x.
Maybe for something like rtbm.c (which is inspired by Roaring bitmaps), you would really want to use an "intersection" operation for this. The TIDs that we need to physically delete from the leaf page inside btvacuumpage() are the intersection of two bitmaps: our bitmap of all TIDs on the leaf page, and our bitmap of all TIDs that need to be deleting by the ongoing btbulkdelete() call. Obviously the typical case is that most TIDs in the index do *not* get deleted -- needing to delete more than ~20% of all TIDs in the index will be rare. Ideally it would be very cheap to figure out that a TID does not need to be deleted at all. Something a little like a negative cache (but not a true negative cache). This is a little bit like how hash joins can be made faster by adding a Bloom filter -- most hash probes don't need to join a tuple in the real world, and we can make these hash probes even faster by using a Bloom filter as a negative cache. If you had the list of TIDs from a leaf page sorted for batch processing, and if you had roaring bitmap style "chunks" with "container" metadata stored in the data structure, you could then use merging/intersection -- that has some of the same advantages. I think that this would be a lot more efficient than having one binary search per TID. Most TIDs from the leaf page can be skipped over very quickly, in large groups. It's very rare for VACUUM to need to delete TIDs from completely random heap table blocks in the real world (some kind of pattern is much more common). When this merging process finds 1 TID that might really be deletable then it's probably going to find much more than 1 -- better to make that cache miss take care of all of the TIDs together. Also seems like the CPU could do some clever prefetching with this approach -- it could prefetch TIDs where the initial chunk metadata is insufficient to eliminate them early -- these are the groups of TIDs that will have many TIDs that we actually need to delete. ISTM that improving temporal locality through batching could matter a lot here. -- Peter Geoghegan