On Fri, Jan 5, 2024 at 12:57 PM Andres Freund <and...@anarazel.de> wrote: > > I will be astonished if you can make this work well enough to avoid > > huge regressions in plausible cases. There are plenty of cases where > > we do a very thorough job opportunistically removing index tuples. > > These days the AM is often involved with that, via > table_index_delete_tuples()/heap_index_delete_tuples(). That IIRC has to > happen before physically removing the already-marked-killed index entries. We > can't rely on being able to actually prune the heap page at that point, there > might be other backends pinning it, but often we will be able to. If we were > to prune below heap_index_delete_tuples(), we wouldn't need to recheck that > index again during "individual tuple pruning", if the to-be-marked-unused heap > tuple is one of the tuples passed to heap_index_delete_tuples(). Which > presumably will be very commonly the case.
I don't understand. Making heap_index_delete_tuples() prune heap pages in passing such that we can ultimately mark dead heap tuples LP_UNUSED necessitates high level coordination -- it has to happen at a level much higher than heap_index_delete_tuples(). In other words, making it all work safely requires the same high level context that makes it safe for VACUUM to set a stub LP_DEAD line pointer to LP_UNUSED (index tuples must never be allowed to point to TIDs/heap line pointers that can be concurrently recycled). Obviously my idea of "a limited form of transaction rollback" has the required high-level context available, which is the crucial factor that allows it to safely reverse all bloat -- even line pointer bloat (which is traditionally something that only VACUUM can do safely). I have a hard time imagining a scheme that can do that outside of VACUUM without directly targeting some special case, such as the case that I'm calling "transaction rollback". In other words, I have a hard time imagining how this would ever be practical as part of any truly opportunistic cleanup process. AFAICT the dependency between indexes and the heap is just too delicate for such a scheme to ever really be practical. > At least for nbtree, we are much more aggressive about marking index entries > as killed, than about actually removing the index entries. "individual tuple > pruning" would have to look for killed-but-still-present index entries, not > just for "live" entries. These days having index tuples directly marked LP_DEAD is surprisingly unimportant to heap_index_delete_tuples(). The batching optimization implemented by _bt_simpledel_pass() tends to be very effective in practice. We only need to have the right *general* idea about which heap pages to visit -- which heap pages will yield some number of deletable index tuples. -- Peter Geoghegan