Hi, reply largely based on a quick IM conversation between Peter and me.
On 2020-03-04 17:13:33 -0800, Peter Geoghegan wrote: > Both plans are very similar, really. The number of heap accesses and > B-Tree index page accesses is exactly the same in each case. Note that bitmap heap scans, currently, have the huge advantage of being able to efficiently prefetch heap data. That can be a *huge* performance boon (I've seen several orders of magnitude on SSDs). There's also some benefits of bitmap heap scans in other ways. For heap the "single tid" path index->heap lookup locks the page once for each tid, whereas bitmap heap scans only do that once - adding more lock cycles obvious can have a noticable performance impact. Not having interspersed io between index and heap can be beneficial too. I thought we had optimized the non-lossy bitmap path for heap (i.e. heapam_scan_bitmap_next_block()) to perform visibility checks more efficiently than single tid fetches (i.e. heapam_index_fetch_tuple()). But both use heap_hot_search_buffer(), even though the number of page locks differs. I'm a bit surprised that neither heap_hot_search_buffer() nor the "lossy path" in heapam_scan_bitmap_next_blocks()'s take advantage of the page's all-visible? I don't really see a good reason for that. The HTSV calls do show up noticably in profiles, in my experience. While your recent btree work ensures that we get the heap tids for an equality lookup in heap order (right?), I don't think we currently have the planner infrastructure to know that that's the case (since other index types don't guarantee that) / take it into account for planning? > But the index scan plan has one non-obvious advantage, that might > matter a lot in the real world: it can apply the kill_prior_tuple > optimization. (It is never possible to use the kill_prior_tuple > optimization during a bitmap index scan.) Indeed. I've seen this cause very significant issues a couple times. Basically whenever the handful of very common queries that touched most of the data switched to bitmap heap scans, the indexes would explode in size. Due to the index sizes involved there was no way normal vacuum could clean up dead tuples quickly enough to prevent bloat, but with kill_prior_tuple that wasn't a problem. I have wondered whether we could "just" add some support for kill_prior_tuple to the bitmap index scan infrastructure. Obviously that'd require some way of calling "back" into the index code once (several?) tuples on a page are found to be dead during a bitmap heap scan. Which would require keeping track of additional metadata for each tuple in the tid bitmap, which is obviously not free and would have to be conditional. I don't really like the kill_prior_tuple interface much. But I don't immediately see how to do better, without increasing the overhead. > It makes sense that the planner determines that a bitmap index scan is > faster -- or it used to make sense. Before commit dd299df8, which made > heap TID a tiebreaker nbtree index column, we might find ourselves > accessing the same heap page multiple times, pinning it a second or a > third time within the executor (it depended on very unstable > implementation details in the nbtree code). These days we should > *reliably* access the same number of heap pages (and index pages) with > either plan. (There are a couple of caveats that I'm glossing over for > now, like pg_upgrade'd indexes.) Leaving the locking difference pointed out above aside, there still is a significant difference in how many times we indirectly call into the index AM, and how much setup work has to be done though? There's at least one index_getnext_tid() call for each result tuple, which each time indirectly has to call btgettuple(). And each btgettuple() has to do checks (do array keys need to be advanced, has _bt_first() been called). Whereas btgetbitmap() can amortize across all tuples. And that's without considering the fact that, to me, btgetbitmap() could be significantly optimized by adding multiple tuples to the bitmap at once, rather than doing so one-by-one. Greetings, Andres Freund