An important goal of the work on nbtree that went into PostgreSQL 12 (and to a lesser extent the work that went into 13) was to make sure that index scans deal with "boundary cases" optimally. The simplest way of explaining what this means is through a practical worked example.
Recap, worked example: Consider a query such as "select count(*) from tenk2 where hundred = $1", where tenk1 is one of the Wisconsin benchmark tables left behind by the standard regression tests. In practice this query will show that there are either 0 or 100 rows counted for every possible $1 value. In practice query execution only ever needs to scan exactly one leaf page when executing this query -- for any possible $1 value. In particular, it doesn't matter if $1 is a value that happens to be a "boundary value". By "boundary value" I mean a value that happens to match the leftmost or the rightmost tuple on some leaf page. In other words, a value that happens to match a key from some internal page seen while descending the tree in _bt_search. This effect is reliable for the tenk1_hundred index that my example query uses (a single column index) because numerous optimizations are in place, which work together. This starts with _bt_search, which will reliably descend to exactly one leaf page -- the only one where we could possibly find a match (thanks to the !pivotsearch optimization). After that, _bt_readpage() will reliably notice that it doesn't have to go to the page to the right at all (occasionally it'll need to check the high key to do this, and so will use the optimization introduced to Postgres 12 by commit 29b64d1d). (It's easy to see this using "EXPLAIN (ANALYZE, BUFFERS)", which will reliably break out index pages when we're using a bitmap index scan -- which happens to be the case here. You'll see one root page access and one leaf page access for the tenk1_hundred index.) This is just an example of a fairly general principle: we ought to expect selective index scans to only need to access one leaf page. At least for any index scan that only needs to access a single "natural grouping" that is cleanly isolated onto a single leaf page. Another example of such a "natural grouping" can also be seen in the TPC-C orderline table's primary key. In practice we shouldn't ever need to touch more than a single leaf page when we go to read any individual order's entries from the order lines table/PK. There will never be more than 15 order lines per order in practice (10 on average), which guarantees that suffix truncation will avoid splitting any individual order's order lines across two leaf pages (after a page split). Plus we're careful to exploit the index structure (and basic B-Tree index invariants) to maximum effect during index scans....as long as they're forward scans. (This ends my recap of "boundary case" handling.) Patch: We still fall short when it comes to handling boundary cases optimally during backwards scans. This is at least true for a subset of backwards scans that request "goback=true" processing inside _bt_first. Attached patch improves matters here. Again, the simplest way of explaining what this does is through a practical worked example. Consider an example where we don't do as well as might be expected right now: EXPLAIN (ANALYZE, BUFFERS) select * from tenk1 where hundred < 12 order by hundred desc limit 1; This particular example shows "Buffers: shared hit=4". We see one more buffer hit than is truly necessary though. If I tweak the query by replacing 12 with the adjacent value 11 (i.e. same query but with "hundred < 11"), then I'll see "Buffers: shared hit=3" instead. Similarly, "hundred < 13" will also show "Buffers: shared hit=3". Why should "hundred <12" need to access an extra leaf page? Sure enough, the problematic case shows "Buffers: shared hit=3" with my patch applied, as expected (a buffer for the root page, a leaf page, and a heap page). The patch makes every variant of my query touch the same number of leaf pages/buffers, as expected. The patch teaches _bt_search (actually, _bt_compare) about the "goback=true" case. This allows it to exploit information about which leaf page accesses are truly necessary. The code in question already knows about "nextkey=true", which is a closely related concept. It feels very natural to also teach it about "goback=true" now. (Actually, I lied. All the patch really does is rename existing "pivotsearch" logic, so that the symbol name "goback" is used instead -- the insertion scankey code doesn't need to change at all. The real behavioral change takes place in _bt_first, the higher level calling code. It has been taught to set its insertion/initial positioning scankey's pivotsearch/goback field to "true" in the patch. Before now, this option was exclusively during VACUUM, for page deletion. It turns out that so-called "pivotsearch" behavior is far more general than currently supposed.) Thoughts? -- Peter Geoghegan
v1-0001-Add-nbtree-goback-boundary-case-optimization.patch
Description: Binary data