On Wed, Dec 6, 2023 at 5:27 AM Matthias van de Meent <boekewurm+postg...@gmail.com> wrote: > 1. When scanning an index in ascending order using scankey a > 1 (so, > one that defines a start point of the scan), we don't need to check > items for consistency with that scankey once we've found the first > value that is consistent with the scankey, as all future values will > also be consistent with the scankey (if we assume no concurrent page > deletions).
BTW, I don't think that page deletion is a concern for these optimizations in the way that it is for the similar idea of "dynamic prefix compression", which works against insertion-type scan keys (used to descend the tree and to do an initial binary search of a leaf page). We already rely on the first call to _bt_readpage (the one that takes place from within _bt_first rather than from _bt_next) passing a page offset number that's exactly at the start of where matches begin -- this is crucial in the case of scans with required equality strategy scan keys (most scans). If we just skipped the _bt_binsrch and passed P_FIRSTDATAKEY(opaque) to _bt_readpage within _bt_first instead, that would break lots of queries. So the _bt_binsrch step within _bt_first isn't just an optimization -- it's crucial. This is nothing new. Recall that _bt_readpage only deals with search-type scan keys, meaning scan keys that use a simple operator (so it uses = operators with the equality strategy, as opposed to using a 3-way ORDER proc/support function 1 that can tell the difference between < and >). In general _bt_readpage doesn't know how to tell the difference between a tuple that's before the start of = matches, and a tuple that's at (or after) the end of any = matches. If it is ever allowed to conflate these two cases, then we'll overlook matching tuples, which is of course wrong (it'll terminate the scan before it even starts). It is up to the caller (really just _bt_first) to never call _bt_readpage in a way that allows this confusion to take place -- which is what makes the _bt_binsrch step crucial. A race condition with page deletion might allow the key space covered by a leaf page to "widen" after we've left its parent, but before we arrive on the leaf page. But the _bt_binsrch step within _bt_first happens *after* we land on and lock that leaf page, in any case. So there is no risk of the scan ever doing anything with concurrently-inserted index tuples. In general we only have to worry about such race conditions when descending the tree -- they're not a problem after the scan has reached the leaf level and established an initial page offset number. (The way that _bt_readpage processes whole pages in one atomic step helps with this sort of thing, too. We can almost pretend that the B-Tree structure is immutable, even though that's obviously not really true at all. We know that we'll always make forward progress through the key space by remembering the next page to visit when processing each page.) My test case broke the required-in-opposite-direction-only optimization by finding a way in which required-in-opposite-direction-only inequalities were not quite the same as required equalities with respect to this business about the precise leaf page offset number that the scan begins at. They make *almost* the same set of guarantees (note in particular that both will be transformed into insertion scan key/3-way ORDER proc scan keys by _bt_first's initial positioning code), but there is at least one special case that applies only to inequalities. I had to play games with weird incomplete opfamilies to actually break the optimization -- that was required to tickle the special case in just the right/wrong way. -- Peter Geoghegan