Suppose that I create the following index on the tenk1 table from the regression tests:
create index on tenk1 (two, four, hundred, thousand, tenthous); Now the following query will be able to use index quals for each column that appear in my composite index: select * from tenk1 where two = 1 and four = 3 and hundred = 91 and thousand = 891 and tenthous = 1891; The query returns one row, and touches 3 buffers/pages (according to EXPLAIN ANALYZE with buffers). The overheads here make perfect sense: there's one root page access, one leaf page access, and a single heap page access. Clearly the nbtree initial positioning code is able to descend to the exact leaf page (and page offset) where the first possible match could be found. Pretty standard stuff. But if I rewrite this query to use an inequality, the picture changes. If I replace "four = 3" with "four > 2", I get a query that is very similar to the original (that returns the same single row): select * from tenk1 where two = 1 and four > 2 and hundred = 91 and thousand = 891 and tenthous = 1891; This time our query touches 16 buffers/pages. That's a total of 15 index pages accessed, of which 14 are leaf pages. We'll needlessly plow through an extra 13 leaf pages, before finally arriving at the first leaf page that might *actually* have a real match for us. We can and should find a way for the second query to descend to the same leaf page directly, so that the physical access patterns match those that we saw with the first query. Only the first query can use an insertion scan key with all 4 attributes filled in to find its initial scan position. The second query uses an insertion scan key with values set for the first 2 index columns (on two and four) only. EXPLAIN offers no hint that this is what happens -- the "Index Cond:" shown for each query is practically identical. It seems to me that Markus Winand had a very good point when he complained that we don't expose this difference directly (e.g., by identifying which columns appear in "Access predicates" and which columns are merely "Index filter predicates") [1]. That would make these kinds of issues a lot more obvious. The nbtree code is already capable of tricks that are close enough to what I'm thinking of here. Currently, nbtree's initial positioning code will set BTScanInsertData.nextkey=false for the first query (where BTScanInsertData.keysz=4), and BTScanInsertData.nextkey=true for the second query (where BTScanInsertData.keysz=2 right now). So the second query I came up with does at least manage to locate the leaf page where "four = 3" tuples begin, even today -- its "four > 2" inequality is at least "handled efficiently". The inefficiencies come from how nbtree handles the remaining two index columns when building an insertion scan key for our initial descent. nbtree will treat the inequality as making it unsafe to include further values for the remaining two attributes, which is the real source of the extra leaf page scans (though of course the later attributes are still usable as search-type scan keys). But it's *not* unsafe to position ourselves on the right leaf page from the start. Not really. All that it would take to fix the problem is per-attribute BTScanInsertData.nextkey values. There is no reason why "nextkey" semantics should only work for the last attribute in the insertion scan key. Under this scheme, _bt_first() would be taught to set up the insertion scan key with (say) nextkey=true for the "four > 2" attribute, and nextkey=false for the other 3 attributes (since we do that whenever >= or = are used). It would probably also make sense to generalize this approach to handle (say) a third query that had a "four < 2" inequality, but otherwise matched the first two queries. So we wouldn't literally use multiple "nextkey" fields to do this. The most general approach seems to require that we teach insertion scan key routines like _bt_compare() about "goback" semantics, which must also work at the attribute granularity. So we'd probably replace both "nextkey" and "goback" with something new and more general. I already wrote a patch (still in the CF queue) to teach nbtree insertion scan keys about "goback" semantics [2] (whose use would still be limited to backwards scans), so that we'd avoid needlessly accessing extra pages in so-called boundary cases (which seems like a much less important problem than the one I've highlighted here). That existing patch already removed code in _bt_first that handled "stepping back" once we're on the leaf level. ISTM that the right place to do stuff like that is in routines like _bt_search, _bt_binsrch, and _bt_compare -- not in _bt_first. The code around _bt_compare seems like it would benefit from having more of this sort of context. Having the full picture matters both when searching internal pages and leaf pages. Thoughts? Was this issue discussed at some point in the past? [1] https://use-the-index-luke.com/sql/explain-plan/postgresql/filter-predicates [2] https://www.postgresql.org/message-id/flat/CAH2-Wz=xpzm8hzalpq278vms420mvshfgs9wi5tjfkhcapz...@mail.gmail.com -- Peter Geoghegan