The single largest benefit of the v12 nbtree enhancement was its adaptability with indexes where a portion of the key space contains many duplicates. Successive page splits choose split points in a way that leaves duplicates together on their own page, and eventually pack pages full of duplicates.
I have thought up a specific case where the logic can be fooled into consistently doing the wrong thing, leading to very poor space utilization: drop table if exists descnums; create table descnums(nums int4); create index bloat_desc on descnums (nums desc nulls first); -- Fill first leaf page (leaf root page) with NULL values, to the point where it almost splits: insert into descnums select null from generate_series(0,400); -- Insert integers, which will be treated as descending insertions within index: insert into descnums select i from generate_series(0,10000) i; -- Observe if we had 50:50 page splits here: create extension if not exists pgstattuple; select avg_leaf_density, leaf_fragmentation from pgstatindex('bloat_desc'); The final output looks like this: avg_leaf_density | leaf_fragmentation ------------------+-------------------- 1.83 | 99.88 (1 row) Although the case is contrived, it is clearly not okay that this is possible -- avg_leaf_density should be about 50 here, which is what you'll see on v11. You'll also see an avg_leaf_density that's at least 50 if you vary any of the details. For example, if you remove "nulls first" then you'll get an avg_leaf_density of ~50. Or, if you make the index ASC then the avg_leaf_density is almost exactly 90 for the usual reason (the NULLs won't "get in the way" of consistently getting rightmost splits that way). Note that I've deliberately arranged for the page splits to be as ineffective as possible by almost filling a leaf page with NULLs, leaving a tiny gap for all future non-NULL integer insertions. This is another case where a bimodal distribution causes trouble when combined with auto-incrementing insertions -- it is slightly similar to the TPC-C issue that the v12 work fixed IMV. You could say that the real root of the problem here is one of two things, depending on your perspective: 1. Arguably, nbtsplitloc.c is already doing the right thing here, and the root of the issue is that _bt_truncate() lacks any way of generating a new high key that is "mid way between" the value NULL in the lastleft tuple and the integer in the firstright tuple during the first split. If _bt_truncate() created a "mid point value" of around INT_MAX/2 for the new high key during the first split, then everything would work out -- we wouldn't keep splitting the same leftmost page again and again. The behavior would stabilize in the same way as it does in the ASC + NULLS LAST case, without hurting any other case that already works well. This isn't an academic point; we might actually need to do that in order to be able to pack the leaf page 90% full with DESC insertions, which ought to be a future goal for nbtsplitloc.c. But that's clearly not in scope for v12. 2. The other way you could look at it (which is likely to be the basis of my fix for v12) is that nbtsplitloc.c has been fooled into treating page splits as "many duplicate" splits, when in fact there are not that many duplicates involved -- there just appears to be many duplicates because they're so unevenly distributed on the page. It would be fine for it to be wrong if there was some way that successive page splits could course correct (see point 1), but that isn't possible here because of the skew -- we get stuck with the same lousy choice of split point again and again. (There also wouldn't be a problem if the integers values were random, since we'd have just one or two uneven splits at the start.) I've already written a rough patch that fixes the issue by taking this second view of the problem. The patch makes nbtsplitloc.c more skeptical about finishing with the "many duplicates" strategy, avoiding the problem -- it can just fall back on a 50:50 page split when it looks like this is happening (the related "single value" strategy must already so something similar in _bt_strategy()). Currently, it simply considers if the new item on the page has an offset number immediately to the right of the split point indicated by the "many duplicates" strategy. We look for it within ~10 offset positions to the right, since that strongly suggests that there aren't that many duplicates after all. I may make the check more careful still, for example by performing additional comparisons on the page to make sure that there are in fact very few distinct values on the whole page. My draft fix doesn't cause any regressions in any of my test cases -- the fix barely affects the splits chosen for my real-world test data, and TPC test data. As far as I know, I already have a comprehensive fix. I will need to think about it much more carefully before proceeding, though. Thoughts? -- Peter Geoghegan