nbtdedup.c's "single value strategy" is used when we have a leaf page that is full of duplicates of the same value. We infer that we will have many more in the future, and cooperate with the "single value strategy" used within nbtsplitloc.c. Deduplication should create 6 large posting list tuples that will remain on the page after an anticipated page split, plus some remaining non-dedup'd regular tuples that go on the new right page. So nbtdedup.c anticipates how an eventual page split will need to work to keep space utilization high, but not too high (though only in this specific "single value strategy" case).
(Note that the choice of 6 posting list tuples is arbitrary -- it seemed like limiting the size of posting list tuples to 1/3 of a page (the long established maximum) was a bit too aggressive, so the posting list tuple size limit was halved to 1/6 of the page.) Sometimes, an append-only inserter of low cardinality data (just a few distinct values) can leave some "single value" pages with perhaps 8 or 9 tuples -- not the intended 7 (i.e. not the intended 6 tuples plus 1 high key). This isn't really a problem. It can only happen when a plain tuple is "wedged" between two existing max-sized posting list tuples, where we cannot merge the plain tuple with either adjoining/neighboring posting list tuple (since that violates the 1/6 of a page limit on posting list tuple size). This scenario is rare, and in any case the space utilization is almost exactly what it's supposed to be in the end (the BTREE_SINGLEVAL_FILLFACTOR "fill factor" that was added to Postgres 12 is still respected in the presence of deduplication in Postgres 13). This much is okay. However, I noticed that it's also possible to confuse "single value strategy" in nbtdedup.c into thinking that it has already observed 6 "max sized" tuples, when in fact it has not. This can lead to the occasional page that contains perhaps 50 or 70 tuples. This is rare enough that you have to really be paying attention to the structure of the index to notice it -- most of my test case indexes that use "single value strategy" a lot aren't even affected. I'm not okay with this, because it's not how nbtdedup.c is intended to work -- clearly nbtdedup.c gets confused as a consequence of the "plain tuple gets wedged" issue. I'm not okay with this. Attached patch nails down the behavior of single value strategy. This slightly improves the space utilization in a small minority of my test case indexes, though not by enough to matter. For example, the TPC-H idx_customer_nationkey index goes from 1193 pages to 1182 pages. This is an utterly insignificant issue, but there is no reason to allow it. When I think about space utilization, I don't just look at the index overall -- I also expect a certain amount of consistency within related subsets of the index's key space. -- Peter Geoghegan
v1-0001-Fix-nbtdedup.c-single-value-strategy-issue.patch
Description: Binary data