The well known rightmost page split optimization (where we apply leaf page fill factor) usually does a great job of maximizing space utilization with indexes on a single auto-incrementing column or timestamp column, by packing the left half of the rightmost page full. Splitting the rightmost page predicts further splits of the new rightmost page. If we are inserting ever-increasing values into the index, then we win by packing the left half full -- we get very high space utilization. If we're wrong to assume that another rightmost page split is sure to follow soon, then we lose on space utilization, though only by a tiny amount. We may occasionally lose out on space utilization because our assumption that the rightmost page is special was wrong. But, since it isn't special, we only inappropriately share space unevenly once in a blue moon, which doesn't matter at all.
This seems to me to be fundamentally based on the assumption that you either have random insertions or sequential insertions, without any gray area between the two. Commit f21668f328c more or less generalized this long established optimization to work with cases where it makes sense to split at the rightmost page of a *grouping* within the index, rather than the *entire* index, but that is almost unrelated to what I want to bring up now. My concern right now is sensor data and IoT data, which seem to call the "random or sequential" assumption into question. Such data often consists of timestamps from a large number of low cost devices -- event data that arrives *approximately* in order. This is more or less the problem that the TimescaleDB extension targets, so it seems likely that a fair number of users care about getting it right, even if they don't know it. I happened to notice that the largest TPC-E index has this exact problem (it was the largest back when VMware ran TPC-E on Postgres [1], which they specifically complained about at the time). The "trade_history" table's primary key has tuples inserted almost-but-not-quite in order. Sometimes packing the left half of the rightmost page 90% full works out, because the remaining space is enough to absorb all future insertions that arrive "slightly late" (though barely). More often it turns out that that isn't enough space, and the almost-rightmost page is split a second time, which is very inefficient in lots of ways, even compared to the case where we reduce fill factor to 50. I have found that reducing trade_history_pkey's fill factor to about 70 increases leaf page space utilization, leaving it at about 85% (it was under 50% with a fill factor of 90). I tend to doubt that this is a practical tuning strategy in the real world, though, since varying TPC-E scale alters which exact setting is effective. Ideally, nbtpslitloc.c could intelligently adapt to slightly out of order insertions in order to maximize space utilization -- it could *dynamically* tune fill factor in response to ongoing observations about page splits. Importantly, this would allow nbtree to avoid unnecessarily splitting the same page twice in a row, having packed it almost full on the first split -- this situation seems truly pathological to me (think about the WAL overhead). ISTM that adding such an optimization would require maintaining dedicated book keeping metadata, which doesn't seem particularly appealing. It might have acceptable overhead within a single backend. I'm thinking of something like the RelationGetTargetBlock() stuff. Has anybody else thought about this? What would a robust algorithm look like? Perhaps this is a problem that isn't worth solving right now, but it is definitely a real problem. [1] https://www.postgresql.org/message-id/66ce997fb523c04e9749452273184c6c137cb88...@exch-mbx-113.vmware.com -- Peter Geoghegan