IoT/sensor data and B-Tree page splits

Peter Geoghegan Mon, 26 Aug 2019 16:15:39 -0700

The well known rightmost page split optimization (where we apply leaf
page fill factor) usually does a great job of maximizing space
utilization with indexes on a single auto-incrementing column or
timestamp column, by packing the left half of the rightmost page full.
Splitting the rightmost page predicts further splits of the new
rightmost page. If we are inserting ever-increasing values into the
index, then we win by packing the left half full -- we get very high
space utilization. If we're wrong to assume that another rightmost
page split is sure to follow soon, then we lose on space utilization,
though only by a tiny amount. We may occasionally lose out on space
utilization because our assumption that the rightmost page is special
was wrong. But, since it isn't special, we only inappropriately share
space unevenly once in a blue moon, which doesn't matter at all.

This seems to me to be fundamentally based on the assumption that you
either have random insertions or sequential insertions, without any
gray area between the two. Commit f21668f328c more or less generalized
this long established optimization to work with cases where it makes
sense to split at the rightmost page of a *grouping* within the index,
rather than the *entire* index, but that is almost unrelated to what I
want to bring up now. My concern right now is sensor data and IoT
data, which seem to call the "random or sequential" assumption into
question. Such data often consists of timestamps from a large number
of low cost devices -- event data that arrives *approximately* in
order. This is more or less the problem that the TimescaleDB extension
targets, so it seems likely that a fair number of users care about
getting it right, even if they don't know it.

I happened to notice that the largest TPC-E index has this exact
problem (it was the largest back when VMware ran TPC-E on Postgres
[1], which they specifically complained about at the time). The
"trade_history" table's primary key has tuples inserted
almost-but-not-quite in order. Sometimes packing the left half of the
rightmost page 90% full works out, because the remaining space is
enough to absorb all future insertions that arrive "slightly late"
(though barely). More often it turns out that that isn't enough space,
and the almost-rightmost page is split a second time, which is very
inefficient in lots of ways, even compared to the case where we reduce
fill factor to 50.

I have found that reducing trade_history_pkey's fill factor to about
70 increases leaf page space utilization, leaving it at about 85% (it
was under 50% with a fill factor of 90). I tend to doubt that this is
a practical tuning strategy in the real world, though, since varying
TPC-E scale alters which exact setting is effective. Ideally,
nbtpslitloc.c could intelligently adapt to slightly out of order
insertions in order to maximize space utilization -- it could
*dynamically* tune fill factor in response to ongoing observations
about page splits. Importantly, this would allow nbtree to avoid
unnecessarily splitting the same page twice in a row, having packed it
almost full on the first split -- this situation seems truly
pathological to me (think about the WAL overhead).

ISTM that adding such an optimization would require maintaining
dedicated book keeping metadata, which doesn't seem particularly
appealing. It might have acceptable overhead within a single backend.
I'm thinking of something like the RelationGetTargetBlock() stuff. Has
anybody else thought about this? What would a robust algorithm look
like?

Perhaps this is a problem that isn't worth solving right now, but it
is definitely a real problem.

[1]
https://www.postgresql.org/message-id/66ce997fb523c04e9749452273184c6c137cb88...@exch-mbx-113.vmware.com
--
Peter Geoghegan

IoT/sensor data and B-Tree page splits

Reply via email to