On Fri, May 7, 2021 at 3:34 PM Greg Stark <st...@mit.edu> wrote: > We've talked before about buffering inserts even just for disk-based > indexes. Much like how GIN buffers inserts and periodically flushes > them out. We talked about doing a local buffer in each session since > no other session even needs to see these buffered inserts until commit > anyways. And we can more efficiently merge in multiple keys at once > than doing them one by one.
Mark Callaghan's high level analysis of the trade-offs here is worth a read, too. > That's a good model for a well-designed schema with an efficient > index. There are plenty of less-than-optimal schemas with indexes on > longer column lists or fairly large text fields.... Suffix truncation can take care of this -- all you really need is a minimally distinguishing separator key to delineate which values belong on which page one level down. It is almost always possible for leaf page splits to find a way to make the new high key (also the key to be inserted in the parent level) much smaller than your typical key. Granted, we don't have what I've called "classic" suffix truncation (within text column truncation) yet, so this analysis isn't going to work with long text keys (we only truncate at the attribute granularity currently). Even if we're pessimistic about suffix truncation, the logarithmic rate of growth still wins -- Tomas' analysis is sound. You cannot realistically make a Postgres B-Tree have more than about 1% of all pages as internal pages, unless you make the indexed keys ludicrously large -- as in several hundred bytes each (~0.5% is typical in practice). I think that 6 levels is very pessimistic, even with a massive B-Tree with weirdly large keys. My mental model for internal pages is that they are practically guaranteed to be in shared_buffers at all times, which is about as accurate as any generalization like that ever can be. I once wrote a test harness that deliberately created a B-Tree that was as tall as possible -- something with the largest possible index tuples on the leaf level (had to disable TOAST for this). I think that it was about 7 or 8 levels deep. The CPU overhead of the test case made it excruciatingly slow, but it wasn't I/O bound at all (pretty sure it all fitted in shared_buffers). -- Peter Geoghegan