Re: Processing btree walks as a batch to parallelize IO

Peter Geoghegan Fri, 07 May 2021 15:56:52 -0700

On Fri, May 7, 2021 at 3:34 PM Greg Stark <st...@mit.edu> wrote:
> We've talked before about buffering inserts even just for disk-based
> indexes. Much like how GIN buffers inserts and periodically flushes
> them out. We talked about doing a local buffer in each session since
> no other session even needs to see these buffered inserts until commit
> anyways. And we can more efficiently merge in multiple keys at once
> than doing them one by one.


Mark Callaghan's high level analysis of the trade-offs here is worth a
read, too.

> That's a good model for a well-designed schema with an efficient
> index. There are plenty of less-than-optimal schemas with indexes on
> longer column lists or fairly large text fields....

Suffix truncation can take care of this -- all you really need is a
minimally distinguishing separator key to delineate which values
belong on which page one level down. It is almost always possible for
leaf page splits to find a way to make the new high key (also the key
to be inserted in the parent level) much smaller than your typical
key. Granted, we don't have what I've called "classic" suffix
truncation (within text column truncation) yet, so this analysis isn't
going to work with long text keys (we only truncate at the attribute
granularity currently).

Even if we're pessimistic about suffix truncation, the logarithmic
rate of growth still wins -- Tomas' analysis is sound. You cannot
realistically make a Postgres B-Tree have more than about 1% of all
pages as internal pages, unless you make the indexed keys ludicrously
large -- as in several hundred bytes each (~0.5% is typical in
practice). I think that 6 levels is very pessimistic, even with a
massive B-Tree with weirdly large keys. My mental model for internal
pages is that they are practically guaranteed to be in shared_buffers
at all times, which is about as accurate as any generalization like
that ever can be.

I once wrote a test harness that deliberately created a B-Tree that
was as tall as possible -- something with the largest possible index
tuples on the leaf level (had to disable TOAST for this). I think that
it was about 7 or 8 levels deep. The CPU overhead of the test case
made it excruciatingly slow, but it wasn't I/O bound at all (pretty
sure it all fitted in shared_buffers).

-- 
Peter Geoghegan

Re: Processing btree walks as a batch to parallelize IO

Reply via email to