On Fri, 9 Apr 2021 at 16:58, Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > > > > On 4/9/21 7:33 PM, James Coleman wrote:
> > A specific area where this is particularly painful is btree index reads. > > Walking the tree to leaf pages isn't naturally prefetchable, and so for > > each level you pay the random page cost. Of course higher levels in the > > tree will almost certainly exhibit emergent behavior such that they > > (just by fact of the LRU caching) will be in the buffer cache, but for a > > large index lower levels likely won't be. We've talked before about buffering inserts even just for disk-based indexes. Much like how GIN buffers inserts and periodically flushes them out. We talked about doing a local buffer in each session since no other session even needs to see these buffered inserts until commit anyways. And we can more efficiently merge in multiple keys at once than doing them one by one. But that was just for disk i/o. For something longer-latency it would be an even bigger win. Buffer the inserted keys in local memory in case you do lookups in this same session and start the i/o to insert the rows into the index but handle that in the background or in a separate process without blocking the transaction until commit. > What do you consider a large index level? > > Consider a 1TB table, with just a single UUID column - that's ~25B rows, > give or take. Real tables will have more columns, so this seems like a > reasonable model of the largest number of rows per relation. With ~32B > per index tuple, that's about 100M leaf pages, and with ~256 branches > per internal page, that's still only ~5 levels. I think it's quite rare > to see indexes with more than 6 or 7 levels. That's a good model for a well-designed schema with an efficient index. There are plenty of less-than-optimal schemas with indexes on longer column lists or fairly large text fields.... -- greg