On 4/2/25 17:45, Peter Geoghegan wrote: > On Wed, Apr 2, 2025 at 11:36 AM Tom Lane <t...@sss.pgh.pa.us> wrote: >> Ouch! I had no idea it had gotten that big. Yeah, we ought to >> do something about that. > > Tomas Vondra talked about this recently, in the context of his work on > prefetching. >
I might have mentioned in the context of index prefetching (because that has to touch this, naturally), but I actually ran into this when working on the fast-path locking [1]. [1] https://www.postgresql.org/message-id/510b887e-c0ce-4a0c-a17a-2c6abb8d9...@enterprisedb.com One of the tests I did was with partitions, and with an index scans on tiny partitions that got pretty awful simply because of malloc() calls. The struct exceeds ALLOCSET_SEPARATE_THRESHOLD, so it can't be cached, and even if it could we would not cache it across scans anyway. >>> And/or perhaps we could could allocate BTScanOpaqueData.markPos as a whole >>> only when mark/restore are used? >> >> That'd be an easy way of removing about half of the problem, but >> 14kB is still too much. How badly do we need this items array? >> Couldn't we just reference the on-page items? > > I'm not sure what you mean by that. The whole design of _bt_readpage > is based on the idea that we read a whole page, in one go. It has to > batch up the items that are to be returned from the page somewhere. > The worst case is that there are about 1350 TIDs to return from any > single page (assuming default BLCKSZ). It's very pessimistic to start > from the assumption that that worst case will be hit, but I don't see > a way around doing it at least some of the time. > > The first thing I'd try is some kind of simple dynamic allocation > scheme, with a small built-in array that avoided any allocation > penalty in the common case where there weren't too many tuples to > return from the page. > > The way that we allocate BLCKSZ twice for index-only scans (one for > so->currTuples, the other for so->markTuples) is also pretty > inefficient. Especially because any kind of use of mark and restore is > exceedingly rare. > Yeah, something like this (allocating smaller arrays unless more is actually needed) would help many common cases. Another thing that helped was setting MALLOC_TOP_PAD_ env variable (or the same thing using mallopt), so that glibc keeps "buffer" for future allocations. regards -- Tomas Vondra