On Fri, Jul 18, 2025 at 10:47 PM Andres Freund <and...@anarazel.de> wrote:
> > I think that the table AM probably needs to have its own definition of
> > a batch (or some other distinct phrase/concept) -- it's not
> > necessarily the same group of TIDs that are associated with a batch on
> > the index AM side.
>
> I assume, for heap, it'll always be a narrower definition than for the
> indexam, basically dealing with all the TIDs that fit within one page at once?

Yes, I think so.

> > (Within an index AM, there is a 1:1 correspondence between batches and leaf
> > pages, and batches need to hold on to a leaf page buffer pin for a
> > time. None of this should really matter to the table AM.)
>
> To some degree the table AM will need to care about the index level batching -
> we have to be careful about how many pages we keep pinned overall. Which is
> something that both the table and the index AM have some influence over.

Can't they operate independently? If not (if there must be a
per-executor-node hard limit on pins held or whatever), then I still
see no need for close coordination.

> > At a high level, the table AM (and/or its read stream) asks for so
> > many heap blocks/TIDs. Occasionally, index AM implementation details
> > (i.e. the fact that many index leaf pages have to be read to get very
> > few TIDs) will result in that request not being honored. The interface
> > that the table AM uses must therefore occasionally answer "I'm sorry,
> > I can only reasonably give you so many TIDs at this time". When that
> > happens, the table AM has to make do. That can be very temporary, or
> > it can happen again and again, depending on implementation details
> > known only to the index AM side (though typically it'll never happen
> > even once).
>
> I think that requirement will make things more complicated. Why do we need to
> have it?

What if it turns out that there is a large run of contiguous leaf
pages that contain no more than 2 or 3 matching index tuples? What if
there's no matches across many leaf pages? Surely we have to back off
with prefetching when that happens.

> > * The table AM knows essentially nothing about leaf pages/index AM
> > batches -- it just has some general idea that sometimes it cannot have
> > its request honored, in which case it must make do.
>
> Not entirely convinced by this one.

We can probably get away with modelling all costs on the index AM side
as the number of pages read. This isn't all that accurate; some pages
are more expensive to read than others, it's more expensive to start a
new primitive index scan/index search than it is to just step to the
next page. But it's probably close enough for our purposes. And, I
think that it'll generalize reasonably well across all index AMs.

> > * This other index AM layer does still know that it isn't cool to drop
> > leaf page buffer pins before we're done reading the corresponding heap
> > TIDs, due to heapam implementation details around making concurrent
> > heap TID recycling safe.
>
> I'm not sure why this needs to live in the generic code, rather than the
> specific index AM?

Currently, the "complex" patch calls into nbtree to release its buffer
pin -- it does this by calling btfreebatch(). btfreebatch is not
completely trivial (it also calls _bt_killitems as needed). But nbtree
doesn't know when or how that'll happen. We're not obligated to do it
in precisely the same order as the order the pages were read in, for
example. In principle, the new indexam.c layer could do this in almost
any order.

> > I'm not really sure how the table AM lets the new index AM layer know "okay,
> > done with all those TIDs now" in a way that is both correct (in terms of
> > avoiding unsafe concurrent TID recycling) and also gives the table AM the
> > freedom to do its own kind of batch access at the level of heap pages.
>
> I'd assume that the table AM has to call some indexam function to release
> index-batches, whenever it doesn't need the reference anymore? And the
> index-batch release can then unpin?

It does. But that can be fairly generic -- btfreebatch will probably
end up looking very similar to (say) hashfreebatch and gistfreebatch.
Again, the indexam.c layer actually gets to decide when it happens --
that's what I meant about it being under its control (I didn't mean
that it literally did everything without involving the index AM).

-- 
Peter Geoghegan


Reply via email to