On Sat, Dec 9, 2023 at 1:08 PM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > But there's a layering problem that I don't know how to solve - I don't > see how we could make indexam.c entirely oblivious to the prefetching, > and move it entirely to the executor. Because how else would you know > what to prefetch?
Yeah, that seems impossible. Some thoughts: * I think perhaps the subject line of this thread is misleading. It doesn't seem like there is any index prefetching going on here at all, and there couldn't be, unless you extended the index AM API with new methods. What you're actually doing is prefetching heap pages that will be needed by a scan of the index. I think this confusing naming has propagated itself into some parts of the patch, e.g. index_prefetch() reads *from the heap* which is not at all clear from the comment saying "Prefetch the TID, unless it's sequential or recently prefetched." You're not prefetching the TID: you're prefetching the heap tuple to which the TID points. That's not an academic distinction IMHO -- the TID would be stored in the index, so if we were prefetching the TID, we'd have to be reading index pages, not heap pages. * Regarding layering, my first thought was that the changes to index_getnext_tid() and index_getnext_slot() are sensible: read ahead by some number of TIDs, keep the TIDs you've fetched in an array someplace, use that to drive prefetching of blocks on disk, and return the previously-read TIDs from the queue without letting the caller know that the queue exists. I think that's the obvious design for a feature of this type, to the point where I don't really see that there's a viable alternative design. Driving something down into the individual index AMs would make sense if you wanted to prefetch *from the indexes*, but it's unnecessary otherwise, and best avoided. * But that said, the skip_all_visible flag passed down to index_prefetch() looks like a VERY strong sign that the layering here is not what it should be. Right now, when some code calls index_getnext_tid(), that function does not need to know or care whether the caller is going to fetch the heap tuple or not. But with this patch, the code does need to care. So knowledge of the executor concept of an index-only scan trickles down into indexam.c, which now has to be able to make decisions that are consistent with the ones that the executor will make. That doesn't seem good at all. * I think it might make sense to have two different prefetching schemes. Ideally they could share some structure. If a caller is using index_getnext_slot(), then it's easy for prefetching to be fully transparent. The caller can just ask for TIDs and the prefetching distance and TID queue can be fully under the control of something that is hidden from the caller. But when using index_getnext_tid(), the caller needs to have an opportunity to evaluate each TID and decide whether we even want the heap tuple. If yes, then we feed that TID to the prefetcher; if no, we don't. That way, we're not replicating executor logic in lower-level code. However, that also means that the IOS logic needs to be aware that this TID queue exists and interact with whatever controls the prefetch distance. Perhaps after calling index_getnext_tid() you call index_prefetcher_put_tid(prefetcher, tid, bool fetch_heap_tuple) and then you call index_prefetcher_get_tid() to drain the queue. Perhaps also the prefetcher has a "fill" callback that gets invoked when the TID queue isn't as full as the prefetcher wants it to be. Then index_getnext_slot() can just install a trivial fill callback that says index_prefetecher_put_tid(prefetcher, index_getnext_tid(...), true), but IOS can use a more sophisticated callback that checks the VM to determine what to pass for the third argument. * I realize that I'm being a little inconsistent in what I just said, because in the first bullet point I said that this wasn't really index prefetching, and now I'm proposing function names that still start with index_prefetch. It's not entirely clear to me what the best thing to do about the terminology is here -- could it be a heap prefetcher, or a TID prefetcher, or an index scan prefetcher? I don't really know, but whatever we can do to make the naming more clear seems like a really good idea. Maybe there should be a clearer separation between the queue of TIDs that we're going to return from the index and the queue of blocks that we want to prefetch to get the corresponding heap tuples -- making that separation crisper might ease some of the naming issues. * Not that I want to be critical because I think this is a great start on an important project, but it does look like there's an awful lot of stuff here that still needs to be sorted out before it would be reasonable to think of committing this, both in terms of design decisions and just general polish. There's a lot of stuff marked with XXX and I think that's great because most of those seem to be good questions but that does leave the, err, small problem of figuring out the answers. index_prefetch_is_sequential() makes me really nervous because it seems to depend an awful lot on whether the OS is doing prefetching, and how the OS is doing prefetching, and I think those might not be consistent across all systems and kernel versions. Similarly with index_prefetch(). There's a lot of "magical" assumptions here. Even index_prefetch_add_cache() has this problem -- the function assumes that it's OK if we sometimes fail to detect a duplicate prefetch request, which makes sense, but under what circumstances is it necessary to detect duplicates and in what cases is it optional? The function comments are silent about that, which makes it hard to assess whether the algorithm is good enough. * In terms of polish, one thing I noticed is that index_getnext_slot() calls index_prefetch_tids() even when scan->xs_heap_continue is set, which seems like it must be a waste, since we can't really need to kick off more prefetch requests halfway through a HOT chain referenced by a single index tuple, can we? Also, blks_prefetch_rounds doesn't seem to be used anywhere, and neither that nor blks_prefetches are documented. In fact there's no new documentation at all, which seems probably not right. That's partly because there are no new GUCs, which I feel like typically for a feature like this would be the place where the feature behavior would be mentioned in the documentation. I don't think it's a good idea to tie the behavior of this feature to effective_io_concurrency partly because it's usually a bad idea to make one setting control multiple different things, but perhaps even more because effective_io_concurrency doesn't actually work in a useful way AFAICT and people typically have to set it to some very artificially large value compared to how much real I/O parallelism they have. So probably there should be new GUCs with hopefully-better semantics, but at least the documentation for any existing ones would need updating, I would think. -- Robert Haas EDB: http://www.enterprisedb.com