On Wed, Apr 9, 2025 at 11:00 PM Thomas Munro <thomas.mu...@gmail.com> wrote: > > On Wed, Apr 9, 2025 at 1:46 PM James Hunter <james.hunter...@gmail.com> wrote: > > On Mon, Apr 7, 2025 at 7:34 PM Thomas Munro <thomas.mu...@gmail.com> wrote: > > > On Thu, Feb 13, 2025 at 1:40 PM Melanie Plageman > > > <melanieplage...@gmail.com> wrote: > > > > Thomas mentioned this to me off-list, and I think he's right. We > > > > likely need to rethink the way parallel bitmap heap scan workers get > > > > block assignments for reading and prefetching to make it more similar > > > > to parallel sequential scan. The workers should probably get > > > > assignments of a range of blocks. On master, each worker does end up > > > > issuing reads/fadvises for a bunch of blocks in a row -- even though > > > > that isn't the intent of the parallel bitmap table scan > > > > implementation. We are losing some of that with the patch -- but only > > > > because it is behaving as you would expect given the implementation > > > > and design. I don't consider that a blocker, though. > > > > > > I had a crack at this one a few weeks back, and wanted to share some > > > thoughts. > > > > Fwiw, I've had reasonably good results, in a similar situation, by > > just batching + double-buffering. > > > > Since you're using AIO (as Melanie points out) there should be no real > > penalty to having a single worker request the next batch of blocks, > > when the current block is used up. > > > > Then you can separate "reading" from "prefetching": whoever reads the > > last block in the current batch, prefetches the next batch. > > I'm trying to understand. IIUC you're suggesting putting the blocks > coming *out* of a worker's ReadStream into a single common shmem queue > for any worker to consume (can't do this for block numbers coming *in* > to the ReadStreams or you're describing what we already have in > master, all mechanical sympathy for I/O out the window).
I am looking at the pre-streaming code, in PG 17, as I am not familiar with the PG 18 "streaming" code. Back in PG 17, nodeBitmapHeapscan.c maintained two shared TBM iterators, for PQ. One of the iterators was the actual, "fetch" iterator; the other was the "prefetch" iterator, which kept some distance ahead of the "fetch" iterator (to hide read latency). In general, I find PG's "streaming" interface to be an awkward way of thinking about prefetching. But in this specific case, what are you trying to solve? Is it contention on the shared "iteration" state? Or is it finding an efficient way to issue lots of async reads? Or is it trying to issue async reads for blocks laid out consecutively on disk? (Is that what you mean by "mechanical sympathy for I/O"?) > Don't you > have to release the buffers you just streamed after putting their the > block numbers in a common shmem queue, hope they don't get evicted, > repin them when you pop one out, Yes, sure. If they do get evicted, then you were prefetching too far ahead, right? Unpinning + repinning keeps the buffer pool more "fair," at the cost of atomic updates + contention. CPUs are happy to evict prefetched data, they don't care if you've had the chance to use them. > and also contend on the queue for > every block? No, because you can read a batch of blocks out to process-local memory, at one time. Use the standard pattern: lock shared state, read a batch of work from shared state to local state, unlock shared state. Contention is amortized over the batch; by picking a reasonable batch size, you typically make contention small enough that it doesn't matter. > What if you go to that queue to find a block, find it is > empty, but actually another backend still has an I/O in progress that > you have no way of knowing about? Any time you try to read a buffer from the buffer pool, there's a chance some other backend still has an I/O in progress on it, right? Existing PG code just has everyone wait for the I/O to complete. This ends up being what you want to do, anyway... > Suppose you do know about it > somehow, so now you have to wait for it to complete the IO before you > can help, but generally speaking it's against the law to wait for > parallel workers unless you can prove they'll come back (for example > there might be another node in the plan where they are waiting for > *you*, deadlock, game over), and even if you could wait safely. My confusion here is probably due to my unfamiliarity with how you're handling async I/Os, but once some backend has issued an aync I/O, don't most async I/O frameworks complete the I/O on their own? Certainly the O/S is going to memcpy() the block into some destination address. Why can't the first backend to wait for I/O completion also be responsible for setting the buffer header bits appropriately? Basically, let whatever backend first needs the block be responsible for processing the I/O completion. Since waiting for an I/O completion is a blocking, synchronous operation, how is deadlock possible? > If > you don't wait, you just threw parallel fairness out the window. > Hence desire to make all relevant materials for progress at > end-of-scan eligible for work stealing. I may have misunderstood the context of this discussion, but isn't this all about *async* I/O? (If it's not async, what does "prefetch" mean?) With async I/O, there's no reason why the entity that issues the I/O also has to process its completion. (MySQL, for example, IIRC, uses a dedicated thread pool to process completions.) You just need *some* backend to issue a batch of async reads, on behalf of *all* backends. And then whenever a particular backend needs one of the blocks including in that async batch of reads, it waits for the async read to complete. And if there's no outstanding async read, then (I think this can be handled fairly well using existing PG code) it just issues a new, synchronous read. I think this is just what "prefetch" means. For example, if you ask the CPU to prefetch address 0xf00 into L1 cache, the CPU doesn't worry about whether that address was actually loaded in time, or whether it gets read by the same or a different thread. You have an in-flight, async operation; and then whenever someone needs the result, they block (or not) depending on whether the async operation already completed. > What I'm speculating about is not a true generalised work stealing or > scheduling system (take for example Intel TBB[1] and various other > parallel computing architectures going back decades), so I'm using the > term pretty loosely: we can't generally chop all of our executor work > up into tiny tasks and have any thread run any of them at any time to > make progress (that'd be a nice goal but you basically have to blow > the executor up with dynamite and then reassemble all the atoms into > coroutines/continuations/tasks (or various similar-ish concepts) to > achieve something like that, probably after you first make all of > PostgreSQL multi-threaded unless you want to do it with your shoelaces > tied together, and then take a very long holiday). Probably by > mentioning work stealing I made it sound more complicated than > necessary, but hear me out: I don't have an opinion about chopping up (CPU) work , but my point is that I/Os go to a different resource -- they have a different queue. So you don't usually *need* to chop up *I/O* work. Assuming you can saturate I/O bandwidth / fill the I/O queue using a single thread, then you don't need to worry about all of the above -- at least not for I/Os! And, for CPU, you can typically get away with maintaining a single global queue, and just copying (sequential) batches of work from the global queue into thread- or process- local memory. Why do you want to chop up and reassemble work? Is this CPU work or I/O work? > A tiny, delimited, node-local kind of work stealing could replace (now > non-viable) ramp-down schemes, I guess, for I/Os, the only work-stealing I think about is to handle I/O completions. And that part, at least, can be done (as sketched above) by having backends complete async I/Os inside ReadBuffer(). > I admit this all sounds kinda complicated and maybe there is a much > simpler way to achieve the twin goals of maximising I/O combining AND > parallel query fairness. I tend to think that the two goals are so much in conflict, that it's not worth trying to apply cleverness to get them to agree on things... What I mean is, if you have to take inputs from multiple parties, combine them, sort them, and then produce a Run-Length-Encoding of the result, then there's always going to be lots of contention. And you'll also have to deal with increased latency: should I wait to see if more blocks come in, to see if it lets me combine I/Os, or should I send the batch now? But, if you just do the most naive thing possible (which is my suggestion!), then whatever backend's turn it is to advance the prefetch iterator, it advances it by (let's say) 50 blocks. It can then try to combine those 50 blocks, with no contention or added complexity. Might that be sufficient? I think that, generally, there are only a few techniques that work consistently for concurrency / async operations. And these techniques are necessarily primitive, they are blunt tools. It sounds like you are working on a subtle, complex, refined solution... and I am just doubtful that any such solution will end up being efficient. It is hard to create a complex concurrent execution model that's also fast. For example, simple batching is the solution to a lot of problems. Any time you process a batch of stuff, you amortize any fixed costs over the size of the batch. This is a blunt tool, but it always works. Also, work stealing is great, and one of the nicest ways to steal work is the "async/await" or "coroutines" model, where someone -- we don't care who! -- sends an async request; and then, at some point in the future, someone (else? we don't care!) waits for that async request to complete. And if there's more work to be done, to complete the async request, then whoever waits first does that work. So you steal work from the thread/process that needs that work to be done -- very nice. This is also a blunt tool, but it works. With PG, specifically, waiting for a read into buffer pool to complete seems to go through the same interface as actually performing that read. So this helps you out for prefetch -- which is always best-effort, it's not the prefetch operation's job to ensure that the prefetched data doesn't get evicted -- because you really don't even need to track async reads after you've issued them! This lets PG prefetch behave like CPU prefetch. James