On Wed, Apr 9, 2025 at 1:46 PM James Hunter <james.hunter...@gmail.com> wrote: > On Mon, Apr 7, 2025 at 7:34 PM Thomas Munro <thomas.mu...@gmail.com> wrote: > > On Thu, Feb 13, 2025 at 1:40 PM Melanie Plageman > > <melanieplage...@gmail.com> wrote: > > > Thomas mentioned this to me off-list, and I think he's right. We > > > likely need to rethink the way parallel bitmap heap scan workers get > > > block assignments for reading and prefetching to make it more similar > > > to parallel sequential scan. The workers should probably get > > > assignments of a range of blocks. On master, each worker does end up > > > issuing reads/fadvises for a bunch of blocks in a row -- even though > > > that isn't the intent of the parallel bitmap table scan > > > implementation. We are losing some of that with the patch -- but only > > > because it is behaving as you would expect given the implementation > > > and design. I don't consider that a blocker, though. > > > > I had a crack at this one a few weeks back, and wanted to share some > > thoughts. > > Fwiw, I've had reasonably good results, in a similar situation, by > just batching + double-buffering. > > Since you're using AIO (as Melanie points out) there should be no real > penalty to having a single worker request the next batch of blocks, > when the current block is used up. > > Then you can separate "reading" from "prefetching": whoever reads the > last block in the current batch, prefetches the next batch.
I'm trying to understand. IIUC you're suggesting putting the blocks coming *out* of a worker's ReadStream into a single common shmem queue for any worker to consume (can't do this for block numbers coming *in* to the ReadStreams or you're describing what we already have in master, all mechanical sympathy for I/O out the window). Don't you have to release the buffers you just streamed after putting their the block numbers in a common shmem queue, hope they don't get evicted, repin them when you pop one out, and also contend on the queue for every block? What if you go to that queue to find a block, find it is empty, but actually another backend still has an I/O in progress that you have no way of knowing about? Suppose you do know about it somehow, so now you have to wait for it to complete the IO before you can help, but generally speaking it's against the law to wait for parallel workers unless you can prove they'll come back (for example there might be another node in the plan where they are waiting for *you*, deadlock, game over), and even if you could wait safely. If you don't wait, you just threw parallel fairness out the window. Hence desire to make all relevant materials for progress at end-of-scan eligible for work stealing. What I'm speculating about is not a true generalised work stealing or scheduling system (take for example Intel TBB[1] and various other parallel computing architectures going back decades), so I'm using the term pretty loosely: we can't generally chop all of our executor work up into tiny tasks and have any thread run any of them at any time to make progress (that'd be a nice goal but you basically have to blow the executor up with dynamite and then reassemble all the atoms into coroutines/continuations/tasks (or various similar-ish concepts) to achieve something like that, probably after you first make all of PostgreSQL multi-threaded unless you want to do it with your shoelaces tied together, and then take a very long holiday). Probably by mentioning work stealing I made it sound more complicated than necessary, but hear me out: A tiny, delimited, node-local kind of work stealing could replace (now non-viable) ramp-down schemes, but workers would only have to fight over the remaining scraps once their own queue can't be refilled. The two key things I'm interested in here are: (1) per-worker queues preserve block locality in the "first level" WorkQueueSet I was talking about, to feed io_combine_limit-friendly block number streams into their ReadStreams and only disturb that when data runs out and stealing becomes necessary for the last few blocks in the scan, and (2) at both levels a known-in-advance finite space can hold the maximum possible unfair allocation to each worker, with certain caveats. Those are: this logic only kicks in at end of scan when you can be guaranteed to "get them all" without ever having to wait for data buffered in the executor state of some other worker (data that is completely unreachable to peers in our executor model), assuming, and this is a hard requirement, they are coming from the same source in all workers (in this case the shared memory bitmap, which you lock while advancing the shared iterator, and a subgoal of 0001 was to amortise that locking). It means there's a finite known number of items that other workers can possibly have in their per-worker queue that you definitely have enough space for, so wait-free progress towards the global end-of-scan is guaranteed in every worker after it reaches the point of stealing. For something a bit like the 0001 patch I posted, it can't be more than _BATCH_SIZE + io_combine_limit -1 for the lower WorkQueueSet that feeds block numbers into each worker's ReadStream. Something like Parallel Seq Scan involves an extra curveball: I don't think you can do it with individual blocks or IO sized chunks because that would be huge and inefficient. PSS works with massive block ranges that could be 1MB or whatever IIRC (good), but I think you could probably represent those ranges as special queue items {blockno, nblocks} and modify them in place until depleted. For the second-level WorkQueueSet I was handwaving about, well the ReadStream queue is also of fixed size, so if your own ReadStream can't pull any more blocks out of the lower WorkQueueSet to chew on (including stealing), and it can't steal any buffers from peer ReadStreams connected to the same second-level WorkQueueSet, then the scan truly is finished. You never gave up I/O combining until the end, and you never gave up the ability to distribute the final parallel granules of work (here: pages, could be/should be tuples but let's not get carried away) one-at-a-time to whoever could help, you didn't have to do any new cache line ping pong until the last few blocks, and no one ever had to wait for a peer. I admit this all sounds kinda complicated and maybe there is a much simpler way to achieve the twin goals of maximising I/O combining AND parallel query fairness. I should probably try stuff before speculating much more. My main goal in writing these brain dumps was to highlight at least what I think the problem to be solved is, as I understood it so far anyway, after a first round of failure at teaching PBHS the new ways. Maybe the particular idea sketched out above sucks or doesn't work for some reason, IDK, but I hope at least the framing might be useful... not too many have had the privilege and pain of working on both parallel query and the new I/O stuff, and the feature collision is new territory, so I wanted to share some ideas about the problem space and solicit better ideas :-) [1] https://en.wikipedia.org/wiki/Threading_Building_Blocks