Re: BitmapHeapScan streaming read user and prelim refactoring

Thomas Munro Wed, 09 Apr 2025 23:01:09 -0700

On Wed, Apr 9, 2025 at 1:46 PM James Hunter <james.hunter...@gmail.com> wrote:
> On Mon, Apr 7, 2025 at 7:34 PM Thomas Munro <thomas.mu...@gmail.com> wrote:
> > On Thu, Feb 13, 2025 at 1:40 PM Melanie Plageman
> > <melanieplage...@gmail.com> wrote:
> > > Thomas mentioned this to me off-list, and I think he's right. We
> > > likely need to rethink the way parallel bitmap heap scan workers get
> > > block assignments for reading and prefetching to make it more similar
> > > to parallel sequential scan. The workers should probably get
> > > assignments of a range of blocks. On master, each worker does end up
> > > issuing reads/fadvises for a bunch of blocks in a row -- even though
> > > that isn't the intent of the parallel bitmap table scan
> > > implementation. We are losing some of that with the patch -- but only
> > > because it is behaving as you would expect given the implementation
> > > and design. I don't consider that a blocker, though.
> >
> > I had a crack at this one a few weeks back, and wanted to share some
> > thoughts.
>
> Fwiw, I've had reasonably good results, in a similar situation, by
> just batching + double-buffering.
>
> Since you're using AIO (as Melanie points out) there should be no real
> penalty to having a single worker request the next batch of blocks,
> when the current block is used up.
>
> Then you can separate "reading" from "prefetching": whoever reads the
> last block in the current batch, prefetches the next batch.


I'm trying to understand.  IIUC you're suggesting putting the blocks
coming *out* of a worker's ReadStream into a single common shmem queue
for any worker to consume (can't do this for block numbers coming *in*
to the ReadStreams or you're describing what we already have in
master, all mechanical sympathy for I/O out the window).  Don't you
have to release the buffers you just streamed after putting their the
block numbers in a common shmem queue, hope they don't get evicted,
repin them when you pop one out, and also contend on the queue for
every block?  What if you go to that queue to find a block, find it is
empty, but actually another backend still has an I/O in progress that
you have no way of knowing about?  Suppose you do know about it
somehow, so now you have to wait for it to complete the IO before you
can help, but generally speaking it's against the law to wait for
parallel workers unless you can prove they'll come back (for example
there might be another node in the plan where they are waiting for
*you*, deadlock, game over), and even if you could wait safely.  If
you don't wait, you just threw parallel fairness out the window.
Hence desire to make all relevant materials for progress at
end-of-scan eligible for work stealing.

What I'm speculating about is not a true generalised work stealing or
scheduling system (take for example Intel TBB[1] and various other
parallel computing architectures going back decades), so I'm using the
term pretty loosely: we can't generally chop all of our executor work
up into tiny tasks and have any thread run any of them at any time to
make progress (that'd be a nice goal but you basically have to blow
the executor up with dynamite and then reassemble all the atoms into
coroutines/continuations/tasks (or various similar-ish concepts) to
achieve something like that, probably after you first make all of
PostgreSQL multi-threaded unless you want to do it with your shoelaces
tied together, and then take a very long holiday).  Probably by
mentioning work stealing I made it sound more complicated than
necessary, but hear me out:

A tiny, delimited, node-local kind of work stealing could replace (now
non-viable) ramp-down schemes, but workers would only have to fight
over the remaining scraps once their own queue can't be refilled.  The
two key things I'm interested in here are: (1) per-worker queues
preserve block locality in the "first level" WorkQueueSet I was
talking about, to feed io_combine_limit-friendly block number streams
into their ReadStreams and only disturb that when data runs out and
stealing becomes necessary for the last few blocks in the scan, and
(2) at both levels a known-in-advance finite space can hold the
maximum possible unfair allocation to each worker, with certain
caveats.  Those are: this logic only kicks in at end of scan when you
can be guaranteed to "get them all" without ever having to wait for
data buffered in the executor state of some other worker (data that is
completely unreachable to peers in our executor model), assuming, and
this is a hard requirement, they are coming from the same source in
all workers (in this case the shared memory bitmap, which you lock
while advancing the shared iterator, and a subgoal of 0001 was to
amortise that locking).  It means there's a finite known number of
items that other workers can possibly have in their per-worker queue
that you definitely have enough space for, so wait-free progress
towards the global end-of-scan is guaranteed in every worker after it
reaches the point of stealing.  For something a bit like the 0001
patch I posted, it can't be more than _BATCH_SIZE + io_combine_limit
-1 for the lower WorkQueueSet that feeds block numbers into each
worker's ReadStream.  Something like Parallel Seq Scan involves an
extra curveball: I don't think you can do it with individual blocks or
IO sized chunks because that would be huge and inefficient.  PSS works
with massive block ranges that could be 1MB or whatever IIRC (good),
but I think you could probably represent those ranges as special queue
items {blockno, nblocks} and modify them in place until depleted.  For
the second-level WorkQueueSet I was handwaving about, well the
ReadStream queue is also of fixed size, so if your own ReadStream
can't pull any more blocks out of the lower WorkQueueSet to chew on
(including stealing), and it can't steal any buffers from peer
ReadStreams connected to the same second-level WorkQueueSet, then the
scan truly is finished.  You never gave up I/O combining until the
end, and you never gave up the ability to distribute the final
parallel granules of work (here: pages, could be/should be tuples but
let's not get carried away) one-at-a-time to whoever could help, you
didn't have to do any new cache line ping pong until the last few
blocks, and no one ever had to wait for a peer.

I admit this all sounds kinda complicated and maybe there is a much
simpler way to achieve the twin goals of maximising I/O combining AND
parallel query fairness.  I should probably try stuff before
speculating much more.  My main goal in writing these brain dumps was
to highlight at least what I think the problem to be solved is, as I
understood it so far anyway, after a first round of failure at
teaching PBHS the new ways.  Maybe the particular idea sketched out
above sucks or doesn't work for some reason, IDK, but I hope at least
the framing might be useful... not too many have had the privilege and
pain of working on both parallel query and the new I/O stuff, and the
feature collision is new territory, so I wanted to share some ideas
about the problem space and solicit better ideas :-)

[1] https://en.wikipedia.org/wiki/Threading_Building_Blocks

Re: BitmapHeapScan streaming read user and prelim refactoring

Reply via email to