Re: BitmapHeapScan streaming read user and prelim refactoring

James Hunter Thu, 10 Apr 2025 10:50:41 -0700

On Wed, Apr 9, 2025 at 11:00 PM Thomas Munro <thomas.mu...@gmail.com> wrote:
>
> On Wed, Apr 9, 2025 at 1:46 PM James Hunter <james.hunter...@gmail.com> wrote:
> > On Mon, Apr 7, 2025 at 7:34 PM Thomas Munro <thomas.mu...@gmail.com> wrote:
> > > On Thu, Feb 13, 2025 at 1:40 PM Melanie Plageman
> > > <melanieplage...@gmail.com> wrote:
> > > > Thomas mentioned this to me off-list, and I think he's right. We
> > > > likely need to rethink the way parallel bitmap heap scan workers get
> > > > block assignments for reading and prefetching to make it more similar
> > > > to parallel sequential scan. The workers should probably get
> > > > assignments of a range of blocks. On master, each worker does end up
> > > > issuing reads/fadvises for a bunch of blocks in a row -- even though
> > > > that isn't the intent of the parallel bitmap table scan
> > > > implementation. We are losing some of that with the patch -- but only
> > > > because it is behaving as you would expect given the implementation
> > > > and design. I don't consider that a blocker, though.
> > >
> > > I had a crack at this one a few weeks back, and wanted to share some
> > > thoughts.
> >
> > Fwiw, I've had reasonably good results, in a similar situation, by
> > just batching + double-buffering.
> >
> > Since you're using AIO (as Melanie points out) there should be no real
> > penalty to having a single worker request the next batch of blocks,
> > when the current block is used up.
> >
> > Then you can separate "reading" from "prefetching": whoever reads the
> > last block in the current batch, prefetches the next batch.
>
> I'm trying to understand.  IIUC you're suggesting putting the blocks
> coming *out* of a worker's ReadStream into a single common shmem queue
> for any worker to consume (can't do this for block numbers coming *in*
> to the ReadStreams or you're describing what we already have in
> master, all mechanical sympathy for I/O out the window).


I am looking at the pre-streaming code, in PG 17, as I am not familiar
with the PG 18 "streaming" code. Back in PG 17, nodeBitmapHeapscan.c
maintained two shared TBM iterators, for PQ. One of the iterators was
the actual, "fetch" iterator; the other was the "prefetch" iterator,
which kept some distance ahead of the "fetch" iterator (to hide read
latency).

In general, I find PG's "streaming" interface to be an awkward way of
thinking about prefetching. But in this specific case, what are you
trying to solve? Is it contention on the shared "iteration" state? Or
is it finding an efficient way to issue lots of async reads?

Or is it trying to issue async reads for blocks laid out consecutively
on disk? (Is that what you mean by "mechanical sympathy for I/O"?)

> Don't you
> have to release the buffers you just streamed after putting their the
> block numbers in a common shmem queue, hope they don't get evicted,
> repin them when you pop one out,

Yes, sure. If they do get evicted, then you were prefetching too far
ahead, right? Unpinning + repinning keeps the buffer pool more "fair,"
at the cost of atomic updates + contention. CPUs are happy to evict
prefetched data, they don't care if you've had the chance to use them.

> and also contend on the queue for
> every block?

No, because you can read a batch of blocks out to process-local
memory, at one time. Use the standard pattern: lock shared state, read
a batch of work from shared state to local state, unlock shared state.
Contention is amortized over the batch; by picking a reasonable batch
size, you typically make contention small enough that it doesn't
matter.

> What if you go to that queue to find a block, find it is
> empty, but actually another backend still has an I/O in progress that
> you have no way of knowing about?

Any time you try to read a buffer from the buffer pool, there's a
chance some other backend still has an I/O in progress on it, right?
Existing PG code just has everyone wait for the I/O to complete. This
ends up being what you want to do, anyway...

>  Suppose you do know about it
> somehow, so now you have to wait for it to complete the IO before you
> can help, but generally speaking it's against the law to wait for
> parallel workers unless you can prove they'll come back (for example
> there might be another node in the plan where they are waiting for
> *you*, deadlock, game over), and even if you could wait safely.

My confusion here is probably due to my unfamiliarity with how you're
handling async I/Os, but once some backend has issued an aync I/O,
don't most async I/O frameworks complete the I/O on their own?
Certainly the O/S is going to memcpy() the block into some destination
address.

Why can't the first backend to wait for I/O completion also be
responsible for setting the buffer header bits appropriately?
Basically, let whatever backend first needs the block be responsible
for processing the I/O completion.

Since waiting for an I/O completion is a blocking, synchronous
operation, how is deadlock possible?

> If
> you don't wait, you just threw parallel fairness out the window.
> Hence desire to make all relevant materials for progress at
> end-of-scan eligible for work stealing.

I may have misunderstood the context of this discussion, but isn't
this all about *async* I/O? (If it's not async, what does "prefetch"
mean?) With async I/O, there's no reason why the entity that issues
the I/O also has to process its completion. (MySQL, for example, IIRC,
uses a dedicated thread pool to process completions.)

You just need *some* backend to issue a batch of async reads, on
behalf of *all* backends. And then whenever a particular backend needs
one of the blocks including in that async batch of reads, it waits for
the async read to complete. And if there's no outstanding async read,
then (I think this can be handled fairly well using existing PG code)
it just issues a new, synchronous read.

I think this is just what "prefetch" means. For example, if you ask
the CPU to prefetch address 0xf00 into L1 cache, the CPU doesn't worry
about whether that address was actually loaded in time, or whether it
gets read by the same or a different thread. You have an in-flight,
async operation; and then whenever someone needs the result, they
block (or not) depending on whether the async operation already
completed.

> What I'm speculating about is not a true generalised work stealing or
> scheduling system (take for example Intel TBB[1] and various other
> parallel computing architectures going back decades), so I'm using the
> term pretty loosely: we can't generally chop all of our executor work
> up into tiny tasks and have any thread run any of them at any time to
> make progress (that'd be a nice goal but you basically have to blow
> the executor up with dynamite and then reassemble all the atoms into
> coroutines/continuations/tasks (or various similar-ish concepts) to
> achieve something like that, probably after you first make all of
> PostgreSQL multi-threaded unless you want to do it with your shoelaces
> tied together, and then take a very long holiday).  Probably by
> mentioning work stealing I made it sound more complicated than
> necessary, but hear me out:

I don't have an opinion about chopping up (CPU) work , but my point is
that I/Os go to a different resource -- they have a different queue.
So you don't usually *need* to chop up *I/O* work. Assuming you can
saturate I/O bandwidth / fill the I/O queue using a single thread,
then you don't need to worry about all of the above -- at least not
for I/Os! And, for CPU, you can typically get away with maintaining a
single global queue, and just copying (sequential) batches of work
from the global queue into thread- or process- local memory.

Why do you want to chop up and reassemble work? Is this CPU work or I/O work?

> A tiny, delimited, node-local kind of work stealing could replace (now
> non-viable) ramp-down schemes,

I guess, for I/Os, the only work-stealing I think about is to handle
I/O completions. And that part, at least, can be done (as sketched
above) by having backends complete async I/Os inside ReadBuffer().

> I admit this all sounds kinda complicated and maybe there is a much
> simpler way to achieve the twin goals of maximising I/O combining AND
> parallel query fairness.

I tend to think that the two goals are so much in conflict, that it's
not worth trying to apply cleverness to get them to agree on things...
What I mean is, if you have to take inputs from multiple parties,
combine them, sort them, and then produce a Run-Length-Encoding of the
result, then there's always going to be lots of contention. And you'll
also have to deal with increased latency: should I wait to see if more
blocks come in, to see if it lets me combine I/Os, or should I send
the batch now?

But, if you just do the most naive thing possible (which is my
suggestion!), then whatever backend's turn it is to advance the
prefetch iterator, it advances it by (let's say) 50 blocks. It can
then try to combine those 50 blocks, with no contention or added
complexity. Might that be sufficient?

I think that, generally, there are only a few techniques that work
consistently for concurrency / async operations. And these techniques
are necessarily primitive, they are blunt tools. It sounds like you
are working on a subtle, complex, refined solution... and I am just
doubtful that any such solution will end up being efficient. It is
hard to create a complex concurrent execution model that's also fast.

For example, simple batching is the solution to a lot of problems. Any
time you process a batch of stuff, you amortize any fixed costs over
the size of the batch. This is a blunt tool, but it always works.

Also, work stealing is great, and one of the nicest ways to steal work
is the "async/await" or "coroutines" model, where someone -- we don't
care who! -- sends an async request; and then, at some point in the
future, someone (else? we don't care!) waits for that async request to
complete. And if there's more work to be done, to complete the async
request, then whoever waits first does that work. So you steal work
from the thread/process that needs that work to be done -- very nice.
This is also a blunt tool, but it works.

With PG, specifically, waiting for a read into buffer pool to complete
seems to go through the same interface as actually performing that
read. So this helps you out for prefetch -- which is always
best-effort, it's not the prefetch operation's job to ensure that the
prefetched data doesn't get evicted -- because you really don't even
need to track async reads after you've issued them! This lets PG
prefetch behave like CPU prefetch.

James

Re: BitmapHeapScan streaming read user and prelim refactoring

Reply via email to