O (WIP)

Heikki Linnakangas Thu, 11 Jan 2024 06:31:43 -0800

On 11/01/2024 05:19, Thomas Munro wrote:

On Thu, Jan 11, 2024 at 8:58 AM Heikki Linnakangas <hlinn...@iki.fi> wrote:

On 10/01/2024 06:13, Thomas Munro wrote:

Bikeshedding call: I am open to better suggestions for the names
PrepareReadBuffer() and CompleteReadBuffers(), they seem a little
grammatically clumsy.


How will these functions work in the brave new async I/O world? I assume
PrepareReadBuffer() will initiate the async I/O, and
CompleteReadBuffers() will wait for it to complete. How about
StartReadBuffer() and WaitReadBuffer()? Or StartBufferRead() and
WaitBufferRead()?


What we have imagined so far is that the asynchronous version would
probably have three steps, like this:

  * PrepareReadBuffer() -> pins one buffer, reports if found or IO/zeroing 
needed
  * StartReadBuffers() -> starts the I/O for n contiguous !found buffers
  * CompleteReadBuffers() -> waits, completes as necessary

Ok. It feels surprising to have three steps. I understand that you needtwo steps, one to start the I/O and another to wait for them to finish,but why do you need separate Prepare and Start steps? What can you do inbetween them? (You explained that. I'm just saying that that's myinitial reaction when seeing that API. It is surprising.)

If StartReadBuffer() starts the async I/O, the idea that you can call
ZeroBuffer() instead of WaitReadBuffer() doesn't work. I think
StartReadBuffer() needs to take ReadBufferMode, and do the zeroing for
you in RBM_ZERO_* modes.


Yeah, good thoughts, and topics that have occupied me for some time
now.  I also thought that StartReadBuffer() should take
ReadBufferMode, but I came to the idea that it probably shouldn't like
this:

I started working on all this by trying to implement the most
complicated case I could imagine, streaming recovery, and then working
back to the easy cases that just do scans with RBM_NORMAL.  In
recovery, we can predict that a block will be zeroed using WAL flags,
and pre-existing cross-checks at redo time that enforce that the flags
and redo code definitely agree on that, but we can't predict which
exact ReadBufferMode the redo code will use, RBM_ZERO_AND_LOCK or
RBM_ZERO_AND_CLEANUP_LOCK (or mode=RBM_NORMAL and
get_cleanup_lock=true, as the comment warns them not to, but I
digress).

That's OK, because we can't take locks while looking ahead in recovery
anyway (the redo routine carefully controls lock order/protocol), so
the code to actually do the locking needs to be somewhere near the
output end of the stream when the redo code calls
XLogReadBufferForRedoExtended().  But if you try to use RBM_XXX in
these interfaces, it begins to look pretty funny: the streaming
callback needs to be able to say which ReadBufferMode, but anywhere
near Prepare*(), Start*() or even Complete*() is too soon, so maybe we
need to invent a new value RBM_WILL_ZERO that doesn't yet say which of
the zero modes to use, and then the redo routine needs to pass in the
RBM_ZERO_AND_{LOCK,CLEANUP_LOCK} value to
XLogReadBufferForRedoExtended() and it does it in a separate step
anyway, so we are ignoring ReadBufferMode.  But that feels just wrong
-- we'd be using RBM but implementing them only partially.

I see. When you're about to zero the page, there's not much point insplitting the operation into Prepare/Start/Complete stages anyway.You're not actually doing any I/O. Perhaps it's best to have a separate"Buffer ZeroBuffer(Relation, ForkNumber, BlockNumber, lockmode)"function that does the same asReadBuffer(RBM_ZERO_AND_[LOCK|CLEANUP_LOCK]) today.

The _ZERO_ON_ERROR aspect is a case where CompleteReadBuffers() is the
right time and makes sense to process as a batch, so it becomes a
flag.

+1

Putting all that together, I propose:

/*
   * Initiate reading a block from disk to the buffer cache.
   *
   * XXX: Until we have async I/O, this just allocates the buffer in the
buffer
   * cache. The actual I/O happens in WaitReadBuffer().
   */
Buffer
StartReadBuffer(BufferManagerRelation bmr,
                                 ForkNumber forkNum,
                                 BlockNumber blockNum,
                                 BufferAccessStrategy strategy,
                                 ReadBufferMode mode,
                                 bool *foundPtr);

/*
   * Wait for a read that was started earlier with StartReadBuffer() to
finish.
   *
   * XXX: Until we have async I/O, this is the function that actually
performs
   * the I/O. If multiple I/Os have been started with StartReadBuffer(), this
   * will try to perform all of them in one syscall. Subsequent calls to
   * WaitReadBuffer(), for those other buffers, will finish quickly.
   */
void
WaitReadBuffer(Buffer buf);


I'm confused about where the extra state lives that would allow the
communication required to build a larger I/O.  In the AIO branch, it
does look a little more like that, but there is more magic state and
machinery hiding behind the curtain: the backend's pending I/O list
builds up a chain of I/Os, and when you try to wait, if it hasn't
already been submitted to the kernel/bgworkers yet it will be, and
before that merging will happen.  So you get bigger I/Os without
having to say so explicitly.

Yeah, I was thinking that there would be a global variable that holds alist of operations started with StartReadBuffer().

For this synchronous version (and hopefully soon a more efficient
improved version in the AIO branch), we want to take advantage of the
client's pre-existing and more cheaply obtained knowledge of ranges.

Ok.

In an asynchronous version, that BM_IO_IN_PROGRESS negotiation would
take place in StartReadBuffers() instead, which would be responsible
for kicking off asynchronous I/Os (= asking a background worker to
call pread(), or equivalent fancy kernel async APIs).  One question is
what it does if it finds a block in the middle that chops the read up,
or for that matter a segment boundary.  I don't think we have to
decide now, but the two options seem to be that it starts one single
I/O and reports its size, making it the client's problem to call again
with the rest, or that it starts more than one I/O and they are
somehow chained together so that the caller doesn't know about that
and can later wait for all of them to finish using just one <handle
thing>.

(The reason why I'm not 100% certain how it will look is that the
real, working code in the aio branch right now doesn't actually expose
a vector/nblocks bufmgr interface at all, yet.  Andres's original
prototype had a single-block Start*(), Complete*() design, but a lower
level of the AIO system notices if pending read operations are
adjacent and could be merged.  While discussing all this we decided it
was a bit strange to have lower code deal with allocating, chaining
and processing lots of separate I/O objects in shared memory, when
higher level code could often work in bigger ranges up front, and then
interact with the AIO subsystem with many fewer objects and steps.
Also, the present simple and lightweight synchronous proposal that
lacks the whole subsystem that could do that by magic.)

Hmm, let's sketch out what this would look like with the approach thatyou always start one I/O and report the size:


/*
 * Initiate reading a range of blocks from disk to the buffer cache.
 *
 * If the pages were already found in the buffer cache, returns true.
 * Otherwise false, and the caller must call WaitReadBufferRange() to
 * wait for the I/O to finish, before accessing the buffers.
 *
 * 'buffers' is a caller-supplied array large enough to hold (endBlk -
 * startBlk) buffers. It is filled with the buffers that the pages are
 * read into.
 *
 * This always starts a read of at least one block, and tries to
 * initiate one read I/O for the whole range if possible. But if the
 * read cannot be performed as a single I/O, a partial read is started,
 * and *endBlk is updated to reflect the range for which the read was
 * started. The caller can make another call to read the rest of the
 * range. A partial read can occur if some, but not all, of the pages
 * are already in the buffer cache, or because the range crosses a
 * segment boundary.
 *
 * XXX: Until we have async I/O, this just allocates the buffers in the
 * buffer cache. And perhaps calls smgrprefetch(). The actual I/O
 * happens in WaitReadBufferRange().
 */
bool
StartReadBufferRange(BufferManagerRelation bmr,
                     ForkNumber forkNum,
                     BlockNumber startBlk,
                     BlockNumber *endBlk,
                     BufferAccessStrategy strategy,
                     Buffer *buffers);

/*
 * Wait for a read that was started earlier with StartReadBufferRange()
 * to finish.
 *
 * XXX: Until we have async I/O, this is the function that actually
 * performs
 * the I/O. StartReadBufferRange already checked that the pages can be
 * read in one preadv() syscall. However, it's possible that another
 * backend performed the read for some of the pages in between. In that
 * case this will perform multiple syscalls, after all.
 */
void
WaitReadBufferRange(Buffer *buffers, int nbuffers, bool zero_on_error);

/*
 * Allocates a buffer for the given page in the buffer cache, and locks
 * the page. No I/O is initiated. The caller must initialize it and mark
 * the buffer dirty before releasing the lock.
 *
 * This is equivalent to ReadBuffer(RBM_ZERO_AND_LOCK) or
 * ReadBuffer(RBM_ZERO_AND_CLEANUP_LOCK).
 */
Buffer
ZeroBuffer(BufferManagerRelation bmr,
                   ForkNumber forkNum,
                   BlockNumber blockNum,
                   BufferAccessStrategy strategy,
                   bool cleanup_lock);

This range-oriented API fits the callers pretty well: the streaming readAPI works with block ranges already. There is no need for separatePrepare and Start steps.

One weakness is that if StartReadBufferRange() finds that the range is"chopped up", it needs to return and throw away the work it had to do tolook up the next buffer. So in the extreme case that every other blockin the range is in the buffer cache, each call would look up two buffersin the buffer cache, startBlk and startBlk + 1, but only return onebuffer to the caller.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Streaming I/O, vectored I/O (WIP)

Reply via email to