On Thu, Jan 11, 2024 at 8:58 AM Heikki Linnakangas <hlinn...@iki.fi> wrote: > On 10/01/2024 06:13, Thomas Munro wrote: > > Bikeshedding call: I am open to better suggestions for the names > > PrepareReadBuffer() and CompleteReadBuffers(), they seem a little > > grammatically clumsy. > > How will these functions work in the brave new async I/O world? I assume > PrepareReadBuffer() will initiate the async I/O, and > CompleteReadBuffers() will wait for it to complete. How about > StartReadBuffer() and WaitReadBuffer()? Or StartBufferRead() and > WaitBufferRead()?
What we have imagined so far is that the asynchronous version would probably have three steps, like this: * PrepareReadBuffer() -> pins one buffer, reports if found or IO/zeroing needed * StartReadBuffers() -> starts the I/O for n contiguous !found buffers * CompleteReadBuffers() -> waits, completes as necessary In the proposed synchronous version, the middle step is missing, but streaming_read.c directly calls smgrprefetch() instead. I thought about shoving that inside a prosthetic StartReadBuffers() function, but I backed out of simulating asynchronous I/O too fancifully. The streaming read API is where we really want to stabilise a nice API, so we can moves things around behind it if required. A bit of analysis of the one block -> nblocks change and the synchronous -> asynchronous change: Two things are new in a world with nblocks > 1. (1) We needed to be able to set BM_IO_IN_PROGRESS on more than one block at a time, but commit 12f3867f already provided that, and (2) someone else might come along and read a block in the middle of our range, effectively chopping our range into subranges. That's true also in master but when nblocks === 1 that's all-or-nothing, and now we have partial cases. In the proposed synchronous code, CompleteReadBuffers() claims as many contiguous BM_IO_IN_PROGRESS flags as it in the range, and then loops process the rest, skipping over any blocks that are already done. Further down in md.c, you might also cross a segment boundary. So that's two different reasons while a single call to CompleteReadBuffers() might finish up generating zero or more than one I/O system call, though very often it's one. Hmm, while spelling that out, I noticed an obvious problem and improvement to make to that part of v5. If backend #1 is trying to read blocks 101..103 and acquires BM_IO_IN_PROGRESS for 101, but backend #2 comes along and starts reading block 102 first, backend #1's StartBufferIO() call would wait for 102's I/O CV while it still holds BM_IO_IN_PROGRESS for block 101, potentially blocking a third backend #3 that wants to read block 101 even though no I/O is in progress for that block yet! At least that's deadlock free (because always in block order), but it seems like undesirable lock chaining. Here is my proposed improvement: StartBufferIO() gains a nowait flag. For the head block we wait, but while trying to build a larger range we don't. We'll try 102 again in the next loop, with a wait. Here is a small fixup for that. In an asynchronous version, that BM_IO_IN_PROGRESS negotiation would take place in StartReadBuffers() instead, which would be responsible for kicking off asynchronous I/Os (= asking a background worker to call pread(), or equivalent fancy kernel async APIs). One question is what it does if it finds a block in the middle that chops the read up, or for that matter a segment boundary. I don't think we have to decide now, but the two options seem to be that it starts one single I/O and reports its size, making it the client's problem to call again with the rest, or that it starts more than one I/O and they are somehow chained together so that the caller doesn't know about that and can later wait for all of them to finish using just one <handle thing>. (The reason why I'm not 100% certain how it will look is that the real, working code in the aio branch right now doesn't actually expose a vector/nblocks bufmgr interface at all, yet. Andres's original prototype had a single-block Start*(), Complete*() design, but a lower level of the AIO system notices if pending read operations are adjacent and could be merged. While discussing all this we decided it was a bit strange to have lower code deal with allocating, chaining and processing lots of separate I/O objects in shared memory, when higher level code could often work in bigger ranges up front, and then interact with the AIO subsystem with many fewer objects and steps. Also, the present simple and lightweight synchronous proposal that lacks the whole subsystem that could do that by magic.) > About the signature of those functions: Does it make sense for > CompleteReadBuffers() (or WaitReadBuffers()) function to take a vector > of buffers? If StartReadBuffer() initiates the async I/O immediately, is > there any benefit to batching the waiting? > > If StartReadBuffer() starts the async I/O, the idea that you can call > ZeroBuffer() instead of WaitReadBuffer() doesn't work. I think > StartReadBuffer() needs to take ReadBufferMode, and do the zeroing for > you in RBM_ZERO_* modes. Yeah, good thoughts, and topics that have occupied me for some time now. I also thought that StartReadBuffer() should take ReadBufferMode, but I came to the idea that it probably shouldn't like this: I started working on all this by trying to implement the most complicated case I could imagine, streaming recovery, and then working back to the easy cases that just do scans with RBM_NORMAL. In recovery, we can predict that a block will be zeroed using WAL flags, and pre-existing cross-checks at redo time that enforce that the flags and redo code definitely agree on that, but we can't predict which exact ReadBufferMode the redo code will use, RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK (or mode=RBM_NORMAL and get_cleanup_lock=true, as the comment warns them not to, but I digress). That's OK, because we can't take locks while looking ahead in recovery anyway (the redo routine carefully controls lock order/protocol), so the code to actually do the locking needs to be somewhere near the output end of the stream when the redo code calls XLogReadBufferForRedoExtended(). But if you try to use RBM_XXX in these interfaces, it begins to look pretty funny: the streaming callback needs to be able to say which ReadBufferMode, but anywhere near Prepare*(), Start*() or even Complete*() is too soon, so maybe we need to invent a new value RBM_WILL_ZERO that doesn't yet say which of the zero modes to use, and then the redo routine needs to pass in the RBM_ZERO_AND_{LOCK,CLEANUP_LOCK} value to XLogReadBufferForRedoExtended() and it does it in a separate step anyway, so we are ignoring ReadBufferMode. But that feels just wrong -- we'd be using RBM but implementing them only partially. Another way to put it is that ReadBufferMode actually conflates a bunch of different behaviour that applies at different times that we are now separating, and recovery reveals this most clearly because it doesn't have all the information needed while looking ahead. It might be possible to shove more information in the WAL to fix the information problem, but it seemed more natural to me to separate the aspects of ReadBufferMode, because that isn't the only problem: groupwise-processing of lock doesn't even make sense. So I teased a couple of those aspects out into separate flags, for example: The streaming read interface has two variants: the "simple" implicit RBM_NORMAL, single relation, single fork version that is used in most client examples and probably everything involving the executor, and the "extended" version (like the earlier versions in this thread, removed for now based on complaints that most early uses don't use it, will bring it back separately later with streaming recovery patches). In the extended version, the streaming callback can set *will_zero = true, which is about all the info that recovery can figure out from the WAL anyway, and then XLogReadBufferForRedoExtended() will later call ZeroBuffer() because at that time we have the ReadBufferMode. The _ZERO_ON_ERROR aspect is a case where CompleteReadBuffers() is the right time and makes sense to process as a batch, so it becomes a flag. > Putting all that together, I propose: > > /* > * Initiate reading a block from disk to the buffer cache. > * > * XXX: Until we have async I/O, this just allocates the buffer in the > buffer > * cache. The actual I/O happens in WaitReadBuffer(). > */ > Buffer > StartReadBuffer(BufferManagerRelation bmr, > ForkNumber forkNum, > BlockNumber blockNum, > BufferAccessStrategy strategy, > ReadBufferMode mode, > bool *foundPtr); > > /* > * Wait for a read that was started earlier with StartReadBuffer() to > finish. > * > * XXX: Until we have async I/O, this is the function that actually > performs > * the I/O. If multiple I/Os have been started with StartReadBuffer(), this > * will try to perform all of them in one syscall. Subsequent calls to > * WaitReadBuffer(), for those other buffers, will finish quickly. > */ > void > WaitReadBuffer(Buffer buf); I'm confused about where the extra state lives that would allow the communication required to build a larger I/O. In the AIO branch, it does look a little more like that, but there is more magic state and machinery hiding behind the curtain: the backend's pending I/O list builds up a chain of I/Os, and when you try to wait, if it hasn't already been submitted to the kernel/bgworkers yet it will be, and before that merging will happen. So you get bigger I/Os without having to say so explicitly. For this synchronous version (and hopefully soon a more efficient improved version in the AIO branch), we want to take advantage of the client's pre-existing and more cheaply obtained knowledge of ranges. > I'm not sure how well this fits with the streaming read API. The > streaming read code performs grouping of adjacent blocks to one > CompleteReadBuffers() call. If WaitReadBuffer() does the batching, > that's not really required. But does that make sense with async I/O? > With async I/O, will you need a vectorized version of StartReadBuffer() too? I think so, yes.
From d7b09cab9faf93ce288dd8006c2202cc08897b67 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Thu, 11 Jan 2024 16:00:57 +1300 Subject: [PATCH] fixup! Provide vectored variant of ReadBuffer(). Avoid wait-chaining for BM_IO_IN_PROGRESS. --- src/backend/storage/buffer/bufmgr.c | 38 ++++++++++++++++++++--------- 1 file changed, 27 insertions(+), 11 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index d462ef7460..af23d3dda4 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -501,7 +501,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf); static int SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context); static void WaitIO(BufferDesc *buf); -static bool StartBufferIO(BufferDesc *buf, bool forInput); +static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait); static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits, bool forget_owner); static void AbortBufferIO(Buffer buffer); @@ -1160,7 +1160,7 @@ PrepareReadBuffer(BufferManagerRelation bmr, } static inline bool -CompleteReadBuffersCanStartIO(Buffer buffer) +CompleteReadBuffersCanStartIO(Buffer buffer, bool nowait) { if (BufferIsLocal(buffer)) { @@ -1169,7 +1169,7 @@ CompleteReadBuffersCanStartIO(Buffer buffer) return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0; } else - return StartBufferIO(GetBufferDescriptor(buffer - 1), true); + return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait); } /* @@ -1252,8 +1252,13 @@ CompleteReadBuffers(BufferManagerRelation bmr, } #endif - /* Skip this block if someone else has already completed it. */ - if (!CompleteReadBuffersCanStartIO(buffers[i])) + /* + * Skip this block if someone else has already completed it. If an + * I/O is already in progress in another backend, this will wait for + * the outcome: either done, or something went wrong and we will + * retry. + */ + if (!CompleteReadBuffersCanStartIO(buffers[i], false)) { /* * Report this as a 'hit' for this backend, even though it must @@ -1276,10 +1281,13 @@ CompleteReadBuffers(BufferManagerRelation bmr, /* * How many neighboring-on-disk blocks can we can scatter-read into - * other buffers at the same time? + * other buffers at the same time? In this case we don't wait if we + * see an I/O already in progress. We already hold BM_IO_IN_PROGRESS + * for the head block, so we should get on with that I/O as soon as + * possible. We'll come back to this block again, above. */ while ((i + 1) < nblocks && - CompleteReadBuffersCanStartIO(buffers[i + 1])) + CompleteReadBuffersCanStartIO(buffers[i + 1], true)) { /* Must be consecutive block numbers. */ Assert(BufferGetBlockNumber(buffers[i + 1]) == @@ -2160,7 +2168,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr, buf_state &= ~BM_VALID; UnlockBufHdr(existing_hdr, buf_state); - } while (!StartBufferIO(existing_hdr, true)); + } while (!StartBufferIO(existing_hdr, true, false)); } else { @@ -2183,7 +2191,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr, LWLockRelease(partition_lock); /* XXX: could combine the locked operations in it with the above */ - StartBufferIO(victim_buf_hdr, true); + StartBufferIO(victim_buf_hdr, true, false); } } @@ -3580,7 +3588,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object, * someone else flushed the buffer before we could, so we need not do * anything. */ - if (!StartBufferIO(buf, false)) + if (!StartBufferIO(buf, false, false)) return; /* Setup error traceback support for ereport() */ @@ -5359,9 +5367,15 @@ WaitIO(BufferDesc *buf) * * Returns true if we successfully marked the buffer as I/O busy, * false if someone else already did the work. + * + * If nowait is true, then we don't wait for an I/O to be finished by another + * backend. In that case, false indicates either that the I/O was already + * finished, or is still in progress. This is useful for callers that want to + * find out if they can perform the I/O as part of a larger operation, without + * waiting for the answer or distinguishing the reasons why not. */ static bool -StartBufferIO(BufferDesc *buf, bool forInput) +StartBufferIO(BufferDesc *buf, bool forInput, bool nowait) { uint32 buf_state; @@ -5374,6 +5388,8 @@ StartBufferIO(BufferDesc *buf, bool forInput) if (!(buf_state & BM_IO_IN_PROGRESS)) break; UnlockBufHdr(buf, buf_state); + if (nowait) + return false; WaitIO(buf); } -- 2.39.2