On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote: > On 2024-09-16 07:43:49 -0700, Noah Misch wrote: > > On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:
> > Reattaching descriptors and memory in each child may work, or one could just > > block io_method=io_uring under EXEC_BACKEND. > > I think the latter option is saner Works for me. > > > +pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios) > > > +{ > > > > + if (ret == -EINTR) > > > + { > > > + elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios); > > > + continue; > > > + } > > > > Since io_uring_submit() is a wrapper around io_uring_enter(), this should > > also > > retry on EAGAIN. "man io_uring_enter" has: > > > > EAGAIN The kernel was unable to allocate memory for the request, or > > otherwise ran out of resources to handle it. The application should wait > > for some completions and try again. > > Hm. I'm not sure that makes sense. We only allow a limited number of IOs to be > in flight for each uring instance. That's different to a use of uring to > e.g. wait for incoming network data on thousands of sockets, where you could > have essentially unbounded amount of requests outstanding. > > What would we wait for? What if we were holding a critical lock in that > moment? Would it be safe to just block for some completions? What if there's > actually no IO in progress? I'd try the following. First, scan for all IOs of all processes at AHS_DEFINED and later, advancing them to AHS_COMPLETED_SHARED. This might be unsafe today, but discovering why it's unsafe likely will inform design beyond EAGAIN returns. I don't specifically know of a way it's unsafe. Do just one pass of that; there may be newer IOs in progress afterward. If submit still gets EAGAIN, sleep a bit and retry. Like we do in pgwin32_open_handle(), fail after a fixed number of iterations. This isn't great if we hold a critical lock, but it beats the alternative of PANIC on the first EAGAIN. > > > +FileStartWriteV(struct PgAioHandle *ioh, File file, > > > + int iovcnt, off_t offset, > > > + uint32 wait_event_info) > > > +{ > > > ... > > > > FileStartWriteV() gets to state AHS_PREPARED, so let's align with the state > > name by calling it FilePrepareWriteV (or FileWriteVPrepare or whatever). > > Hm - that doesn't necessarily seem right to me. I don't think the caller > should assume that the IO will just be prepared and not already completed by > the time FileStartWriteV() returns - we might actually do the IO > synchronously. Yes. Even if it doesn't become synchronous IO, some other process may advance the IO to AHS_COMPLETED_SHARED by the next wake-up of the process that defined the IO. Still, I think this shouldn't use the term "Start" while no state name uses that term. What else could remove that mismatch? > > Is there any specific decision you'd like to settle before patch 6 exits > > WIP? > > Patch 6 specifically? That I really mainly kept separate for review - it No. I'll rephrase as "Is there any specific decision you'd like to settle before the next cohort of patches exits WIP?" > doesn't seem particularly interesting to commit it earlier than 7, or do you > think differently? No, I agree a lone commit of 6 isn't a win. Roughly, the eight patches 6-9,12-15 could be a minimal attractive unit. I've not thought through that grouping much. > In case you mean 6+7 or 6 to ~11, I can think of the following: > > - I am worried about the need for bounce buffers for writes of checksummed > buffers. That quickly ends up being a significant chunk of memory, > particularly when using a small shared_buffers with a higher than default > number of connection. I'm currently hacking up a prototype that'd prevent us > from setting hint bits with just a share lock. I'm planning to start a > separate thread about that. AioChooseBounceBuffers() limits usage to 256 blocks (2MB) per MaxBackends. Doing better is nice, but I don't consider this a blocker. I recommend dealing with the worry by reducing the limit initially (128 blocks?). Can always raise it later. > - The header split doesn't yet quite seem right yet I won't have a strong opinion on that one. The aio.c/aio_io.c split did catch my attention. I made a note to check it again once those files get header comments. > - I'd like to implement retries in the later patches, to make sure that it > doesn't have design implications Yes, that's a blocker to me. > - Worker mode needs to be able to automatically adjust the number of running > workers, I think - otherwise it's going to be too hard to tune. Changing that later wouldn't affect much else, so I'd not consider it a blocker. (The worst case is that we think the initial AIO release will be a loss for most users, so we wrap it in debug_ terminology like we did for debug_io_direct. I'm not saying worker scaling will push AIO from one side of that line to another, but that's why I'm fine with commits that omit self-contained optimizations.) > - I think the PgAioHandles need to be slimmed down a bit - there's some design > evolution visible that should not end up in the tree. Okay. > - I'm not sure that I like name "subject" for the different things AIO is > performed for How about one of these six terms: - listener, observer [if you view smgr as an observer of IOs in the sense of https://en.wikipedia.org/wiki/Observer_pattern] - class, subclass, type, tag [if you view an SmgrIO as a subclass of an IO, in the object-oriented sense] > - I am wondering if the need for pgaio_io_set_io_data_32() (to store the set > of buffer ids that are affected by one IO) could be replaced by repurposing > BufferDesc->freeNext or something along those lines. I don't like the amount > of memory required for storing those arrays, even if it's not that much > compared to needing space to store struct iovec[PG_IOV_MAX] for each AIO > handle. Here too, changing that later wouldn't affect much else, so I'd not consider it a blocker. > - I'd like to extend the test module to actually test more cases, it's too > hard to reach some paths, particularly without [a lot] of users yet. That's > not strictly a dependency of the earlier patches - since the initial patches > can't actually do much in the way of IO. Agreed. Among the post-patch check-world coverage, which uncovered parts have the most risk? > - We shouldn't reserve AioHandles etc for io workers - but because different > tpes of aux processes don't use a predetermined ProcNumber, that's not > entirely trivial without adding more complexity. I've actually wondered > whether IO workes should be their own "top-level" kind of process, rather > than an aux process. But that seems quite costly. Here too, changing that later wouldn't affect much else, so I'd not consider it a blocker. Of these ones I'm calling non-blockers, which would you most regret deferring? > - Right now the io_uring mode has each backend's io_uring instance visible to > each other process. That ends up using a fair number of FDs. That's OK from > an efficiency perspective, but I think we'd need to add code to adjust the > soft RLIMIT_NOFILE (it's set to 1024 on most distros because there are > various programs that iterate over all possible FDs, causing significant Agreed on raising the soft limit. Docs and/or errhint() likely will need to mention system configuration nonetheless, since some users will encounter RLIMIT_MEMLOCK or /proc/sys/kernel/io_uring_disabled. > slowdowns when the soft limit defaults to something high). I earlier had a > limited number of io_uring instances, but that added a fair amount of > overhead because then submitting IO would require a lock. > > That again doesn't have to be solved as part of the earlier patches but > might have some minor design impact. How far do you see the design impact spreading on that one? Thanks, nm