On Thu, May 08, 2025 at 09:06:18PM -0400, Andres Freund wrote: > On 2025-05-02 20:05:11 -0700, Noah Misch wrote: > > On Wed, Apr 30, 2025 at 04:00:35PM -0400, Andres Freund wrote:
> We do need to hold interrupts in a few other places, I think - with some debug > infrastructure (things like calling ProcessBarrierSmgrRelease() whenever > interrupts could be processed and calling CFI() in errstart() in its return > false case) it's possible to find state confusions which trigger > assertions. The issue is that pgaio_io_update_state() contains a > pgaio_debug_io() and executing pgaio_closing_fd() in places that call > pgaio_io_update_state() doesn't end well. There's a similar danger with the > debug message in pgaio_io_reclaim(). > > In the attached patch I added an assertion to pgaio_io_update_state() > verifying that interrupts are held and added code to hold interrupts in the > relevant places. Works for me. > > For the "no free IOs despite no in-flight IOs" case, I'd replace the > > ereport(ERROR) with "return;" since we now know interrupt processing > > reclaimed > > an IO. > > Hm - it seems better to me to check if there are now free handles and return > if that's the case, but to keep the error check in case there actually is no > free IO? That seems like a not implausible bug... Works for me. > > Then decide what protection if any, we need against bugs causing an > > infinite loop in caller pgaio_io_acquire(). What's the case motivating the > > unbounded loop in pgaio_io_acquire(), as opposed to capping at two > > pgaio_io_acquire_nb() calls? If the theory is that pgaio_io_acquire() could > > be reentrant, what scenario would reach that reentrancy? > > I do not remember why I wrote this as an endless loop. If you prefer I could > change that as part of this patch. I asked because it would be sad to remove the ereport(ERROR) like I proposed and then have a bug cause a real infinite loop. Removing the loop was one way to prove that can't happen. As you say, another way would be keeping the ereport(ERROR) and guarding it with a free-handles check, like in your patch today. I don't have a strong preference between those. > It does seem rather dangerous that errstart() processes interrupts for debug > messages, but only if the debug message is actually logged. That's really a > recipe for hard to find bugs. I wonder if we should, at least in assertion > mode, process interrupts even if not emitting the message. Yes, that sounds excellent to have.