On Thu, May 08, 2025 at 09:06:18PM -0400, Andres Freund wrote:
> On 2025-05-02 20:05:11 -0700, Noah Misch wrote:
> > On Wed, Apr 30, 2025 at 04:00:35PM -0400, Andres Freund wrote:

> We do need to hold interrupts in a few other places, I think - with some debug
> infrastructure (things like calling ProcessBarrierSmgrRelease() whenever
> interrupts could be processed and calling CFI() in errstart() in its return
> false case) it's possible to find state confusions which trigger
> assertions. The issue is that pgaio_io_update_state() contains a
> pgaio_debug_io() and executing pgaio_closing_fd() in places that call
> pgaio_io_update_state() doesn't end well.  There's a similar danger with the
> debug message in pgaio_io_reclaim().
> 
> In the attached patch I added an assertion to pgaio_io_update_state()
> verifying that interrupts are held and added code to hold interrupts in the
> relevant places.

Works for me.

> > For the "no free IOs despite no in-flight IOs" case, I'd replace the
> > ereport(ERROR) with "return;" since we now know interrupt processing 
> > reclaimed
> > an IO.
> 
> Hm - it seems better to me to check if there are now free handles and return
> if that's the case, but to keep the error check in case there actually is no
> free IO? That seems like a not implausible bug...

Works for me.

> > Then decide what protection if any, we need against bugs causing an
> > infinite loop in caller pgaio_io_acquire().  What's the case motivating the
> > unbounded loop in pgaio_io_acquire(), as opposed to capping at two
> > pgaio_io_acquire_nb() calls?  If the theory is that pgaio_io_acquire() could
> > be reentrant, what scenario would reach that reentrancy?
> 
> I do not remember why I wrote this as an endless loop.  If you prefer I could
> change that as part of this patch.

I asked because it would be sad to remove the ereport(ERROR) like I proposed
and then have a bug cause a real infinite loop.  Removing the loop was one way
to prove that can't happen.  As you say, another way would be keeping the
ereport(ERROR) and guarding it with a free-handles check, like in your patch
today.  I don't have a strong preference between those.

> It does seem rather dangerous that errstart() processes interrupts for debug
> messages, but only if the debug message is actually logged. That's really a
> recipe for hard to find bugs.  I wonder if we should, at least in assertion
> mode, process interrupts even if not emitting the message.

Yes, that sounds excellent to have.


Reply via email to