On Wed, Apr 30, 2025 at 04:00:35PM -0400, Andres Freund wrote:
> pgaio_io_wait_for_free() does what it says on the tin. For that, after a bunch
> of other things, finds the oldest in-flight IO and waits for it.
> 
>               PgAioHandle *ioh = dclist_head_element(PgAioHandle, node,
>                                                                               
>            &pgaio_my_backend->in_flight_ios);
> 
>               switch (ioh->state)
>               {
> ...
>                       case PGAIO_HS_COMPLETED_IO:
>                       case PGAIO_HS_SUBMITTED:
>                               pgaio_debug_io(DEBUG2, ioh,
>                                                          "waiting for free io 
> with %d in flight",
>                                                          
> dclist_count(&pgaio_my_backend->in_flight_ios));
> ...
>                               pgaio_io_wait(ioh, ioh->generation);
>                               break;
> 
> 
> The problem is that, if the log level is low enough, ereport() (which is
> called by pgaio_debug_io()), processes interrupts.  The interrupt processing
> may end up execute ProcessBarrierSmgrRelease(), which in turn needs to wait
> for all in-flight IOs before the IOs are closed.
> 
> Which then leads to the
>                       elog(PANIC, "waiting for own IO in wrong state: %d",
>                                state);
> 
> error.

Printing state 0 (PGAIO_HS_IDLE), right?  I think the chief problem is that
pgaio_io_wait_for_free() is fetching ioh->state, then possibly processing
interrupts in pgaio_debug_io(), then finally fetching ioh->generation.  If it
fetched ioh->generation to a local variable before pgaio_debug_io, I think
that would resolve this one.  Then the pgaio_io_was_recycled() would prevent
the PANIC:

        if (pgaio_io_was_recycled(ioh, ref_generation, &state))
                return;

        if (am_owner)
        {
                if (state != PGAIO_HS_SUBMITTED
                        && state != PGAIO_HS_COMPLETED_IO
                        && state != PGAIO_HS_COMPLETED_SHARED
                        && state != PGAIO_HS_COMPLETED_LOCAL)
                {
                        elog(PANIC, "waiting for own IO in wrong state: %d",
                                 state);
                }
        }

Is that right?  If that's the solution, pgaio_closing_fd() and
pgaio_shutdown() would need similar care around fetching the generation before
the pgaio_debug_io.  Maybe there's an opportunity for a common inline
function.  Or at least a comment at the "generation" field on how to safely
time a fetch thereof and any barrier required.

> A similar set of steps can lead to the "no free IOs despite no in-flight IOs"
> ERROR that Alexander also observed - if pgaio_submit_staged() triggers a debug
> ereport that executes ProcessBarrierSmgrRelease() in an interrupt, we might
> wait for all in-flight IOs during IO submission, triggering the error.

That makes sense.

> I'm not yet sure how to best fix it - locally I have done so by pgaio_debug()
> do a HOLD_INTERRUPTS()/RESUME_INTERRUPTS() around the call to ereport.  But
> that doesn't really seem great - otoh requiring various pieces of code to know
> that anything emitting debug messages needs to hold interrupts etc makes for
> rare and hard to understand bugs.
> 
> We could just make the relevant functions hold interrupts, and that might be
> the best path forward, but we don't really need to hold all interrupts
> (e.g. termination would be fine), so it's a bit coarse grained.  It would need
> to happen in a few places, which isn't great either.
> 
> Other suggestions?

For the "no free IOs despite no in-flight IOs" case, I'd replace the
ereport(ERROR) with "return;" since we now know interrupt processing reclaimed
an IO.  Then decide what protection if any, we need against bugs causing an
infinite loop in caller pgaio_io_acquire().  What's the case motivating the
unbounded loop in pgaio_io_acquire(), as opposed to capping at two
pgaio_io_acquire_nb() calls?  If the theory is that pgaio_io_acquire() could
be reentrant, what scenario would reach that reentrancy?

> Thanks again for finding and reporting this Alexander!

+1!


Reply via email to