Re: Non-reproducible AIO failure

Andres Freund Thu, 12 Jun 2025 08:52:51 -0700

Hi,

On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote:
> On 12/06/2025 4:57 pm, Andres Freund wrote:
> > The problem appears to be in that switch between "when submitted, by the IO
> > worker" and "then again by the backend".  It's not concurrent access in the
> > sense of two processes writing to the same value, it's that when switching
> > from the worker updating ->distilled_result to the issuer looking at that, 
> > the
> > issuer didn't ensure that no outdated version of ->distilled_result could be
> > used.
> > 
> > Basically, the problem is that the worker would
> > 
> > 1) set ->distilled_result
> > 2) perform a write memory barrier
> > 3) set ->state to COMPLETED_SHARED
> > 
> > and then the issuer of the IO would:
> > 
> > 4) check ->state is COMPLETED_SHARED
> > 5) use ->distilled_result
> > 
> > The problem is that there currently is no barrier between 4 & 5, which means
> > an outdated ->distilled_result could be used.
> > 
> > 
> > This also explains why the issue looked so weird - eventually, after 
> > fprintfs,
> > after a core dump, etc, the updated ->distilled_result result would "arrive"
> > in the issuing process, and suddenly look correct.
> 
> Thank you very much for explanation.
> Everything seems to be so simple after explanations, that you can not even
> believe that before you think that such behavior can be only caused by
> "black magic" or "OS bug":)


Indeed. For a while I was really starting to doubt my sanity - not helped by
the fact that I'd loose most of the mental context between reproductions (the
problem did not re-occur from Saturday to Wednesday morning...).  What finally
made it clearer was moving the failing assertion to earlier
(pgaio_io_call_complete_local() and shared_buffer_readv_complete_local()), as
that took out a lot of the time in which the problem could occur.


> Certainly using outdated result can explain such behavior.
> But in which particular place we loose read barrier between 4 and 5?
> I see `pgaio_io_wait` which as I expect should be called by backend to wait
> completion of IO.
> And it calls `pgaio_io_was_recycled` to get state and it in turn enforce
> read barrier:
> ```
> 
> bool
> pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation,
> PgAioHandleState *state)
> {
>     *state = ioh->state;
>     pg_read_barrier();
> 
>     return ioh->generation != ref_generation;
> }
> ```

Yes, I don't think that path has the issue.

As far as I can tell the problem only ever happens if we end up reclaiming an
IO while waiting for a free IO handle.

The known problematic path is
pgaio_io_acquire()
-> pgaio_io_wait_for_free()
-> pgaio_io_reclaim()

In that path we don't use pgaio_io_was_recycled().  I couldn't find any other
path with the same issue [1].

The issue only happening while waiting for free IO handles is presumably the
reason why it happend in 027_stream_regress.pl, due to the small
shared_buffers used for the test io_max_concurrency is 1, which means we
constantly need to wait for a free IO handles.

Greetings,

Andres Freund

[1] I think theoretically we are missing a memory barrier in the worker, when
starting to process an IO, but the lwlock for getting the IO to process
provides that. But we should probably add a comment about that.

Re: Non-reproducible AIO failure

Reply via email to