Hi,

On 2025-06-12 11:52:31 -0400, Andres Freund wrote:
> On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote:
> > On 12/06/2025 4:57 pm, Andres Freund wrote:
> > > The problem appears to be in that switch between "when submitted, by the 
> > > IO
> > > worker" and "then again by the backend".  It's not concurrent access in 
> > > the
> > > sense of two processes writing to the same value, it's that when switching
> > > from the worker updating ->distilled_result to the issuer looking at 
> > > that, the
> > > issuer didn't ensure that no outdated version of ->distilled_result could 
> > > be
> > > used.
> > > 
> > > Basically, the problem is that the worker would
> > > 
> > > 1) set ->distilled_result
> > > 2) perform a write memory barrier
> > > 3) set ->state to COMPLETED_SHARED
> > > 
> > > and then the issuer of the IO would:
> > > 
> > > 4) check ->state is COMPLETED_SHARED
> > > 5) use ->distilled_result
> > > 
> > > The problem is that there currently is no barrier between 4 & 5, which 
> > > means
> > > an outdated ->distilled_result could be used.
> > > 
> > > 
> > > This also explains why the issue looked so weird - eventually, after 
> > > fprintfs,
> > > after a core dump, etc, the updated ->distilled_result result would 
> > > "arrive"
> > > in the issuing process, and suddenly look correct.
> > 
> > Thank you very much for explanation.
> > Everything seems to be so simple after explanations, that you can not even
> > believe that before you think that such behavior can be only caused by
> > "black magic" or "OS bug":)
> 
> Indeed. For a while I was really starting to doubt my sanity - not helped by
> the fact that I'd loose most of the mental context between reproductions (the
> problem did not re-occur from Saturday to Wednesday morning...).  What finally
> made it clearer was moving the failing assertion to earlier
> (pgaio_io_call_complete_local() and shared_buffer_readv_complete_local()), as
> that took out a lot of the time in which the problem could occur.

> 
> > Certainly using outdated result can explain such behavior.
> > But in which particular place we loose read barrier between 4 and 5?
> > I see `pgaio_io_wait` which as I expect should be called by backend to wait
> > completion of IO.
> > And it calls `pgaio_io_was_recycled` to get state and it in turn enforce
> > read barrier:
> > ```
> > 
> > bool
> > pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation,
> > PgAioHandleState *state)
> > {
> >     *state = ioh->state;
> >     pg_read_barrier();
> > 
> >     return ioh->generation != ref_generation;
> > }
> > ```
> 
> Yes, I don't think that path has the issue.
> 
> As far as I can tell the problem only ever happens if we end up reclaiming an
> IO while waiting for a free IO handle.
> 
> The known problematic path is
> pgaio_io_acquire()
> -> pgaio_io_wait_for_free()
> -> pgaio_io_reclaim()
> 
> In that path we don't use pgaio_io_was_recycled().  I couldn't find any other
> path with the same issue [1].

> The issue only happening while waiting for free IO handles is presumably the
> reason why it happend in 027_stream_regress.pl, due to the small
> shared_buffers used for the test io_max_concurrency is 1, which means we
> constantly need to wait for a free IO handles.

Armed with that knowledge it turns out to not be too hard to create a repro
that fails a lot quicker than multiple days.

Starting postgres with

  -c io_max_concurrency=1 -c io_combine_limit=1 -c synchronize_seqscans=false 
-c restart_after_crash=false -c max_parallel_workers_per_gather=0

and creating a test table with:
  create table large as select generate_series(1, 100000) a, b from 
generate_series(1, 100) b;

makes the following pgbench fail within seconds for me:
  c=16; pgbench -c $c -j $c -M prepared -n -f <(echo "select count(*) FROM 
large;") -T 1000 -P 10

(interestingly at c=8 it takes closer to a minute on average)

Adding pg_read_barrier() before the two pgaio_io_reclaim() calls in
pgaio_io_wait_for_free() has so far prevented a crash.  I'll leave it running
for a while longer, to make sure.

We really lack any meaningful stress testing of IO paths, or I suspect we
would have found this much earlier.  I wonder if we could add a smaller
version of the above, with a small fixed runtime (maybe 3-5s), as a tap test.

Greetings,

Andres Freund


Reply via email to