Re: Non-reproducible AIO failure

Konstantin Knizhnik Sat, 14 Jun 2025 11:45:26 -0700


On 13/06/2025 11:20 pm, Andres Freund wrote:

Hi,

On 2025-06-12 12:23:13 -0400, Andres Freund wrote:

On 2025-06-12 11:52:31 -0400, Andres Freund wrote:

On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote:

On 12/06/2025 4:57 pm, Andres Freund wrote:

The problem appears to be in that switch between "when submitted, by the IO
worker" and "then again by the backend".  It's not concurrent access in the
sense of two processes writing to the same value, it's that when switching
from the worker updating ->distilled_result to the issuer looking at that, the
issuer didn't ensure that no outdated version of ->distilled_result could be
used.

Basically, the problem is that the worker would

1) set ->distilled_result
2) perform a write memory barrier
3) set ->state to COMPLETED_SHARED

and then the issuer of the IO would:

4) check ->state is COMPLETED_SHARED
5) use ->distilled_result

The problem is that there currently is no barrier between 4 & 5, which means
an outdated ->distilled_result could be used.


This also explains why the issue looked so weird - eventually, after fprintfs,
after a core dump, etc, the updated ->distilled_result result would "arrive"
in the issuing process, and suddenly look correct.

Thank you very much for explanation.
Everything seems to be so simple after explanations, that you can not even
believe that before you think that such behavior can be only caused by
"black magic" or "OS bug":)

Indeed. For a while I was really starting to doubt my sanity - not helped by
the fact that I'd loose most of the mental context between reproductions (the
problem did not re-occur from Saturday to Wednesday morning...).  What finally
made it clearer was moving the failing assertion to earlier
(pgaio_io_call_complete_local() and shared_buffer_readv_complete_local()), as
that took out a lot of the time in which the problem could occur.

Certainly using outdated result can explain such behavior.
But in which particular place we loose read barrier between 4 and 5?
I see `pgaio_io_wait` which as I expect should be called by backend to wait
completion of IO.
And it calls `pgaio_io_was_recycled` to get state and it in turn enforce
read barrier:
```

bool
pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation,
PgAioHandleState *state)
{
     *state = ioh->state;
     pg_read_barrier();

     return ioh->generation != ref_generation;
}
```

Yes, I don't think that path has the issue.

As far as I can tell the problem only ever happens if we end up reclaiming an
IO while waiting for a free IO handle.

The known problematic path is
pgaio_io_acquire()
-> pgaio_io_wait_for_free()
-> pgaio_io_reclaim()

In that path we don't use pgaio_io_was_recycled().  I couldn't find any other
path with the same issue [1].
The issue only happening while waiting for free IO handles is presumably the
reason why it happend in 027_stream_regress.pl, due to the small
shared_buffers used for the test io_max_concurrency is 1, which means we
constantly need to wait for a free IO handles.

Armed with that knowledge it turns out to not be too hard to create a repro
that fails a lot quicker than multiple days.

Starting postgres with

   -c io_max_concurrency=1 -c io_combine_limit=1 -c synchronize_seqscans=false 
-c restart_after_crash=false -c max_parallel_workers_per_gather=0

and creating a test table with:
   create table large as select generate_series(1, 100000) a, b from 
generate_series(1, 100) b;

makes the following pgbench fail within seconds for me:
   c=16; pgbench -c $c -j $c -M prepared -n -f <(echo "select count(*) FROM 
large;") -T 1000 -P 10

(interestingly at c=8 it takes closer to a minute on average)

Adding pg_read_barrier() before the two pgaio_io_reclaim() calls in
pgaio_io_wait_for_free() has so far prevented a crash.  I'll leave it running
for a while longer, to make sure.

We really lack any meaningful stress testing of IO paths, or I suspect we
would have found this much earlier.  I wonder if we could add a smaller
version of the above, with a small fixed runtime (maybe 3-5s), as a
tap test.

Attached is a patch that fixes the problem for me. Alexander, Konstantin,
could you verify that it also fixes the problem for you?

I have inspected your patch and from my point of view it adds allnecessary read barriers and so eliminates the race.I also ran proposed test during more than 4 hours and the problem is notreproduced at my system (without patch it is reproduced after 100 seconds).

It's a little bit strange that pgbench is "stuck" - I have specifiedtimeout 10000 seconds (-T 10000) but it didn't stop and continue workinguntil 15000 seconds when I have terminated it.There were no stuck queries: all 16 backends are running queries whichexecution takes about 2 seconds:


progress: 14960.0 s, 7.4 tps, lat 2189.006 ms stddev 133.277, 0 failed
progress: 14970.0 s, 7.3 tps, lat 2113.871 ms stddev 164.008, 0 failed
progress: 14980.0 s, 7.3 tps, lat 2038.736 ms stddev 152.869, 0 failed

Looks like there is yet another race condition now in pgbench:)


Given that it does address the problem for me, I'm inclined to push this
fairly soon, the barrier is pretty obviously required.

Greetings,

Andres Freund

Re: Non-reproducible AIO failure

Reply via email to