Re: Non-reproducible AIO failure

Andres Freund Fri, 13 Jun 2025 13:21:11 -0700

Hi,

On 2025-06-12 12:23:13 -0400, Andres Freund wrote:
> On 2025-06-12 11:52:31 -0400, Andres Freund wrote:
> > On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote:
> > > On 12/06/2025 4:57 pm, Andres Freund wrote:
> > > > The problem appears to be in that switch between "when submitted, by 
> > > > the IO
> > > > worker" and "then again by the backend".  It's not concurrent access in 
> > > > the
> > > > sense of two processes writing to the same value, it's that when 
> > > > switching
> > > > from the worker updating ->distilled_result to the issuer looking at 
> > > > that, the
> > > > issuer didn't ensure that no outdated version of ->distilled_result 
> > > > could be
> > > > used.
> > > > 
> > > > Basically, the problem is that the worker would
> > > > 
> > > > 1) set ->distilled_result
> > > > 2) perform a write memory barrier
> > > > 3) set ->state to COMPLETED_SHARED
> > > > 
> > > > and then the issuer of the IO would:
> > > > 
> > > > 4) check ->state is COMPLETED_SHARED
> > > > 5) use ->distilled_result
> > > > 
> > > > The problem is that there currently is no barrier between 4 & 5, which 
> > > > means
> > > > an outdated ->distilled_result could be used.
> > > > 
> > > > 
> > > > This also explains why the issue looked so weird - eventually, after 
> > > > fprintfs,
> > > > after a core dump, etc, the updated ->distilled_result result would 
> > > > "arrive"
> > > > in the issuing process, and suddenly look correct.
> > > 
> > > Thank you very much for explanation.
> > > Everything seems to be so simple after explanations, that you can not even
> > > believe that before you think that such behavior can be only caused by
> > > "black magic" or "OS bug":)
> > 
> > Indeed. For a while I was really starting to doubt my sanity - not helped by
> > the fact that I'd loose most of the mental context between reproductions 
> > (the
> > problem did not re-occur from Saturday to Wednesday morning...).  What 
> > finally
> > made it clearer was moving the failing assertion to earlier
> > (pgaio_io_call_complete_local() and shared_buffer_readv_complete_local()), 
> > as
> > that took out a lot of the time in which the problem could occur.
> 
> > 
> > > Certainly using outdated result can explain such behavior.
> > > But in which particular place we loose read barrier between 4 and 5?
> > > I see `pgaio_io_wait` which as I expect should be called by backend to 
> > > wait
> > > completion of IO.
> > > And it calls `pgaio_io_was_recycled` to get state and it in turn enforce
> > > read barrier:
> > > ```
> > > 
> > > bool
> > > pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation,
> > > PgAioHandleState *state)
> > > {
> > >     *state = ioh->state;
> > >     pg_read_barrier();
> > > 
> > >     return ioh->generation != ref_generation;
> > > }
> > > ```
> > 
> > Yes, I don't think that path has the issue.
> > 
> > As far as I can tell the problem only ever happens if we end up reclaiming 
> > an
> > IO while waiting for a free IO handle.
> > 
> > The known problematic path is
> > pgaio_io_acquire()
> > -> pgaio_io_wait_for_free()
> > -> pgaio_io_reclaim()
> > 
> > In that path we don't use pgaio_io_was_recycled().  I couldn't find any 
> > other
> > path with the same issue [1].
> 
> > The issue only happening while waiting for free IO handles is presumably the
> > reason why it happend in 027_stream_regress.pl, due to the small
> > shared_buffers used for the test io_max_concurrency is 1, which means we
> > constantly need to wait for a free IO handles.
> 
> Armed with that knowledge it turns out to not be too hard to create a repro
> that fails a lot quicker than multiple days.
> 
> Starting postgres with
> 
>   -c io_max_concurrency=1 -c io_combine_limit=1 -c synchronize_seqscans=false 
> -c restart_after_crash=false -c max_parallel_workers_per_gather=0
> 
> and creating a test table with:
>   create table large as select generate_series(1, 100000) a, b from 
> generate_series(1, 100) b;
> 
> makes the following pgbench fail within seconds for me:
>   c=16; pgbench -c $c -j $c -M prepared -n -f <(echo "select count(*) FROM 
> large;") -T 1000 -P 10
> 
> (interestingly at c=8 it takes closer to a minute on average)
> 
> Adding pg_read_barrier() before the two pgaio_io_reclaim() calls in
> pgaio_io_wait_for_free() has so far prevented a crash.  I'll leave it running
> for a while longer, to make sure.
> 
> We really lack any meaningful stress testing of IO paths, or I suspect we
> would have found this much earlier.  I wonder if we could add a smaller
> version of the above, with a small fixed runtime (maybe 3-5s), as a
> tap test.


Attached is a patch that fixes the problem for me. Alexander, Konstantin,
could you verify that it also fixes the problem for you?

Given that it does address the problem for me, I'm inclined to push this
fairly soon, the barrier is pretty obviously required.

Greetings,

Andres Freund

>From 7d5141043f34278995f1cf3fc2aa4f16e2adde59 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Fri, 13 Jun 2025 15:25:03 -0400
Subject: [PATCH v1] aio: Add missing memory barrier when waiting for IO handle

Previously there was no memory barrier enforcing correct memory ordering when
waiting for a free IO handle. However, in the much more common case of waiting
for IO to complete, memory barriers already were present.

On strongly ordered architectures like x86 this had no negative consequences,
but on some armv8 hardware (observed on Apple hardware), it was possible for
the update, in the IO worker, to PgAioHandle->state to become visible before
->distilled_result becoming visible, leading to rather confusing assertion
failures. The failures were rare enough that the bug sometimes took days to
reproduce when running 027_stream_regress in a loop.

Once finally debugged, it was easy enough to come up with a much quicker
repro: Trigger a lot of very fast IO by limiting io_combine_limit to 1 and
ensure that we always have to wait for a free handle by setting
io_max_concurrency to 1. Triggering lots of concurrent seqscans in that setup
triggers the issue within seconds.

One reason this was hard to debug was that the assertion failure most commonly
happened in WaitReadBuffers(), rather than in the AIO subsystem itself. The
assertions added in this commit make problems like this easier to understand.

Also add a comment to the IO worker explaining that we rely on the lwlock
acquisition for correct memory ordering.

I think it'd be good to add a tap test that stress tests buffer IO, but that's
material for a separate patch.

Thanks a lot to Alexander and Konstantin for all the debugging help.

Reported-by: Tom Lane <t...@sss.pgh.pa.us>
Reported-by: Alexander Lakhin <exclus...@gmail.com>
Investigated-by: Andres Freund <and...@anarazel.de>
Investigated-by: Alexander Lakhin <exclus...@gmail.com>
Investigated-by: Konstantin Knizhnik <knizh...@garret.ru>
Discussion: https://postgr.es/m/2dkz7azclpeiqcmouamdixyn5xhlzy4rvikxrbovyzvi6rnv5c@pz7o7osv2ahf
---
 src/backend/storage/aio/aio.c           | 17 +++++++++++++++++
 src/backend/storage/aio/aio_callback.c  |  7 +++++++
 src/backend/storage/aio/method_worker.c |  7 ++++++-
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 6c6c0a908e2..3643f27ad6e 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -556,6 +556,13 @@ bool
 pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state)
 {
 	*state = ioh->state;
+
+	/*
+	 * Ensure that we don't see an earlier state of the handle than ioh->state
+	 * due to compiler or CPU reordering. This protects both ->generation as
+	 * directly used here, and other fields in the handle accessed in the
+	 * caller if the handle was not reused.
+	 */
 	pg_read_barrier();
 
 	return ioh->generation != ref_generation;
@@ -773,7 +780,12 @@ pgaio_io_wait_for_free(void)
 			 * Note that no interrupts are processed between the state check
 			 * and the call to reclaim - that's important as otherwise an
 			 * interrupt could have already reclaimed the handle.
+			 *
+			 * Need to ensure that there's no reordering, in the more common
+			 * paths, where we wait for IO, that's done by
+			 * pgaio_io_was_recycled().
 			 */
+			pg_read_barrier();
 			pgaio_io_reclaim(ioh);
 			reclaimed++;
 		}
@@ -852,7 +864,12 @@ pgaio_io_wait_for_free(void)
 				 * check and the call to reclaim - that's important as
 				 * otherwise an interrupt could have already reclaimed the
 				 * handle.
+				 *
+				 * Need to ensure that there's no reordering, in the more
+				 * common paths, where we wait for IO, that's done by
+				 * pgaio_io_was_recycled().
 				 */
+				pg_read_barrier();
 				pgaio_io_reclaim(ioh);
 				break;
 		}
diff --git a/src/backend/storage/aio/aio_callback.c b/src/backend/storage/aio/aio_callback.c
index 0ad9795bb7e..03c9bba0802 100644
--- a/src/backend/storage/aio/aio_callback.c
+++ b/src/backend/storage/aio/aio_callback.c
@@ -256,6 +256,9 @@ pgaio_io_call_complete_shared(PgAioHandle *ioh)
 					   pgaio_result_status_string(result.status),
 					   result.id, result.error_data, result.result);
 		result = ce->cb->complete_shared(ioh, result, cb_data);
+
+		/* the callback should never transition to unknown */
+		Assert(result.status != PGAIO_RS_UNKNOWN);
 	}
 
 	ioh->distilled_result = result;
@@ -290,6 +293,7 @@ pgaio_io_call_complete_local(PgAioHandle *ioh)
 
 	/* start with distilled result from shared callback */
 	result = ioh->distilled_result;
+	Assert(result.status != PGAIO_RS_UNKNOWN);
 
 	for (int i = ioh->num_callbacks; i > 0; i--)
 	{
@@ -306,6 +310,9 @@ pgaio_io_call_complete_local(PgAioHandle *ioh)
 					   pgaio_result_status_string(result.status),
 					   result.id, result.error_data, result.result);
 		result = ce->cb->complete_local(ioh, result, cb_data);
+
+		/* the callback should never transition to unknown */
+		Assert(result.status != PGAIO_RS_UNKNOWN);
 	}
 
 	/*
diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 743cccc2acd..36be179678d 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -461,7 +461,12 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 		int			nwakeups = 0;
 		int			worker;
 
-		/* Try to get a job to do. */
+		/*
+		 * Try to get a job to do.
+		 *
+		 * The lwlock acquisition also provides the necessary memory barrier
+		 * to ensure that we don't see an outdated data in the handle.
+		 */
 		LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE);
 		if ((io_index = pgaio_worker_submission_queue_consume()) == UINT32_MAX)
 		{
-- 
2.48.1.76.g4e746b1a31.dirty

Re: Non-reproducible AIO failure

Reply via email to