Hi, On 2025-04-15 21:00:00 +0300, Alexander Lakhin wrote: > Please take a look also at the simple reproducer for the crash inside > pg_get_aios() I mentioned upthread: > for i in {1..100}; do > numjobs=12 > echo "iteration $i" > date > for ((j=1;j<=numjobs;j++)); do > ( createdb db$j; for k in {1..300}; do > echo "CREATE TABLE t (a INT); CREATE INDEX ON t (a); VACUUM t; > SELECT COUNT(*) >= 0 AS ok FROM pg_aios; " \ > | psql -d db$j >/dev/null 2>&1; > done; dropdb db$j; ) & > done > wait > psql -c 'SELECT 1' || break; > done > > it fails for me as follows: > iteration 20 > Tue Apr 15 07:21:29 PM EEST 2025 > dropdb: error: connection to server on socket "/tmp/.s.PGSQL.55432" failed: > No such file or directory > Is the server running locally and accepting connections on that socket? > ... > 2025-04-15 19:21:30.675 EEST [3111699] LOG: client backend (PID 3320979) was > terminated by signal 11: Segmentation fault > 2025-04-15 19:21:30.675 EEST [3111699] DETAIL: Failed process was running: > SELECT COUNT(*) >= 0 AS ok FROM pg_aios; > 2025-04-15 19:21:30.675 EEST [3111699] LOG: terminating any other active > server processes
Thanks for that. The bug turns out to be pretty stupid - pgaio_io_reclaim() resets the fields in PgAioHandle *before* updating the generation/state. That opens up a window in which pg_get_aios() thinks the copied PgAioHandle is valid, even though it was taken while the fields were being reset. Once I had figured that out, it was easy to make it more reproducible - put a pg_usleep() between the fields being reset in pgaio_io_reclaim() and the generation increase / state update. The fix is simple, increment generation and state before resetting fields. Will push the fix for that soon. Greetings, Andres Freund