Hello Andres,

24.04.2025 03:40, Andres Freund wrote:
Hi,

On 2025-04-20 18:00:00 +0300, Alexander Lakhin wrote:

2025-04-08 01:41:54.670 UTC [4013251][client backend][0/2:0] DEBUG: waiting for 
self with 0 pending
I'd like to extend this debug message with the number of in-flight IOs.  I
assume nobody will mind me doing that.

I printed that number for debugging and always got 1, because
io_max_concurrency == 1 during that test (because of shared_buffers=1MB).
With shared_buffers increased:
--- a/src/test/recovery/t/032_relfilenode_reuse.pl
+++ b/src/test/recovery/t/032_relfilenode_reuse.pl
@@ -18,7 +18,7 @@ log_connections=on
 # to avoid "repairing" corruption
 full_page_writes=off
 log_min_messages=debug2
-shared_buffers=1MB
+shared_buffers=16MB

I got 100 iterations passed.


I could reproduce this error with the following TEMP_CONFIG:
log_min_messages = DEBUG4

log_connections = on
log_disconnections = on
log_statement = 'all'
log_line_prefix = '%m [%d][%p:%l][%b] %q[%a] '

fsync = on
io_method=io_uring
backtrace_functions = 'pgaio_io_wait_for_free'
Hm. Several hundred iterations in without triggering the issue even once, with
that added config. Given your reproducer used fsync, I assume this was on a
real filesystem not tmpfs? What FS?  I tried on XFS.

very well could just be a timing dependent issue, the time for an fsync will
differ heavily betweeen devices...

Yeah, it definitely depends on timing.
Using that TEMP_CONFIG changes the duration of the test significantly for me:
Files=1, Tests=14,  2 wallclock secs ( 0.01 usr  0.01 sys +  0.13 cusr  0.32 
csys =  0.47 CPU)
->
Files=1, Tests=14, 14 wallclock secs ( 0.00 usr  0.00 sys +  0.14 cusr  0.33 
csys =  0.47 CPU)

I reproduced the failure on ext4 (desktop-grade NVME) and on xfs (cheap
SSD), on two different machines.

It's also reproduced on HDD (with ext4), seemingly worse, where the
duration is increased even more:
I 25
# +++ tap check in src/test/recovery +++
t/032_relfilenode_reuse.pl .. ok
All tests successful.
Files=1, Tests=14, 28 wallclock secs ( 0.00 usr  0.00 sys +  0.15 cusr  0.39 
csys =  0.54 CPU)
Result: PASS
I 26
# +++ tap check in src/test/recovery +++
t/032_relfilenode_reuse.pl .. 7/? Bailout called.  Further testing stopped:  
aborting wait: program timed out


When running 032_relfilenode_reuse.pl in a loop, I got failures on
iterations 10, 18, 6:
2025-04-20 15:56:01.310 CEST [][1517068:122][startup] DEBUG: updated min 
recovery point to 0/4296E90 on timeline 1
2025-04-20 15:56:01.310 CEST [][1517068:123][startup] CONTEXT:  WAL redo at
0/4296E48 for Transaction/COMMIT: 2025-04-20 15:56:01.09363+02; inval msgs:
catcache 21; sync
2025-04-20 15:56:01.310 CEST [][1517068:124][startup] DEBUG: waiting for all
backends to process ProcSignalBarrier generation 5
2025-04-20 15:56:01.310 CEST [][1517068:125][startup] CONTEXT:  WAL redo at 
0/4296E90 for Database/DROP: dir 16409/50001
2025-04-20 15:56:01.312 CEST [postgres][1517075:144][client backend]
[032_relfilenode_reuse.pl] DEBUG:  waiting for self with 0 pending
2025-04-20 15:56:01.312 CEST [postgres][1517075:145][client backend] 
[032_relfilenode_reuse.pl] BACKTRACE:
pgaio_io_wait_for_free at aio.c:703:2
  (inlined by) pgaio_io_acquire at aio.c:186:3
AsyncReadBuffers at bufmgr.c:1854:9
StartReadBuffersImpl at bufmgr.c:1425:18
  (inlined by) StartReadBuffers at bufmgr.c:1497:9
read_stream_start_pending_read at read_stream.c:333:25
read_stream_look_ahead at read_stream.c:475:3
read_stream_next_buffer at read_stream.c:840:6
pg_prewarm at pg_prewarm.c:214:10
ExecInterpExpr at execExprInterp.c:927:7
ExecEvalExprNoReturn at executor.h:447:2
...
2025-04-20 15:56:01.312 CEST [][1517068:126][startup] DEBUG: finished
waiting for all backends to process ProcSignalBarrier generation 5
2025-04-20 15:56:01.312 CEST [][1517068:127][startup] CONTEXT:  WAL redo at 
0/4296E90 for Database/DROP: dir 16409/50001
2025-04-20 15:56:01.314 CEST [postgres][1517075:146][client backend]
[032_relfilenode_reuse.pl] ERROR:  no free IOs despite no in-flight IOs
The earlier report also had a ProcSignalBarrier wait just before. That's
doesn't have to be related, the test stresses those pretty intentionally.

But it does seem like it could be related, perhaps we somehow end processing
interrupts while issuing IOs, which then again submits IOs or something? I
don't immediately see such code, e.g. pgaio_closing_fd() is careful to iterate
over the list while waiting.

I think pgaio_closing_fd() should get a DEBUG2 (or so) message documenting
that we are waiting for an IO due to closing an FD. pgaio_shutdown() as well,
while at it.


I somewhat suspect that this is the same issue that you reported earlier, that
I also couldn't repro.  That involved a lot of CREATE DATABASEs, which will
trigger a lot of procsignal barriers...  I'll try to come up with a more
dedicated stress test and see whether I can trigger a problem that way.

Maybe it does matter that "no free IOs despite no in-flight IOs" follows
"WAL redo at 0/4296E90 for Database/DROP" here (that's seemingly constant)...

I've tried also:
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -705,6 +705,7 @@ pgaio_io_wait_for_free(void)
        pgaio_debug(DEBUG2, "waiting for self with %d pending",
pgaio_my_backend->num_staged_ios);

+pg_sleep(3000);

and got the test failed on each run with:
t/032_relfilenode_reuse.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)

032_relfilenode_reuse_primary.log
2025-04-24 05:29:27.103 CEST [3504475] 032_relfilenode_reuse.pl LOG:  
statement: UPDATE large SET datab = 1;
2025-04-24 05:29:27.104 CEST [3504475] 032_relfilenode_reuse.pl DEBUG:  waiting 
for self with 0 pending
2025-04-24 05:29:27.112 CEST [3504369] DEBUG:  releasing pm child slot 3
2025-04-24 05:29:27.112 CEST [3504369] LOG:  client backend (PID 3504475) was 
terminated by signal 11: Segmentation fault
2025-04-24 05:29:27.112 CEST [3504369] DETAIL:  Failed process was running: 
UPDATE large SET datab = 1;


I also encountered another failure when running this test:
t/032_relfilenode_reuse.pl .. 1/? Bailout called.  Further testing stopped:  
aborting wait: program died

Core was generated by `postgres: standby: law postgres [local] SELECT           
        '.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) 
at ./nptl/pthread_kill.c:44

(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) 
at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at 
./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at 
./nptl/pthread_kill.c:89
#3  0x0000700a5a24527e in __GI_raise (sig=sig@entry=6) at 
../sysdeps/posix/raise.c:26
#4  0x0000700a5a2288ff in __GI_abort () at ./stdlib/abort.c:79
#5  0x0000630ff2c5dcea in errfinish (filename=filename@entry=0x630ff2cdd139 
"aio.c", lineno=lineno@entry=559,
     funcname=funcname@entry=0x630ff2e85da8 <__func__.9> "pgaio_io_wait") at 
elog.c:599
#6  0x0000630ff2aae79c in pgaio_io_wait (ioh=ioh@entry=0x700a57b59700, 
ref_generation=3972) at aio.c:559
#7  0x0000630ff2aaefab in pgaio_io_wait_for_free () at aio.c:771
Do you still have the core file for this? Or at least the generated log
message? It'd be useful to know what state the IO was in when the error was
triggered...

I can reproduce it easily, so here is the message:
2025-04-24 06:10:01.924 EEST [postgres][3036378:150][client backend] [032_relfilenode_reuse.pl] PANIC:  waiting for own IO in wrong state: 0 2025-04-24 06:10:01.924 EEST [postgres][3036378:151][client backend] [032_relfilenode_reuse.pl] STATEMENT:  SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE pg_relation_filenode(oid) != 0;

(Sorry for not including it before — I observed it many times, always with
state 0, so it had lost its value for me.)

Best regards,
Alexander Lakhin
Neon (https://neon.tech)


Reply via email to