On 17/06/2025 3:22 am, Tom Lane wrote:
Konstantin Knizhnik<knizh...@garret.ru> writes:
On 16/06/2025 6:11 pm, Andres Freund wrote:
I unfortunately can't repro this issue so far.
But unfortunately it means that the problem is not fixed.
FWIW, I get similar results to Andres' on a Mac Mini M4 Pro
using MacPorts' current compiler release (clang version 19.1.7).
The currently-proposed test case fails within a few minutes on
e9a3615a5^ but doesn't fail in a couple of hours on e9a3615a5.
However, I cannot repro that on a slightly older Mini M1 using Apple's
current release (clang-1700.0.13.5, which per wikipedia is really LLVM
19.1.4). It seems to work fine even without e9a3615a5. So the whole
thing is still depressingly phase-of-the-moon-dependent.
I don't doubt that Konstantin has found a different issue, but
it's hard to be sure about the fix unless we can get it to be
more reproducible. Neither of my machines has ever shown the
symptom he's getting.
regards, tom lane
Unfortunately I still able to reproduce assertion failure (with Andreas
patch but without my "fixes" - uint8 instead of bitfields).
Apple M2 Pro, Ventura 13.7.6, clang 15.0
Postgres build option: --without-icu --enable-debug --enable-cassert
CFLAGS=-O0 Postgres config:
io_max_concurrency=1
io_combine_limit=1
synchronize_seqscans=false
restart_after_crash=false
max_parallel_workers_per_gather=0
fsync=off
Scenario proposed by Andreas:
c=16; pgbench -c $c -j $c -M prepared -n -f <(echo "select count(*) FROM
large;") -T 30000 -P 10
Now it takes me much more time to reproduce the error. Night 30000 seconds
iterations passed normally.
I thought that it can be somehow caused or enforces by hibernation.
When I have reproduced it last time, I do not change default setting for display
"Prevent automatic sleeping on power adapter when the display is off"
(disabled).
It was cause strange timing effect I have reported: when system is hibernated, alarm is
also "hibernated" and so total test time is much more than specified with -T
option.
Now I tried explicitly force hibernation by periodically closing laptop cover -
it doesn't help. But I disable one again this option and let it run. Several
times it normally continue execution after wakeup. But finally I got assertion
failure:
progress: 18290.0 s, 7.3 tps, lat 2229.243 ms stddev 147.211, 0 failed
progress: 18300.0 s, 7.3 tps, lat 2164.216 ms stddev 176.251, 0 failed
progress: 18310.0 s, 7.5 tps, lat 2105.803 ms stddev 202.003, 0 failed
progress: 18320.0 s, 7.7 tps, lat 2162.209 ms stddev 209.344, 0 failed
progress: 18330.0 s, 7.0 tps, lat 2157.891 ms stddev 181.369, 0 failed
progress: 18340.0 s, 7.6 tps, lat 2120.269 ms stddev 169.287, 0 failed
progress: 18350.0 s, 7.3 tps, lat 2178.657 ms stddev 159.984, 0 failed
pgbench: error: client 4 aborted in command 0 (SQL) of script 0; perhaps the
backend died while processing
WARNING: terminating connection because of crash of another server process
Not sure if it is related with hibernation or not (looks like it happens at the
moment when computer is going to hibernate - not sure how to check it).
Backtrace is usual:
* frame #0: 0x0000000187248704 libsystem_kernel.dylib`__pthread_kill + 8
frame #1: 0x000000018727fc28 libsystem_pthread.dylib`pthread_kill + 288
frame #2: 0x000000018718dae8 libsystem_c.dylib`abort + 180
frame #3: 0x00000001034b2080 postgres`ExceptionalCondition(conditionName="ioh->op ==
PGAIO_OP_INVALID", fileName="aio_io.c", lineNumber=161) at assert.c:66:2
frame #4: 0x00000001031bac04
postgres`pgaio_io_before_start(ioh=0x000000010c9f6e40) at aio_io.c:161:2
frame #5: 0x00000001031baac4
postgres`pgaio_io_start_readv(ioh=0x000000010c9f6e40, fd=10, iovcnt=1,
offset=180944896) at aio_io.c:81:2
frame #6: 0x00000001031d985c
postgres`FileStartReadV(ioh=0x000000010c9f6e40, file=3, iovcnt=1,
offset=180944896, wait_event_info=167772181) at fd.c:2241:2
frame #7: 0x000000010322c894 postgres`mdstartreadv(ioh=0x000000010c9f6e40,
reln=0x000000011f0289c8, forknum=MAIN_FORKNUM, blocknum=22088,
buffers=0x000000016d2dd998, nblocks=1) at md.c:1019:8
frame #8: 0x000000010322fe4c
postgres`smgrstartreadv(ioh=0x000000010c9f6e40, reln=0x000000011f0289c8,
forknum=MAIN_FORKNUM, blocknum=22088, buffers=0x000000016d2dd998, nblocks=1) at
smgr.c:758:2
frame #9: 0x00000001031c2b0c
postgres`AsyncReadBuffers(operation=0x000000010f81b910,
nblocks_progress=0x000000016d2ddeb4) at bufmgr.c:1959:3
frame #10: 0x00000001031c1ce8
postgres`StartReadBuffersImpl(operation=0x000000010f81b910,
buffers=0x000000010f81b3e0, blockNum=22088, nblocks=0x000000016d2ddeb4,
flags=0, allow_forwarding=true) at bufmgr.c:1428:18
frame #11: 0x00000001031c182c
postgres`StartReadBuffers(operation=0x000000010f81b910,
buffers=0x000000010f81b3e0, blockNum=22088, nblocks=0x000000016d2ddeb4,
flags=0) at bufmgr.c:1500:9
frame #12: 0x00000001031bee44
postgres`read_stream_start_pending_read(stream=0x000000010f81b378) at
read_stream.c:335:14
frame #13: 0x00000001031be528
postgres`read_stream_look_ahead(stream=0x000000010f81b378) at
read_stream.c:493:3
frame #14: 0x00000001031be0c0
postgres`read_stream_next_buffer(stream=0x000000010f81b378,
per_buffer_data=0x0000000000000000) at read_stream.c:971:2
frame #15: 0x0000000102bb3d34
postgres`heap_fetch_next_buffer(scan=0x000000010f81ae58,
dir=ForwardScanDirection) at heapam.c:675:18
frame #16: 0x0000000102ba4a88
postgres`heapgettup_pagemode(scan=0x000000010f81ae58, dir=ForwardScanDirection,
nkeys=0, key=0x0000000000000000) at heapam.c:1037:3
frame #17: 0x0000000102ba5058
postgres`heap_getnextslot(sscan=0x000000010f81ae58,
direction=ForwardScanDirection, slot=0x000000010f811330) at heapam.c:1391:3
frame #18: 0x0000000102f26448
postgres`table_scan_getnextslot(sscan=0x000000010f81ae58,
direction=ForwardScanDirection, slot=0x000000010f811330) at tableam.h:1031:9
frame #19: 0x0000000102f26254 postgres`SeqNext(node=0x000000010f811110) at
nodeSeqscan.c:81:6
frame #20: 0x0000000102f2691c
postgres`ExecScanFetch(node=0x000000010f811110, epqstate=0x0000000000000000,
accessMtd=(postgres`SeqNext at nodeSeqscan.c:52),
recheckMtd=(postgres`SeqRecheck at nodeSeqscan.c:91)) at execScan.h:126:9
If you are interested - I have core file.
May be it is separate problem - not related with race condition caused by lack
of memory barrier. But frankly speaking - unlikely.
I think hibernation just affects processes schedule which cause this effect.