On Sun, Sep 11, 2022 at 6:42 PM Justin Pryzby <pry...@telsasoft.com> wrote: > I think you're saying is that this can be explained by the > io_concurrency bug in recovery_prefetch, if run under 15b3. > > But yesterday I started from initdb and restored this cluster from backup, and > started up sqlsmith, and sent some kill -9, and now got more corruption. > Looks like it took ~10 induced crashes before this happened.
Have you tested fsync on the system? The symptoms here are all over the place. This assertion failure seems like a pretty good sign that the problems happen during recovery, or because basic guarantees needed by for crash safety aren't met: > #2 0x0000000000962c5c in ExceptionalCondition > (conditionName=conditionName@entry=0x9ce238 "P_ISLEAF(opaque) && > !P_ISDELETED(opaque)", errorType=errorType@entry=0x9bad97 "FailedAssertion", > fileName=fileName@entry=0x9cdcd1 "nbtpage.c", > lineNumber=lineNumber@entry=1778) at assert.c:69 > #3 0x0000000000507e34 in _bt_rightsib_halfdeadflag > (rel=rel@entry=0x7f4138a238a8, leafrightsib=leafrightsib@entry=53) at > nbtpage.c:1778 > #4 0x0000000000507fba in _bt_mark_page_halfdead > (rel=rel@entry=0x7f4138a238a8, leafbuf=leafbuf@entry=13637, > stack=stack@entry=0x144ca20) at nbtpage.c:2121 This shows that the basic rules for page deletion have somehow seemingly been violated. It's as if a page deletion went ahead, but didn't work as an atomic operation -- there were some lost writes for some but not all pages. Actually, it looks like a mix of states from before and after both the first and the second phases of page deletion -- so not just one atomic operation. -- Peter Geoghegan