On Sun, Sep 4, 2022 at 4:34 PM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > > I've been running some valgrind tests on rpi4/aarch64, and I get a crash > in test_decoding ddl test in ~50% runs. I don't see the same failure > without valgrind or on 32-bit system (hundreds of runs, no crashes), so > I suspect this is a race condition, and with valgrind the timing changes > in a way to make it more likely. > > The crash always happens in the "ddl" test. The backtrace always looks > like this: > > (ExceptionalCondition+0x98)[0x8f6f7c] > (+0x57a7ec)[0x6827ec] > (+0x579edc)[0x681edc] > (ReorderBufferAddNewTupleCids+0x60)[0x686758] > (SnapBuildProcessNewCid+0x94)[0x68b920] > (heap2_decode+0x17c)[0x671584] > (LogicalDecodingProcessRecord+0xbc)[0x670cd0] > (+0x570f88)[0x678f88] > (pg_logical_slot_get_changes+0x1c)[0x6790fc] > (ExecMakeTableFunctionResult+0x29c)[0x4a92c0] > (+0x3be638)[0x4c6638] > (+0x3a2c14)[0x4aac14] > (ExecScan+0x8c)[0x4aaca8] > (+0x3bea14)[0x4c6a14] > (+0x39ea60)[0x4a6a60] > (+0x392378)[0x49a378] > (+0x39520c)[0x49d20c] > (standard_ExecutorRun+0x214)[0x49aad8] > (ExecutorRun+0x64)[0x49a8b8] > (+0x62f53c)[0x73753c] > (PortalRun+0x27c)[0x737198] > (+0x627e78)[0x72fe78] > (PostgresMain+0x9a0)[0x73512c] > (+0x547be8)[0x64fbe8] > (+0x547540)[0x64f540] > (+0x542d30)[0x64ad30] > (PostmasterMain+0x1460)[0x64a574] > (+0x418888)[0x520888] > > I'm unable to get a better backtrace from the valgrind-produces core > usign gdb, for some reason. > > However, I've modified AssertTXNLsnOrder() - which is where the assert > is checked - to also dump toplevel_by_lsn instead of just triggering the > assert, and the result is always like this: > > WARNING: ============================================== > WARNING: txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0 > WARNING: txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0 > WARNING: ============================================== > > The LSNs change a bit between the runs, but the failing transactions are > always 848 and 849. Also, both transactions have exactly the same info. > > But the very first WAL record for 849 is > > ASSIGNMENT xtop 848: subxacts: 849 > > so it's strange 849 is in the toplevel_by_lsn list at all, because it > clearly is a subxact of 848. >
There is no guarantee that toplevel_by_lsn won't have subxact. As per my understanding, the problem I reported in the email [1] is the same and we have seen this in BF failures as well. I posted a way to reproduce it in that email. It seems this is possible if the decoding gets XLOG_HEAP2_NEW_CID as the first record (belonging to a subtransaction) after XLOG_RUNNING_XACTS. [1] - https://www.postgresql.org/message-id/CAA4eK1LK1nxOTL32OP%3DejhPoBsUP4Bvwb3Ly%3DfethyJ-KbaXyw%40mail.gmail.com -- With Regards, Amit Kapila.