Hi, I've been running some valgrind tests on rpi4/aarch64, and I get a crash in test_decoding ddl test in ~50% runs. I don't see the same failure without valgrind or on 32-bit system (hundreds of runs, no crashes), so I suspect this is a race condition, and with valgrind the timing changes in a way to make it more likely.
The crash always happens in the "ddl" test. The backtrace always looks like this: (ExceptionalCondition+0x98)[0x8f6f7c] (+0x57a7ec)[0x6827ec] (+0x579edc)[0x681edc] (ReorderBufferAddNewTupleCids+0x60)[0x686758] (SnapBuildProcessNewCid+0x94)[0x68b920] (heap2_decode+0x17c)[0x671584] (LogicalDecodingProcessRecord+0xbc)[0x670cd0] (+0x570f88)[0x678f88] (pg_logical_slot_get_changes+0x1c)[0x6790fc] (ExecMakeTableFunctionResult+0x29c)[0x4a92c0] (+0x3be638)[0x4c6638] (+0x3a2c14)[0x4aac14] (ExecScan+0x8c)[0x4aaca8] (+0x3bea14)[0x4c6a14] (+0x39ea60)[0x4a6a60] (+0x392378)[0x49a378] (+0x39520c)[0x49d20c] (standard_ExecutorRun+0x214)[0x49aad8] (ExecutorRun+0x64)[0x49a8b8] (+0x62f53c)[0x73753c] (PortalRun+0x27c)[0x737198] (+0x627e78)[0x72fe78] (PostgresMain+0x9a0)[0x73512c] (+0x547be8)[0x64fbe8] (+0x547540)[0x64f540] (+0x542d30)[0x64ad30] (PostmasterMain+0x1460)[0x64a574] (+0x418888)[0x520888] I'm unable to get a better backtrace from the valgrind-produces core usign gdb, for some reason. However, I've modified AssertTXNLsnOrder() - which is where the assert is checked - to also dump toplevel_by_lsn instead of just triggering the assert, and the result is always like this: WARNING: ============================================== WARNING: txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0 WARNING: txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0 WARNING: ============================================== The LSNs change a bit between the runs, but the failing transactions are always 848 and 849. Also, both transactions have exactly the same info. But the very first WAL record for 849 is ASSIGNMENT xtop 848: subxacts: 849 so it's strange 849 is in the toplevel_by_lsn list at all, because it clearly is a subxact of 848. Furthermore, the WAL is almost exactly the same in both cases. Attached are two dumps from a failed and successful run (only the part related to these two xids is included). There are very few differences - there is a PRUNE in the failed case, and a LOCK / RUNNING_XACTS moved a bit. Any ideas? -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
wal-crash.log.gz
Description: application/gzip
wal-ok.log.gz
Description: application/gzip