Hi, Michael Paquier (running locally I think), and subsequently Thomas Munro (noticing [1]), privately reported that they noticed an assertion failure in GetSnapshotData(). Both reasonably were wondering if that's related to the snapshot scalability patches.
Michael reported the following assertion failure in 023_pitr_prepared_xact.pl: > TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, > RecentXmin)", File: "procarray.c", Line: 2468, PID: 22901) > The failure was triggered by one of the new TAP tests, > 023_pitr_prepared_xact.pl, after recovering a 2PC transaction that > used a transaction ID that matches with RecentXmin: > (gdb) p RecentXmin > $1 = 588 > (gdb) p TransactionXmin > $2 = 589 I tried for a while to reproduce that, but couldn't. Adding a bunch of debugging output and increasing the log level shows the problem pretty clearly however, just not tripping any asserts: 2021-04-21 17:55:54.287 PDT [1829098] [unknown] LOG: setting xmin to 588 ... 2021-04-21 17:55:54.377 PDT [1829049] DEBUG: removing all KnownAssignedXids 2021-04-21 17:55:54.377 PDT [1829049] DEBUG: release all standby locks ... 2021-04-21 17:55:54.396 PDT [1829100] [unknown] LOG: setting xmin to 589 ... 2021-04-21 17:55:55.379 PDT [1829048] LOG: database system is ready to accept connections ... 2021-04-21 17:55:55.380 PDT [1829120] LOG: setting xmin to 588 ... 2021-04-21 17:55:55.386 PDT [1829126] [unknown] LOG: setting xmin to 588 2021-04-21 17:55:55.387 PDT [1829126] 023_pitr_prepared_xact.pl LOG: statement: COMMIT PREPARED 'fooinsert'; ... 2021-04-21 17:55:55.428 PDT [1829128] [unknown] LOG: setting xmin to 589 So there's clear proof for xmin going from 588 to 589 and back and forth. After looking some more the bug isn't even that subtle. And definitely not new - likely it exists since the introduction of hot standby. The sequence in StartupXLOG() leading to the issue is the following: 1) RecoverPreparedTransactions(); 2) ShutdownRecoveryTransactionEnvironment(); 3) XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE; Because 2) resets the KnownAssignedXids machinery, snapshots that happen between 2) and 3) will not actually look at the procarray to compute snapshots, as that only happens within snapshot->takenDuringRecovery = RecoveryInProgress(); if (!snapshot->takenDuringRecovery) and RecoveryInProgress() checks XLogCtl->SharedRecoveryState != RECOVERY_STATE_DONE, which is set in 3). So snapshots within that window will always be "empty", i.e. xmin is latestCompletedXid and xmax is latestCompletedXid + 1. Once we reach 3), we'll look at the procarray, which then leads xmin going back to 588. I think that this can lead to data corruption, because a too new xmin horizon could lead to rows from a prepared transaction getting hint bitted as dead (or perhaps even pruned, although that's probably harder to hit). Due to the too new xmin horizon we won't treat rows by the prepared xact as in-progress, and TransactionIdDidCommit() will return false, as the transaction didn't commit yet. Which afaict can result in row versions created by the prepared transaction being invisible even after the prepared transaction commits. Without prepared transaction there probably isn't an issue, because there shouldn't be any other in-progress xids at that point? I think to fix the issue we'd have to move ShutdownRecoveryTransactionEnvironment() to after XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE. The acquisition of ProcArrayLock() in ShutdownRecoveryTransactionEnvironment()->ExpireAllKnownAssignedTransactionIds() should prevent the data from being removed between the RecoveryInProgress() and the KnownAssignedXidsGetAndSetXmin() calls in GetSnapshotData(). I haven't yet figured out whether there would be a problem with deferring the other tasks in ShutdownRecoveryTransactionEnvironment() until after RECOVERY_STATE_DONE. I think we ought to introduce assertions that have a higher chance to catch cases like this. The window to hit the new assertion that caused Michael to hit this is pretty darn small (xmin needs to move backwards between two snapshot computations inside a single transaction). I *think* we can safely assert that xmin doesn't move backwards globally, if we store it as a 64bit xid, and don't perform that check in walsender? Greetings, Andres Freund [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-04-20%2003%3A04%3A04