On Thu, Jul 29, 2021 at 11:49 AM Robert Haas <robertmh...@gmail.com> wrote: > On Wed, Jul 28, 2021 at 3:28 PM Andres Freund <and...@anarazel.de> wrote: > > > Imagine if instead of > > > all the hairy logic we have now we just replaced this whole if > > > (IsInRecovery) stanza with this: > > > > > > if (InRecovery) > > > CreateEndOfRecoveryRecord(); > > > > > > That would be WAY easier to reason about than the rat's nest we have > > > here today. Now, I am not sure what it would take to get there, but I > > > think that is the direction we ought to be heading. > > > > What are we going to do in the single user ([1]) case in this awesome > > future? > > I guess we could just not create a checkpoint until single user mode is shut > > down / creates a checkpoint for other reasons? > > It probably depends on who writes this thus-far-hypothetical patch, > but my thought is that we'd unconditionally request a checkpoint after > writing the end-of-recovery record, same as we do now if (promoted). > If we happened to be in single-user mode, then that checkpoint request > would turn into performing a checkpoint on the spot in the one and > only process we've got, but StartupXLOG() wouldn't really need to care > what happens under the hood after it called RequestCheckpoint().
I decided to try writing a patch to use an end-of-recovery record rather than a checkpoint record in all cases. The patch itself was pretty simple but didn't work. There are two different reasons why it didn't work, which I'll explain in a minute. I'm not sure whether there are any other problems; these are the only two things that cause problems with 'make check-world', but that's hardly a guarantee of anything. Anyway, I thought it would be useful to report these issues first and hopefully get some feedback. The first problem I hit was that GetRunningTransactionData() does Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)). While that does seem like a superficially reasonable thing to assert, StartupXLOG() initializes latestCompletedXid by calling TransactionIdRetreat on the value extracted from checkPoint.nextXid, and the first-ever checkpoint that is written at the beginning of WAL has a nextXid of 3, so when starting up from that checkpoint nextXid is 2, which is not a normal XID. When we try to create the next checkpoint, CreateCheckPoint() does LogStandbySnapshot() which calls GetRunningTransactionData() and the assertion fails. In the current code, we avoid this more or less accidentally because LogStandbySnapshot() is skipped when starting from a shutdown checkpoint. If we write an end-of-recovery record and then trigger a checkpoint afterwards, then we don't avoid the problem. Although I'm impishly tempted to suggest that we #define SecondNormalTransactionId 4 and then use that to create the first checkpoint instead of FirstNormalTransactionId -- naturally with no comments explaining why we're doing it -- I think the real problem is that the assertion is just wrong. CurrentRunningXacts->latestCompletedXid shouldn't be InvalidTransactionId or BootstrapTransactionId, but FrozenTransactionId is a legal, if corner-case, value, at least as the code stands today. Unfortunately we can't just relax the assertion, because the XLOG_RUNNING_XACTS record will eventually be handed to ProcArrayApplyRecoveryInfo() for processing ... and that function contains a matching assertion which would in turn fail. It in turn passes the value to MaintainLatestCompletedXidRecovery() which contains yet another matching assertion, so the restriction to normal XIDs here looks pretty deliberate. There are no comments, though, so the reader is left to guess why. I see one problem: MaintainLatestCompletedXidRecovery uses FullXidRelativeTo, which expects a normal XID. Perhaps it's best to just dodge the entire issue by skipping LogStandbySnapshot() if latestCompletedXid happens to be 2, but that feels like a hack, because AFAICS the real problem is that StartupXLog() doesn't agree with the rest of the code on whether 2 is a legal case, and maybe we ought to be storing a value that doesn't need to be computed via TransactionIdRetreat(). The second problem I hit was a preexisting bug where, under wal_level=minimal, redo of a "create tablespace" record can blow away data of which we have no other copy. See http://postgr.es/m/CA+TgmoaLO9ncuwvr2nN-J4VEP5XyAcy=zkihxqzbbfrxxgx...@mail.gmail.com for details. That bug happens to make src/test/recovery's 018_wal_optimize.pl fail one of the tests, because the test starts and stops the server repeatedly, with the result that with the attached patch, we just keep writing end-of-recovery records and never getting time to finish a checkpoint before the next shutdown, so every test replays the CREATE TABLESPACE record and everything that previous tests did. The "wal_level = minimal, SET TABLESPACE commit subtransaction" fails because it's the only one that (1) uses the tablespace for a new table, (2) commits, and (3) runs before a checkpoint is manually forced. It's also worth noting that if we go this way, CHECKPOINT_END_OF_RECOVERY should be ripped out entirely. We'd still be triggering a checkpoint at the end of recovery, but because it could be running concurrently with WAL-generating activity, it wouldn't be an end-of-recovery checkpoint in the sense that we now use that term. In particular, you couldn't assume that no other write transactions are running at the point when this checkpoint is performed. I haven't yet tried ripping that out and doing so might reveal other problems. -- Robert Haas EDB: http://www.enterprisedb.com
v1-0001-Always-use-an-end-of-recovery-record-rather-than-.patch
Description: Binary data