using an end-of-recovery record in all cases

Robert Haas Mon, 09 Aug 2021 12:01:10 -0700

On Thu, Jul 29, 2021 at 11:49 AM Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Jul 28, 2021 at 3:28 PM Andres Freund <and...@anarazel.de> wrote:
> > > Imagine if instead of
> > > all the hairy logic we have now we just replaced this whole if
> > > (IsInRecovery) stanza with this:
> > >
> > > if (InRecovery)
> > >     CreateEndOfRecoveryRecord();
> > >
> > > That would be WAY easier to reason about than the rat's nest we have
> > > here today. Now, I am not sure what it would take to get there, but I
> > > think that is the direction we ought to be heading.
> >
> > What are we going to do in the single user ([1]) case in this awesome 
> > future?
> > I guess we could just not create a checkpoint until single user mode is shut
> > down / creates a checkpoint for other reasons?
>
> It probably depends on who writes this thus-far-hypothetical patch,
> but my thought is that we'd unconditionally request a checkpoint after
> writing the end-of-recovery record, same as we do now if (promoted).
> If we happened to be in single-user mode, then that checkpoint request
> would turn into performing a checkpoint on the spot in the one and
> only process we've got, but StartupXLOG() wouldn't really need to care
> what happens under the hood after it called RequestCheckpoint().


I decided to try writing a patch to use an end-of-recovery record
rather than a checkpoint record in all cases. The patch itself was
pretty simple but didn't work. There are two different reasons why it
didn't work, which I'll explain in a minute. I'm not sure whether
there are any other problems; these are the only two things that cause
problems with 'make check-world', but that's hardly a guarantee of
anything. Anyway, I thought it would be useful to report these issues
first and hopefully get some feedback.

The first problem I hit was that GetRunningTransactionData() does
Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)).
While that does seem like a superficially reasonable thing to assert,
StartupXLOG() initializes latestCompletedXid by calling
TransactionIdRetreat on the value extracted from checkPoint.nextXid,
and the first-ever checkpoint that is written at the beginning of WAL
has a nextXid of 3, so when starting up from that checkpoint nextXid
is 2, which is not a normal XID. When we try to create the next
checkpoint, CreateCheckPoint() does LogStandbySnapshot() which calls
GetRunningTransactionData() and the assertion fails. In the current
code, we avoid this more or less accidentally because
LogStandbySnapshot() is skipped when starting from a shutdown
checkpoint. If we write an end-of-recovery record and then trigger a
checkpoint afterwards, then we don't avoid the problem. Although I'm
impishly tempted to suggest that we #define SecondNormalTransactionId
4 and then use that to create the first checkpoint instead of
FirstNormalTransactionId -- naturally with no comments explaining why
we're doing it -- I think the real problem is that the assertion is
just wrong. CurrentRunningXacts->latestCompletedXid shouldn't be
InvalidTransactionId or BootstrapTransactionId, but
FrozenTransactionId is a legal, if corner-case, value, at least as the
code stands today.

Unfortunately we can't just relax the assertion, because the
XLOG_RUNNING_XACTS record will eventually be handed to
ProcArrayApplyRecoveryInfo() for processing ... and that function
contains a matching assertion which would in turn fail. It in turn
passes the value to MaintainLatestCompletedXidRecovery() which
contains yet another matching assertion, so the restriction to normal
XIDs here looks pretty deliberate. There are no comments, though, so
the reader is left to guess why. I see one problem:
MaintainLatestCompletedXidRecovery uses FullXidRelativeTo, which
expects a normal XID. Perhaps it's best to just dodge the entire issue
by skipping LogStandbySnapshot() if latestCompletedXid happens to be
2, but that feels like a hack, because AFAICS the real problem is that
StartupXLog() doesn't agree with the rest of the code on whether 2 is
a legal case, and maybe we ought to be storing a value that doesn't
need to be computed via TransactionIdRetreat().

The second problem I hit was a preexisting bug where, under
wal_level=minimal, redo of a "create tablespace" record can blow away
data of which we have no other copy. See
http://postgr.es/m/CA+TgmoaLO9ncuwvr2nN-J4VEP5XyAcy=zkihxqzbbfrxxgx...@mail.gmail.com
for details. That bug happens to make src/test/recovery's
018_wal_optimize.pl fail one of the tests, because the test starts and
stops the server repeatedly, with the result that with the attached
patch, we just keep writing end-of-recovery records and never getting
time to finish a checkpoint before the next shutdown, so every test
replays the CREATE TABLESPACE record and everything that previous
tests did. The "wal_level = minimal, SET TABLESPACE commit
subtransaction" fails because it's the only one that (1) uses the
tablespace for a new table, (2) commits, and (3) runs before a
checkpoint is manually forced.

It's also worth noting that if we go this way,
CHECKPOINT_END_OF_RECOVERY should be ripped out entirely. We'd still
be triggering a checkpoint at the end of recovery, but because it
could be running concurrently with WAL-generating activity, it
wouldn't be an end-of-recovery checkpoint in the sense that we now use
that term. In particular, you couldn't assume that no other write
transactions are running at the point when this checkpoint is
performed. I haven't yet tried ripping that out and doing so might
reveal other problems.

--
Robert Haas
EDB: http://www.enterprisedb.com

v1-0001-Always-use-an-end-of-recovery-record-rather-than-.patch
Description: Binary data

using an end-of-recovery record in all cases

Reply via email to