On Tue, Oct 5, 2021 at 10:42 PM Robert Haas <robertmh...@gmail.com> wrote: > > On Tue, Oct 5, 2021 at 12:41 PM Amul Sul <sula...@gmail.com> wrote: > > No, InRecovery flag get cleared before this point. I think, we can use > > lastReplayedEndRecPtr what you have suggested in other thread. > > Hmm, right, that makes sense. Perhaps I should start remembering what > I said in my own emails. >
Here I end up with the attached version where I have dropped the changes for standby.c and 018_wal_optimize.pl files. Also, I am not sure that we should have the changes for bgwriter.c and slot.c in this patch, but that's not touched. Regards, Amul
From 8afbd58573f744ef742969575ecf4de7ec5d915d Mon Sep 17 00:00:00 2001 From: Robert Haas <rhaas@postgresql.org> Date: Wed, 6 Oct 2021 09:38:02 -0400 Subject: [PATCH v2] Always use an end-of-recovery record rather than a checkpoint. Insert awful hack to allow LogStandbySnapshot() to not log a standby snapshot, because it's unprepared to handle the case where latestCompletedXid is 2, which can happen when starting from the very first ever checkpoint. Nuke 018_wal_optimize.pl. Otherwise, the "wal_level = minimal, SET TABLESPACE commit subtransaction" test fails. The reason for the failure is that CREATE TABLESPACE nukes the entire directory and then recreates it, which is a really bad thing to do when wal_level=minimal. Without the changes made by this patch the problem is masked because each server restart necessarily involves a checkpoint, so the CREATE TABLESPACE and the creation of a table in that tablespace don't manage to happen in the same checkpoint cycle. This is not to be taken seriously in its current form; it's just an illustration of (some of?) the problems that would need to be solved to make this kind of change viable. Robert Haas, with modifications by Amul Sul. --- src/backend/access/transam/xlog.c | 72 ++++++++++++------------------- src/backend/postmaster/bgwriter.c | 1 + src/backend/replication/slot.c | 1 + 3 files changed, 29 insertions(+), 45 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 26dcc00ac01..112de230a0d 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -6549,7 +6549,6 @@ StartupXLOG(void) DBState dbstate_at_startup; XLogReaderState *xlogreader; XLogPageReadPrivate private; - bool promoted = false; struct stat st; /* @@ -7955,43 +7954,8 @@ StartupXLOG(void) if (InRecovery) { - /* - * Perform a checkpoint to update all our recovery activity to disk. - * - * Note that we write a shutdown checkpoint rather than an on-line - * one. This is not particularly critical, but since we may be - * assigning a new TLI, using a shutdown checkpoint allows us to have - * the rule that TLI only changes in shutdown checkpoints, which - * allows some extra error checking in xlog_redo. - * - * In promotion, only create a lightweight end-of-recovery record - * instead of a full checkpoint. A checkpoint is requested later, - * after we're fully out of recovery mode and already accepting - * queries. - */ - if (ArchiveRecoveryRequested && IsUnderPostmaster && - LocalPromoteIsTriggered) - { - promoted = true; - - /* - * Insert a special WAL record to mark the end of recovery, since - * we aren't doing a checkpoint. That means that the checkpointer - * process may likely be in the middle of a time-smoothed - * restartpoint and could continue to be for minutes after this. - * That sounds strange, but the effect is roughly the same and it - * would be stranger to try to come out of the restartpoint and - * then checkpoint. We request a checkpoint later anyway, just for - * safety. - */ - CreateEndOfRecoveryRecord(); - } - else - { - RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY | - CHECKPOINT_IMMEDIATE | - CHECKPOINT_WAIT); - } + /* Insert a special WAL record to mark the end of recovery. */ + CreateEndOfRecoveryRecord(); } if (ArchiveRecoveryRequested) @@ -8177,12 +8141,30 @@ StartupXLOG(void) WalSndWakeup(); /* - * If this was a promotion, request an (online) checkpoint now. This isn't - * required for consistency, but the last restartpoint might be far back, - * and in case of a crash, recovering from it might take a longer than is - * appropriate now that we're not in standby mode anymore. - */ - if (promoted) + * Request an (online) checkpoint now. Note that, until this is complete, + * a crash would start replay from the same WAL location we did, or from + * the last restartpoint that completed. We don't want to let that + * situation persist for longer than necessary, since users don't like + * long recovery times. On the other hand, they also want to be able to + * start doing useful work again as quickly as possible. Therfore, we + * don't pass CHECKPOINT_IMMEDIATE to avoid bogging down the system. + * + * Note that the consequence of requesting a checkpoint here only after + * we've allowed WAL writes is that a single checkpoint cycle can span + * multiple server lifetimes. So for example if you want to something to + * happen at least once per checkpoint cycle or at most once per + * checkpoint cycle, you have to consider what happens if the server + * is restarted someplace in the middle. + * + * NB: We generally use the InRecovery flag to perform the redo and to + * perform the subsequent operation needed after the redo. The following + * checkpoint request is one of the operations that need to perform if the + * redo has been performed previously. But we cannot check the InRecovery + * flag now because that has been cleared by now, perhaps, we can still find + * out whether redo has been performed by checking lastReplayedEndRecPtr + * that get set only in case of redo. + */ + if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr)) RequestCheckpoint(CHECKPOINT_FORCE); } @@ -8998,7 +8980,7 @@ CreateCheckPoint(int flags) shutdown = false; /* sanity check */ - if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0) + if (RecoveryInProgress()) elog(ERROR, "can't create a checkpoint during recovery"); /* diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index 5584f4bc241..3eeacad50aa 100644 --- a/src/backend/postmaster/bgwriter.c +++ b/src/backend/postmaster/bgwriter.c @@ -294,6 +294,7 @@ BackgroundWriterMain(void) last_snapshot_lsn <= GetLastImportantRecPtr()) { last_snapshot_lsn = LogStandbySnapshot(); + Assert(!XLogRecPtrIsInvalid(last_snapshot_lsn)); last_snapshot_ts = now; } } diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 1c6c0c7ce27..0d9b1bbe293 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1120,6 +1120,7 @@ ReplicationSlotReserveWal(void) /* make sure we have enough information to start */ flushptr = LogStandbySnapshot(); + Assert(!XLogRecPtrIsInvalid(flushptr)); /* and make sure it's fsynced to disk */ XLogFlush(flushptr); -- 2.18.0