Re: using an end-of-recovery record in all cases

Amul Sul Wed, 06 Oct 2021 06:55:17 -0700

On Tue, Oct 5, 2021 at 10:42 PM Robert Haas <robertmh...@gmail.com> wrote:
>
> On Tue, Oct 5, 2021 at 12:41 PM Amul Sul <sula...@gmail.com> wrote:
> > No, InRecovery flag get cleared before this point. I think, we can use 
> > lastReplayedEndRecPtr what you have suggested in other thread.
>
> Hmm, right, that makes sense. Perhaps I should start remembering what
> I said in my own emails.
>


Here I end up with the attached version where I have dropped the
changes for standby.c and 018_wal_optimize.pl files.  Also, I am not
sure that we should have the changes for bgwriter.c and slot.c in this
patch, but that's not touched.

Regards,
Amul

From 8afbd58573f744ef742969575ecf4de7ec5d915d Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 6 Oct 2021 09:38:02 -0400
Subject: [PATCH v2] Always use an end-of-recovery record rather than a
 checkpoint.

Insert awful hack to allow LogStandbySnapshot() to not log a
standby snapshot, because it's unprepared to handle the case
where latestCompletedXid is 2, which can happen when starting
from the very first ever checkpoint.

Nuke 018_wal_optimize.pl. Otherwise, the "wal_level = minimal, SET
TABLESPACE commit subtransaction" test fails. The reason for the
failure is that CREATE TABLESPACE nukes the entire directory and
then recreates it, which is a really bad thing to do when
wal_level=minimal. Without the changes made by this patch the
problem is masked because each server restart necessarily
involves a checkpoint, so the CREATE TABLESPACE and the creation
of a table in that tablespace don't manage to happen in the same
checkpoint cycle.

This is not to be taken seriously in its current form; it's just
an illustration of (some of?) the problems that would need to be
solved to make this kind of change viable.

Robert Haas, with modifications by Amul Sul.
---
 src/backend/access/transam/xlog.c | 72 ++++++++++++-------------------
 src/backend/postmaster/bgwriter.c |  1 +
 src/backend/replication/slot.c    |  1 +
 3 files changed, 29 insertions(+), 45 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 26dcc00ac01..112de230a0d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6549,7 +6549,6 @@ StartupXLOG(void)
 	DBState		dbstate_at_startup;
 	XLogReaderState *xlogreader;
 	XLogPageReadPrivate private;
-	bool		promoted = false;
 	struct stat st;
 
 	/*
@@ -7955,43 +7954,8 @@ StartupXLOG(void)
 
 	if (InRecovery)
 	{
-		/*
-		 * Perform a checkpoint to update all our recovery activity to disk.
-		 *
-		 * Note that we write a shutdown checkpoint rather than an on-line
-		 * one. This is not particularly critical, but since we may be
-		 * assigning a new TLI, using a shutdown checkpoint allows us to have
-		 * the rule that TLI only changes in shutdown checkpoints, which
-		 * allows some extra error checking in xlog_redo.
-		 *
-		 * In promotion, only create a lightweight end-of-recovery record
-		 * instead of a full checkpoint. A checkpoint is requested later,
-		 * after we're fully out of recovery mode and already accepting
-		 * queries.
-		 */
-		if (ArchiveRecoveryRequested && IsUnderPostmaster &&
-			LocalPromoteIsTriggered)
-		{
-			promoted = true;
-
-			/*
-			 * Insert a special WAL record to mark the end of recovery, since
-			 * we aren't doing a checkpoint. That means that the checkpointer
-			 * process may likely be in the middle of a time-smoothed
-			 * restartpoint and could continue to be for minutes after this.
-			 * That sounds strange, but the effect is roughly the same and it
-			 * would be stranger to try to come out of the restartpoint and
-			 * then checkpoint. We request a checkpoint later anyway, just for
-			 * safety.
-			 */
-			CreateEndOfRecoveryRecord();
-		}
-		else
-		{
-			RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
-							  CHECKPOINT_IMMEDIATE |
-							  CHECKPOINT_WAIT);
-		}
+		/* Insert a special WAL record to mark the end of recovery. */
+		CreateEndOfRecoveryRecord();
 	}
 
 	if (ArchiveRecoveryRequested)
@@ -8177,12 +8141,30 @@ StartupXLOG(void)
 	WalSndWakeup();
 
 	/*
-	 * If this was a promotion, request an (online) checkpoint now. This isn't
-	 * required for consistency, but the last restartpoint might be far back,
-	 * and in case of a crash, recovering from it might take a longer than is
-	 * appropriate now that we're not in standby mode anymore.
-	 */
-	if (promoted)
+	 * Request an (online) checkpoint now. Note that, until this is complete,
+	 * a crash would start replay from the same WAL location we did, or from
+	 * the last restartpoint that completed. We don't want to let that
+	 * situation persist for longer than necessary, since users don't like
+	 * long recovery times. On the other hand, they also want to be able to
+	 * start doing useful work again as quickly as possible. Therfore, we
+	 * don't pass CHECKPOINT_IMMEDIATE to avoid bogging down the system.
+	 *
+	 * Note that the consequence of requesting a checkpoint here only after
+	 * we've allowed WAL writes is that a single checkpoint cycle can span
+	 * multiple server lifetimes. So for example if you want to something to
+	 * happen at least once per checkpoint cycle or at most once per
+	 * checkpoint cycle, you have to consider what happens if the server
+	 * is restarted someplace in the middle.
+	 *
+	 * NB: We generally use the InRecovery flag to perform the redo and to
+	 * perform the subsequent operation needed after the redo. The following
+	 * checkpoint request is one of the operations that need to perform if the
+	 * redo has been performed previously. But we cannot check the InRecovery
+	 * flag now because that has been cleared by now, perhaps, we can still find
+	 * out whether redo has been performed by checking lastReplayedEndRecPtr
+	 * that get set only in case of redo.
+	 */
+	if (!XLogRecPtrIsInvalid(XLogCtl->lastReplayedEndRecPtr))
 		RequestCheckpoint(CHECKPOINT_FORCE);
 }
 
@@ -8998,7 +8980,7 @@ CreateCheckPoint(int flags)
 		shutdown = false;
 
 	/* sanity check */
-	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
+	if (RecoveryInProgress())
 		elog(ERROR, "can't create a checkpoint during recovery");
 
 	/*
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 5584f4bc241..3eeacad50aa 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -294,6 +294,7 @@ BackgroundWriterMain(void)
 				last_snapshot_lsn <= GetLastImportantRecPtr())
 			{
 				last_snapshot_lsn = LogStandbySnapshot();
+				Assert(!XLogRecPtrIsInvalid(last_snapshot_lsn));
 				last_snapshot_ts = now;
 			}
 		}
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 1c6c0c7ce27..0d9b1bbe293 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1120,6 +1120,7 @@ ReplicationSlotReserveWal(void)
 
 			/* make sure we have enough information to start */
 			flushptr = LogStandbySnapshot();
+			Assert(!XLogRecPtrIsInvalid(flushptr));
 
 			/* and make sure it's fsynced to disk */
 			XLogFlush(flushptr);
-- 
2.18.0

Re: using an end-of-recovery record in all cases

Reply via email to