Re: Use of backup_label not noted in log

Andres Freund Sat, 18 Nov 2023 13:49:39 -0800

Hi,

On 2023-11-18 10:01:42 -0800, Andres Freund wrote:
> > What about adding it to the "redo starts at" message, something like
> >
> >   redo starts at 12/12345678, taken from control file
> >
> > or
> >
> >   redo starts at 12/12345678, taken from backup label
> 
> I think it'd make sense to log use of backup_label earlier than that - the
> locations from backup_label might end up not being available in the archive,
> the primary or locally, and we'll error out with "could not locate a valid
> checkpoint record".
> 
> I'd probably just do it within the if (read_backup_label()) block in
> InitWalRecovery(), *before* the ReadCheckpointRecord().


Not enamored with the phrasing of the log messages, but here's a prototype:

When starting up with backup_label present:
LOG:  starting from base backup with redo LSN A/34100028, checkpoint LSN 
A/34100080 on timeline ID 1

When restarting before reaching the end of the backup, but after backup_label
has been removed:
LOG:  continuing to start from base backup with redo LSN A/34100028
LOG:  entering standby mode
LOG:  redo starts at A/3954B958

Note that the LSN in the "continuing" case is the one the backup started at,
not where recovery will start.


I've wondered whether it's worth also adding an explicit message just after
ReachedEndOfBackup(), but it seems far less urgent due to the existing
"consistent recovery state reached at %X/%X" message.


We are quite inconsistent about how we spell LSNs. Sometimes with LSN
preceding, sometimes not. Sometimes with (LSN). Etc.


> I do like the idea of expanding the "redo starts at" message
> though. E.g. including minRecoveryLSN, ControlFile->backupStartPoint,
> ControlFile->backupEndPoint would provide information about when the node
> might become consistent.

Playing around with this a bit, I'm wondering if we instead should remove that
message, and emit something more informative earlier on. If there's a problem,
you kinda want the information before we finally get to the loop in
PerformWalLRecovery(). If e.g. there's no WAL you'll only get
LOG:  invalid checkpoint record
PANIC:  could not locate a valid checkpoint record

which is delightfully lacking in details.


There also are some other oddities:

If the primary is down when starting up, and we'd need WAL from the primary
for the first record, the "redo start at" message is delayed until that
happens, because we emit the message not before we read the first record, but
after. That's just plain odd.

And sometimes we'll start referencing the LSN at which we are starting
recovery before the "redo starts at" message. If e.g. we shut down
at a restart point, we'll emit

  LOG:  consistent recovery state reached at ...
before
  LOG:  redo starts at ...


But that's all clearly just material for HEAD.

Greetings,

Andres Freund

>From 37e9ebf2167892b0e58df04670288b9b8780bfdc Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Sat, 18 Nov 2023 13:27:17 -0800
Subject: [PATCH v1] wip: Log when starting up from a base backup

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20231117041811.vz4vgkthwjnwp...@awork3.anarazel.de
Backpatch:
---
 src/backend/access/transam/xlogrecovery.c | 26 +++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c61566666aa..c25831303fa 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -603,6 +603,22 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		if (StandbyModeRequested)
 			EnableStandbyMode();
 
+		/*
+		 * Omitting backup_label when creating a new replica, PITR node etc.
+		 * unfortunately is a common cause of corruption. Logging that
+		 * backup_label was used makes it a bit easier to exclude that as the
+		 * cause of observed corruption.
+		 *
+		 * Do so before we try to read the checkpoint record (which can fail),
+		 * as otherwise it can be hard to understand why a checkpoint other
+		 * than ControlFile->checkPoint is used.
+		 */
+		ereport(LOG,
+				(errmsg("starting from base backup with redo LSN %X/%X, checkpoint LSN %X/%X on timeline ID %d",
+						LSN_FORMAT_ARGS(RedoStartLSN),
+						LSN_FORMAT_ARGS(CheckPointLoc),
+						CheckPointTLI)));
+
 		/*
 		 * When a backup_label file is present, we want to roll forward from
 		 * the checkpoint it identifies, rather than using pg_control.
@@ -742,6 +758,16 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 				EnableStandbyMode();
 		}
 
+		/*
+		 * For the same reason as when starting up with backup_label present,
+		 * emit a log message when we continue initializing from a base
+		 * backup.
+		 */
+		if (ControlFile->backupStartPoint != InvalidXLogRecPtr)
+			ereport(LOG,
+					(errmsg("continuing to start from base backup with redo LSN %X/%X",
+							LSN_FORMAT_ARGS(ControlFile->backupStartPoint))));
+
 		/* Get the last valid checkpoint record. */
 		CheckPointLoc = ControlFile->checkPoint;
 		CheckPointTLI = ControlFile->checkPointCopy.ThisTimeLineID;
-- 
2.38.0

Re: Use of backup_label not noted in log

Reply via email to