Hi,

On 2023-11-20 11:35:15 +0100, Laurenz Albe wrote:
> On Mon, 2023-11-20 at 17:30 +0900, Michael Paquier wrote:
> > +       if (ControlFile->backupStartPoint != InvalidXLogRecPtr)
> > +           ereport(LOG,
> > +                   (errmsg("continuing to start from base backup with redo 
> > LSN %X/%X",
> > +                           
> > LSN_FORMAT_ARGS(ControlFile->backupStartPoint))));
> >
> > "Continuing to start" sounds a bit weird to me, though, considering
> > that there are a few LOGs that say "starting" when there is a signal
> > file, but I don't have a better idea on top of my mind.  So that
> > sounds OK here.
>
> We can only reach that message in recovery or standby mode, right?
> So why not write "continuing to recover from base backup"?

It can be reached without either too, albeit much less commonly.


> If we add a message for starting with "backup_label", shouldn't
> we also add a corresponding message for starting from a checkpoint
> found in the control file?  If you see that in a problem report,
> you immediately know what is going on.

Maybe - the reason I hesitate on that is that emitting an additional log
message when starting from a base backup just adds something "once once the
lifetime of a node". Whereas emitting something every start obviously doesn't
impose any limit.

You also can extrapolate from the messages absence that we started up without
backup_label, it's not like there would be a lot of messages inbetween
  "database system was interrupted; last ..."
and
  "starting backup recovery ..."
(commonly there would be no messages)

We can do more on HEAD of course, but we should be wary of just spamming the
log unnecessarily as well.


I guess we could add this message at the same time, including in the back
branches. Initially I thought that might be unwise, because replacing
                elog(DEBUG1, "end of backup reached");
with a different message could theoretically cause issues, even if unlikely,
given that it's a DEBUG1 message.

But I think we actually want to emit the message a bit later, just *after* we
updated the control file, as that's the actually relevant piece after which we
won't go back to the "backup recovery" state.  I am somewhat agnostic about
whether we should add that in the back branches or not.


Here's the state with my updated patch, when starting up from a base backup:

LOG:  starting PostgreSQL 17devel on x86_64-linux, compiled by gcc-14.0.0, 
64-bit
LOG:  listening on IPv6 address "::1", port 5441
LOG:  listening on IPv4 address "127.0.0.1", port 5441
LOG:  listening on Unix socket "/tmp/.s.PGSQL.5441"
LOG:  database system was interrupted; last known up at 2023-11-20 10:55:49 PST
LOG:  starting recovery from base backup with redo LSN E/AFF07F20, checkpoint 
LSN E/B01B17F0, on timeline ID 1
LOG:  entering standby mode
LOG:  redo starts at E/AFF07F20
LOG:  completed recovery from base backup with redo LSN E/AFF07F20
LOG:  consistent recovery state reached at E/B420FC80


Besides the phrasing and the additional log message (I have no opinion about
whether it should be backpatched or not), I used %u for TimelineID as
appropriate, and added a comma before "on timeline".

Greetings,

Andres Freund
>From d80e55942a096d9f1ab10b84ab6f2a9d371cf88d Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Sat, 18 Nov 2023 13:27:17 -0800
Subject: [PATCH v2] WIP: Log when starting up from a base backup

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20231117041811.vz4vgkthwjnwp...@awork3.anarazel.de
Backpatch:
---
 src/backend/access/transam/xlogrecovery.c | 32 +++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c61566666aa..9803b481118 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -603,6 +603,22 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		if (StandbyModeRequested)
 			EnableStandbyMode();
 
+		/*
+		 * Omitting backup_label when creating a new replica, PITR node etc.
+		 * unfortunately is a common cause of corruption. Logging that
+		 * backup_label was used makes it a bit easier to exclude that as the
+		 * cause of observed corruption.
+		 *
+		 * Do so before we try to read the checkpoint record (which can fail),
+		 * as otherwise it can be hard to understand why a checkpoint other
+		 * than ControlFile->checkPoint is used.
+		 */
+		ereport(LOG,
+				(errmsg("starting recovery from base backup with redo LSN %X/%X, checkpoint LSN %X/%X, on timeline ID %u",
+						LSN_FORMAT_ARGS(RedoStartLSN),
+						LSN_FORMAT_ARGS(CheckPointLoc),
+						CheckPointTLI)));
+
 		/*
 		 * When a backup_label file is present, we want to roll forward from
 		 * the checkpoint it identifies, rather than using pg_control.
@@ -742,6 +758,16 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 				EnableStandbyMode();
 		}
 
+		/*
+		 * For the same reason as when starting up with backup_label present,
+		 * emit a log message when we continue initializing from a base
+		 * backup.
+		 */
+		if (ControlFile->backupStartPoint != InvalidXLogRecPtr)
+			ereport(LOG,
+					(errmsg("restarting recovery from base backup with redo LSN %X/%X",
+							LSN_FORMAT_ARGS(ControlFile->backupStartPoint))));
+
 		/* Get the last valid checkpoint record. */
 		CheckPointLoc = ControlFile->checkPoint;
 		CheckPointTLI = ControlFile->checkPointCopy.ThisTimeLineID;
@@ -2151,6 +2177,8 @@ CheckRecoveryConsistency(void)
 	if (!XLogRecPtrIsInvalid(backupEndPoint) &&
 		backupEndPoint <= lastReplayedEndRecPtr)
 	{
+		XLogRecPtr oldBackupStartPoint = backupStartPoint;
+
 		elog(DEBUG1, "end of backup reached");
 
 		/*
@@ -2161,6 +2189,10 @@ CheckRecoveryConsistency(void)
 		backupStartPoint = InvalidXLogRecPtr;
 		backupEndPoint = InvalidXLogRecPtr;
 		backupEndRequired = false;
+
+		ereport(LOG,
+				(errmsg("completed recovery from base backup with redo LSN %X/%X",
+						LSN_FORMAT_ARGS(oldBackupStartPoint))));
 	}
 
 	/*
-- 
2.38.0

Reply via email to