Re: Use of backup_label not noted in log

Andres Freund Sat, 18 Nov 2023 10:02:10 -0800

Hi,

On 2023-11-17 06:41:46 +0100, Laurenz Albe wrote:
> On Thu, 2023-11-16 at 20:18 -0800, Andres Freund wrote:
> > I've often had to analyze what caused corruption in PG instances, where the
> > symptoms match not having had backup_label in place when bringing on the
> > node. However that's surprisingly hard - the only log messages that indicate
> > use of backup_label are at DEBUG1.
> >
> > Given how crucial use of backup_label is and how frequently people do get it
> > wrong, I think we should add a LOG message - it's not like use of 
> > backup_label
> > is a frequent thing in the life of a postgres instance and is going to swamp
> > the log.  And I think we should backpatch that addition.
>
> +1
>
> I am not sure about the backpatch: it is not a bug, and we should not wantonly
> introduce new log messages in a minor release.  Some monitoring system may
> get confused.


I think log monitoring need (and do) handle unknown log messages
gracefully. You're constantly encountering them.  If were to change an
existing log message in the back branches it'd be a different story.

The reason for backpatching is that this is by far the most common reason for
corrupted systems in the wild that I have seen. And there's no way to
determine from the logs whether something has gone right or wrong - not really
a bug, but a pretty substantial weakness. And we're going to have to deal with
< 17 releases for 5 years, so making this at least somewhat diagnosable seems
like a good idea.


> What about adding it to the "redo starts at" message, something like
>
>   redo starts at 12/12345678, taken from control file
>
> or
>
>   redo starts at 12/12345678, taken from backup label

I think it'd make sense to log use of backup_label earlier than that - the
locations from backup_label might end up not being available in the archive,
the primary or locally, and we'll error out with "could not locate a valid
checkpoint record".

I'd probably just do it within the if (read_backup_label()) block in
InitWalRecovery(), *before* the ReadCheckpointRecord().

I do like the idea of expanding the "redo starts at" message
though. E.g. including minRecoveryLSN, ControlFile->backupStartPoint,
ControlFile->backupEndPoint would provide information about when the node
might become consistent.

Greetings,

Andres Freund

Re: Use of backup_label not noted in log

Reply via email to