Hi, On 2023-11-17 06:41:46 +0100, Laurenz Albe wrote: > On Thu, 2023-11-16 at 20:18 -0800, Andres Freund wrote: > > I've often had to analyze what caused corruption in PG instances, where the > > symptoms match not having had backup_label in place when bringing on the > > node. However that's surprisingly hard - the only log messages that indicate > > use of backup_label are at DEBUG1. > > > > Given how crucial use of backup_label is and how frequently people do get it > > wrong, I think we should add a LOG message - it's not like use of > > backup_label > > is a frequent thing in the life of a postgres instance and is going to swamp > > the log. And I think we should backpatch that addition. > > +1 > > I am not sure about the backpatch: it is not a bug, and we should not wantonly > introduce new log messages in a minor release. Some monitoring system may > get confused.
I think log monitoring need (and do) handle unknown log messages gracefully. You're constantly encountering them. If were to change an existing log message in the back branches it'd be a different story. The reason for backpatching is that this is by far the most common reason for corrupted systems in the wild that I have seen. And there's no way to determine from the logs whether something has gone right or wrong - not really a bug, but a pretty substantial weakness. And we're going to have to deal with < 17 releases for 5 years, so making this at least somewhat diagnosable seems like a good idea. > What about adding it to the "redo starts at" message, something like > > redo starts at 12/12345678, taken from control file > > or > > redo starts at 12/12345678, taken from backup label I think it'd make sense to log use of backup_label earlier than that - the locations from backup_label might end up not being available in the archive, the primary or locally, and we'll error out with "could not locate a valid checkpoint record". I'd probably just do it within the if (read_backup_label()) block in InitWalRecovery(), *before* the ReadCheckpointRecord(). I do like the idea of expanding the "redo starts at" message though. E.g. including minRecoveryLSN, ControlFile->backupStartPoint, ControlFile->backupEndPoint would provide information about when the node might become consistent. Greetings, Andres Freund