Re: Should we make scary sounding, but actually routine, errors less scary?

Stephen Frost Wed, 06 Nov 2019 05:23:08 -0800

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> There's a few errors that we issue that are, often, much less bad than
> they sound. The most common cases I immediately can recall are:


Agreed.

> 1) Mentioning crash, once for each backend, when shutting down
> immediately. Currently the log output for that, with just two sessions
> connected, is the following:

Eh, +1 or +2.

> It seems like we ought to be able to somehow
> 
> a) Signal that the server has been shut down in immediate mode, rather
> than actually crashed, and issue a different log message to the user.

That'd be nice, but I anticipate the question coming up of "how much is
too much to do before an immediate shutdown?"

> b) Stop issuing, to the server log, the same message over and over. We
> instead just ought to send the message to the client. We however need to
> be careful that we don't make it harder to debug a SIGQUIT sent directly
> to backend processes.

Would be nice if we could improve that, agreed.

> 2) At the end of crash recovery, and often when the startup processes
> switches between WAL sources, we get scary messages like:

+1000

> I've seen countless tickets being raised because PG users looked at
> their logs and got scared. We want them to look at the logs, so this
> seems counter-productive.

Yes, agreed, this happens all the time and would be good to improve.

> It seems we, at the very least, could add an error context or something
> explaining that a LOG message about the log end is to be expected. Or
> perhaps we could should reformulate the message to something like
> 'ELEVEL: reached end of valid WAL at XX/XX'
> 'DETAIL: end determined due to invalid record length: wanted 24, got 0'
> 
> perhaps with a HINT in the elevel < ERROR case indicating that this is
> not a failure.

Something like that does look like an improvement.

> 3) When a standby node is shutdown in immediate mode, we issue:
> 
> 2019-11-05 15:45:58.722 PST [30321][] LOG:  database system was interrupted 
> while in recovery at log time 2019-11-05 15:37:43 PST
> 2019-11-05 15:45:58.722 PST [30321][] HINT:  If this has occurred more than 
> once some data might be corrupted and you might need to choose an earlier 
> recovery target.
> 
> Which imo doesn't really make sense for at least standbys, which are
> expected to be in recovery forever. There's more than enough scenarios
> where one would shut down a standby in immediate mode for good reasons,
> we shouldn't issue such warnings in that case.
> 
> 
> The tricky part in improving this would be how to detect when the
> message should still be issued for a standby. One idea, which is not
> bullet proof but might be good enough, would be to record in the control
> file which position recovery was started from last time, and only issue
> the error when recovery would start from the same point.

Yeah... This sounds like it would be more difficult to tackle, though I
agree it'd be nice to improve on this too.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Should we make scary sounding, but actually routine, errors less scary?

Reply via email to