Greetings, * Andres Freund (and...@anarazel.de) wrote: > There's a few errors that we issue that are, often, much less bad than > they sound. The most common cases I immediately can recall are:
Agreed. > 1) Mentioning crash, once for each backend, when shutting down > immediately. Currently the log output for that, with just two sessions > connected, is the following: Eh, +1 or +2. > It seems like we ought to be able to somehow > > a) Signal that the server has been shut down in immediate mode, rather > than actually crashed, and issue a different log message to the user. That'd be nice, but I anticipate the question coming up of "how much is too much to do before an immediate shutdown?" > b) Stop issuing, to the server log, the same message over and over. We > instead just ought to send the message to the client. We however need to > be careful that we don't make it harder to debug a SIGQUIT sent directly > to backend processes. Would be nice if we could improve that, agreed. > 2) At the end of crash recovery, and often when the startup processes > switches between WAL sources, we get scary messages like: +1000 > I've seen countless tickets being raised because PG users looked at > their logs and got scared. We want them to look at the logs, so this > seems counter-productive. Yes, agreed, this happens all the time and would be good to improve. > It seems we, at the very least, could add an error context or something > explaining that a LOG message about the log end is to be expected. Or > perhaps we could should reformulate the message to something like > 'ELEVEL: reached end of valid WAL at XX/XX' > 'DETAIL: end determined due to invalid record length: wanted 24, got 0' > > perhaps with a HINT in the elevel < ERROR case indicating that this is > not a failure. Something like that does look like an improvement. > 3) When a standby node is shutdown in immediate mode, we issue: > > 2019-11-05 15:45:58.722 PST [30321][] LOG: database system was interrupted > while in recovery at log time 2019-11-05 15:37:43 PST > 2019-11-05 15:45:58.722 PST [30321][] HINT: If this has occurred more than > once some data might be corrupted and you might need to choose an earlier > recovery target. > > Which imo doesn't really make sense for at least standbys, which are > expected to be in recovery forever. There's more than enough scenarios > where one would shut down a standby in immediate mode for good reasons, > we shouldn't issue such warnings in that case. > > > The tricky part in improving this would be how to detect when the > message should still be issued for a standby. One idea, which is not > bullet proof but might be good enough, would be to record in the control > file which position recovery was started from last time, and only issue > the error when recovery would start from the same point. Yeah... This sounds like it would be more difficult to tackle, though I agree it'd be nice to improve on this too. Thanks, Stephen
signature.asc
Description: PGP signature