Re: RFC: pg_stat_logmsg

Joe Conway Sun, 09 Jul 2023 11:13:30 -0700

On 7/7/23 01:38, Gurjeet Singh wrote:

On Thu, Jul 6, 2023 at 12:37 AM Masahiko Sawada <sawada.m...@gmail.com> wrote:

On Sat, Jul 1, 2023 at 8:57 AM Joe Conway <m...@joeconway.com> wrote:
>
> The basic idea is to mirror how pg_stat_statements works, except the
> logged messages keyed by filename, lineno, and elevel are saved with a
> aggregate count. The format string is displayed (similar to a query
> jumble) for context, along with function name and sqlerrcode.
>
>
> Part of the thinking is that people with fleets of postgres instances
> can use this to scan for various errors that they care about.
> Additionally it would be useful to look for "should not happen" errors.

I'm concerned that displaying the format string could not be helpful
in some cases. For example, when raising an ERROR while reading WAL
records, we typically write the error message stored in
record->errormsg_buf:

in ReadRecord():
            if (errormsg)
                ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
                        (errmsg_internal("%s", errormsg) /* already
translated */ ));

In this case, the error message stored in pg_stat_logmsg is just '%s'.
The filename and line number columns might help identify the actual
error but it requires users to read the source code and cannot know
the actual error message.


I believe that the number of such error messages, the ones with very
little descriptive content, is very low in Postgres code. These kinds
of messages are not prevalent, and must be used when code hits obscure
conditions, like seen in your example above. These are the kinds of
errors that Joe is referring to as "should not happen". In these
cases, even if the error message was descriptive, the user will very
likely have to dive deep into code to find out the real cause.


Agreed. As an example, after running `make installcheck`

8<-----------------
select sum(count) from pg_stat_logmsg;
 sum
------
 6005
(1 row)

select message,sum(count)
from pg_stat_logmsg
where length(message) < 30
  and elevel in ('ERROR','FATAL','PANIC')
  and message like '%\%s%' escape '\'
group by message
order by length(message);
            message            | sum
-------------------------------+-----
 %s                            | 107
 "%s" is a view                |   9
 "%s" is a table               |   4
 %s is a procedure             |   1
 invalid size: "%s"            |  13
 %s at or near "%s"            | 131
 %s at end of input            |   3
...
8<-----------------

I only see three message formats there that are generic enough to be ofconcern (the first and last two shown -- FWIW I did not see any more ofthem as the fmt string gets longer)


So out of 6005 log events, 241 fit this concern.

Perhaps given the small number of message format strings affected, itwould make sense to special case those, but I am not sure it is worththe effort, at least for version 1.

I feel that the unique combination of file name and line number is
useful information, even in cases where the format string not very
descriptive. So I believe the extension's behaviour in this regard is
acceptable.

In cases where the extension's output is not descriptive enough, the
user can use the filename:lineno pair to look for fully formed error
messages in the actual log files; they may have to make appropriate
changes to log_* parameters, though.


Right

If we still feel strongly about getting the actual message for these
cases, perhaps we can develop a heuristic to identify such messages,
and use either full or a prefix of the 'message' field, instead of
'message_id' field. The heuristic could be: strlen(message_id) is too
short, or that message_id is all/overwhelmingly format specifiers
(e.g. '%s: %d').

Based on the above analysis (though granted, not all inclusive), itseems like just special casing the specific message format strings ofinterest would work.

The core benefit of this extension is to make it easy for the user to
discover error messages affecting their workload. The user may be
interested in knowing the most common log messages in their server
logs, so that they can work on reducing those errors or warnings. Or
they may be interested in knowing when their workload hits
unexpected/serious error messages, even if they're infrequent. The
data exposed by pg_stat_logmsg view would serve as a starting point,
but I'm guessing it would not be sufficient for their investigation.


Yes, exactly.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: RFC: pg_stat_logmsg

Reply via email to