Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

Robert Haas Wed, 11 Sep 2024 06:01:22 -0700

On Tue, Sep 10, 2024 at 4:58 PM Noah Misch <[email protected]> wrote:
> ... a rule of "each wait event appears in one
> pgstat_report_wait_start()" would be a rule I don't want.


As the original committer of the wait event stuff, I intended for the
rule that you do not want to be the actual rule. However, I see that I
didn't spell that out anywhere in the commit message, or the commit
itself.

> I see this level of fine-grained naming
> as making the event name a sort of stable proxy for FILE:LINE.  I'd value
> exposing such a proxy, all else being equal, but I don't think wait event
> names like AuthLdapBindLdapbinddn/AuthLdapBindUser are the right way.  Wait
> event names should be more independent of today's code-level details.

I don't agree with that. One of the most difficult parts of supporting
PostgreSQL, in my experience, is that it's often very difficult to
find out what has gone wrong when a system starts behaving badly. It
is often necessary to ask customers to install a debugger and do stuff
with it, or give them an instrumented build, in order to determine the
root cause of a problem that in some cases is not even particularly
complicated. While needing to refer to specific source code details
may not be a common experience for the typical end user, it is
extremely common for me. This problem commonly arises with error
messages, because we have lots of error messages that are exactly the
same, although thankfully it has become less common due to "could not
find tuple for THINGY %u" no longer being a message that no longer
typically reaches users. But even when someone has a complaint about
an error message and there are multiple instances of that error
message, I know that:

(1) I can ask them to set the error verbosity to verbose. I don't have
that option for wait events.

(2) The primary function of the error message is to be understandable
to the user, which means that it needs to be written in plain English.
The primary function of a wait event is to make it possible to
understand the behavior of the system and troubleshoot problems, and
it becomes much less effective as soon as it starts saying that thing
A and thing B are so similar that nobody will ever care about the
distinction. It is very hard to be certain of that. When somebody
reports that they've got a whole bunch of wait events on some wait
event that nobody has ever complained about before, I want to go look
at the code in that specific place and try to figure out what's
happening. If I have to start imagining possible scenarios based on 2
or more call sites, or if I have to start by getting them to install a
modified build with those properly split apart and trying to reproduce
the problem, it's a lot harder.

In my experience, the number of distinct wait events that a particular
installation experiences is rarely very large. It is probably measured
in dozens. A user who wishes to disregard the distinction between
similarly-named wait events won't find it prohibitively difficult to
look over the list of all the wait events they ever see and decide
which ones they'd like to merge for reporting purposes. But a user who
really needs things separated out and finds that they aren't is simply
out of luck.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

Reply via email to