[PATCH] pg_stat_activity: make slow/hanging authentication more visible

Jacob Champion Mon, 06 May 2024 14:24:05 -0700

Hi all,

Recently I dealt with a server where PAM had hung a connection
indefinitely, suppressing our authentication timeout and preventing a
clean shutdown. Worse, the xmin that was pinned by the opening
transaction cascaded to replicas and started messing things up
downstream.


The DBAs didn't know what was going on, because pg_stat_activity
doesn't report the authenticating connection or its open transaction.
It just looked like a Postgres bug. And while talking about it with
Euler, he mentioned he'd seen similar "invisible" hangs with
misbehaving LDAP deployments. I think we can do better to show DBAs
what's happening.

0001, attached, changes InitPostgres() to report a nearly-complete
pgstat entry before entering client authentication, then fills it in
the rest of the way once we know who the user is. Here's a sample
entry for a client that's hung during a SCRAM exchange:

    =# select * from pg_stat_activity where state = 'authenticating';
    -[ RECORD 1 ]----+------------------------------
    datid            |
    datname          |
    pid              | 745662
    leader_pid       |
    usesysid         |
    usename          |
    application_name |
    client_addr      | 127.0.0.1
    client_hostname  |
    client_port      | 38304
    backend_start    | 2024-05-06 11:25:23.905923-07
    xact_start       |
    query_start      |
    state_change     |
    wait_event_type  | Client
    wait_event       | ClientRead
    state            | authenticating
    backend_xid      |
    backend_xmin     | 784
    query_id         |
    query            |
    backend_type     | client backend

0002 goes even further, and adds wait events for various forms of
external authentication, but it's not fully baked. The intent is for a
DBA to be able to see when a bunch of connections are piling up
waiting for PAM/Kerberos/whatever. (I'm also motivated by my OAuth
patchset, where there's a server-side plugin that we have no control
over, and we'd want to be able to correctly point fingers at it if
things go wrong.)

= Open Issues, Idle Thoughts =

Maybe it's wishful thinking, but it'd be cool if a misbehaving
authentication exchange did not impact replicas in any way. Is there a
way to make that opening transaction lighterweight?

0001 may be a little too much code. There are only two parts of
pgstat_bestart() that need to be modified: omit the user ID, and fill
in the state as 'authenticating' rather than unknown. I could just add
the `pre_auth` boolean to the signature of pgstat_bestart() directly,
if we don't mind adjusting all the call sites. We could also avoid
changing the signature entirely, and just assume that we're
authenticating if SessionUserId isn't set. That felt like a little too
much global magic to me, though.

Would anyone like me to be more aggressive, and create a pgstat entry
as soon as we have the opening transaction? Or... as soon as a
connection is made?

0002 is abusing the "IPC" wait event class. If the general idea seems
okay, maybe we could add an "External" class that encompasses the
general idea of "it's not our fault, it's someone else's"?

I had trouble deciding how granular to make the areas that are covered
by the new wait events. Ideally they would kick in only when we call
out to an external system, but for some authentication types, that's a
lot of calls to wrap. On the other extreme, we don't want to go too
high in the call stack and accidentally nest wait events (such as
those generated during pq_getmessage()). What I have now is not very
principled.

I haven't decided how to test these patches. Seems like a potential
use case for injection points, but I think I'd need to preload an
injection library rather than using the existing extension. Does that
seem like an okay way to go?

Thanks,
--Jacob

0001-pgstat-report-in-earlier-with-STATE_AUTHENTICATING.patch
Description: Binary data

0002-WIP-report-external-auth-calls-as-wait-events.patch
Description: Binary data

[PATCH] pg_stat_activity: make slow/hanging authentication more visible

Reply via email to