Hi all, Recently I dealt with a server where PAM had hung a connection indefinitely, suppressing our authentication timeout and preventing a clean shutdown. Worse, the xmin that was pinned by the opening transaction cascaded to replicas and started messing things up downstream.
The DBAs didn't know what was going on, because pg_stat_activity doesn't report the authenticating connection or its open transaction. It just looked like a Postgres bug. And while talking about it with Euler, he mentioned he'd seen similar "invisible" hangs with misbehaving LDAP deployments. I think we can do better to show DBAs what's happening. 0001, attached, changes InitPostgres() to report a nearly-complete pgstat entry before entering client authentication, then fills it in the rest of the way once we know who the user is. Here's a sample entry for a client that's hung during a SCRAM exchange: =# select * from pg_stat_activity where state = 'authenticating'; -[ RECORD 1 ]----+------------------------------ datid | datname | pid | 745662 leader_pid | usesysid | usename | application_name | client_addr | 127.0.0.1 client_hostname | client_port | 38304 backend_start | 2024-05-06 11:25:23.905923-07 xact_start | query_start | state_change | wait_event_type | Client wait_event | ClientRead state | authenticating backend_xid | backend_xmin | 784 query_id | query | backend_type | client backend 0002 goes even further, and adds wait events for various forms of external authentication, but it's not fully baked. The intent is for a DBA to be able to see when a bunch of connections are piling up waiting for PAM/Kerberos/whatever. (I'm also motivated by my OAuth patchset, where there's a server-side plugin that we have no control over, and we'd want to be able to correctly point fingers at it if things go wrong.) = Open Issues, Idle Thoughts = Maybe it's wishful thinking, but it'd be cool if a misbehaving authentication exchange did not impact replicas in any way. Is there a way to make that opening transaction lighterweight? 0001 may be a little too much code. There are only two parts of pgstat_bestart() that need to be modified: omit the user ID, and fill in the state as 'authenticating' rather than unknown. I could just add the `pre_auth` boolean to the signature of pgstat_bestart() directly, if we don't mind adjusting all the call sites. We could also avoid changing the signature entirely, and just assume that we're authenticating if SessionUserId isn't set. That felt like a little too much global magic to me, though. Would anyone like me to be more aggressive, and create a pgstat entry as soon as we have the opening transaction? Or... as soon as a connection is made? 0002 is abusing the "IPC" wait event class. If the general idea seems okay, maybe we could add an "External" class that encompasses the general idea of "it's not our fault, it's someone else's"? I had trouble deciding how granular to make the areas that are covered by the new wait events. Ideally they would kick in only when we call out to an external system, but for some authentication types, that's a lot of calls to wrap. On the other extreme, we don't want to go too high in the call stack and accidentally nest wait events (such as those generated during pq_getmessage()). What I have now is not very principled. I haven't decided how to test these patches. Seems like a potential use case for injection points, but I think I'd need to preload an injection library rather than using the existing extension. Does that seem like an okay way to go? Thanks, --Jacob
0001-pgstat-report-in-earlier-with-STATE_AUTHENTICATING.patch
Description: Binary data
0002-WIP-report-external-auth-calls-as-wait-events.patch
Description: Binary data