Aurora Postgres version 17.4
This application mainly runs 3 queries very frequently- around 12000
executions per second across the 3. Normally the server handles this fine and
at any given time there are between 0 - 10 active sessions running the queries.
The 3 queries are tuned- doing 3-7 logical reads per execute and run in < .1
millisecond.
Periodically there will be a huge spike in active sessions up to 400+ sessions
all on CPU (it can be higher depending on how many app servers are in the mix
at the time). This spike will last 2-5 seconds but it causes an increase in
app response time which is a problem.
We snap info from pg_stat_activity (where state <> ‘idle' ) every 1 second for
troubleshooting issues and a couple observations from that snap data when this
spike happens:
1.
The app sessions will show as wait_event_type and wait_event as null which I
understand means they are on CPU.
2.
Some of those sessions will not have queryid populated. But all the app
sessions do is execute queries.
3.
With aurora Postgres they store plan information in pg_stat_activity in planid
column. For some sessions the queryid will be populated but the planid will
not be populated. According to AWS this can happen if the session has
completed parsing the query but has not finished plan generation or sessions
are in the bind phase where parameter binding occurs before final plan
selection.
With Aurora Postgres we do not have access to the server to look at the unix
processes but we enabled enhanced monitoring in 1 second intervals which does
provide a little info on session processes. One thing that stood out when the
active session spike happens if I looked at the process list (it only shows ~
95 of the processes) I would see a bunch like this:
"vss": 640900784,
"name": "postgres: prcoffrcnsmruser prcoffrcnsmr
172.29.145.159(53026) BIND",
"tgid": 26247,
"parentID": 584,
"memoryUsedPc": 0.02,
"cpuUsedPc": 0.06,
"id": 26247,
"rss": 30496,
"vmlimit": “unlimited"
Where they had BIND next to the process info (vs others will show things like
idle or SELECT). In this case of the 95 processes shown it may have shown 59
of them were idle, 2 were SELECT, 2 were PARSE, and 32 were BIND.
So based on these processes showing BIND and pg_stat_activity either not
showing a queryid or planid if it did show a queryid it seems these sessions
may be getting hung up parsing the queries or specifically at the step it does
binding.
Can anyone come up with any theories what may be going on here or is my
description/explanation above too vague? The app is using prepared statements.
The connection pool is HikariCP which does not support caching prepared
statements. The thing that throws me off is so far I’m unable to see anything
the app is doing to cause this. I don’t see a spike in query activity. I
don’t see a spike in new connections from the pools. I don’t see anything
leading up to this other than this spike in sessions out of nowhere. I don’t
see any resource issues on the server. It will run this same load consistently
throughout the day and the active session spike may happen 0 - 10 times in a
day.
And ideas would be appreciated.