Occasional performance issue after changing table partitions

Nathan Ward Sat, 09 Jul 2022 21:56:02 -0700

Hi,

I am running Postgres 13 on CentOS 7, installed from the yum.postgresql.org 
<http://yum.postgresql.org/> repo.


I have 3 servers in different sites, with asynchronous replication managed by 
pgpool 4.0.11. Highest latency is 16ms RTT - but shouldn’t matter as it’s async 
- but my application is running many concurrent connections to keep throughput 
up when latency is high - as the queries are typically very fast.

Servers have 32GB RAM and 4 cores.

max_connections = 256
max_locks_per_transaction = 256
max_stack_depth = 2MB
max_wal_senders = 5
max_wal_size = 1GB
min_wal_size = 80MB
shared_buffers = 10GB

I am storing internet usage data and attributes from RADIUS messages.
Our ingest does not come directly from the RADIUS server - instead it comes via 
rabbitmq which allows the database performance to drop without impacting our 
RADIUS services.

I am making pretty heavy use of partitions to manage this. The data usage 
accounting is stored in a partition per hour, which runs through daily 
aggregation - aggregating usage older than 3 to usage per day, rather than 
usage per hour. At this point, we have a partition per day, rather than per 
hour.
I am doing this, because we want hourly usage for 3 days, and RADIUS data comes 
in as a running total rather than a delta per some interval, and we want to 
have the deltas available. For each incoming record, we look at previous hours 
to find a previous total, and calculate a delta from that - this means we are 
regularly (hundred of times per second) looking at the last 2 hours. 
Partitioning per hour lets us manage that.
I partition on an “interval_start” timestamp column.

The aggregation process which runs each night at some time after midnight does 
the following, in a transaction:
1) Create a new “full day” partition for 3 days ago.
2) Look at hourly data older than 3 days (so, across 24 partitions) and 
calculate totals for that internet session on that day and insert in to the 
“full day” table.
3) Detach the hourly partitions
4) Attach the full day partition
5) Drop the hourly partitions

This process takes around 12s to run - aggregating around 16.8M rows in to 700k 
rows.
Each partition is around 70MB.

Through this process, data continues to be ingested - but as the data being 
aggregated is 3+ days old, and the ingest process is only looking at current 
partitions (i.e. last couple hours), we don’t have any conflicts here.



Additionally (and I think less importantly so skip this if it’s getting long 
winded) we store all RADIUS attributes from the RADIUS messages as jsonb, doing 
deduplication - most RADIUS messages (apart from the usage, session timers, 
etc.) are static, so we have a table of those “static” attributes in a jsonb 
column, and a table of events matching timestamps to sets of RADIUS messages.
These tables also make use of partitioning - we have hot, cold, and frozen data 
so that only the hot and cold indexes are in memory most of the time - the 
“frozen” data is very large (100s of GB) so is only used for reporting.
There are a couple of processes which run over this data periodically:
1) Move data from hot -> cold -> frozen as it ages
2) Aggregate “events” (i.e. RADIUS message timestamp to attributes) together so 
that longer term we don’t have a row per message, and rather only a row each 
time the attributes for a RADIUS session changes. This means there is always 
dead rows in this table, but they regularly get re-used.
This appears to work very well, these processes run every 5 mins or so. The 
event aggregation process drops around 180k rows.



The issue I am having, is that when the daily data usage aggregation runs, 
sometimes we have a big performance impact, with the following characteristics 
which happen *after* the aggregation job runs in it usual fast time of 12s or 
so:
- The aggregation runs fast as per normal
- Load on the server goes to 30-40 - recall we have quite high “max 
connections” to keep throughput high when the client is far (16ms) from the 
server
- IOWait drops to ~0% (it’s usually at around 1-2%) but actual disk IO rates 
seem approx normal
- User and System CPU increase to a total of 100% - ~86% and ~14% respectively
- Processing time for RADIUS messages increases, and a big processing backlog 
builds
- Swap is not used - it is enabled, but very low swap IO
- Memory usage does not change

If I stop the ingest process briefly, then start it up again, the problem goes 
away - the database server drops to 0% CPU, then after starting the ingest 
process the backlog clears very rapidly and performance is back to normal.


This happens maybe once or twice a week - it’s not every day. It’s not on 
specific days each week.
There is a vague correlation with other aggregation jobs (i.e. event 
aggregation mentioned above) running immediately after the daily data usage 
aggregation. Only one of these jobs runs at once - so if another scheduled job 
wants to run, it will run immediately after whatever is already running.


I am wondering if there’s some sort of problem where we drop all these 
partitions, and postgres needs to do some work internally to free the space up 
or something, but has a hard time doing so with all the updates going on?
I am not clear why this only happens some days - I am working on seeing if I 
can firm up (or rule out) the correlation with other aggregation jobs running 
immediately afterwards.


Can anyone recommend some things to look at here? I’ve got quite a bit of 
metrics collected every minute - per-table io (i.e. hit/read), index sizes, 
table sizes, etc. - however everything there seems “normal” for the slow ingest 
rate when the issue occurs, so it’s hard to differentiate between cause and 
symptoms in those metrics.


I have bumped up effective_cache_size from default of 4GB to 16GB since this 
last happened - but given IO doesn’t appear to be an issue, I don’t think this 
will have too much effect.

--
Nathan Ward

Occasional performance issue after changing table partitions

Reply via email to