On 09.07.2020 19:19, Nikolay Samokhvalov wrote:
Hi Konstantin, a silly question: do you consider the workload you have
as well-optimized? Can it be optimized further? Reading this thread I
have a strong feeling that a very basic set of regular optimization
actions is missing here (or not explained): query analysis and
optimization based on pg_stat_statements (and, maybe pg_stat_kcache),
some method to analyze the state of the server in general, resource
consumption, etc.
Do you have some monitoring that covers pg_stat_statements?
Before looking under the hood, I would use multiple pg_stat_statements
snapshots (can be analyzed using, say, postgres-checkup or pgCenter)
to understand the workload and identify the heaviest queries -- first
of all, in terms of total_time, calls, shared buffers reads/hits,
temporary files generation. Which query groups are Top-N in each
category, have you looked at it?
You mentioned some crazy numbers for the planning time, but why not to
analyze the picture holistically and see the overall numbers? Those
queries that have increased planning time, what their part of
total_time, on the overall picture, in %? (Unfortunately, we cannot
see Top-N by planning time in pg_stat_statements till PG13, but it
doesn't mean that we cannot have some good understanding of overall
picture today, it just requires more work).
If workload analysis & optimization was done holistically already, or
not possible due to some reason — pardon me. But if not and if your
primary goal is to improve this particular setup ASAP, then the topic
could be started in the -performance mailing list first, discussing
the workload and its aspects, and only after it's done, raised in
-hackers. No?
Certainly, both we and customer has made workload analysis & optimization.
It is not a problem of particular queries, bad plans, resource
exhaustion,...
Unfortunately there many scenarios when Postgres demonstrates not
gradual degrade of performance with increasing workload,
but "snow avalanche" whennegative feedback cause very fastparalysis of
the system.
This case is just one if this scenarios. It is hard to say for sure what
triggers the avalanche... Long living transaction, huge number of tables,
aggressive autovacuum settings... But there is cascade of negative
events which cause system which normally function for months to stop
working at all.
In this particular case we have the following chain:
- long living transaction cause autovacuum to send a lot of invalidation
message
- this messages cause overflow of invalidation message queues, forcing
backens to invalidate their caches and reload from catalog.
- too small value of fastpath lock cache cause many concurrent accesses
to shared lock hash
- contention for LW-lock caused by small number of lock partition cause
starvation