On 09.07.2020 19:19, Nikolay Samokhvalov wrote:
Hi Konstantin, a silly question: do you consider the workload you have as well-optimized? Can it be optimized further? Reading this thread I have a strong feeling that a very basic set of regular optimization actions is missing here (or not explained): query analysis and optimization based on pg_stat_statements (and, maybe pg_stat_kcache), some method to analyze the state of the server in general, resource consumption, etc.

Do you have some monitoring that covers pg_stat_statements?

Before looking under the hood, I would use multiple pg_stat_statements snapshots (can be analyzed using, say, postgres-checkup or pgCenter) to understand the workload and identify the heaviest queries -- first of all, in terms of total_time, calls, shared buffers reads/hits, temporary files generation. Which query groups are Top-N in each category, have you looked at it?

You mentioned some crazy numbers for the planning time, but why not to analyze the picture holistically and see the overall numbers? Those queries that have increased planning time, what their part of total_time, on the overall picture, in %? (Unfortunately, we cannot see Top-N by planning time in pg_stat_statements till PG13, but it doesn't mean that we cannot have some good understanding of overall picture today, it just requires more work).

If workload analysis & optimization was done holistically already, or not possible due to some reason — pardon me. But if not and if your primary goal is to improve this particular setup ASAP, then the topic could be started in the -performance mailing list first, discussing the workload and its aspects, and only after it's done, raised in -hackers. No?

Certainly, both we and customer has made workload analysis & optimization.
It is not a problem of particular queries, bad plans, resource exhaustion,...

Unfortunately there many scenarios when Postgres demonstrates not gradual degrade of performance with increasing workload, but "snow avalanche" whennegative feedback cause very fastparalysis of the system.

This case is just one if this scenarios. It is hard to say for sure what triggers the avalanche... Long living transaction, huge number of tables, aggressive autovacuum settings... But there is cascade of negative events which cause system which normally function for months to stop working at all.

In this particular case we have the following chain:

- long living transaction cause autovacuum to send a lot of invalidation message - this messages cause overflow of invalidation message queues, forcing backens to invalidate their caches and reload from catalog. - too small value of fastpath lock cache cause many concurrent accesses to shared lock hash - contention for LW-lock caused by small number of lock partition cause starvation



Reply via email to