Hi Stephen,

Thank you for supporting an opinion that it is the problems not only of client system design (I agree it is not so good idea to have thousands tables and thousands active backends) but also of Postgres.

We have made more investigation and found out one more problem in Postgres causing "invalidation storm". There are some log living transactions which prevent autovacuum to do it work and  remove dead tuples. So autovacuum is started once and once again and each time did no progress but updated statistics and so sent invalidation messages. autovacuum_naptime was set to 30 seconds, so each 30 seconds autovacuum proceed huge number of tables and initiated large number of invalidation messages which quite soon cause overflow of validation message buffers for backends performing long OLAP queries.

It makes me think about two possible optimizations:

1. Provide separate invalidation messages for relation metadata and its statistic.
So update of statistic should not invalidate relation cache.
The main problem with this proposal is that pg_class contains relpages and reltuples columns which conceptually are \ part of relation statistic but stored in relation cache. If relation statistic is updated, then most likely this fields are also changed. So we have to remove this relation
from relation cache in any case.

2. Remember in relation info XID of oldest active transaction at the moment of last autovacuum. At next autovacuum iteration we first of all compare this stored XID with current oldest active transaction XID
and bypass vacuuming this relation if XID is not changed.

Thoughts?

So, that's really the core of your problem.  We don't promise that
you can run several thousand backends at once.  Usually it's recommended
that you stick a connection pooler in front of a server with (at most)
a few hundred backends.
Sure, but that doesn't mean things should completely fall over when we
do get up to larger numbers of backends, which is definitely pretty
common in larger systems.  I'm pretty sure we all agree that using a
connection pooler is recommended, but if there's things we can do to
make the system work at least a bit better when folks do use lots of
connections, provided we don't materially damage other cases, that's
probably worthwhile.

I also think that Postgres performance should degrade gradually with increasing number of active backends. Actually further investigations of this particular case shows that such large number of
database connections was caused by ... Postgres slowdown.
During normal workflow number of active backends is few hundreds.
But "invalidation storm" cause hangout of queries, so user application has to initiate more and more new connections to perform required actions. Yes, this may be not the best behavior of application in this case. At least it should first terminate current connection using pg_terminate_backend. I just want to notice that large number of backends was not the core of the problem.


Making them GUCs does seem like it's a few steps too far... but it'd be
nice if we could arrange to have values that don't result in the system
falling over with large numbers of backends and large numbers of tables.
To get a lot of backends, you'd have to set max_connections up pretty
high to begin with- perhaps we should contemplate allowing these values
to vary based on what max_connections is set to?

I think that optimal value of number of lock partitions should depend not on number of connections
but on number of available CPU cores and so expected level on concurrency.
It is hard to propose some portable way to obtain this number.
This is why I think that GUCs is better solution.
Certainly I realize that it is very dangerous parameter which should be changed with special care.
Not only because of  MAX_SIMUL_LWLOCKS.

There are few places in Postgres when it tries to lock all partitions (deadlock detector, logical replication,...). If there very thousands of partitions, then such lock will be too expensive and we get yet another popular Postgres program: "deadlock detection storm" when due to high contention between backends lock can not be obtained in deadlock timeout and so initiate deadlock detection. Simultaneous deadlock detection performed by all backends (which tries to take ALL partitions locks) paralyze the system (TPS falls down to 0). Proposed patch for this problem was also rejected (once again - problem can be reproduced only of powerful server with large number of cores).


Reply via email to