On Tue, Apr 18, 2017 at 2:42 AM, Moreno Andreo <moreno.and...@evolu-s.it> wrote:
> Hi all, > As many of you has read last Friday (and many has tired to help, too, > and I still thank you very much), I had a bad service outage. > I was pointed to reduce number of maximum connections using a pooler, and > that's what I'm building in test lab, but I'm wondering if there's > something I can do with my overall architecture design. > ATM we host one database per customer (about 400 now) and every customer > has two points of access to data: > - Directly to database, via rubyrep, to replicate the database he has in > his own machine > - Wia WCF self-hosted web services to read other customers data > Every customer can access (and replicate) his database from a number of > different positions (max 3). > Customers are organized in groups (max 10 per group), and there is the > chance that someone accesses someone else's data via WCF. > For example, pick up a group of 5: everyone running rubyrep with only one > position enabled, and getting data from others' database. > If I'm not wrong, it's 5 connections (rubyrep) plus 5*4 (everyone > connecting to everyone else's database) for WCF, so 25 connections > Now imagine a group of 10.... > Last friday I've been told that 350 connections is quite a big number and > things can begin to slow down. Ok. When something slows down I'm used to > search and find the bottleneck (CPU, RAM, IO, etc). If everything was > running apparently fine (CPU < 10%, RAM used < 20%, I/O rarely over 20%), > how can I say there's a bottleneck that's slowing down things? Am I missing > something? > Another thing is that on a customer server (with a similar, smaller > architecture) I _do_ have a connection leak problem that's under > investigation, but when things begin to slow down I simply run a > pg_terminate_backend on all connection with an age > 10 min and everything > goes back to normal. On my server, last friday, it did not help, so I > thought that was not the main problem. > I've got no problems in splitting this architecture in how many servers I > need, but I think I need some tips on how to design this, in order to avoid > major issues in the near future (ask for details if needed). > > The current PostgreSQL 9.5.6 server is an 8 core VM with 52 GB RAM and > Debian 8. > WCF server is Windows 2012 R2 4-core, 16 GB RAM. > > While facing the issue none of them showed up any kind of overload and > their logs were clean. > > I'm a bit scared it can happen again......... > The logs being clean doesn't help much, if your log settings are set to be too terse. Is log_lock_waits on? log_checkpoints? track_io_timing (doesn't show up in the logs, you have to query database views)? Is log_min_duration_statement set to a reasonable value? log_autovacuum_min_duration? Are you using pg_stat_statement (also doesn't show up in the logs, you have to query it), and perhaps auto_explain? Cheers, Jeff