On Thu, Oct 27, 2011 at 2:39 PM, Brian Fehrle <bri...@consistentstate.com>wrote:
> Hi all, need some help/clues on tracking down a performance issue. > > PostgreSQL version: 8.3.11 > > I've got a system that has 32 cores and 128 gigs of ram. We have connection > pooling set up, with about 100 - 200 persistent connections open to the > database. Our applications then use these connections to query the database > constantly, but when a connection isn't currently executing a query, it's > <IDLE>. On average, at any given time, there are 3 - 6 connections that are > actually executing a query, while the rest are <IDLE>. > Remember, when you read pg_stat_activity, it is showing you query activity from that exact specific moment in time. Just because it looks like only 3-6 connections are executing, doesn't mean that 200 aren't actually executing < .1ms statements. With such a beefy box, I would see if you can examine any stats from your connection pooler to find out how many connections are actually getting used. > > About once a day, queries that normally take just a few seconds slow way > down, and start to pile up, to the point where instead of just having 3-6 > queries running at any given time, we get 100 - 200. The whole system comes > to a crawl, and looking at top, the CPU usage is 99%. > > Looking at top, I see no SWAP usage, very little IOWait, and there are a > large number of postmaster processes at 100% cpu usage (makes sense, at this > point there are 150 or so queries currently executing on the database). > > Tasks: 713 total, 44 running, 668 sleeping, 0 stopped, 1 zombie > Cpu(s): 4.4%us, 92.0%sy, 0.0%ni, 3.0%id, 0.0%wa, 0.0%hi, 0.3%si, > 0.2%st > Mem: 134217728k total, 131229972k used, 2987756k free, 462444k buffers > Swap: 8388600k total, 296k used, 8388304k free, 119029580k cached > > > In the past, we noticed that autovacuum was hitting some large tables at > the same time this happened, so we turned autovacuum off to see if that was > the issue, and it still happened without any vacuums running. > That was my next question :) > > We also ruled out checkpoints being the cause. > > How exactly did you rule this out? Just because a checkpoint is over doesn't mean that it hasn't had a negative effect on the OS cache. If you're stuck going to disk, that could be hurting you (that being said, you do point to a low I/O wait above, so you're probably correct in ruling this out). > > I'm currently digging through some statistics I've been gathering to see if > traffic increased at all, or remained the same when the slowdown occurred. > I'm also digging through the logs from the postgresql cluster (I increased > verbosity yesterday), looking for any clues. Any suggestions or clues on > where to look for this to see what can be causing a slowdown like this would > be greatly appreciated. > > Are you capturing table-level stats from pg_stat_user_[tables | indexes]? Just because a server doesn't look busy doesn't mean that you're not doing 1000 index scans per second returning 1000 tuples each time. --Scott > Thanks, > - Brian F > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/**mailpref/pgsql-general<http://www.postgresql.org/mailpref/pgsql-general> >