We do record perf data. For each incident we've had the data looks about the same. Unfortunately, I can't read much into it. Besides it getting stuck on a spinlock. But why and with what?
### from perf report 53.28% postmaster postgres [.] s_lock 6.22% postmaster postgres [.] 0x00000000001b4306 2.30% postmaster [kernel.kallsyms] [k] _spin_lock 2.23% postmaster postgres [.] LWLockAcquire 1.79% postmaster postgres [.] LWLockRelease 1.39% postmaster postgres [.] hash_search_with_hash_value 1.20% postmaster postgres [.] SearchCatCache 0.71% postmaster [kernel.kallsyms] [k] hrtimer_interrupt 0.56% postmaster [kernel.kallsyms] [k] tick_program_event 0.54% postmaster [kernel.kallsyms] [k] schedule 0.44% postmaster [kernel.kallsyms] [k] _spin_lock_irq ### Then zooming on s_lock # Annotate s_lock ... 99.04 │ test %al,%al ... # Zoom into postmaster(81487) thread 55.84% postmaster postgres [.] s_lock ◆ 6.52% postmaster postgres [.] 0x00000000001b4306 ▒ 2.42% postmaster [kernel.kallsyms] [k] _spin_lock ▒ 2.34% postmaster postgres [.] LWLockAcquire ▒ 1.87% postmaster postgres [.] LWLockRelease ▒ 1.46% postmaster postgres [.] hash_search_with_hash_value ▒ 1.26% postmaster postgres [.] SearchCatCache ▒ 0.75% postmaster [kernel.kallsyms] [k] hrtimer_interrupt ▒ 0.59% postmaster [kernel.kallsyms] [k] tick_program_event ▒ 0.57% postmaster [kernel.kallsyms] [k] schedule # Zoom into postgres DSO 65.75% postmaster [.] s_lock ◆ 7.68% postmaster [.] 0x00000000001b4306 ▒ 2.75% postmaster [.] LWLockAcquire ▒ 2.20% postmaster [.] LWLockRelease ▒ 1.72% postmaster [.] hash_search_with_hash_value ▒ 1.49% postmaster [.] SearchCatCache ▒ 0.54% postmaster [.] _bt_compare ▒ 0.51% postmaster [.] _bt_checkkeys On 19 June 2014 15:57, Merlin Moncure <mmonc...@gmail.com> wrote: > On Thu, Jun 19, 2014 at 5:12 PM, Borislav Ivanov <biva...@atlassian.com> > wrote: > > However, most people on our team think that the number of connections is > > purely a symptom of the actual problem. We would love to be wrong about > > this. But for now we feel the high number of connections contributes for > > preserving the problem but it's not actually triggering the problem. > > This is entirely correct. pgbouncer does not preventing database load > but about limiting damage when it occurs. This generally necessary in > environments where application servers keep piling on connections when > the database is not clearing queries fast enough. > > In your case user% is dominating system load. Along with the high cs > this is really suggesting spinlock contention. A 'perf top' is > essential for identifying the culprit. It's very possible that 9.4 > will fix your problem...see: > > http://postgresql.1045698.n5.nabble.com/Cpu-usage-100-on-slave-s-lock-problem-td5768655.html > . > There was some poorly optimized code in the wal replay. > > merlin >