We do record perf data. For each incident we've had the data looks about
the same. Unfortunately, I can't read much into it. Besides it getting
stuck on a spinlock. But why and with what?


### from perf report

 53.28%       postmaster  postgres               [.] s_lock


  6.22%       postmaster  postgres               [.] 0x00000000001b4306


  2.30%       postmaster  [kernel.kallsyms]      [k] _spin_lock


  2.23%       postmaster  postgres               [.] LWLockAcquire


  1.79%       postmaster  postgres               [.] LWLockRelease


  1.39%       postmaster  postgres               [.]
hash_search_with_hash_value


  1.20%       postmaster  postgres               [.] SearchCatCache


  0.71%       postmaster  [kernel.kallsyms]      [k] hrtimer_interrupt


  0.56%       postmaster  [kernel.kallsyms]      [k] tick_program_event


  0.54%       postmaster  [kernel.kallsyms]      [k] schedule


  0.44%       postmaster  [kernel.kallsyms]      [k] _spin_lock_irq


### Then zooming on s_lock

# Annotate s_lock

...

 99.04 │       test   %al,%al

...


# Zoom into postmaster(81487) thread

 55.84%  postmaster  postgres               [.] s_lock
                                                      ◆

  6.52%  postmaster  postgres               [.] 0x00000000001b4306
                                                      ▒

  2.42%  postmaster  [kernel.kallsyms]      [k] _spin_lock
                                                      ▒

  2.34%  postmaster  postgres               [.] LWLockAcquire
                                                      ▒

  1.87%  postmaster  postgres               [.] LWLockRelease
                                                      ▒

  1.46%  postmaster  postgres               [.] hash_search_with_hash_value
                                                      ▒

  1.26%  postmaster  postgres               [.] SearchCatCache
                                                      ▒

  0.75%  postmaster  [kernel.kallsyms]      [k] hrtimer_interrupt
                                                      ▒

  0.59%  postmaster  [kernel.kallsyms]      [k] tick_program_event
                                                      ▒

  0.57%  postmaster  [kernel.kallsyms]      [k] schedule


# Zoom into postgres DSO

 65.75%  postmaster  [.] s_lock
                                                      ◆

  7.68%  postmaster  [.] 0x00000000001b4306
                                                      ▒

  2.75%  postmaster  [.] LWLockAcquire
                                                      ▒

  2.20%  postmaster  [.] LWLockRelease
                                                      ▒

  1.72%  postmaster  [.] hash_search_with_hash_value
                                                      ▒

  1.49%  postmaster  [.] SearchCatCache
                                                      ▒

  0.54%  postmaster  [.] _bt_compare
                                                      ▒

  0.51%  postmaster  [.] _bt_checkkeys


On 19 June 2014 15:57, Merlin Moncure <mmonc...@gmail.com> wrote:

> On Thu, Jun 19, 2014 at 5:12 PM, Borislav Ivanov <biva...@atlassian.com>
> wrote:
> > However, most people on our team think that the number of connections is
> > purely a symptom of the actual problem. We would love to be wrong about
> > this. But for now we feel the high number of connections contributes for
> > preserving the problem but it's not actually triggering the problem.
>
> This is entirely correct.  pgbouncer does not preventing database load
> but about limiting damage when it occurs.  This generally necessary in
> environments where application servers keep piling on connections when
> the database is not clearing queries fast enough.
>
> In your case user% is dominating system load.  Along with the high cs
> this is really suggesting spinlock contention.  A 'perf top' is
> essential for identifying the culprit.  It's very possible that 9.4
> will fix your problem...see:
>
> http://postgresql.1045698.n5.nabble.com/Cpu-usage-100-on-slave-s-lock-problem-td5768655.html
> .
> There was some poorly optimized code in the wal replay.
>
> merlin
>

Reply via email to