Some additional observation and food for thoughts. Our app uses connection
caching (Apache::DBI). By disabling Apache::DBI and forcing
client re-connection for every (http) request processed I eliminated the
stall. The user cpu usage jumped (mostly cause prepared sql queries are no
longer available, and some additional overhead on re-connection), but no
single case of high-sys-cpu stall.

I can not completely rule out the possibility of some left-overs
(unfinished transaction?) remain after serving http request, which, in the
absence of connection caching, are discarded for sure....

-- Vlad


On Mon, Nov 19, 2012 at 11:19 AM, Merlin Moncure <mmonc...@gmail.com> wrote:

>
> yeah.  interesting -- contention was much higher this time and that
> changes things.  strange how it was missed earlier.
>
> you're getting bounced around a lot in lwlock especially
> (unfortunately we don't know which one).  I'm going to hazard another
> guess:  maybe the trigger here is when the number of contending
> backends exceeds some critical number (probably based on the number of
> cores) you see a quick cpu spike (causing more backends to lock and
> pile up) as cache line bouncing sets in.  That spike doesn't last
> long, because the spinlocks quickly accumulate delay counts then punt
> to the scheduler which is unable to cope.  The exact reason why this
> is happening to you in exactly this way (I've never seen it) is
> unclear.  Also the line between symptom and cause is difficult to
> draw.
>
> unfortunately, in your case spinlock re-scheduling isn't helping.  log
> entries like this one:
> 18764 [2012-11-19 10:43:50.124 CST] LOG:  JJ spin delay from file
> sinvaladt.c line 512 delay 212, pointer 0x7f514959a394 at character 29
>
> are suggesting major problems.  you're dangerously close to a stuck
> spinlock which is lights out for the database.
>
> merlin
>

Reply via email to