Some additional observation and food for thoughts. Our app uses connection caching (Apache::DBI). By disabling Apache::DBI and forcing client re-connection for every (http) request processed I eliminated the stall. The user cpu usage jumped (mostly cause prepared sql queries are no longer available, and some additional overhead on re-connection), but no single case of high-sys-cpu stall.
I can not completely rule out the possibility of some left-overs (unfinished transaction?) remain after serving http request, which, in the absence of connection caching, are discarded for sure.... -- Vlad On Mon, Nov 19, 2012 at 11:19 AM, Merlin Moncure <mmonc...@gmail.com> wrote: > > yeah. interesting -- contention was much higher this time and that > changes things. strange how it was missed earlier. > > you're getting bounced around a lot in lwlock especially > (unfortunately we don't know which one). I'm going to hazard another > guess: maybe the trigger here is when the number of contending > backends exceeds some critical number (probably based on the number of > cores) you see a quick cpu spike (causing more backends to lock and > pile up) as cache line bouncing sets in. That spike doesn't last > long, because the spinlocks quickly accumulate delay counts then punt > to the scheduler which is unable to cope. The exact reason why this > is happening to you in exactly this way (I've never seen it) is > unclear. Also the line between symptom and cause is difficult to > draw. > > unfortunately, in your case spinlock re-scheduling isn't helping. log > entries like this one: > 18764 [2012-11-19 10:43:50.124 CST] LOG: JJ spin delay from file > sinvaladt.c line 512 delay 212, pointer 0x7f514959a394 at character 29 > > are suggesting major problems. you're dangerously close to a stuck > spinlock which is lights out for the database. > > merlin >