On 11/12/2013 10:57 PM, Chris Picton wrote:
Hi all
I have a set of servers running asterisk and some java apps which have
(so far) unexplained spikes in load average.
A typical spike which occurs at "random" times would see the 1 minute
load average load go from around 4 to upwards of 50, sometime
approaching 200, within one second.
From proc manpage, the 1 min load average is "number of jobs in the
run queue (state R) or waiting for disk I/O (state D) averaged over 1
minute"
I am collecting many different stats from proc every second, but nothing
I have found can correlate with the spike in load average. The counts of
process numbers from /proc/stat and /prov/loadavg do not match up to the
sudden spike. I have looked at memory paging, irqs, number of threads,
cpu states(intr/iowait/etc), network traffic, disk io, etc but no metric
I have yet found indicates it is changing behaviour at the same time as
the load average spikes
As I am writing this, I have realized that I am not actually tracking
the numbers which would be the direct cause of the load average, which
would be to loop through all processes, extract the process state from
/proc/<pid>/stat, and add up the various types. This would provide
(hopefully) a match so I could see that the load average numbers are
"correct", and may indicate a cause (many processes waiting for IO, or
lots of the same process (asterisk or java) being scheduled to run at
the same time)
While I do that, would anyone have some other idea of how to
troubleshoot the cause of very high load spikes?
Take a look at this (presented at LISA):
http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/
once you've read the main blog, search for java in one of the comments.
Combine these 2 and you may have some seriously cool insights into your
problem.
_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/