Try to look for processes which are in zombie (defunct) state.
If I'm not mistaken, for some reason they tend to be counted when kernel
calculates the load average.
Michael Green wrote:
I have 18 identical Sun Fire X4100 systems here all configured
identically:
4-way Opteron, 4G RAM, 70G SAS HDD, RHEL AS 4U3, Sun Grid Engine
agents (SGE) v6u7, NIS.
Periodically some of the systems exibit high load average while idling
for no obvious reason. Rebooting solves the problem, but after some
time the symptom returns. Typically the load average reaches 3 and
wouldn't go beyond that. How would you approach such a problem?
One such system shows:
[EMAIL PROTECTED] ~]# w
10:00:55 up 31 days, 17:47, 1 user, load average: 3.00, 3.00, 3.00
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
root pts/1 192.168.1.100 09:31 0.00s 0.02s 0.00s w
Typical top on this system:
top - 10:01:58 up 31 days, 17:49, 1 user, load average: 3.00, 3.00,
3.00
Tasks: 80 total, 1 running, 79 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1% us, 0.0% sy, 0.0% ni, 99.9% id, 0.0% wa, 0.0% hi,
0.0% si
Mem: 4051196k total, 891428k used, 3159768k free, 67620k buffers
Swap: 8160912k total, 4776k used, 8156136k free, 667488k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 16 0 4752 444 412 S 0.0 0.0 0:00.62 init
2 root RT 0 0 0 0 S 0.0 0.0 0:00.25 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.14 ksoftirqd/0
4 root RT 0 0 0 0 S 0.0 0.0 0:00.19 migration/1
5 root 34 19 0 0 0 S 0.0 0.0 0:13.35 ksoftirqd/1
6 root RT 0 0 0 0 S 0.0 0.0 0:00.19 migration/2
vmstat 2
procs -----------memory---------- ---swap-- -----io---- --system--
----cpu----
r b swpd free buff cache si so bi bo in cs us
sy id wa
0 0 4776 3159960 67620 667488 0 0 0 6 2007 28 0
1 99 0
0 0 4776 3159960 67620 667488 0 0 0 0 2007 25 0
0 100 0
0 0 4776 3159960 67620 667488 0 0 0 0 2005 23 0
0 100 0
0 0 4776 3159960 67620 667488 0 0 0 0 2018 28 0
1 99 0
0 0 4776 3159960 67620 667488 0 0 0 0 2023 23 0
0 100 0
0 0 4776 3159960 67620 667488 0 0 0 0 2008 25 0
0 100 0
0 0 4776 3159960 67620 667488 0 0 0 0 2009 22 0
0 100 0
0 0 4776 3159960 67620 667488 0 0 0 0 2006 22 0
1 99 0
0 0 4776 3159960 67620 667488 0 0 0 0 2008 26 0
0 100 0
What I've noticed here is that the rate of interrupts is relatively
high: 2000 appr.
On this particular system the rate of interrupts after reboot is
approximately 1000:
procs -----------memory---------- ---swap-- -----io---- --system--
----cpu----
r b swpd free buff cache si so bi bo in cs us
sy id wa
0 0 0 3840536 12464 128976 0 0 339 42 276 172 2
2 90 7
0 0 0 3840536 12464 128976 0 0 0 0 1081 122 0
0 100 0
2 0 0 3840536 12464 128976 0 0 0 0 1083 112 0
1 99 0
0 0 0 3840672 12472 129036 0 0 0 16 1064 119 0
0 100 0
0 0 0 3840672 12472 129036 0 0 0 0 1065 112 0
0 100 0
0 0 0 3840672 12472 129036 0 0 0 0 1064 116 0
0 100 0
0 0 0 3840672 12472 129036 0 0 0 0 1066 116 0
0 100 0
=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]