I have 18 identical Sun Fire X4100 systems here all configured identically:
4-way Opteron, 4G RAM, 70G SAS HDD, RHEL AS 4U3, Sun Grid Engine
agents (SGE) v6u7, NIS.
Periodically some of the systems exibit high load average while idling
for no obvious reason. Rebooting solves the problem, but after some
time the symptom returns. Typically the load average reaches 3 and
wouldn't go beyond that. How would you approach such a problem?

One such system shows:
[EMAIL PROTECTED] ~]# w
10:00:55 up 31 days, 17:47,  1 user,  load average: 3.00, 3.00, 3.00
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
root     pts/1    192.168.1.100    09:31    0.00s  0.02s  0.00s w

Typical top on this system:
top - 10:01:58 up 31 days, 17:49,  1 user,  load average: 3.00, 3.00, 3.00
Tasks:  80 total,   1 running,  79 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1% us,  0.0% sy,  0.0% ni, 99.9% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:   4051196k total,   891428k used,  3159768k free,    67620k buffers
Swap:  8160912k total,     4776k used,  8156136k free,   667488k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  1 root      16   0  4752  444  412 S  0.0  0.0   0:00.62 init
  2 root      RT   0     0    0    0 S  0.0  0.0   0:00.25 migration/0
  3 root      34  19     0    0    0 S  0.0  0.0   0:00.14 ksoftirqd/0
  4 root      RT   0     0    0    0 S  0.0  0.0   0:00.19 migration/1
  5 root      34  19     0    0    0 S  0.0  0.0   0:13.35 ksoftirqd/1
  6 root      RT   0     0    0    0 S  0.0  0.0   0:00.19 migration/2

vmstat 2
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
0  0   4776 3159960  67620 667488    0    0     0     6 2007    28  0  1 99  0
0  0   4776 3159960  67620 667488    0    0     0     0 2007    25  0  0 100  0
0  0   4776 3159960  67620 667488    0    0     0     0 2005    23  0  0 100  0
0  0   4776 3159960  67620 667488    0    0     0     0 2018    28  0  1 99  0
0  0   4776 3159960  67620 667488    0    0     0     0 2023    23  0  0 100  0
0  0   4776 3159960  67620 667488    0    0     0     0 2008    25  0  0 100  0
0  0   4776 3159960  67620 667488    0    0     0     0 2009    22  0  0 100  0
0  0   4776 3159960  67620 667488    0    0     0     0 2006    22  0  1 99  0
0  0   4776 3159960  67620 667488    0    0     0     0 2008    26  0  0 100  0

What I've noticed here is that the rate of interrupts is relatively
high: 2000 appr.
On this particular system the rate of interrupts after reboot is
approximately 1000:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
0  0      0 3840536  12464 128976    0    0   339    42  276   172  2  2 90  7
0  0      0 3840536  12464 128976    0    0     0     0 1081   122  0  0 100  0
2  0      0 3840536  12464 128976    0    0     0     0 1083   112  0  1 99  0
0  0      0 3840672  12472 129036    0    0     0    16 1064   119  0  0 100  0
0  0      0 3840672  12472 129036    0    0     0     0 1065   112  0  0 100  0
0  0      0 3840672  12472 129036    0    0     0     0 1064   116  0  0 100  0
0  0      0 3840672  12472 129036    0    0     0     0 1066   116  0  0 100  0


--
Warm regards,
Michael Green

=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Reply via email to