I have 18 identical Sun Fire X4100 systems here all configured identically: 4-way Opteron, 4G RAM, 70G SAS HDD, RHEL AS 4U3, Sun Grid Engine agents (SGE) v6u7, NIS. Periodically some of the systems exibit high load average while idling for no obvious reason. Rebooting solves the problem, but after some time the symptom returns. Typically the load average reaches 3 and wouldn't go beyond that. How would you approach such a problem?
One such system shows: [EMAIL PROTECTED] ~]# w 10:00:55 up 31 days, 17:47, 1 user, load average: 3.00, 3.00, 3.00 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root pts/1 192.168.1.100 09:31 0.00s 0.02s 0.00s w Typical top on this system: top - 10:01:58 up 31 days, 17:49, 1 user, load average: 3.00, 3.00, 3.00 Tasks: 80 total, 1 running, 79 sleeping, 0 stopped, 0 zombie Cpu(s): 0.1% us, 0.0% sy, 0.0% ni, 99.9% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 4051196k total, 891428k used, 3159768k free, 67620k buffers Swap: 8160912k total, 4776k used, 8156136k free, 667488k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 16 0 4752 444 412 S 0.0 0.0 0:00.62 init 2 root RT 0 0 0 0 S 0.0 0.0 0:00.25 migration/0 3 root 34 19 0 0 0 S 0.0 0.0 0:00.14 ksoftirqd/0 4 root RT 0 0 0 0 S 0.0 0.0 0:00.19 migration/1 5 root 34 19 0 0 0 S 0.0 0.0 0:13.35 ksoftirqd/1 6 root RT 0 0 0 0 S 0.0 0.0 0:00.19 migration/2 vmstat 2 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 4776 3159960 67620 667488 0 0 0 6 2007 28 0 1 99 0 0 0 4776 3159960 67620 667488 0 0 0 0 2007 25 0 0 100 0 0 0 4776 3159960 67620 667488 0 0 0 0 2005 23 0 0 100 0 0 0 4776 3159960 67620 667488 0 0 0 0 2018 28 0 1 99 0 0 0 4776 3159960 67620 667488 0 0 0 0 2023 23 0 0 100 0 0 0 4776 3159960 67620 667488 0 0 0 0 2008 25 0 0 100 0 0 0 4776 3159960 67620 667488 0 0 0 0 2009 22 0 0 100 0 0 0 4776 3159960 67620 667488 0 0 0 0 2006 22 0 1 99 0 0 0 4776 3159960 67620 667488 0 0 0 0 2008 26 0 0 100 0 What I've noticed here is that the rate of interrupts is relatively high: 2000 appr. On this particular system the rate of interrupts after reboot is approximately 1000: procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 3840536 12464 128976 0 0 339 42 276 172 2 2 90 7 0 0 0 3840536 12464 128976 0 0 0 0 1081 122 0 0 100 0 2 0 0 3840536 12464 128976 0 0 0 0 1083 112 0 1 99 0 0 0 0 3840672 12472 129036 0 0 0 16 1064 119 0 0 100 0 0 0 0 3840672 12472 129036 0 0 0 0 1065 112 0 0 100 0 0 0 0 3840672 12472 129036 0 0 0 0 1064 116 0 0 100 0 0 0 0 3840672 12472 129036 0 0 0 0 1066 116 0 0 100 0 -- Warm regards, Michael Green ================================================================= To unsubscribe, send mail to [EMAIL PROTECTED] with the word "unsubscribe" in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]