<snip> > Yes, let the server act as usual, it just starts happening out of the blue. > No new hardware has been added or removed, no new programs has been > installed.
"Me too" [2.] Full description of the problem/report: Last week I had the same issue twice. In short: load goes through the roof from 2 to 950 within five minutes. Never been able to use the console -no response-. These servers where part of a HA nfs cluster so we had to pull the plug (we dont have a mon plugin for extreme load values yet). Once triggered, it seems, there is noway back. But once fail-over has occured the other server does break - so my simplistic conclusion is that this behaviour isnt based on usage patterns by our nfs clients beating the server. # lspci | grep -i giga 02:03.0 Ethernet controller: Intel Corporation 82546EB Gigabit Ethernet Controller (Copper) (rev 01) 02:03.1 Ethernet controller: Intel Corporation 82546EB Gigabit Ethernet Controller (Copper) (rev 01) # more /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 26571295 26559117 40034487 40909166 IO-APIC-edge timer 1: 9 0 0 0 IO-APIC-edge i8042 2: 0 0 0 0 XT-PIC cascade 8: 1 0 0 0 IO-APIC-edge rtc 12: 59 0 0 0 IO-APIC-edge i8042 15: 20 0 0 0 IO-APIC-edge ide1 153: 0 0 0 0 IO-APIC-level uhci_hcd 161: 0 0 0 0 IO-APIC-level uhci_hcd 169: 0 0 0 0 IO-APIC-level uhci_hcd 177: 176067 319337 329774 239612 IO-APIC-level aic79xx 185: 534649 3958315 6395749 9602510 IO-APIC-level aic79xx 193: 88511141 0 0 0 IO-APIC-level eth0 201: 60 0 24342441 0 IO-APIC-level eth1 NMI: 0 0 0 0 LOC: 134085019 134085069 134084971 134085090 ERR: 0 MIS: 0 [7.1.] Software (add the output of the ver_linux script here) Linux somewhere 2.6.9-11.ELsmp #1 SMP Fri May 20 18:26:27 EDT 2005 i686 i686 i386 GNU/Linux Gnu C 3.4.3 Gnu make 3.80 binutils 2.15.92.0.2 util-linux 2.12a mount 2.12a module-init-tools 3.1-pre5 e2fsprogs 1.35 reiserfsprogs line reiser4progs line pcmcia-cs 3.2.7 quota-tools 3.12. PPP 2.4.2 isdn4k-utils 3.3 nfs-utils 1.0.6 Linux C Library 2.3.4 Dynamic linker (ldd) 2.3.4 Procps 3.2.3 Net-tools 1.60 Kbd 1.12 Sh-utils 5.2.1 Modules Loaded nfsd exportfs drbd nfs lockd sunrpc md5 ipv6 parport_pc lp parport autofs4 w83781d adm1021 i2c_sensor i2c_i801 i2c_core dm_mod uhci_hcd hw_random e1000 floppy sg ext3 jbd aic79xx sd_mod scsi_mod [7.2.] Processor information (from /proc/cpuinfo): processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.80GHz stepping : 5 cpu MHz : 2799.959 cache size : 512 KB physical id : 0 siblings : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse ss e2 ss ht tm pbe cid xtpr bogomips : 5521.40 and so on... because of two xeon's in HT-mode [7.5.] PCI information ('lspci -vvv' as root) ((Just the ethernet)) 02:03.0 Ethernet controller: Intel Corporation 82546EB Gigabit Ethernet Controller (Copper) (rev 01) Subsystem: Intel Corporation PRO/1000 MT Dual Port Server Adapter Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 64 (63750ns min), Cache Line Size 08 Interrupt: pin A routed to IRQ 193 Region 0: Memory at fc200000 (64-bit, non-prefetchable) [size=128K] Region 4: I/O ports at 3000 [size=64] Capabilities: [dc] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [e4] PCI-X non-bridge device. Command: DPERE- ERO+ RBC=0 OST=0 Status: Bus=2 Dev=3 Func=0 64bit+ 133MHz+ SCD- USC-, DC=simple, DMMRBC=2, DMOST=0, DMCRS=1, RSCEM- Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable- Address: 0000000000000000 Data: 0000 Is there a method which can give hints about what was going on during the sharply rising load? My guess is that even monitoring/sampling aint doable anymore if the bad situation occurs. Tips on obtaining information about subjects like: - who was using which tcp/udp connection with what bandwidth - who was generating how many read/writes on which filesystem incl. location - etc etc. are more then welcome too. Does using profile=2 and storing readprofile output to files every 5 seconds give enough information to tacle the source of this problem? Please CC, thanks. Best regards, Leroy - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html