Dear list,
I run a Compaq Proliant 1500 (dual Pentium 75.200) with hardware raid
(Smart2) with two ethernet cards 3com905 (b or c, I can't tell you right
now) as a firewall and web/mail virus scanner which (needless to say)
needs to be up 7d/24h.
Recently, during a pretty fast download the machine (ethernet technically,
you could login on the console, even ping the ethernet ip address) locked
up with the following error log:
Oct 9 17:29:02 fwintern kernel: eth0: transmit timed out, tx_status
00 status e681.
Oct 9 17:29:02 fwintern kernel: eth0: Interrupt posted but not
delivered -- IRQ blocked by another device?
Oct 9 17:29:02 fwintern kernel: Flags; bus-master 1, full 0; dirty
6543269 current 6543269.
Oct 9 17:29:02 fwintern kernel: Transmit list 00000000 vs.
cbef4250.
Oct 9 17:29:02 fwintern kernel: 0: @cbef4200 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 1: @cbef4210 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 2: @cbef4220 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 3: @cbef4230 length 800005ea
status 800105ea
Oct 9 17:29:02 fwintern kernel: 4: @cbef4240 length 800005ea
status 800105ea
Oct 9 17:29:02 fwintern kernel: 5: @cbef4250 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 6: @cbef4260 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 7: @cbef4270 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 8: @cbef4280 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 9: @cbef4290 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 10: @cbef42a0 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 11: @cbef42b0 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 12: @cbef42c0 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 13: @cbef42d0 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 14: @cbef42e0 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 15: @cbef42f0 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: eth0: Resetting the Tx ring pointer.
Oct 9 17:29:02 fwintern kernel: Packet log: input DENY eth1 PROTO=17
10.1.1.200:138 10.1.2.2:138 L=257 S=0x00 I=34892 F=0x0000 T=126 (#3)
Oct 9 17:29:12 fwintern kernel: eth0: transmit timed out, tx_status
00 status e601.
Oct 9 17:29:12 fwintern kernel: eth0: Interrupt posted but not
delivered -- IRQ blocked by another device?
Oct 9 17:29:12 fwintern kernel: Flags; bus-master 1, full 0; dirty
6543285 current 6543285.
Oct 9 17:29:12 fwintern kernel: Transmit list 00000000 vs.
cbef4250.
... repeating ...
The problem was reproducible (several times) with the same download (a
300MB file) after a reboot.
There should be no irq collision on irq 9, /proc/interrupts:
CPU0 CPU1
0: 4444438 4611949 IO-APIC-edge timer
1: 2077 2163 IO-APIC-edge keyboard
2: 0 0 XT-PIC cascade
5: 390236 391327 IO-APIC-level eth1
8: 2 0 IO-APIC-edge rtc
9: 435924 436646 IO-APIC-level eth0
10: 17 17 IO-APIC-level ncr53c8xx
11: 24525 24737 IO-APIC-level ida0
13: 0 0 XT-PIC fpu
NMI: 0
ERR: 0
No other hardware is installed.
Another observation: Normally the free (non buffer) memory of that machine
is 3MB (of 192MB). At the point of the crash it was down to 1.7MB. I don't
know if this is the cause or just a symptom of the locked up ethernet
card.
Is this a known problem? If it's not a hardware/driver bug, I suspect an
SMP related race condition. Maybe the bios did set up the cpu's / io-apics
wrong? I should mention that now, two days later, I can no longer
reproduce the crash.
This might be due to 2 reasons: 1) The bandwidth when downloading from the
web server was measurably slower (about 1/2) this time (about 600kbit (of
a 2Mb link)) today. 2) the machine was power cycled. Unfortunately the
personnel cannot remember if the problem was only reproducible after a
warm reboot or also after a power cycle.
Finally: /proc/version:
Linux version 2.2.14 (root@fwintern) (gcc version 2.95.2 19991024
(release)) #4 SMP Thu Aug 3 20:32:32 CEST 2000
The recent discussion on linux-kernel seems to indicate that gcc 2.95
cannot compile a kernel at all. Could this be the cause here (the system
was up since that Aug 3 flawlessly). Unfortunately there is no other
version available on that system and the distributor (Suse) seems to deny
any gcc 2.95 - kernel incompatibilies.
Please advise,
Michael.
--
Michael Weller: [EMAIL PROTECTED], [EMAIL PROTECTED],
or even [EMAIL PROTECTED] If you encounter an eowmob account on
any machine in the net, it's very likely it's me.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/