Every few days, I get the kernel error "hdX: lost interrupt" where X is
usually c or g.

I'm having a hard time tracking down any systematic way of
troubleshooting this problem.

hdg is a brand new drive and ran for a couple of weeks in another system
without a blip, so I don't think it is a problem with the drive itself.

There are also no SMART errors appearing on any drives.

I have replaced the ribbon cable connecting the drive to the controller.

hdc and hdg, which both occasionally get lost interrupts, are on
different controllers--and, in fact, on diffferent sorts of controllers.
One is a VIA vt8235 IDE UDMA133, the other is a RAID Controller Triones
Technologies HPT366/368/370/370A/372.

I was using Debian stock kernel 2.6.8-2-k7; now I'm using a custom built
vanilla 2.6.15.4. I haven't figured out if there is a real statistical
difference in the number of errors with each--I may be getting them
slightly more frequently with 2.6.15.4 but I don't have enough data
points to be sure.

I also *seemed* to be getting them more frequently when I had a UPS
installed. Since I've taken the UPS out and connected the CPU directly to
a power socket, they seem to be rarer and are not accompanied by any dma
timeout errors, but again I'm not certain this is statistically
significant.

/proc/interrupts says:

           CPU0
  0:   32453965          XT-PIC  timer
  1:         16          XT-PIC  i8042
  2:          0          XT-PIC  cascade
  5:          0          XT-PIC  uhci_hcd:usb2
  8:          4          XT-PIC  rtc
 10:    3554483          XT-PIC  ide2, ide3, uhci_hcd:usb3
 11:    9589616          XT-PIC  uhci_hcd:usb1, eth0, eth1
 12:          0          XT-PIC  ehci_hcd:usb4
 14:    2235942          XT-PIC  ide0
 15:    1836402          XT-PIC  ide1
NMI:          0
LOC:   32454287
ERR:      12990
MIS:          0

/proc/ioports:

0000-001f : dma1
0020-0021 : pic1
0040-0043 : timer0
0050-0053 : timer1
0060-006f : keyboard
0070-0077 : rtc
0080-008f : dma page reg
00a0-00a1 : pic2
00c0-00df : dma2
00f0-00ff : fpu
0170-0177 : ide1
01f0-01f7 : ide0
02f8-02ff : serial
0376-0376 : ide1
03c0-03df : vga+
03f6-03f6 : ide0
03f8-03ff : serial
0cf8-0cff : PCI conf1
4000-407f : 0000:00:11.0
5000-500f : 0000:00:11.0
c000-c0ff : 0000:00:0c.0
  c000-c0ff : r8169
c400-c4ff : 0000:00:0e.0
c800-c807 : 0000:00:0f.0
  c800-c807 : ide2
cc00-cc03 : 0000:00:0f.0
  cc02-cc02 : ide2
d000-d007 : 0000:00:0f.0
  d000-d007 : ide3
d400-d403 : 0000:00:0f.0
  d402-d402 : ide3
d800-d8ff : 0000:00:0f.0
  d800-d807 : ide2
  d808-d80f : ide3
  d810-d8ff : HPT372
dc00-dc1f : 0000:00:10.0
  dc00-dc1f : uhci_hcd
e000-e01f : 0000:00:10.1
  e000-e01f : uhci_hcd
e400-e41f : 0000:00:10.2
  e400-e41f : uhci_hcd
e800-e80f : 0000:00:11.1
  e800-e807 : ide0
  e808-e80f : ide1
ec00-ecff : 0000:00:12.0
  ec00-ecff : via-rhine

I have one drive from each controller in a software RAID-5: hda, hdc,
hde, and hdh.

Any suggestions for how to go about diagnosing the problem?
-- 
Adam Rosi-Kessel
http://adam.rosi-kessel.org


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Reply via email to