Re: Read / write timeouts on SATA disks connected to ICH9

Pieter de Boer Sat, 15 May 2010 00:04:40 -0700

Hi Jeremy,

Lots to say about all of this.

Thanks for your elaborate reply, it was very useful to see smartctloutput explained a bit :) I still think there's something else in playbeside disk failure. I've checked one of the drives I replaced earlier,but that one doesn't have any of the errors in its SMART output youdescribed, although it did drop out of the mirror multiple times duringits lifetime.

The WD Caviar Black drives have a useful feature called TLER -- it's
disabled by default, for reasons which I don't want to get into here --
which can force the drive to internally give up after X seconds (it's
user-selectable) when dealing with such remapping/errors.  The idea is
to keep the drive from being deemed dead from the OS/controller's point
of view.  I believe Seagate, Hitachi, or Samsung (I forget which) have
this feature as well, but it's not called TLER.

I've read about this feature, but didn't have the time to try to get itturned on (iirc you'd need a specific Western Digital DOS-based util orsomething).

If you want to find out the exact LBA that has the problem (there may be
more than one), I can step you through performing a selective LBA scan
using SMART, since this model of disk does support such.  It's easy to
do, easy to understand the results, and can be done while the drive is
in operation (though I would recommend trying to keep disk I/O to a
minimum during this test).  Let me know.

At a certain point in time I had read errors from specific LBA's on ad4.Using dd I was able to pinpoint those to single sectors. Overwritingthose sectors with what was on ad6 made them readable again. What is oddis that the 'remapped sector' count of ad4 is 0.


Still I'd like to know how do perform such a scan.

 > Finally, your vmstat -i output:

# vmstat -i
interrupt                          total       rate
irq23: atapci0                 371021299      10423


Good to know there's no IRQ sharing going on, but what does worry me is
the interrupt rate (10K interrupts/second).  That seems *extremely*
high, but it also depends on what kind of disk I/O is happening on this
system -- especially since you have 2 disks attached to the same
controller.

The rate is higher than 10000 also at idle. During a gmirror sync fromad6 to ad4, it's about 10670.

"iostat 1", "iostat -x 1", or "gstat" might come in handy to tell you
what kind of disk I/O is going on.  If actual I/O is very little, then
something weird is going on with regards to the number of interrupts
being seen on IRQ 23.  mav@ might have some ideas, otherwise I'd
recommend rebooting the machine and seeing if the number drops.  If so,
it may be that the OS has some sort of bug where a disk timing out or
falling off the bus causes interrupt problems.  (It's too bad you don't
have AHCI on this system.  It handles stuff like this much more
elegantly...)

If mav@ or anyone else doesn't have another insight in the interruptrate, I guess a reboot will at least show if it's persistent or relatedto the errors. I'll try to do a reboot when convenient (probably sundaymorning or something).


Thanks,
Pieter




_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Read / write timeouts on SATA disks connected to ICH9

Reply via email to