Hi Jeremy,
Lots to say about all of this.
Thanks for your elaborate reply, it was very useful to see smartctl
output explained a bit :) I still think there's something else in play
beside disk failure. I've checked one of the drives I replaced earlier,
but that one doesn't have any of the errors in its SMART output you
described, although it did drop out of the mirror multiple times during
its lifetime.
The WD Caviar Black drives have a useful feature called TLER -- it's
disabled by default, for reasons which I don't want to get into here --
which can force the drive to internally give up after X seconds (it's
user-selectable) when dealing with such remapping/errors. The idea is
to keep the drive from being deemed dead from the OS/controller's point
of view. I believe Seagate, Hitachi, or Samsung (I forget which) have
this feature as well, but it's not called TLER.
I've read about this feature, but didn't have the time to try to get it
turned on (iirc you'd need a specific Western Digital DOS-based util or
something).
If you want to find out the exact LBA that has the problem (there may be
more than one), I can step you through performing a selective LBA scan
using SMART, since this model of disk does support such. It's easy to
do, easy to understand the results, and can be done while the drive is
in operation (though I would recommend trying to keep disk I/O to a
minimum during this test). Let me know.
At a certain point in time I had read errors from specific LBA's on ad4.
Using dd I was able to pinpoint those to single sectors. Overwriting
those sectors with what was on ad6 made them readable again. What is odd
is that the 'remapped sector' count of ad4 is 0.
Still I'd like to know how do perform such a scan.
> Finally, your vmstat -i output:
# vmstat -i
interrupt total rate
irq23: atapci0 371021299 10423
Good to know there's no IRQ sharing going on, but what does worry me is
the interrupt rate (10K interrupts/second). That seems *extremely*
high, but it also depends on what kind of disk I/O is happening on this
system -- especially since you have 2 disks attached to the same
controller.
The rate is higher than 10000 also at idle. During a gmirror sync from
ad6 to ad4, it's about 10670.
"iostat 1", "iostat -x 1", or "gstat" might come in handy to tell you
what kind of disk I/O is going on. If actual I/O is very little, then
something weird is going on with regards to the number of interrupts
being seen on IRQ 23. mav@ might have some ideas, otherwise I'd
recommend rebooting the machine and seeing if the number drops. If so,
it may be that the OS has some sort of bug where a disk timing out or
falling off the bus causes interrupt problems. (It's too bad you don't
have AHCI on this system. It handles stuff like this much more
elegantly...)
If mav@ or anyone else doesn't have another insight in the interrupt
rate, I guess a reboot will at least show if it's persistent or related
to the errors. I'll try to do a reboot when convenient (probably sunday
morning or something).
Thanks,
Pieter
_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"