Hi,
<SNIP: disk without errors timing out>
That could be caused by a multitude of other known things. For
example, some Western Digital "Green" drives (including the
Enterprise class ones) are known to perform head parking/offloading
excessively, which could result in the drive spending more time doing
that than actually serving overall I/O requests. There are some
other reports of Samsung Spinpoint drives experiencing other issues
(I've since forgotten and would have to dig up the threads).
If you could provide full SMART stats for that drive, it might help.
Attached the SMART output of both disks I replaced about a month ago. It
appears I replaced perfectly fine drives with the current disks with
errors ;( One of the old disks is in a USB-enclosure now, so 'da0'.
<SNIP: enabling TLER>
Yes, it's a DOS-based utility (like most firmware upgraders these
days). I can provide it if you'd like. I've been meaning to spend
some time trying to reverse-engineer the binary to figure out what
ATA commands it sends to the disk to toggle/adjust the feature (so
that one could do it in real-time rather than have to boot into DOS).
I'd like to try that tool. Since the old WD disks are now lying around
at home, I have some time to get a DOS boot working to try it out. A
FreeBSD-implementation of the WD tool and possibly other brands would be
really useful indeed.
At a certain point in time I had read errors from specific LBA's on
ad4. Using dd I was able to pinpoint those to single sectors.
This isn't very effective (dd will read large chunks/amounts of data
(read: multiple LBAs) from the underlying disk at once, rather than
the disk itself performing a per-LBA test). My opinion is that the
"dd method" should only be used on drives which don't support
selective LBA scanning via SMART.
Will dd read multiple LBAs even when using 'bs=512'? The process I used
was reading using bs=8192, then zooming in on the LBA's mentioned in
the errors in dmesg with bs=512 to find the actual LBA.
A selective scan on ad4 did not reveal any errors today: it 'completed
without error'. On ad6 it's a whole lot slower; at the time of writing
it's at 2/3.
All HD vendors have their own quirks/ordeals right now. You
basically just have to go with one who works wells for you, then if
things start going downhill, switch to another. None of them are
perfect.
I figured as much. What irritates though is that I've had consistent
problems with 4 disks in this specific system, but not (such) issues
with any other disk in other systems I've had. I generally replace disks
when I grow out of them, not because they break down.
What this indicates to me is that if a disk falls off the bus on an
ICH9 controller in Enhanced (non-AHCI) mode, FreeBSD starts seeing an
absurd number of interrupts generated from the ICH9. My guess is
FreeBSD isn't doing something correctly with the controller when this
happens; maybe certain commands aren't being sent back to the
controller or handling of certain events are being done improperly
when it comes to ICH9 (or possibly earlier ICH revisions too). This
should be *very* easy to reproduce.
Unfortunately I'm not really in a position to help reproducing this or
testing possible fixes; downtime is currently very unwelcome. Although
one of the previous disks indeed fell of the bus entirely (couldn't get
it back with atacontrol either), that hasn't happened again so far. I
only see timeouts (and a few days ago read errors on ad4) which gmirror
doesn't like. I guess those aren't that simple to reproduce (apart from
on my system ;).
If you see any of your disks on the ICH9 controller fall off the bus
or report ATA errors (doesn't matter what kind), please make note of
the timestamp (should be in the kernel log), and ASAP run "smartctl
-a" on the disk. You should compare attributes before and after the
event.
You might also want to consider using smartd, which can log SMART
attribute changes on its own. Note that you might have to tune the
arguments in smartd.conf to ignore some attributes which fluctuate
naturally (such as drive temperature and seek error rate).
I've configured smartd to poll both disks every 5 minutes. I -think- the
issues happen specifically under load: the periodic scripts of the host
and its 4 jails appear to trigger it sometimes. At that time I'm
normally trying to get some sleep, so smartd will have to do for now.
Although I'll run a "smartctl -a" asap anyway.
--
Pieter
_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"