On Feb 21, 2007, at 5:20 PM, Eric Schrock wrote:
On Wed, Feb 21, 2007 at 03:35:06PM -0700, Gregory Shaw wrote:
Below is another paper on drive failure analysis, this one won best
paper at usenix:
http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/
index.html
What I found most interesting was the idea that drives don't fail
outright most of the time. They can slow down operations, and
slowly die.
Seems like there are a two pieces you're suggesting here:
1. Some sort of background process to proactively find errors on disks
in use by ZFS. This will be accomplished by a background scrubbing
option, dependent on the block-rewriting work Matt and Mark are
working on. This will allow something like "zpool set
scrub=2weeks",
which will tell ZFS to "scrub my data at an interval such that all
data is touched over a 2 week period". This will test reading from
every block and verifying checksums. Stressing write failures is a
little more difficult.
I was thinking of something similar to a scrub. An ongoing process
seemed too intrusive. I'd envisioned a cron job similar to a scrub
(or defrag) that could be run periodically to show any differences
between disk performance over time.
2. Distinguish "slow" drives from "normal" drives and proactively mark
them faulted. This shouldn't require an explicit "zpool dft", as
we should be watching the response times of the various drives and
keep this as a statistic. We want to incorporate this information
to allow better allocation amongst slower and faster drives.
Determining that a drive is "abnormally slow" is much more
difficult,
though it could theoretically be done if we had some basis - either
historical performance for the same drive or comparison to
identical
drives (manufacturer/model) within the pool. While we've thought
about these same issues, there is currently no active effort to
keep
track of these statistics or do anything with them.
I thought this would be very difficult to determine, as a slow disk
could be a transient problem.
Me, I like tools that give me information I can work with. Fully
automated systems always seem to cause more problems than they solve.
For instance, if I have a drive on a pc using a shared ide bus, is it
the disk that is slow, or the connection method? It's obviously the
second, but finding that programatically will be very difficult.
I like the idea of a dft for testing a disk in a subjective manner.
One benefit of this could be an objective performance test baseline
for disks and arrays.
Btw, it does help. :-)
These two things combined should avoid the need for an explicit
fitness
test.
Hope that helps,
- Eric
--
Eric Schrock, Solaris Kernel Development http://blogs.sun.com/
eschrock
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-----
Gregory Shaw, IT Architect
Phone: (303) 272-8817 (x78817)
ITCTO Group, Sun Microsystems Inc.
500 Eldorado Blvd, UBRM02-157 [EMAIL PROTECTED] (work)
Broomfield, CO 80021 [EMAIL PROTECTED] (home)
"When Microsoft writes an application for Linux, I've Won." - Linus
Torvalds
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss