On Feb 21, 2007, at 5:20 PM, Eric Schrock wrote:

On Wed, Feb 21, 2007 at 03:35:06PM -0700, Gregory Shaw wrote:
Below is another paper on drive failure analysis, this one won best
paper at usenix:

http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/
index.html

What I found most interesting was the idea that drives don't fail
outright most of the time.   They can slow down operations, and
slowly die.

Seems like there are a two pieces you're suggesting here:

1. Some sort of background process to proactively find errors on disks
   in use by ZFS.  This will be accomplished by a background scrubbing
   option, dependent on the block-rewriting work Matt and Mark are
working on. This will allow something like "zpool set scrub=2weeks",
   which will tell ZFS to "scrub my data at an interval such that all
   data is touched over a 2 week period".  This will test reading from
   every block and verifying checksums.  Stressing write failures is a
   little more difficult.


I was thinking of something similar to a scrub. An ongoing process seemed too intrusive. I'd envisioned a cron job similar to a scrub (or defrag) that could be run periodically to show any differences between disk performance over time.

2. Distinguish "slow" drives from "normal" drives and proactively mark
   them faulted.  This shouldn't require an explicit "zpool dft", as
   we should be watching the response times of the various drives and
   keep this as a statistic.  We want to incorporate this information
   to allow better allocation amongst slower and faster drives.
Determining that a drive is "abnormally slow" is much more difficult,
   though it could theoretically be done if we had some basis - either
historical performance for the same drive or comparison to identical
   drives (manufacturer/model) within the pool.  While we've thought
about these same issues, there is currently no active effort to keep
   track of these statistics or do anything with them.


I thought this would be very difficult to determine, as a slow disk could be a transient problem.

Me, I like tools that give me information I can work with. Fully automated systems always seem to cause more problems than they solve.

For instance, if I have a drive on a pc using a shared ide bus, is it the disk that is slow, or the connection method? It's obviously the second, but finding that programatically will be very difficult.

I like the idea of a dft for testing a disk in a subjective manner. One benefit of this could be an objective performance test baseline for disks and arrays.

Btw, it does help.  :-)

These two things combined should avoid the need for an explicit fitness
test.

Hope that helps,

- Eric

--
Eric Schrock, Solaris Kernel Development http://blogs.sun.com/ eschrock
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-----
Gregory Shaw, IT Architect
Phone: (303) 272-8817 (x78817)
ITCTO Group, Sun Microsystems Inc.
500 Eldorado Blvd, UBRM02-157               [EMAIL PROTECTED] (work)
Broomfield, CO 80021                          [EMAIL PROTECTED] (home)
"When Microsoft writes an application for Linux, I've Won." - Linus Torvalds



_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to