On Wed, Feb 21, 2007 at 03:35:06PM -0700, Gregory Shaw wrote: > Below is another paper on drive failure analysis, this one won best > paper at usenix: > > http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/ > index.html > > What I found most interesting was the idea that drives don't fail > outright most of the time. They can slow down operations, and > slowly die.
Seems like there are a two pieces you're suggesting here: 1. Some sort of background process to proactively find errors on disks in use by ZFS. This will be accomplished by a background scrubbing option, dependent on the block-rewriting work Matt and Mark are working on. This will allow something like "zpool set scrub=2weeks", which will tell ZFS to "scrub my data at an interval such that all data is touched over a 2 week period". This will test reading from every block and verifying checksums. Stressing write failures is a little more difficult. 2. Distinguish "slow" drives from "normal" drives and proactively mark them faulted. This shouldn't require an explicit "zpool dft", as we should be watching the response times of the various drives and keep this as a statistic. We want to incorporate this information to allow better allocation amongst slower and faster drives. Determining that a drive is "abnormally slow" is much more difficult, though it could theoretically be done if we had some basis - either historical performance for the same drive or comparison to identical drives (manufacturer/model) within the pool. While we've thought about these same issues, there is currently no active effort to keep track of these statistics or do anything with them. These two things combined should avoid the need for an explicit fitness test. Hope that helps, - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss