reply below...

Torrey McMahon wrote:
Richard Elling - PAE wrote:

This question was asked many times in this thread.  IMHO, it is the
single biggest reason we should implement ditto blocks for data.

We did a study of disk failures in an enterprise RAID array a few
years ago.  One failure mode stands heads and shoulders above the
others: non-recoverable reads.  A short summary:

  2,919 total errors reported
  1,926 (66.0%) operations succeeded (eg. write failed, auto reallocated)
    961 (32.9%) unrecovered errors (of all types)
     32 (1.1%) other (eg. device not ready)
    707 (24.2%) non-recoverable reads

In other words, non-recoverable reads represent 73.6% of the non-
recoverable failures that occur, including complete drive failures.


Does this take cascading failures into account? How often do you get an unrecoverable read and yet are still able to perform operation on the target media? Thats where ditto blocks could come in handy modulo the concerns around utilities and quotas.

No event analysis is done here, though we do have the data, the task is
time consuming.

Non-recoverable reads may not represent permanent failures.  In the case
of a RAID array, the data should be reconstructed and a rewrite + verify
attempted with the possibility of sparing the sector.  ZFS can
reconstruct the data and relocate the block.

I have some (volumous) data on disk error rates as reported though kstat.
I plan to attempt to get a better sense of the failure rates from that
data.  The disk vendors specify non-recoverable read error rates, but
we think they are overly pessimistic for the first few years of life.
We'd like to have a better sense of how to model this, for a variety of
applications which are concerned with archival periods.
 -- richard
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to