Daniel Carosone wrote:
Sorry, don't have a thread reference
to hand just now.
http://www.opensolaris.org/jive/thread.jspa?threadID=100296
Note that there's little empirical evidence that this is directly applicable to
the kinds of errors (single bit, or otherwise) that a single failing disk
medium would produce. Modern disks already include and rely on a lot of ECC as
part of ordinary operation, below the level usually seen by the host. These
mechanisms seem unlikely to return a read with just one (or a few) bit errors.
This strikes me, if implemented, as potentially more applicable to errors
introduced from other sources (controller/bus transfer errors, non-ecc memory,
weak power supply, etc). Still handy.
Adding additional data protection options are commendable. On the other
hand I feel there are important gaps in the existing feature set that
are worthy of a higher priority, not the least of which is the automatic
recovery of uberblock / transaction group problems (see Victor
Latushkin's recovery technique which I linked to in a recent post),
followed closely by a zpool shrink or zpool remove command that lets you
resize pools and disconnect devices without replacing them. I saw
postings or blog entries from about 6 months ago that this code was
'near' as part of solving a resilvering bug but have not seen anything
else since. I think many users would like to see improved resilience in
the existing features and the addition of frequently long requested
features before other new features are added. (Exceptions can readily
be made for new features that are trivially easy to implement and/or are
not competing for developer time with higher priority features.)
In the meantime, there is the copies flag option that you can use on
single disks. With immense drives, even losing 1/2 the capacity to
copies isn't as traumatic for many people as it was in days gone by.
(E.g. consider a 500 gb hard drive with copies=2 versus a 128 gb SSD).
Of course if you need all that space then it is a no-go.
Related threads that also had ideas on using spare CPU cycles for brute
force recovery of single bit errors using the checksum:
[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data
corruption?
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg14720.html
[zfs-discuss] integrated failure recovery thoughts (single-bit correction)
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg18540.html
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss