[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

Jim Klimov Wed, 11 Jan 2012 07:21:06 -0800

Hello all, I have a new "crazy idea" of the day ;)


  Some years ago there was an idea proposed in one of ZFS
developers' blogs (maybe Jeff's? sorry, can't find and
link it now) that went somewhat along these lines:

   Modern disks have some ECC/CRC codes for each sector,
   and uses them to test read-in data. If the disk fails
   to produce a sector correctly, it tries harder to read
   it and reallocates the LBA from a spare-sector region,
   if possible. This leads to some more random IO for
   linearly-numbered LBA sectors, as well as waste of
   disk space for spare sectors and checksums - at least
   in comparison to better error-detection and redundancy
   of ZFS checksums. Besides, attempts to re-read a faulty
   sector may succeed or they may produce undeteced garbage,
   and take some time (maybe seconds) if the retries fail
   consistently. Then the block is marked bad and data is
   lost.

   The article went on to suggest "let's get an OEM vendor
   to give us same disks without their kludges, and we'll
   get (20%?) more platter-speed and volume, better used
   by ZFS error-detection and repair mechanisms".

I've recently had a sort of an opposite thought: yes,
ZFS redundancy is good - but also expensive in terms
of raw disk space. This is especially bad for hardware
space-constrained systems like laptops and home-NASes,
where doubling the number of HDDs (for mirrors) or
adding tens of percent of storage for raidZ is often
not practical for whatever reason.

Current ZFS checksums allow us to detect errors, but
in order for recovery to actually work, there should be
a redundant copy and/or parity block available and valid.

Hence the question: why not put ECC info into ZFS blocks?
IMHO, pluggable ECC (like pluggable compression or
varied checksums - in this case ECC algorithms allowing
for recovery of 1 or 2 bits, for example) would be cheaper
on disk space than redundancy (few % instead of 25-50% of
disk space), and still allow for recovery of certain errors,
such as on-disk or on-wire bit rot, even in single-disk
ZFS pools.

This could be an inheritable per-dataset attribute
like compression, encryption, dedup or checksum
algorithms.

Replacement of recovered "faulted" blocks into currently
free space is already part of ZFS, except that now it
might have to track the notion of "permanently-bad block
lists" and decreasing space available for addressing on
each leaf VDEV. There should also be a mechanism to
retest and clear such blocks, i.e. when a faulty drive
or LUN is replaced by a new one (perhaps with DD'ing
of an old hardware drive to a new one, and replacement,
while the pool is offline) - probably as a special
scrub-like command to zpool, also invoked during scrub.

This may be combined with the wish for OEM disks that
lack hardware ECC/spare sectors in return for more
performance; although I'm not sure how good that would
be in practice - the hardware creator's in-depth
knowledge of how to retry reading initially "faulty"
blocks, i.e. by changing voltage or platter speeds
or whatever, may be invaluable and not replaceable
by software.

What do you think? Doable? Useful? Why not, if not? ;)

Thanks,
//Jim Klimov
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Idea: ZFS and on-disk ECC for blocks

Reply via email to