Hello all, I have a new "crazy idea" of the day ;)
Some years ago there was an idea proposed in one of ZFS developers' blogs (maybe Jeff's? sorry, can't find and link it now) that went somewhat along these lines: Modern disks have some ECC/CRC codes for each sector, and uses them to test read-in data. If the disk fails to produce a sector correctly, it tries harder to read it and reallocates the LBA from a spare-sector region, if possible. This leads to some more random IO for linearly-numbered LBA sectors, as well as waste of disk space for spare sectors and checksums - at least in comparison to better error-detection and redundancy of ZFS checksums. Besides, attempts to re-read a faulty sector may succeed or they may produce undeteced garbage, and take some time (maybe seconds) if the retries fail consistently. Then the block is marked bad and data is lost. The article went on to suggest "let's get an OEM vendor to give us same disks without their kludges, and we'll get (20%?) more platter-speed and volume, better used by ZFS error-detection and repair mechanisms". I've recently had a sort of an opposite thought: yes, ZFS redundancy is good - but also expensive in terms of raw disk space. This is especially bad for hardware space-constrained systems like laptops and home-NASes, where doubling the number of HDDs (for mirrors) or adding tens of percent of storage for raidZ is often not practical for whatever reason. Current ZFS checksums allow us to detect errors, but in order for recovery to actually work, there should be a redundant copy and/or parity block available and valid. Hence the question: why not put ECC info into ZFS blocks? IMHO, pluggable ECC (like pluggable compression or varied checksums - in this case ECC algorithms allowing for recovery of 1 or 2 bits, for example) would be cheaper on disk space than redundancy (few % instead of 25-50% of disk space), and still allow for recovery of certain errors, such as on-disk or on-wire bit rot, even in single-disk ZFS pools. This could be an inheritable per-dataset attribute like compression, encryption, dedup or checksum algorithms. Replacement of recovered "faulted" blocks into currently free space is already part of ZFS, except that now it might have to track the notion of "permanently-bad block lists" and decreasing space available for addressing on each leaf VDEV. There should also be a mechanism to retest and clear such blocks, i.e. when a faulty drive or LUN is replaced by a new one (perhaps with DD'ing of an old hardware drive to a new one, and replacement, while the pool is offline) - probably as a special scrub-like command to zpool, also invoked during scrub. This may be combined with the wish for OEM disks that lack hardware ECC/spare sectors in return for more performance; although I'm not sure how good that would be in practice - the hardware creator's in-depth knowledge of how to retry reading initially "faulty" blocks, i.e. by changing voltage or platter speeds or whatever, may be invaluable and not replaceable by software. What do you think? Doable? Useful? Why not, if not? ;) Thanks, //Jim Klimov _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss