Re: [zfs-discuss] Single disk parity

Richard Elling Thu, 09 Jul 2009 10:44:16 -0700

Haudy Kazemi wrote:

Adding additional data protection options are commendable. On theother hand I feel there are important gaps in the existing featureset that are worthy of a higher priority, not the least of which isthe automatic recovery of uberblock / transaction group problems(see Victor Latushkin's recovery technique which I linked to in arecent post),
This does not seem to be a widespread problem.  We do see the
occasional complaint on this forum, but considering the substantial
number of ZFS implementations in existence today, the rate seems
to be quite low.  In other words, the impact does not seem to be high.
Perhaps someone at Sun could comment on the call rate for such
conditions?
I counter this. The user impact is very high when the pool iscompletely inaccessible due to a minor glitch in the ZFS metadata, andthe user is told to restore from backups, particularly if they've beenconsidering snapshots to be their backups (I know they're not the samething). The incidence rate may be low, but the impact is still high,and anecdotally there have been enough reports on list to know it is areal non-zero event probability.


Impact in my context is statistical.  If everyone was hitting this problem,
then it would have been automated long ago.  Sun does track such reports
and will know their rate.

Think earth-asteroid collisions...doesn't happen very often but iscatastrophic when it does happen. Graceful handling of low incidencehigh impact events plays a role in real world robustness and isimportant in widescale adoption of a filesystem. It is about softwarerobustness in the face of failure vs. brittleness. (In another area,I and others found MythTV's dependence on MySQL to be source of systembrittleness.) Google adopts robustness principles in its Google FileSystem (GFS) by not trusting the hardware at all and then keeping aminimum of three copies of everything on three separate computers.

Right, so you also know that the reports of this problem are fornon-mirrored

pools.  I agree with Google, mirrors work.

Consider the users/admin's dilemma of choosing between a filesystemthat offers all the great features of ZFS but can be broken (and isdocumented to have broken) with a few miswritten bytes, or choosing afilesystem with no great features but is also generally robust to widevariety of minor metadata corrupt issues. Complex filesystems need totake special measures that their complexity doesn't compromise theirefforts at ensuring reliability. ZFS's extra metadata copies providethis versus simply duplicating the file allocation table as is done inFAT16/32 filesystems (a basic filesystem). The extra filesystemcomplexity also makes users more dependent upon built in recoverymechanisms and makes manual recovery more challenging. (This is anunavoidable result of more complicated filesystem design.)


I agree 100%.  But the question here is manual vs automated, not possible
vs impossible.  Even the venerable UFS fsck defers to manual if things are
really messed up.

More below.
followed closely by a zpool shrink or zpool remove command that letsyou resize pools and disconnect devices without replacing them. Isaw postings or blog entries from about 6 months ago that this codewas 'near' as part of solving a resilvering bug but have not seenanything else since. I think many users would like to see improvedresilience in the existing features and the addition of frequentlylong requested features before other new features are added.(Exceptions can readily be made for new features that are triviallyeasy to implement and/or are not competing for developer time withhigher priority features.)
In the meantime, there is the copies flag option that you can use onsingle disks. With immense drives, even losing 1/2 the capacity tocopies isn't as traumatic for many people as it was in days goneby. (E.g. consider a 500 gb hard drive with copies=2 versus a 128gb SSD). Of course if you need all that space then it is a no-go.
Space, performance, dependability: you can pick any two.
Related threads that also had ideas on using spare CPU cycles forbrute force recovery of single bit errors using the checksum:
There is no evidence that the type of unrecoverable read errors we
see are single bit errors. And while it is possible for an errorhandling
code to correct single bit flips, multiple bit flips would remain as a
large problem space.  There are error codes which can correct multiple
flips, but they quickly become expensive.  This is one reason why nobody
does RAID-2.
Expensive in CPU cycles or engineering resources or hardware ordollars? If the argument is CPU cycles, then that is the same casemade against software RAID as a whole and an argument increasinglybroken by modern high performance CPUs. If the argument isengineering resources, consider the complexity of ZFS itself. If theargument is hardware, (e.g. you need a lot of disks) why not run it atthe block level? Dollars is going to be a function of engineeringresources, hardware, and licenses.


All algorithms are not created equal.  A CPU can do XOR at memory
bandwidth rates.  Even the special case BCH, called Reed-Solomon,
used for raidz2 has a reputation for slowness.

Simple redundancy works pretty well.  Space, speed, dependability: pick two.

There are many error correcting codes available. RAID2 used Hammingcodes, but that's just one of many options out there. Par2 usesconfigurable strength Reed-Solomon to get multi bit error correction.The par2 source is available, although from a ZFS perspective ishindered by the CDDL-GPL license incompatibility problem.
It is possible to write a FUSE filesystem using Reed-Solomon (likepar2) as the underlying protection. A quick search of the FUSEwebsite turns up the Reed-Solomon FS (a FUSE-based filesystem):"Shielding your files with Reed-Solomon codes"http://ttsiodras.googlepages.com/rsbep.html
While most FUSE work is on Linux, and there is a ZFS-FUSE project forit, there has also been FUSE work done for OpenSolaris:
http://www.opensolaris.org/os/project/fuse/
BTW, if you do have the case where unprotected data is not
readable, then I have a little DTrace script that I'd like you to run
which would help determine the extent of the corruption.  This is
one of those studies which doesn't like induced errors ;-)
http://www.richardelling.com/Home/scripts-and-programs-1/zcksummon
Is this intended as general monitoring script or only after one hasotherwise experienced corruption problems?


It is intended to try to answer the question of whether the errors we see

in real life might be single bit errors. I do not believe they will besingle

bit errors, but we don't have the data.

To be pedantic, wouldn't protected data also be affected if all copiesare damaged at the same time, especially if also damaged in the sameplace?


Yep.  Which is why there is RFE CR 6674679, complain if all data
copies are identical and corrupt.
-- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Single disk parity

Reply via email to