Adding additional data protection options are commendable. On the
other hand I feel there are important gaps in the existing feature
set that are worthy of a higher priority, not the least of which is
the automatic recovery of uberblock / transaction group problems (see
Victor Latushkin's recovery technique which I linked to in a recent
post),
This does not seem to be a widespread problem. We do see the
occasional complaint on this forum, but considering the substantial
number of ZFS implementations in existence today, the rate seems
to be quite low. In other words, the impact does not seem to be high.
Perhaps someone at Sun could comment on the call rate for such
conditions?
I counter this. The user impact is very high when the pool is
completely inaccessible due to a minor glitch in the ZFS metadata, and
the user is told to restore from backups, particularly if they've been
considering snapshots to be their backups (I know they're not the same
thing). The incidence rate may be low, but the impact is still high,
and anecdotally there have been enough reports on list to know it is a
real non-zero event probability. Think earth-asteroid
collisions...doesn't happen very often but is catastrophic when it does
happen. Graceful handling of low incidence high impact events plays a
role in real world robustness and is important in widescale adoption of
a filesystem. It is about software robustness in the face of failure
vs. brittleness. (In another area, I and others found MythTV's
dependence on MySQL to be source of system brittleness.) Google adopts
robustness principles in its Google File System (GFS) by not trusting
the hardware at all and then keeping a minimum of three copies of
everything on three separate computers.
Consider the users/admin's dilemma of choosing between a filesystem that
offers all the great features of ZFS but can be broken (and is
documented to have broken) with a few miswritten bytes, or choosing a
filesystem with no great features but is also generally robust to wide
variety of minor metadata corrupt issues. Complex filesystems need to
take special measures that their complexity doesn't compromise their
efforts at ensuring reliability. ZFS's extra metadata copies provide
this versus simply duplicating the file allocation table as is done in
FAT16/32 filesystems (a basic filesystem). The extra filesystem
complexity also makes users more dependent upon built in recovery
mechanisms and makes manual recovery more challenging. (This is an
unavoidable result of more complicated filesystem design.)
More below.
followed closely by a zpool shrink or zpool remove command that lets
you resize pools and disconnect devices without replacing them. I
saw postings or blog entries from about 6 months ago that this code
was 'near' as part of solving a resilvering bug but have not seen
anything else since. I think many users would like to see improved
resilience in the existing features and the addition of frequently
long requested features before other new features are added.
(Exceptions can readily be made for new features that are trivially
easy to implement and/or are not competing for developer time with
higher priority features.)
In the meantime, there is the copies flag option that you can use on
single disks. With immense drives, even losing 1/2 the capacity to
copies isn't as traumatic for many people as it was in days gone by.
(E.g. consider a 500 gb hard drive with copies=2 versus a 128 gb
SSD). Of course if you need all that space then it is a no-go.
Space, performance, dependability: you can pick any two.
Related threads that also had ideas on using spare CPU cycles for
brute force recovery of single bit errors using the checksum:
There is no evidence that the type of unrecoverable read errors we
see are single bit errors. And while it is possible for an error
handling
code to correct single bit flips, multiple bit flips would remain as a
large problem space. There are error codes which can correct multiple
flips, but they quickly become expensive. This is one reason why nobody
does RAID-2.
Expensive in CPU cycles or engineering resources or hardware or
dollars? If the argument is CPU cycles, then that is the same case made
against software RAID as a whole and an argument increasingly broken by
modern high performance CPUs. If the argument is engineering resources,
consider the complexity of ZFS itself. If the argument is hardware,
(e.g. you need a lot of disks) why not run it at the block level?
Dollars is going to be a function of engineering resources, hardware,
and licenses.
There are many error correcting codes available. RAID2 used Hamming
codes, but that's just one of many options out there. Par2 uses
configurable strength Reed-Solomon to get multi bit error correction.
The par2 source is available, although from a ZFS perspective is
hindered by the CDDL-GPL license incompatibility problem.
It is possible to write a FUSE filesystem using Reed-Solomon (like par2)
as the underlying protection. A quick search of the FUSE website turns
up the Reed-Solomon FS (a FUSE-based filesystem):
"Shielding your files with Reed-Solomon codes"
http://ttsiodras.googlepages.com/rsbep.html
While most FUSE work is on Linux, and there is a ZFS-FUSE project for
it, there has also been FUSE work done for OpenSolaris:
http://www.opensolaris.org/os/project/fuse/
BTW, if you do have the case where unprotected data is not
readable, then I have a little DTrace script that I'd like you to run
which would help determine the extent of the corruption. This is
one of those studies which doesn't like induced errors ;-)
http://www.richardelling.com/Home/scripts-and-programs-1/zcksummon
Is this intended as general monitoring script or only after one has
otherwise experienced corruption problems?
To be pedantic, wouldn't protected data also be affected if all copies
are damaged at the same time, especially if also damaged in the same
place?
-hk
The data we do have suggests that magnetic hard disk failures tend
to be spatially clustered. So there is still the problem of spatial
diversity
which is rather nicely handled by copies, today.
-- richard
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss