can you guess? wrote: > > Primarily its checksumming features, since other open source solutions > support simple disk scrubbing (which given its ability to catch most > deteriorating disk sectors before they become unreadable probably has a > greater effect on reliability than checksums in any environment where the > hardware hasn't been slapped together so sloppily that connections are flaky). > From what I've read on the subject, That premise seems bad from the start. I don't believe that scrubbing will catch all the types of errors that checksumming will. There are a category of errors that are not caused by firmware, or any type of software. The hardware just doesn't write or read the correct bit value this time around. With out a checksum there's no way for the firmware to know, and next time it very well may write or read the correct bit value from the exact same spot on the disk, so scrubbing is not going to flag this sector as 'bad'.
Now you may claim that this type of error happens so infrequently that it's not worth it. You may think so since the number of bits you need to read or write to experience this is huge. However, hard disk sizes are still increasing exponentially, and the data we users are storing on them is too. I don't believe that the distinctive makers are making corresponding improvements in the bit error rates. Therefore while it may not be a huge benefit today, it's good we have it today, because it's value will increase as time goes on, drive sizes and data sizes increase. > Aside from the problems that scrubbing handles (and you need scrubbing even > if you have checksums, because scrubbing is what helps you *avoid* data loss > rather than just discover it after it's too late to do anything about it), > and aside from problems Again I think you're wrong on the basis for your point. The checksumming in ZFS (if I understand it correctly) isn't used for only detecting the problem. If the ZFS pool has any redundancy at all, those same checksums can be used to repair that same data, thus *avoiding* the data loss. I agree that scrubbing is still a good idea. but as discussed above it won't catch (and avoid) all the types of errors that checksumming can catch *and repair*. > deriving from sloppy assembly (which tend to become obvious fairly quickly, > though it's certainly possible for some to be more subtle), checksums > primarily catch things like bugs in storage firmware and otherwise undetected > disk read errors (which occur orders of magnitude less frequently than > uncorrectable read errors). > Sloppy assembly isn't the only place these errors can occur. it can occur between the head and the platter, even with the best drive and controller firmware. > Robert Milkowski cited some sobering evidence that mid-range arrays may have > non-negligible firmware problems that ZFS could often catch, but a) those are > hardly 'consumer' products (to address that sub-thread, which I think is what > applies in Stefano's case) and b) ZFS's claimed attraction for higher-end > (corporate) use is its ability to *eliminate* the need for such products > (hence its ability to catch their bugs would not apply - though I can > understand why people who needed to use them anyway might like to have ZFS's > integrity checks along for the ride, especially when using > less-than-fully-mature firmware). > > Every drive has firmware too. If it can be used to detect and repair array firmware problems, then it can be used by consumers to detect and repair drive firmware problems too. > And otherwise undetected disk errors occur with negligible frequency compared > with software errors that can silently trash your data in ZFS cache or in > application buffers (especially in PC environments: enterprise software at > least tends to be more stable and more carefully controlled - not to mention > their typical use of ECC RAM). > > As I wrote above. The undetected disk error rate is not improving (AFAIK) as fast as disk size and data size that these drives are used for. Therefore the value of this protection is increasing all the time. Sure it's true that something else that could trash your data without checksumming can still trash your data with it. But making sure that the data gets unmangled if it can is still worth something, and the improvements you point out are needed in other components would be pointless (according to your argument) if something like ZFS didn't also exist. > So depending upon ZFS's checksums to protect your data in most PC > environments is sort of like leaving on a vacation and locking and bolting > the back door of your house while leaving the front door wide open: yes, a > burglar is less likely to enter by the back door, but thinking that the extra > bolt there made you much safer is likely foolish. > > .. are you > >> just trying to say that without multiple copies of >> data in multiple >> physical locations you're not really accomplishing a >> more complete >> risk reduction >> > > What I'm saying is that if you *really* care about your data, then you need > to be willing to make the effort to lock and bolt the front door as well as > the back door and install an alarm system: if you do that, *then* ZFS's > additional protection mechanisms may start to become significant (because > you're eliminated the higher-probability risks and ZFS's extra protection > then actually reduces the *remaining* risk by a significant percentage). > > Agreed. Depending on only one copy of your important data is shortsighted. But using a tool like ZFS on at least the most active copy, if not all copies will be an improvement, if it even once stops you from having to go to your other copies. Also it's interesting that you use the term 'alarm system'. That's exactly how I view the checksumming features of ZFS. It is an alarm that goes off if any of my bits have been lost to an invisible 'burglar'. I've also noticed how you happen to skip the data replication features of ZFS. While they may not be everything you've hoped they would be, they are features that will have value to people who want to do exactly what you suggest, keeping multiple copies of their data in multiple places. > Conversely, if you don't care enough about your data to take those extra > steps, then adding ZFS's incremental protection won't reduce your net risk by > a significant percentage (because the other risks that still remain are so > much larger). > > Was my point really that unclear before? It seems as if this must be at > least the third or fourth time that I've explained it. > > On the cost side of things, I think you also miss a point. The data checking *and repair* features of ZFS bring down the cost of storage not just on the cost of the software. It also allows (as in safeguards) the use of significantly lower priced Hardware (SATA drives instead of SAS or FCAL, or expensive arrays) by making up for the slightly higher possibility of problems that hardware brings with it. This in my opinion fundamentally changes the cost/risk ratio by giving virtually the same or better error rates on the cheaper hardware. > >> i'd love to see >> the improvements on the many shortcomings you're >> pointing to and >> passionate about written up, proposed, and freely >> implemented :) >> > > Then ask the ZFS developers to get on the stick: fixing the fragmentation > problem discussed elsewhere should be easy, and RAID-Z is at least amenable > to a redesign (though not without changing the on-disk metadata structures a > bit - but while they're at it, they could include support for data redundancy > in a manner analogous to ditto blocks so that they could get rid of the > vestigial LVM-style management in that area). > > I think he was suggesting that if it's so important to you, go ahead and submit the changes yourself. Though I know not all of us have the skills to do that. I'll admit I don't. -Kyle _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss