> > Au contraire: I estimate its worth quite > accurately from the undetected error rates reported > in the CERN "Data Integrity" paper published last > April (first hit if you Google 'cern "data > integrity"'). > > > > > While I have yet to see any checksum error > reported > > > by ZFS on > > > Symmetrix arrays or FC/SAS arrays with some other > > > "cheap" HW I've seen > > > many of them > > > > While one can never properly diagnose anecdotal > issues off the cuff in a Web forum, given CERN's > experience you should probably check your > configuration very thoroughly for things like > marginal connections: unless you're dealing with a > far larger data set than CERN was, you shouldn't have > seen 'many' checksum errors. > > Well single bit error rates may be rare in normal > operation hard > drives, but from a systems perspective, data can be > corrupted anywhere > between disk and CPU.
The CERN study found that such errors (if they found any at all, which they couldn't really be sure of) were far less common than the manufacturer's spec for plain old detectable but unrecoverable bit errors or to the one hardware problem that they discovered (a disk firmware bug that appeared related to the unusual demands and perhaps negligent error reporting of their RAID controller and caused errors at a rate about an order of magnitude higher than the nominal spec for detectable but unrecoverable errors). This suggests that in a ZFS-style installation without a hardware RAID controller they would have experienced at worst a bit error about every 10^14 bits or 12 TB (the manufacturer's spec rate for detectable but unrecoverable errors) - though some studies suggest that the actual incidence of 'bit rot' is considerably lower than such specs. Furthermore, simply scrubbing the disk in the background (as I believe some open-source LVMs are starting to do and for that matter some disks are starting to do themselves) would catch virtually all such errors in a manner that would allow a conventional RAID to correct them, leaving a residue of something more like one error per PB that ZFS could catch better than anyone else save WAFL. I know you're not interested > in anecdotal > evidence, It's less that I'm not interested in it than that I don't find it very convincing when actual quantitative evidence is available that doesn't seem to support its importance. I know very well that things like lost and wild writes occur, as well as the kind of otherwise undetected bus errors that you describe, but the available evidence seems to suggest that they occur in such small numbers that catching them is of at most secondary importance compared to many other issues. All other things being equal, I'd certainly pick a file system that could do so, but when other things are *not* equal I don't think it would be a compelling attraction. but I had a box that was randomly > corrupting blocks during > DMA. The errors showed up when doing a ZFS scrub and > I caught the > problem in time. Yup - that's exactly the kind of error that ZFS and WAFL do a perhaps uniquely good job of catching. Of course, buggy hardware can cause errors that trash your data in RAM beyond any hope of detection by ZFS, but (again, other things being equal) I agree that the more ways you have to detect them, the better. That said, it would be interesting to know who made this buggy hardware. ... > Like others have said for big business; as a consumer > I can reasonably > comforably buy off the shelf cheap controllers and > disks, and know > that should any part of the system be flaky enough to > cause data > corruption the software layer will catch it which > both saves money and > creates peace of mind. CERN was using relatively cheap disks and found that they were more than adequate (at least for any normal consumer use) without that additional level of protection: the incidence of errors, even including the firmware errors which presumably would not have occurred in a normal consumer installation lacking hardware RAID, was on the order of 1 per TB - and given that it's really, really difficult for a consumer to come anywhere near that much data without most of it being video files (which just laugh and keep playing when they discover small errors) that's pretty much tantamount to saying that consumers would encounter no *noticeable* errors at all. Your position is similar to that of an audiophile enthused about a measurable but marginal increase in music quality and trying to convince the hoi polloi that no other system will do: while other audiophiles may agree with you, most people just won't consider it important - and in fact won't even be able to distinguish it at all. - bill This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss