On 9-Nov-07, at 2:45 AM, can you guess? wrote: >>> Au contraire: I estimate its worth quite >> accurately from the undetected error rates reported >> in the CERN "Data Integrity" paper published last >> April (first hit if you Google 'cern "data >> integrity"'). >>> >>>> While I have yet to see any checksum error >> reported >>>> by ZFS on >>>> Symmetrix arrays or FC/SAS arrays with some other >>>> "cheap" HW I've seen >>>> many of them >>> >>> While one can never properly diagnose anecdotal >> issues off the cuff in a Web forum, given CERN's >> experience you should probably check your >> configuration very thoroughly for things like >> marginal connections: unless you're dealing with a >> far larger data set than CERN was, you shouldn't have >> seen 'many' checksum errors. >> >> Well single bit error rates may be rare in normal >> operation hard >> drives, but from a systems perspective, data can be >> corrupted anywhere >> between disk and CPU. > > The CERN study found that such errors (if they found any at all, > which they couldn't really be sure of) were far less common than > the manufacturer's spec for plain old detectable but unrecoverable > bit errors or to the one hardware problem that they discovered (a > disk firmware bug that appeared related to the unusual demands and > perhaps negligent error reporting of their RAID controller and > caused errors at a rate about an order of magnitude higher than the > nominal spec for detectable but unrecoverable errors). > > This suggests that in a ZFS-style installation without a hardware > RAID controller they would have experienced at worst a bit error > about every 10^14 bits or 12 TB
And how about FAULTS? hw/firmware/cable/controller/ram/... > (the manufacturer's spec rate for detectable but unrecoverable > errors) - though some studies suggest that the actual incidence of > 'bit rot' is considerably lower than such specs. Furthermore, > simply scrubbing the disk in the background (as I believe some open- > source LVMs are starting to do and for that matter some disks are > starting to do themselves) would catch virtually all such errors in > a manner that would allow a conventional RAID to correct them, > leaving a residue of something more like one error per PB that ZFS > could catch better than anyone else save WAFL. > > I know you're not interested >> in anecdotal >> evidence, > > It's less that I'm not interested in it than that I don't find it > very convincing when actual quantitative evidence is available that > doesn't seem to support its importance. I know very well that > things like lost and wild writes occur, as well as the kind of > otherwise undetected bus errors that you describe, but the > available evidence seems to suggest that they occur in such small > numbers that catching them is of at most secondary importance > compared to many other issues. All other things being equal, I'd > certainly pick a file system that could do so, but when other > things are *not* equal I don't think it would be a compelling > attraction. > > but I had a box that was randomly >> corrupting blocks during >> DMA. The errors showed up when doing a ZFS scrub and >> I caught the >> problem in time. > > Yup - that's exactly the kind of error that ZFS and WAFL do a > perhaps uniquely good job of catching. WAFL can't catch all: It's distantly isolated from the CPU end. > Of course, buggy hardware can cause errors that trash your data > in RAM beyond any hope of detection by ZFS, but (again, other > things being equal) I agree that the more ways you have to detect > them, the better. That said, it would be interesting to know who > made this buggy hardware. > > ... > >> Like others have said for big business; as a consumer >> I can reasonably >> comforably buy off the shelf cheap controllers and >> disks, and know >> that should any part of the system be flaky enough to >> cause data >> corruption the software layer will catch it which >> both saves money and >> creates peace of mind. > > CERN was using relatively cheap disks Don't forget every other component in the chain. > and found that they were more than adequate (at least for any > normal consumer use) without that additional level of protection: > the incidence of errors, even including the firmware errors which > presumably would not have occurred in a normal consumer > installation lacking hardware RAID, was on the order of 1 per TB - > and given that it's really, really difficult for a consumer to come > anywhere near that much data without most of it being video files > (which just laugh and keep playing when they discover small errors) > that's pretty much tantamount to saying that consumers would > encounter no *noticeable* errors at all. > > Your position is similar to that of an audiophile enthused about a > measurable but marginal increase in music quality and trying to > convince the hoi polloi that no other system will do: while other > audiophiles may agree with you, most people just won't consider it > important - and in fact won't even be able to distinguish it at all. Data integrity *is* important. --Toby > > - bill > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss