> On December 13, 2007 12:51:55 PM -0800 "can you > guess?" > <[EMAIL PROTECTED]> wrote: > > ... > > > >> when the difference between an unrecoverable > single > >> bit error is not just > >> 1 bit but the entire file, or corruption of an > entire > >> database row (etc), > >> those small and infrequent errors are an > "extremely > >> big" deal. > > > > You are confusing unrecoverable disk errors (which > are rare but orders of > > magnitude more common) with otherwise > *undetectable* errors (the > > occurrence of which is at most once in petabytes by > the studies I've > > seen, rather than once in terabytes), despite my > attempt to delineate the > > difference clearly. > > No I'm not. I know exactly what you are talking > about.
Then you misspoke in your previous post by referring to "an unrecoverable single bit error" rather than to "an undetected single-bit error", which I interpreted as a misunderstanding. > > > Conventional approaches using scrubbing provide as > > complete protection against unrecoverable disk > errors as ZFS does: it's > > only the far rarer otherwise *undetectable* errors > that ZFS catches and > > they don't. > > yes. far rarer and yet home users still see them. I'd need to see evidence of that for current hardware. > > that the home user ever sees these extremely rare > (undetectable) errors > may have more to do with poor connection (cables, > etc) to the disk, Unlikely, since transfers over those connections have been protected by 32-bit CRCs since ATA busses went to 33 or 66 MB/sec. (SATA has even stronger protection), and SMART tracks the incidence of these errors (which result in retries when detected) such that very high error rates should be noticed before an error is likely to make it through the 2^-32 probability sieve (for that matter, you might well notice the performance degradation due to the frequent retries). I can certainly believe that undetected transfer errors occurred in noticeable numbers in older hardware, though: that's why they introduced the CRCs. and > less to do with disk media errors. enterprise users > probably have > better connectivity and see errors due to high i/o. As I said, at most once in petabytes transferred. It takes about 5 years for a contemporary ATA/SATA disk to transfer 10 PB if it's streaming data at top speed, 24/7; doing 8 KB random database accesses (the example that you used) flat out, 24/7, it takes about 500 years (though most such drives aren't speced for 24/7 operation, especially with such a seek-intensive workload) - and for a more realistic random-access database workload it would take many millennia. So it would take an extremely large (on the order of 1,000 disks) and very active database before you'd be likely to see one of these errors within the lifetime of the disks involved. > just thinking > ut loud. > > regardless, zfs on non-raid provides better > protection than zfs on raid > (well, depending on raid configuration) so just from > the data integrity > POV non-raid would generally be preferred. That was the point I made in my original post here - but *if* the hardware RAID is scrubbing its disks the difference in data integrity protection is unlikely to be of any real significance and one might reasonably elect to use the hardware RAID if it offered any noticeable performance advantage (e.g., by providing NVRAM that could expedite synchronous writes). the fact > that the type of > error being prevented is rare doesn't change that and > i was further > arguing that even though it's rare the impact can be > high so you don't > want to write it off. All reliability involves trade-offs, and very seldom are "all other things equal". Extremely low probability risks are often worth taking if it costs *anything* to avoid them (but of course are never worth taking if it costs *nothing* to avoid them). - bill This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss