paul wrote: > Bob wrote: > >> ... Given the many hardware safeguards against single (and several) bit >> errors, >> the most common data error will be large. For example, the disk drive may >> return data from the wrong sector. >> > > - actually data integrity check bits as may exist within memory systems and/or > communication channels are rarely prorogated beyond their boundaries, > thereby > data is subject to corruption at every such interface traversal, including > for > example during the simple process of being read and re-written by the CPUs > anywhere within the system that touches data, including within the disk > drive > itself. (unless a machine with error detecting/correcting memory is itself > detecting uncorrectable 2-bit errors, which should kill the process being > run, > there's no real reason to suspect that 3 or more bit errors are sneeking > through > with any measurable frequency; although possible). > > - personally I believe that errors such as erroneous sectors being written or > read > are themselves most likely due to single bit errors propagating into > critical things > like sector addresses calculations and thereby ultimately expressing > themselves as > large obvious errors, although actually caused by more subtle ones. Shy > extremely > noisy hardware and/or literal hard failure, most errors will most likely > always be > expressed as 1 bit out of some very large N number of bits. >
Today, we can detect a large number of these using the current ZFS checksum (by default, fletcher-2). But we don't record the scope of the corruption, once we correct the data. I filed RFE 6736986, bitwise failure data collection for zfs. Once implemented, we would get a better idea of how extensive corruption can be, even though the root cause cannot be determined from ZFS -- that would be a job for a different FMA DE. -- richard _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss