Re: [zfs-discuss] integrated failure recovery thoughts (single-bit

Richard Elling Wed, 13 Aug 2008 14:55:13 -0700

paul wrote:
> Bob wrote:
>   
>> ... Given the many hardware safeguards against single (and several) bit 
>> errors,
>> the most common data error will be large. For example, the disk drive may 
>> return data from the wrong sector.
>>     
>
> - actually data integrity check bits as may exist within memory systems and/or
>   communication channels are rarely prorogated beyond their boundaries, 
> thereby
>   data is subject to corruption at every such interface traversal, including 
> for
>   example during the simple process of being read and re-written by the CPUs
>   anywhere within the system that touches data, including within the disk 
> drive
>   itself. (unless a machine with error detecting/correcting memory is itself
>   detecting uncorrectable 2-bit errors, which should kill the process being 
> run,
>   there's no real reason to suspect that 3 or more bit errors are sneeking 
> through
>   with any measurable frequency; although possible).
>
> - personally I believe that errors such as erroneous sectors being written or 
> read
>   are themselves most likely due to single bit errors propagating into 
> critical things
>   like sector addresses calculations and thereby ultimately expressing 
> themselves as
>   large obvious errors, although actually caused by more subtle ones.  Shy 
> extremely
>   noisy hardware and/or literal hard failure, most errors will most likely 
> always be
>   expressed as 1 bit out of some very large N number of bits.
>


Today, we can detect a large number of these using the current ZFS
checksum (by default, fletcher-2).  But we don't record the scope of the
corruption, once we correct the data.  I filed RFE 6736986, bitwise
failure data collection for zfs.  Once implemented, we would get a
better idea of how extensive corruption can be, even though the
root cause cannot be determined from ZFS -- that would be a job
for a different FMA DE.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] integrated failure recovery thoughts (single-bit

Reply via email to