Re: [zfs-discuss] integrated failure recovery thoughts (single-bit

Richard Elling Thu, 14 Aug 2008 09:18:38 -0700

paul wrote:
> bob wrote:
>   
>> On Wed, 13 Aug 2008, paul wrote:
>>
>>     
>>>  Shy extremely noisy hardware and/or literal hard failure, most
>>>  errors will most likely always be expressed as 1 bit out of some
>>>  very large N number of bits.
>>>       
>> This claim ignores the fact that most computers today are still based 
>> on synchronously clocked parallel bus hardware.  A common failure mode 
>> is clock skew, which causes many bits to be wrong at once.  This can 
>> even happen within the CPU.
>>     
>
> - in my experience clock skew/drift problems will first manifest themselves
> by expressing single bit errors even on parallel interfaces, as all although
> all paths are logically parallel, the actual physical performance of each of
> the individual transistor & traces composing the data path will be ever so
> slightly different and although physical cad layout tools attempt to balance
> clock trees, the actual arrival time of the clock to the latch elements of
> the physical data-path implementation will also be slightly different (often
> differing by as much as few picoseconds; therefore as a circuit approaches
> its maximum frequency threshold (which depends on temperature, age, etc),
> some very small number of single bit errors will begin to be generated, due
> to setup/hold time violations being exceeded on the bit with the least
> physical clock skew tolerance, as the clock frequency and/or temperature
> (etc) increases, more and more bit paths will begin to fail, until the whole
> path fails. Thereby as all systems have some of the bits within parallel paths
> being more sensitive to one type of corruption or another, I tend to believe
> that single bit failures will tend to express themselves statistically prior 
> to
> and in greater number than multi-bit failures even though hardware still
> seems operable.
>


I'm not convinced, but perhaps it is because of the scar near my left ankle.
Long, long ago... SunOS 3.2 days, we had a server with two (!) ethernet
interfaces which we used to serve two different subnets (router) in addition
to its normal services (NFS, mail, etc.)  If the server couldn't service the
ethernet interrupts fast enough, the ethernet interface would zero-fill the
packets.  This is a really bad idea because the symptom was random zeros
intermixed with legitimate data... but only sometimes. The lesson here
is that you are often dealing with firmware or other, high-level decisions
on what happens to data as it flows through the system, and I doubt
very seriously that the firmware developers would just flip a single bit
somewhere rather than do something like ZFOD.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] integrated failure recovery thoughts (single-bit

Reply via email to