Re: [zfs-discuss] Quantifying ZFS reliability

Bill Sommerfeld Wed, 01 Oct 2008 15:07:03 -0700

On Wed, 2008-10-01 at 11:54 -0600, Robert Thurlow wrote:
> > like they are not good enough though, because unless this broken
> > router that Robert and Darren saw was doing NAT, yeah, it should not
> > have touch the TCP/UDP checksum.


NAT was not involved.

> I believe we proved that the problem bit flips were such
> that the TCP checksum was the same, so the original checksum
> still appeared correct.

That's correct.   

The pattern we found in corrupted data was that there would be two
offsetting bit-flips.  

A 0->1 was followed 256 or 512 or 1024 bytes later by a 1->0 
Or vice-versa.  (It was always the same bit; in the cases I analyzed,
the corrupted files contained C source code and the bit-flips were
obvious).  Under the 16-bit one's-complement checksum used by TCP, these
two  changes cancel each other out and the resulting packet has the same
checksum.

> > BTW which router was it, or you
> > can't say because you're in the US? :)
> 
> I can't remember; it was aging at the time.

to use excruciatingly precise terminology, I believe the switch in
question is marketed as a combo L2 bridge/L3 router but in this case may
have been acting as a bridge rather than a router. 

After we noticed the data corruption we looked at TCP counters on hosts
on that subnet and noticed a high rate of failed checksums, so clearly
the TCP checksum was catching *most* of the corrupted packets; we just
didn't look at the counters until after we saw data corruption.

                                        - Bill









_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Quantifying ZFS reliability

Reply via email to