On Wed, 2008-10-01 at 11:54 -0600, Robert Thurlow wrote: > > like they are not good enough though, because unless this broken > > router that Robert and Darren saw was doing NAT, yeah, it should not > > have touch the TCP/UDP checksum.
NAT was not involved. > I believe we proved that the problem bit flips were such > that the TCP checksum was the same, so the original checksum > still appeared correct. That's correct. The pattern we found in corrupted data was that there would be two offsetting bit-flips. A 0->1 was followed 256 or 512 or 1024 bytes later by a 1->0 Or vice-versa. (It was always the same bit; in the cases I analyzed, the corrupted files contained C source code and the bit-flips were obvious). Under the 16-bit one's-complement checksum used by TCP, these two changes cancel each other out and the resulting packet has the same checksum. > > BTW which router was it, or you > > can't say because you're in the US? :) > > I can't remember; it was aging at the time. to use excruciatingly precise terminology, I believe the switch in question is marketed as a combo L2 bridge/L3 router but in this case may have been acting as a bridge rather than a router. After we noticed the data corruption we looked at TCP counters on hosts on that subnet and noticed a high rate of failed checksums, so clearly the TCP checksum was catching *most* of the corrupted packets; we just didn't look at the counters until after we saw data corruption. - Bill _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss