> On Dec 14, 2007 1:12 AM, can you guess?
> <[EMAIL PROTECTED]> wrote:
> > > yes.  far rarer and yet home users still see
> them.
> >
> > I'd need to see evidence of that for current
> hardware.
> What would constitute "evidence"?  Do anecdotal tales
> from home users
> qualify?  I have two disks (and one controller!) that
> generate several
> checksum errors per day each.

I assume that you're referring to ZFS checksum errors rather than to transfer 
errors caught by the CRC resulting in retries.

If so, then the next obvious question is, what is causing the ZFS checksum 
errors?  And (possibly of some help in answering that question) is the disk 
seeing CRC transfer errors (which show up in its SMART data)?

If the disk is not seeing CRC errors, then the likelihood that data is being 
'silently' corrupted as it crosses the wire is negligible (1 in 65,536 if 
you're using ATA disks, given your correction below, else 1 in 4.3 billion for 
SATA).  Controller or disk firmware bugs have been known to cause otherwise 
undetected errors (though I'm not familiar with any recent examples in normal 
desktop environments - e.g., the CERN study discussed earlier found a disk 
firmware bug that seemed only activated by the unusual demands placed on the 
disk by a RAID controller, and exacerbated by that controller's propensity just 
to ignore disk time-outs).  So, for that matter, have buggy file systems.  
Flaky RAM can result in ZFS checksum errors (the CERN study found correlations 
there when it used its own checksum mechanisms).

  I've also seen
> intermittent checksum
> fails that go away once all the cables are wiggled.

Once again, a significant question is whether the checksum errors are 
accompanied by a lot of CRC transfer errors.  If not, that would strongly 
suggest that they're not coming from bad transfers (and while they could 
conceivably be the result of commands corrupted on the wire, so much more data 
is transferred compared to command bandwidth that you'd really expect to see 
data CRC errors too if commands were getting mangled).  When you wiggle the 
cables, other things wiggle as well (I assume you've checked that your RAM is 
solidly seated).

On the other hand, if you're getting a whole bunch of CRC errors, then with 
only a 16-bit CRC it's entirely conceivable that a few are sneaking by 
unnoticed.

> 
> > Unlikely, since transfers over those connections
> have been protected by 32-bit CRCs since ATA busses
> went to 33 or 66 MB/sec. (SATA has even stronger
> protection)
> The ATA/7 spec specifies a 32-bit CRC (older ones
> used a 16-bit CRC)
> [1].

Yup - my error:  the CRC was indeed introduced in ATA-4 (33 MB/sec. version), 
but was only 16 bits wide back then.

  The serial ata protocol also specifies 32-bit
> CRCs beneath 8/10b
> coding (1.0a p. 159)[2].  That's not much stronger at
> all.

The extra strength comes more from its additional coverage (commands as well as 
data).

- bill
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to