On Mon, Sep 19, 2011 at 07:25:42AM -0400, Edward Ned Harvey wrote:
> Bear in mind, the magnetic surface of a disk platter doesn't do ECC either.
> But in response to this, they use FEC chips on the circuit board of the hard
> drive, and encode more bits onto the magnetic surface.  Whenever a checksum
> error occurs, the disk controller will silently retry (indicates a soft
> error, a 1-rotation performance hit) but as long as there's no error on the
> 2nd or 3rd or 4th attempt, the hardware silently hides this condition from
> the OS.  You might get SMART indicating failure predicted.

I still don't trust a single drive.   Mirror them.  

Of course, if bits on a hard drive flipped as easily as they do in ram 
(and bits on a hard drive do occasionally silently flip, just not nearly 
as often as ram)  then mirroring would not be enough; I'd need something
like zfs to checksum and mirror.

> So what if your device only takes non-ECC ram?  Does it have a FEC chip on
> the controller board?  Does your OS/FS do any checksumming?  Have any
> redundant copies with which to restore/recreate data after a checksum error
> occurs?

So you are suggesting that maybe the device does the sort of error
correction that hard drives do on their platters on non-ECC ram?

I soppose that is possible... but I find it fairly unlikely.   this was 
not an 'Enterprise' product, and really, I don't know of any motherboards
that do that sort of error correction on non-ecc ram, so yeah, I'd bet
money that it had no protection against ram errors.  .

> There are so many levels of checksumming and error detection/correction.  I
> agree zero isn't enough, but ...  Is it really zero in the case mentioned?

Eh, I wouldn't trust it.  bit flips in ram happen;  go poke through 
your EDAC logs on any sufficently large population of servers and you will
see a fairly large number of detected/corrected errors.  Some servers only
have a few, and maybe are ignorable?  (I am on the 'replace after the first
error' side here, but if there was only one or two, I can respect those
on the other side)  but most servers will have a lot of corrected errors.

I mean, yeah, I soppose you could implement some sort of error correction
outside of the dimm?  but why would you?  I think you'd have a difficult
time doing it both safely and more efficently than commodity ECC ram.



-- 
Luke S. Crawford
http://prgmr.com/xen/         -   Hosting for the technically adept
http://nostarch.com/xen.htm   -   We don't assume you are stupid.  
_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to