On Mon, Sep 19, 2011 at 07:25:42AM -0400, Edward Ned Harvey wrote: > Bear in mind, the magnetic surface of a disk platter doesn't do ECC either. > But in response to this, they use FEC chips on the circuit board of the hard > drive, and encode more bits onto the magnetic surface. Whenever a checksum > error occurs, the disk controller will silently retry (indicates a soft > error, a 1-rotation performance hit) but as long as there's no error on the > 2nd or 3rd or 4th attempt, the hardware silently hides this condition from > the OS. You might get SMART indicating failure predicted.
I still don't trust a single drive. Mirror them. Of course, if bits on a hard drive flipped as easily as they do in ram (and bits on a hard drive do occasionally silently flip, just not nearly as often as ram) then mirroring would not be enough; I'd need something like zfs to checksum and mirror. > So what if your device only takes non-ECC ram? Does it have a FEC chip on > the controller board? Does your OS/FS do any checksumming? Have any > redundant copies with which to restore/recreate data after a checksum error > occurs? So you are suggesting that maybe the device does the sort of error correction that hard drives do on their platters on non-ECC ram? I soppose that is possible... but I find it fairly unlikely. this was not an 'Enterprise' product, and really, I don't know of any motherboards that do that sort of error correction on non-ecc ram, so yeah, I'd bet money that it had no protection against ram errors. . > There are so many levels of checksumming and error detection/correction. I > agree zero isn't enough, but ... Is it really zero in the case mentioned? Eh, I wouldn't trust it. bit flips in ram happen; go poke through your EDAC logs on any sufficently large population of servers and you will see a fairly large number of detected/corrected errors. Some servers only have a few, and maybe are ignorable? (I am on the 'replace after the first error' side here, but if there was only one or two, I can respect those on the other side) but most servers will have a lot of corrected errors. I mean, yeah, I soppose you could implement some sort of error correction outside of the dimm? but why would you? I think you'd have a difficult time doing it both safely and more efficently than commodity ECC ram. -- Luke S. Crawford http://prgmr.com/xen/ - Hosting for the technically adept http://nostarch.com/xen.htm - We don't assume you are stupid. _______________________________________________ Tech mailing list Tech@lists.lopsa.org https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech This list provided by the League of Professional System Administrators http://lopsa.org/