> From: tech-boun...@lists.lopsa.org [mailto:tech-boun...@lists.lopsa.org]
> On Behalf Of 'Luke S. Crawford'
> 
> On Mon, Sep 19, 2011 at 07:25:42AM -0400, Edward Ned Harvey wrote:
> > Bear in mind, the magnetic surface of a disk platter doesn't do ECC
either.
> > But in response to this, they use FEC chips on the circuit board of the
hard
> > drive, and encode more bits onto the magnetic surface.  Whenever a
> checksum
> > error occurs, the disk controller will silently retry (indicates a soft
> > error, a 1-rotation performance hit) but as long as there's no error on
the
> > 2nd or 3rd or 4th attempt, the hardware silently hides this condition
from
> > the OS.  You might get SMART indicating failure predicted.
> 
> I still don't trust a single drive.   Mirror them.

I don't quite get where you're coming from here.  There are two separate
issues - mirror-vs-not-mirror of the ZIL, which isn't mentioned above.  And
somebody said the lack of ECC in the DRAM-based sata devices made them not
an option, which is what I'm discussing above.

As for mirroring the ZIL:  Distrust for a single drive has some truth in it.
If you have a disk failure (including data error) on your unmirrored ZIL
device, which coincides with a system ungraceful crash, then the data on
that device would be lost.  The assumption, if you don't mirror your ZIL, is
that the probability of these multiple failures coinciding is small enough
to be comparable to the probability of multiple disk failures coinciding.


> So you are suggesting that maybe the device does the sort of error
> correction that hard drives do on their platters on non-ECC ram?

Just suggesting a possibility.  We know this is the case for HDD's and
SSD's.  Why not also DRAM based drives?


> I soppose that is possible... but I find it fairly unlikely.   this was
> not an 'Enterprise' product, 

I mean all HDD's and SSD's.  Not just enterprise ones.  So this DRAM device
not being enterprise level...  Maybe significant, maybe not.


> I mean, yeah, I soppose you could implement some sort of error correction
> outside of the dimm?  but why would you?  I think you'd have a difficult
> time doing it both safely and more efficently than commodity ECC ram.

Take it for granted, because of HDD/SSD, yes it's definitely possible, and
common, for error detection/correction to happen on-chip, outside of the
storage media, very close to the storage media.  Now you raise an excellent
question:  In the DRAM SATA device, which design would be more attractive to
the manufacturer?  use ECC ram, or use FEC outside of the ram, as they do
for other types of devices (HDD/SSD)?

I can say this:  ECC ram uses 9 bits instead of 8.  This is not a simple
parity bit (because parity is only useful for detecting, not correcting
errors).  But the payload is 8/9.  Also, the actual error detection happens
off-chip, not inside the DIMM.  That's why your motherboard needs to have
support for ECC ram in order to use it, and ECC ram is slightly slower than
non-ECC.  Also, the volume of sales for non-ECC ram is much higher, so
non-ECC ram is significantly cheaper (not just a ratio of 8:9).

So take it for granted, the non-ECC ram is significantly cheaper, and even
if you're using ECC, then the error detection is going to happen outside the
DIMM anyway.

In the case of ECC for your system memory, you need to operate on 32bits or
64bits depending on your architecture.  But in the case of your DRAM SATA
device, it's either 512 bytes, or 4K bytes (4096 or 32768 bits).  Basically
1000 times larger word.  This allows you to use a standard SATA FEC chip,
which has a much better payload than 8/9.  Say, for example, the FEC is
using LDPC, which operates at or near the theoretical limit of the channel,
it means you're (a) operating at optimal speed, (b) operating at minimal
cost, (c) operating at maximum reliability.

So yes, there is motivation to do the error detection outside of ECC, using
FEC on non-ECC ram on the DRAM SATA device.  I cannot say, of course,
whether or not they're doing any of this.  I can only say that yes, it's
reasonable, yes it's common in other products, and yes there is motivation
to do so.  Don't make any assumptions about it not being done at all just
because it's non-ECC ram.

_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to