On 11/03/2010 15:49, R.G. Keen wrote:
I think ZFS has no specific mechanisms in respect to
RAM integrity. It will just count on a healthy and
robust foundation for any component in the machine.
I'd really like to understand what OS does with respect to ECC.  Anyone who 
does understand the internal operation and can comment would be doing me a real 
favor by 'splaining this to me. 8-)

And yes, it's the OS, not zfs, that would do the memory operations.

- I don't think there is a software mechanism for detecting and/or correcting 
memory errors. I'll go read up on memtest, but I suspect it is just that - a 
memory testing routine that writes to memory, reads it back, and then tries to 
discover whether what it read back is what it sent. This is a good way to 
discover hard, stuck faults in a memory array, but cannot cope well with soft 
and intermittent errors.
- ECC is great for dealing with soft, intermittent errors, because it completely prevents 
single, infrequent errors from causing "bit rot" by polluting memory which is 
then flushed back to disk (and then protected from rot in disk by zfs.)
- ECC can hide a rising soft error rate from a failing memory. This is good in 
that it holds off the day when things crash, but bad in that the data is in 
there to do preventive maintenance to replace the failing unit if it's bubbled 
up so the user can see it. It's bad if it hides errors from a memory testing 
routine, as has been noted in this thread.
- You need to turn off hardware/chipset ECC to get a real result from a 
software write/read back memory test. Otherwise all you get back is 'yep, 
everything's all right'.

I think I need to get into the OS forum to understand this better.


Solaris *can* detect ECC errors (correctable or not) and it will be feeded into FMA. Then FMA will take appropriate actions, for example if there are more than N correctable errors in a given memory page within 24h window FMA will migrate data in that page somewhere else and mark it dead. You will loose usually 8kB or 8kB of memory but at least you are minimizing risk.

If it was an ucorrectable error then it depends on what was referring the page - if only a user land application that it will get killed (and restarted by SMF or cluster), if it was reffered to by kernel then entire OS will panic.

For more information look at:

http://blogs.sun.com/mws/entry/fma_on_x64_and_at
http://milek.blogspot.com/2006/05/psh-smf-less-downtime.html


--
Robert Milkowski
http://milek.blogspot.com

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to