On 25-May-09, at 11:16 PM, Frank Middleton wrote:
On 05/22/09 21:08, Toby Thain wrote:
Yes, the important thing is to *detect* them, no system can run
reliably
with bad memory, and that includes any system with ZFS. Doing nutty
things like calculating the checksum twice does not buy anything of
value here.
All memory is "bad" if it doesn't have ECC. There are only varying
degrees of badness. Calculating the checksum twice on its own would
be nutty, as you say, but doing so on a separate copy of the data
might prevent unrecoverable errors
I don't see this at all. The kernel reads the application buffer. How
does reading it twice buy you anything?? It sounds like you are
assuming 1) the buffer includes faulty RAM; and 2) the faulty RAM
reads differently each time. Doesn't that seem statistically unlikely
to you? And even if you really are chasing this improbable scenario,
why make ZFS do the job of a memory tester?
after writes to mirrored drives.
You can't detect memory errors if you don't have ECC. But you can
try to mitigate them. Without doing so makes ZFS less reliable than
the memory it is running on. The problem is that ZFS makes any file
with a bad checksum inaccessible, even if one really doesn't care
if the data has been corrupted. A workaround might be a way to allow
such files to be readable despite the bad checksum...
I am not sure what you are trying to say here.
...
How can a machine with bad memory "work fine with ext3"?
It does. It works fine with ZFS too. Just really annoying
unrecoverable
files every now and then on mirrored drives. This shouldn't happen
even
with lousy memory and wouldn't (doesn't) with ECC. If there was a way
to examine the files and their checksums, I would be surprised if they
were different (If they were, it would almost certainly be the
controller
or the PCI bus itself causing the problem). But I speculate that it is
predictable memory hits.
You're making this harder than it really is. Run a memory test. If it
fails, take the machine out of service until it's fixed. There's no
reasonable way to keep running faulty hardware.
--Toby
-- Frank
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss