>On 05/22/09 21:08, Toby Thain wrote:
>> Yes, the important thing is to *detect* them, no system can run reliably
>> with bad memory, and that includes any system with ZFS. Doing nutty
>> things like calculating the checksum twice does not buy anything of
>> value here.
>
>All memory is "bad" if it doesn't have ECC. There are only varying
>degrees of badness. Calculating the checksum twice on its own would
>be nutty, as you say, but doing so on a separate copy of the data
>might prevent unrecoverable errors after writes to mirrored drives.
>You can't detect memory errors if you don't have ECC.

And where exactly do you get the second good copy of the data?

If you copy the code you've just doubled your chance of using bad memory.
The original copy can be good or bad; the second copy cannot be better
than the first copy.

 But you can
>try to mitigate them. Without doing so makes ZFS less reliable than
>the memory it is running on. The problem is that ZFS makes any file
>with a bad checksum inaccessible, even if one really doesn't care
>if the data has been corrupted. A workaround might be a way to allow
>such files to be readable despite the bad checksum...

You can disable the checksums if you don't care.

>But it isn't. Applications aren't dying, compilers are not segfaulting
>(it was even possible to compile GCC 4.3.2 with the supplied gcc); gdm
>is staying up for weeks at a time... And I wouldn't consider running a
>non-trivial database application on a machine without ECC.

One broken bit may not have cause serious damage "most things work".

>> Absolutely, memory diags are essential. And you certainly run them if
>> you see unexpected behaviour that has no other obvious cause.
>
>Runs for days, as noted.

Doesn't proof anything.

Casper

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to