There have been a number of threads here on the reliability of ZFS in the
face of flaky hardware. ZFS certainly runs well on decent (e.g., SPARC)
hardware, but isn't it reasonable to expect it to run well on something
less well engineered? I am a real ZFS fan, and I'd hate to see folks
trash it because it appears to be unreliable.

In an attempt to bolster the proposition that there should at least be
an option to buffer the data before checksumming and writing, we've
been doing a lot of testing on presumed flaky (cheap) hardware, with a
peculiar result - see below.

On 04/21/09 12:16, casper....@sun.com wrote:
And so what?  You can't write two different checksums; I mean, we're
mirroring the data so it MUST BE THE SAME.  (A different checksum would be
wrong: I don't think ZFS will allow different checksums for different
sides of a mirror)

Unless it does a read after write on each disk, how would it know that
the checksums are the same? If the data is damaged before the checksum
is calculated then it is no worse than the ufs/ext3 case. If data +
checksum is damaged whilst the (single) checksum is being calculated,
or after, then the file is already lost before it is even written!
There is a significant probability that this could occur on a machine
with no ecc. Evidently memory concerns /are/ an issue - this thread
http://opensolaris.org/jive/thread.jspa?messageID=338148 even suggests
including a memory diagnostic with the distribution CD (Fedora already
does so).

Memory diagnostics just test memory. Disk diagnostics just test disks.
ZFS keeps disks pretty busy, so perhaps it loads the power supply
to the point where it heats up and memory glitches are more likely.
It might also explain why errors don't really begin until ~15 minutes
after the busy time starts.

You might argue that this problem could only affect systems doing a
lot of disk i/o and such systems probably have ecc memory. But doing
an o/s install is the one time where a consumer grade computer does
a *lot* of disk i/o for quite a long time and is hence vulnerable.
Ironically,  the Open Solaris installer does not allow for ZFS
mirroring at install time, one time where it might be really important!
Now that sounds like a more useful RFE, especially since it would be
relatively easy to implement. Anaconda does it...

A Solaris install writes almost 4*10^10 bits. Quoting Wikipedia, look
at Cypress on ECC, see http://www.edn.com/article/CA454636.html.
Possibly, statistically likely random memory glitches could actually
explain the error rate that is occurring.

You are assuming that the error is the memory being modified after
computing the checksums; I would say that that is unlikely; I think it's a
bit more likely that the data gets corrupted when it's handled by the disk
controller or the disk itself.  (The data is continuously re-written by
the DRAM controller)

See below for an example where a checksum error occurs without the
disk subsystem being involved. There seems to be no other plausible
explanation other than an improbable bug in X86 ZFS itself.

It would have been nice if we were able to recover the contents of the
file; if you also know what was supposed to be there, you can diff and
then we can find out what was wrong.

"file" on those files resulted in "bus error". Is there a way to actually
read a file reported by ZFS as unrecoverable to do just that (and to
separately retrieve the copy from each half of the mirror)?

Maybe this should be a new thread, but I suspect the following
proves that the problem must be memory, and that begs the question
as to how memory glitches can cause fatal ZFS checksum errors.

Here is the peculiar result (same machine)

After several attempts, succeeded in doing a zfs send to a file
on a NFS mounted ZFS file system on another machine (SPARC)
followed by a ZFS recv of that file there. But every attempt to
do a ZFS recv of the same snapshot (i.e., from NFS) on the local
machine (X86) has failed with a checksum mismatch. Obviously,
the file is good, since it was possible to do a zfs recv from it.
You can't blame the IDE drivers (or the bus, or the disks) for
this. Similarly, piping the snapshot though SSH fails, so you
can't blame NFS either. Something is happening to cause checksum
failures between after when the data is received by the PC and
when ZFS computes its checksums. Surely this is either a highly
repeatable memory glitch, or (most unlikely) a bug in X86 ZFS.
ZFS recv to another SPARC over SSH to the same physical disk
(accessed via a sata/pata adapter) was also successful.

Does this prove that the data+checksum is being corrupted by
memory glitches? Both NFS and SSH over TCP/IP provide reliable
transport (via checksums), so the data is presumably received
correctly. ZFS then calculates its own checksum and it fails.
Oddly, it /always/ fails, but not at the same point, and far
into the stream when both disks have been very busy for a while.

It would be interesting to see if the checksumming still fails
if the writes were somehow skipped or sent to /dev/null. If it
still fails. it should be possible to pinpoint the failure. If
not, then it would seem the the only recourse is to replace
the machine or not use ZFS even though it is otherwise quite
reliable (it has been running an XDMCP session for 2 weeks
now with no apparent glitches; even zpool status shows no
errors at all after a couple of scrubs). It would be even
more interesting to hear speculation as to why another machine
can recv the datastream but not the one that originated it.

If a memory that can pass diagnostics for 24 hours at a
stretch can cause glitches in huge datastreams, then IMO it
behooves ZFS to defend itself against them. Buffering disk
i/o on machines with no ECC seems like reasonably cheap
insurance against a whole class of errors, and could make
ZFS usable on PCs that, although they work fine with ext3,
fail annoyingly with ZFS. Ironically this wouldn't fix the
peculiar recv problem, which none-the-less seems to point
to memory glitches as a source of errors.

-- Frank






_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to