There have been a number of threads here on the reliability of ZFS in the face of flaky hardware. ZFS certainly runs well on decent (e.g., SPARC) hardware, but isn't it reasonable to expect it to run well on something less well engineered? I am a real ZFS fan, and I'd hate to see folks trash it because it appears to be unreliable.
In an attempt to bolster the proposition that there should at least be an option to buffer the data before checksumming and writing, we've been doing a lot of testing on presumed flaky (cheap) hardware, with a peculiar result - see below. On 04/21/09 12:16, casper....@sun.com wrote:
And so what? You can't write two different checksums; I mean, we're mirroring the data so it MUST BE THE SAME. (A different checksum would be wrong: I don't think ZFS will allow different checksums for different sides of a mirror)
Unless it does a read after write on each disk, how would it know that the checksums are the same? If the data is damaged before the checksum is calculated then it is no worse than the ufs/ext3 case. If data + checksum is damaged whilst the (single) checksum is being calculated, or after, then the file is already lost before it is even written! There is a significant probability that this could occur on a machine with no ecc. Evidently memory concerns /are/ an issue - this thread http://opensolaris.org/jive/thread.jspa?messageID=338148 even suggests including a memory diagnostic with the distribution CD (Fedora already does so). Memory diagnostics just test memory. Disk diagnostics just test disks. ZFS keeps disks pretty busy, so perhaps it loads the power supply to the point where it heats up and memory glitches are more likely. It might also explain why errors don't really begin until ~15 minutes after the busy time starts. You might argue that this problem could only affect systems doing a lot of disk i/o and such systems probably have ecc memory. But doing an o/s install is the one time where a consumer grade computer does a *lot* of disk i/o for quite a long time and is hence vulnerable. Ironically, the Open Solaris installer does not allow for ZFS mirroring at install time, one time where it might be really important! Now that sounds like a more useful RFE, especially since it would be relatively easy to implement. Anaconda does it... A Solaris install writes almost 4*10^10 bits. Quoting Wikipedia, look at Cypress on ECC, see http://www.edn.com/article/CA454636.html. Possibly, statistically likely random memory glitches could actually explain the error rate that is occurring.
You are assuming that the error is the memory being modified after computing the checksums; I would say that that is unlikely; I think it's a bit more likely that the data gets corrupted when it's handled by the disk controller or the disk itself. (The data is continuously re-written by the DRAM controller)
See below for an example where a checksum error occurs without the disk subsystem being involved. There seems to be no other plausible explanation other than an improbable bug in X86 ZFS itself.
It would have been nice if we were able to recover the contents of the file; if you also know what was supposed to be there, you can diff and then we can find out what was wrong.
"file" on those files resulted in "bus error". Is there a way to actually read a file reported by ZFS as unrecoverable to do just that (and to separately retrieve the copy from each half of the mirror)? Maybe this should be a new thread, but I suspect the following proves that the problem must be memory, and that begs the question as to how memory glitches can cause fatal ZFS checksum errors. Here is the peculiar result (same machine) After several attempts, succeeded in doing a zfs send to a file on a NFS mounted ZFS file system on another machine (SPARC) followed by a ZFS recv of that file there. But every attempt to do a ZFS recv of the same snapshot (i.e., from NFS) on the local machine (X86) has failed with a checksum mismatch. Obviously, the file is good, since it was possible to do a zfs recv from it. You can't blame the IDE drivers (or the bus, or the disks) for this. Similarly, piping the snapshot though SSH fails, so you can't blame NFS either. Something is happening to cause checksum failures between after when the data is received by the PC and when ZFS computes its checksums. Surely this is either a highly repeatable memory glitch, or (most unlikely) a bug in X86 ZFS. ZFS recv to another SPARC over SSH to the same physical disk (accessed via a sata/pata adapter) was also successful. Does this prove that the data+checksum is being corrupted by memory glitches? Both NFS and SSH over TCP/IP provide reliable transport (via checksums), so the data is presumably received correctly. ZFS then calculates its own checksum and it fails. Oddly, it /always/ fails, but not at the same point, and far into the stream when both disks have been very busy for a while. It would be interesting to see if the checksumming still fails if the writes were somehow skipped or sent to /dev/null. If it still fails. it should be possible to pinpoint the failure. If not, then it would seem the the only recourse is to replace the machine or not use ZFS even though it is otherwise quite reliable (it has been running an XDMCP session for 2 weeks now with no apparent glitches; even zpool status shows no errors at all after a couple of scrubs). It would be even more interesting to hear speculation as to why another machine can recv the datastream but not the one that originated it. If a memory that can pass diagnostics for 24 hours at a stretch can cause glitches in huge datastreams, then IMO it behooves ZFS to defend itself against them. Buffering disk i/o on machines with no ECC seems like reasonably cheap insurance against a whole class of errors, and could make ZFS usable on PCs that, although they work fine with ext3, fail annoyingly with ZFS. Ironically this wouldn't fix the peculiar recv problem, which none-the-less seems to point to memory glitches as a source of errors. -- Frank _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss