<preface>
This forum is littered with claims of "zfs checksums are broken" where
the root cause turned out to be faulty hardware or firmware in the data
path.
</preface>
I think that before you should speculate on a redesign, we should get to
the root cause.
Frank Middleton wrote:
There have been a number of threads here on the reliability of ZFS in the
face of flaky hardware. ZFS certainly runs well on decent (e.g., SPARC)
hardware, but isn't it reasonable to expect it to run well on something
less well engineered? I am a real ZFS fan, and I'd hate to see folks
trash it because it appears to be unreliable.
It depends on what you consider to be flaky. If a CPU has a stuck bit
in the carry lookahead (can't add properly for some pattern of operands),
then it is flaky and will probably create bogus checksums, no?
In an attempt to bolster the proposition that there should at least be
an option to buffer the data before checksumming and writing, we've
been doing a lot of testing on presumed flaky (cheap) hardware, with a
peculiar result - see below.
On 04/21/09 12:16, casper....@sun.com wrote:
And so what? You can't write two different checksums; I mean, we're
mirroring the data so it MUST BE THE SAME. (A different checksum
would be
wrong: I don't think ZFS will allow different checksums for different
sides of a mirror)
Unless it does a read after write on each disk, how would it know that
the checksums are the same? If the data is damaged before the checksum
is calculated then it is no worse than the ufs/ext3 case.
Even if you do a read after write, there is no guarantee that you will
read from the medium instead of a cache. There is some concern here,
in general, because some mobo RAID controllers and (I believe) some
disk drives have caches which are not protected. These are generally
not too much of a problem because the data is not resident for a
significant period of time and the probability of a bit flip caused by
radiation, for instance, is a function of time.
If data +
checksum is damaged whilst the (single) checksum is being calculated,
or after, then the file is already lost before it is even written!
The checksum occurs in the pipeline prior to write to disk.
So if the data is damaged prior to checksum, then ZFS will
never know. Nor will UFS. Neither will be able to detect
this. In Solaris, if the damage is greater than the ability
of the memory system and CPU to detect or correct, then
even Solaris won't know. If the memory system or CPU
detects a problem, then Solaris fault management will kick
in and do something, preempting ZFS.
There is a significant probability that this could occur on a machine
with no ecc. Evidently memory concerns /are/ an issue - this thread
http://opensolaris.org/jive/thread.jspa?messageID=338148 even suggests
including a memory diagnostic with the distribution CD (Fedora already
does so).
SunVTS ships with SCXE and Solaris 2.2-10. SunVTS replaced
SunDiag which, IIRC, started shipping in SunOS 3. I believe SunVTS
is available via OpenSolaris repository for those with support contracts.
VTS is an acronym for Verification Test Suite and includes many
tests, including memory tests. VTS is used to verify systems in the
factory prior to shipping to customers. Look for /usr/sunvts on your
system or search for the SUNWvts* packages and checkout the docs
online.
Memory diagnostics just test memory. Disk diagnostics just test disks.
This is not completely accurate. Disk diagnostics also test the
data path. Memory tests also test the CPU. The difference is the
amount of test coverage for the subsystem.
ZFS keeps disks pretty busy, so perhaps it loads the power supply
to the point where it heats up and memory glitches are more likely.
In general, for like configurations, ZFS won't keep a disk any more
busy than other file systems. In fact, because ZFS groups transactions,
it may create less activity than other file systems, such as UFS.
It might also explain why errors don't really begin until ~15 minutes
after the busy time starts.
You might argue that this problem could only affect systems doing a
lot of disk i/o and such systems probably have ecc memory. But doing
an o/s install is the one time where a consumer grade computer does
a *lot* of disk i/o for quite a long time and is hence vulnerable.
Ironically, the Open Solaris installer does not allow for ZFS
mirroring at install time, one time where it might be really important!
Now that sounds like a more useful RFE, especially since it would be
relatively easy to implement. Anaconda does it...
This is not an accurate statement. The OpenSolaris installer does
support mirrored boot disks via the Automated Installer method.
http://dlc.sun.com/osol/docs/content/2008.11/AIinstall/index.html
You can also install Solaris 10 to mirrored root pools via JumpStart.
A Solaris install writes almost 4*10^10 bits. Quoting Wikipedia, look
at Cypress on ECC, see http://www.edn.com/article/CA454636.html.
Possibly, statistically likely random memory glitches could actually
explain the error rate that is occurring.
You are assuming that the error is the memory being modified after
computing the checksums; I would say that that is unlikely; I think
it's a
bit more likely that the data gets corrupted when it's handled by the
disk
controller or the disk itself. (The data is continuously re-written by
the DRAM controller)
See below for an example where a checksum error occurs without the
disk subsystem being involved. There seems to be no other plausible
explanation other than an improbable bug in X86 ZFS itself.
I think a better test would be to md5 the file from all systems
and see if the md5 hashes are the same. If they are, then yes,
the finger would point more in the direction of ZFS. The
send/recv protocol hasn't changed in quite some time, but it
is arguably not as robust as it could be.
ZFS send/recv use fletcher4 for the checksums. ZFS uses fletcher2
for data (by default) and fletcher4 for metadata. The same fletcher
code is used. So if you believe fletcher4 is broken for send/recv,
how do you explain that it works for the metadata? Or does it?
There may be another failure mode at work here...
(see comment on scrubs at the end of this extended post)
It would have been nice if we were able to recover the contents of the
file; if you also know what was supposed to be there, you can diff and
then we can find out what was wrong.
"file" on those files resulted in "bus error". Is there a way to actually
read a file reported by ZFS as unrecoverable to do just that (and to
separately retrieve the copy from each half of the mirror)?
ZFS corrects automatically, when it can. But if the source data is
bad, then ZFS couldn't possibly detect it.
For files that ZFS can detect are corrupted and cannot automatically
correct, you can get the list from "zpool status -xv" The behaviour
as seen by applications is determined by the zpool failmode property.
In any event, if file core dumps consistently in the same part of the
code, then please log a bug against file -- it should not core dump,
no matter what input it receives.
Maybe this should be a new thread, but I suspect the following
proves that the problem must be memory, and that begs the question
as to how memory glitches can cause fatal ZFS checksum errors.
Here is the peculiar result (same machine)
After several attempts, succeeded in doing a zfs send to a file
on a NFS mounted ZFS file system on another machine (SPARC)
followed by a ZFS recv of that file there. But every attempt to
do a ZFS recv of the same snapshot (i.e., from NFS) on the local
machine (X86) has failed with a checksum mismatch. Obviously,
the file is good, since it was possible to do a zfs recv from it.
You can't blame the IDE drivers (or the bus, or the disks) for
this. Similarly, piping the snapshot though SSH fails, so you
can't blame NFS either. Something is happening to cause checksum
failures between after when the data is received by the PC and
when ZFS computes its checksums. Surely this is either a highly
repeatable memory glitch, or (most unlikely) a bug in X86 ZFS.
ZFS recv to another SPARC over SSH to the same physical disk
(accessed via a sata/pata adapter) was also successful.
Does this prove that the data+checksum is being corrupted by
memory glitches? Both NFS and SSH over TCP/IP provide reliable
transport (via checksums), so the data is presumably received
correctly. ZFS then calculates its own checksum and it fails.
Oddly, it /always/ fails, but not at the same point, and far
into the stream when both disks have been very busy for a while.
Uhmm, if it were a software bug, one would expect it to fail
at exactly the same place, no?
It would be interesting to see if the checksumming still fails
if the writes were somehow skipped or sent to /dev/null. If it
still fails. it should be possible to pinpoint the failure. If
not, then it would seem the the only recourse is to replace
the machine or not use ZFS even though it is otherwise quite
reliable (it has been running an XDMCP session for 2 weeks
now with no apparent glitches; even zpool status shows no
errors at all after a couple of scrubs). It would be even
more interesting to hear speculation as to why another machine
can recv the datastream but not the one that originated it.
Yep, interesting question. But since you say "even zpool status
shows no error at all after a couple of scrubs" makes me think
that you've had errors in the past?
If a memory that can pass diagnostics for 24 hours at a
stretch can cause glitches in huge datastreams, then IMO it
behooves ZFS to defend itself against them. Buffering disk
i/o on machines with no ECC seems like reasonably cheap
insurance against a whole class of errors, and could make
ZFS usable on PCs that, although they work fine with ext3,
fail annoyingly with ZFS. Ironically this wouldn't fix the
peculiar recv problem, which none-the-less seems to point
to memory glitches as a source of errors.
I'm still a little confused. If ext3 can't detect data errors, what
verification have you used to back your claim that it is unaffected?
Please check the image views with md5 digests and get back to us.
If you get a chance, run SunVTS to verify the memory and CPU,
too. If the CPU is b0rken, the fletcher4 checksum for the recv may
be tickling it.
<sidebar>
Microsoft got so tired of defending its software against memory
errors, that it requires Windows Server platforms to use ECC. But
even Microsoft doesn't have the power to force the vendors to use
ECC for all PCs.
</sidebar>
-- richard
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss