On 05/23/09 10:21, Richard Elling wrote:
<preface>
This forum is littered with claims of "zfs checksums are broken" where
the root cause turned out to be faulty hardware or firmware in the data
path.
</preface>
I think that before you should speculate on a redesign, we should get to
the root cause.
The hardware is clearly misbehaving. No argument. The questions is - how
far out of reasonable behavior is it?
Redesign? I'm not sure I can conceive an architecture that would make
double buffering difficult to do. It is unclear how faulty hardware or
firmware could be responsible for such a low error rate (<1 in 4*10^10).
Just asking if an option for machines with no ecc and their inevitable
memory errors is a reasonable thing to suggest in an RFE.
The checksum occurs in the pipeline prior to write to disk.
So if the data is damaged prior to checksum, then ZFS will
never know. Nor will UFS. Neither will be able to detect
this. In Solaris, if the damage is greater than the ability
of the memory system and CPU to detect or correct, then
even Solaris won't know. If the memory system or CPU
detects a problem, then Solaris fault management will kick
in and do something, preempting ZFS.
Exactly. My whole point. And without ECC there's no way of knowing.
But if the data is damaged /after/ checksum but /before/ write, then
you have a real problem...
Memory diagnostics just test memory. Disk diagnostics just test disks.
This is not completely accurate. Disk diagnostics also test the
data path. Memory tests also test the CPU. The difference is the
amount of test coverage for the subsystem.
Quite. But the disk diagnostic doesn't really test memory beyond what
it uses to run itself. Likewise it may not test the FPU forexample.
ZFS keeps disks pretty busy, so perhaps it loads the power supply
to the point where it heats up and memory glitches are more likely.
In general, for like configurations, ZFS won't keep a disk any more
busy than other file systems. In fact, because ZFS groups transactions,
it may create less activity than other file systems, such as UFS.
That's a point in it's favor, although not really relevant. If the disks
are really busy they will load the PSU more and that could drag the supply
down which in turn might make errors occur that otherwise wouldn't.
Ironically, the Open Solaris installer does not allow for ZFS
mirroring at install time, one time where it might be really important!
Now that sounds like a more useful RFE, especially since it would be
relatively easy to implement. Anaconda does it...
This is not an accurate statement. The OpenSolaris installer does
support mirrored boot disks via the Automated Installer method.
http://dlc.sun.com/osol/docs/content/2008.11/AIinstall/index.html
You can also install Solaris 10 to mirrored root pools via JumpStart.
Talking about the live CD here. I prefer to install via jumpstart, but
AFAIK Open Solaris (indiana) isn't available as an installable DVD. But
most consumers are going to be installing from the live CD and they
are the ones with the low end hardware without ECC. There was recently
a suggestion on another thread about an RFE to add mirroring as an
install option.
I think a better test would be to md5 the file from all systems
and see if the md5 hashes are the same. If they are, then yes,
the finger would point more in the direction of ZFS. The
send/recv protocol hasn't changed in quite some time, but it
is arguably not as robust as it could be.
Thanks! md5 hash is exactly the kind of test I was looking for.
ms5sum on SPARC 9ec4f7da41741b469fcd7cb8c5040564 (local ZFS)
md5sum on X86 9ec4f7da41741b469fcd7cb8c5040564 (remote NFS)
ZFS send/recv use fletcher4 for the checksums. ZFS uses fletcher2
for data (by default) and fletcher4 for metadata. The same fletcher
code is used. So if you believe fletcher4 is broken for send/recv,
how do you explain that it works for the metadata? Or does it?
There may be another failure mode at work here...
(see comment on scrubs at the end of this extended post)
[Did you forget the scrubs comment?]
Never said it was broken. I assume the same code is used for both SPARC
and X86, and it works fine on SPARC. It would seem that this machine
gets memory errors so often (even though it passes the Linux memory
diagnostic) that it can never get to the end of a 4GB recv stream. Odd
that it can do the md5sum, but as mentioned, perhaps doing the i/o
puts more strain on the machine and stresses it to where more memory
faults occur. I can't quite picture a software bug that would cause
random failures on specific hardware and I am happy to give ZFS the
benefit of the doubt.
It would have been nice if we were able to recover the contents of the
file; if you also know what was supposed to be there, you can diff and
then we can find out what was wrong.
"file" on those files resulted in "bus error". Is there a way to actually
read a file reported by ZFS as unrecoverable to do just that (and to
separately retrieve the copy from each half of the mirror)?
ZFS corrects automatically, when it can. But if the source data is
bad, then ZFS couldn't possibly detect it.
For files that ZFS can detect are corrupted and cannot automatically
correct, you can get the list from "zpool status -xv" The behaviour
as seen by applications is determined by the zpool failmode property.
Exactly. And "file" on such a file will repeatably segfault. So will
pkg fix (there is a bug reported for this). Fortunately rm doesn't
segfault or there would be no way to repair such files. Is there
a way to actually get copies of with bad checksums so they may be
examined to see where the fault actually lies?
Quoting the ZFS admin guide: "The failmode property ... provides the
failmode property for determining the behavior of a catastrophic pool
failure due to a loss of device connectivity or the failure of all
devices in the pool. ". Has this changed since the ZFS admin guide
was last updated? If not, it doesn't seem relevant.
In any event, if file core dumps consistently in the same part of the
code, then please log a bug against file -- it should not core dump,
no matter what input it receives.
Ironically all such files have long since been scrubbed away. I suppose
one could deliberately damage a file to reproduce this. It could also be
that a library required to /run/ file was the one that was damaged...
Uhmm, if it were a software bug, one would expect it to fail
at exactly the same place, no?
Exactly. Not a bug. If it were, it would have been fixed a long time
ago on such a critical path. How about an RFE along the lines of
"Improved support for machines without ecc memory"? How about one
to recover files with bad checksums (a bit like getting fragments
out oflost+found in the bad old days)?
Yep, interesting question. But since you say "even zpool status
shows no error at all after a couple of scrubs" makes me think
that you've had errors in the past?
You bet! 5 unrecoverable errors, and maybe 10 or so recoverable
ones. About once a month, zpool status shows an error (note this
machine is being used as an X-terminal, so it hardly does any i/o)
and a scrub gets rid of it.
I'm still a little confused. If ext3 can't detect data errors, what
verification have you used to back your claim that it is unaffected?
None at all. But in a read-mostly environment this isn't an issue.
Other, known, bugs (in Fedora) account for almost every crash, and
Solaris hasn't failed once since it was (finally) installed a few
weeks ago with the screensaver disabled :-).
Please check the image views with md5 digests and get back to us.
If you get a chance, run SunVTS to verify the memory and CPU,
too. If the CPU is b0rken, the fletcher4 checksum for the recv may
be tickling it.
If the CPU was broken, wouldn't it always fail at the same point in
the stream? It definitely doesn't. Could you expand a little on what
it means to do md5sums on the image views? I'm not sure what an image
view is in this context. AFAIK SUNWvts is available only in SXCE, not
in Open Solaris. Oddly, you can load SUNWvts via pkg, but evidently
not smcwebserver - please correct me if I am wrong. FWIW we are running
SXCE on SPARC (installed via jumpstart) and indiana on X86 (installed
via live CD and updated to snv111a via pkg.
<sidebar>
Microsoft got so tired of defending its software against memory
errors, that it requires Windows Server platforms to use ECC. But
even Microsoft doesn't have the power to force the vendors to use
ECC for all PCs.
</sidebar>
Quite. My point exactly! My only issue is that I have experienced
what is IMO an unreasonably large number of unrecoverable errors on
mirrored drives. I was merely speculating on reasons for this and
possible solutions. Ironically, my applications are running beautifully,
and the users are quite happy with the performance and stability. ZFS
is wonderful because updates are so easy to roll back and painless
to install, snapshots are so useful, and all the other reasons that
make every other fs seem so antiquated...
In a sense, the proposal is merely to replicate in software what
ECC does in hardware. There may be much better solutions than double
buffering the data, and doing it at the level of ZFS is not a complete
solution. But doing nothing exposes ZFS users of mirrored drives to
the likelihood of unnecessarily unrecoverable failures due to
statistically probable memory glitches on machines with no ecc.
Cheers -- Frank
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss