Frank brings up some interesting ideas, some of which might
need some additional thoughts...
Frank Middleton wrote:
On 05/23/09 10:21, Richard Elling wrote:
<preface>
This forum is littered with claims of "zfs checksums are broken" where
the root cause turned out to be faulty hardware or firmware in the data
path.
</preface>
I think that before you should speculate on a redesign, we should get to
the root cause.
The hardware is clearly misbehaving. No argument. The questions is - how
far out of reasonable behavior is it?
Hardware is much less expensive than software, even free software.
Your system has a negative ROI, kinda like trading credit default
swaps. The best thing you can do is junk it :-)
Redesign? I'm not sure I can conceive an architecture that would make
double buffering difficult to do. It is unclear how faulty hardware or
firmware could be responsible for such a low error rate (<1 in 4*10^10).
Just asking if an option for machines with no ecc and their inevitable
memory errors is a reasonable thing to suggest in an RFE.
It is a good RFE, but it isn't an RFE for the software folks.
The checksum occurs in the pipeline prior to write to disk.
So if the data is damaged prior to checksum, then ZFS will
never know. Nor will UFS. Neither will be able to detect
this. In Solaris, if the damage is greater than the ability
of the memory system and CPU to detect or correct, then
even Solaris won't know. If the memory system or CPU
detects a problem, then Solaris fault management will kick
in and do something, preempting ZFS.
Exactly. My whole point. And without ECC there's no way of knowing.
But if the data is damaged /after/ checksum but /before/ write, then
you have a real problem...
To put this in perspective, ECC is a broad category. When we
think of ECC for memory, it is usually Single Error (bit) Correction,
Double Error (bit) Detection (SECDED). A well designed system
will also do Single Device Data Correction (aka Chipkill or Extended
ECC, since Chipkill is trademarked). What this means is that faults
of more than 2 bits per word are not detected, unless all of the faults
occur in the same chip for SDDC cases.
Clearly, this wouldn't scale well to large data streams, which is why
they use checksums like Fletcher or hash functions like SHA-256.
ZFS keeps disks pretty busy, so perhaps it loads the power supply
to the point where it heats up and memory glitches are more likely.
In general, for like configurations, ZFS won't keep a disk any more
busy than other file systems. In fact, because ZFS groups transactions,
it may create less activity than other file systems, such as UFS.
That's a point in it's favor, although not really relevant. If the disks
are really busy they will load the PSU more and that could drag the
supply
down which in turn might make errors occur that otherwise wouldn't.
The dynamic loads of modern disk drives are not very great. I don't
believe your argument is very strong, here. Also, the solution is,
once again, fix the hardware.
I think a better test would be to md5 the file from all systems
and see if the md5 hashes are the same. If they are, then yes,
the finger would point more in the direction of ZFS. The
send/recv protocol hasn't changed in quite some time, but it
is arguably not as robust as it could be.
Thanks! md5 hash is exactly the kind of test I was looking for.
ms5sum on SPARC 9ec4f7da41741b469fcd7cb8c5040564 (local ZFS)
md5sum on X86 9ec4f7da41741b469fcd7cb8c5040564 (remote NFS)
Good.
ZFS send/recv use fletcher4 for the checksums. ZFS uses fletcher2
for data (by default) and fletcher4 for metadata. The same fletcher
code is used. So if you believe fletcher4 is broken for send/recv,
how do you explain that it works for the metadata? Or does it?
There may be another failure mode at work here...
(see comment on scrubs at the end of this extended post)
[Did you forget the scrubs comment?]
no, you responded that you had been seeing scrubs fix errors.
Never said it was broken. I assume the same code is used for both SPARC
and X86, and it works fine on SPARC. It would seem that this machine
gets memory errors so often (even though it passes the Linux memory
diagnostic) that it can never get to the end of a 4GB recv stream. Odd
that it can do the md5sum, but as mentioned, perhaps doing the i/o
puts more strain on the machine and stresses it to where more memory
faults occur. I can't quite picture a software bug that would cause
random failures on specific hardware and I am happy to give ZFS the
benefit of the doubt.
Yes, software can trigger memory failures. More below...
It would have been nice if we were able to recover the contents of the
file; if you also know what was supposed to be there, you can diff and
then we can find out what was wrong.
"file" on those files resulted in "bus error". Is there a way to
actually
read a file reported by ZFS as unrecoverable to do just that (and to
separately retrieve the copy from each half of the mirror)?
ZFS corrects automatically, when it can. But if the source data is
bad, then ZFS couldn't possibly detect it.
For files that ZFS can detect are corrupted and cannot automatically
correct, you can get the list from "zpool status -xv" The behaviour
as seen by applications is determined by the zpool failmode property.
Exactly. And "file" on such a file will repeatably segfault. So will
pkg fix (there is a bug reported for this). Fortunately rm doesn't
segfault or there would be no way to repair such files. Is there
a way to actually get copies of with bad checksums so they may be
examined to see where the fault actually lies?
Yes, to some degree. See a few of the blogs in this collection
http://blogs.sun.com/relling/entry/holy_smokes_a_holey_file
Quoting the ZFS admin guide: "The failmode property ... provides the
failmode property for determining the behavior of a catastrophic pool
failure due to a loss of device connectivity or the failure of all
devices in the pool. ". Has this changed since the ZFS admin guide
was last updated? If not, it doesn't seem relevant.
It is relevant in those cases where you want a process to continue
though the hardware has failed. Rather than panic, you can get
an EIO.
Uhmm, if it were a software bug, one would expect it to fail
at exactly the same place, no?
Exactly. Not a bug. If it were, it would have been fixed a long time
ago on such a critical path. How about an RFE along the lines of
"Improved support for machines without ecc memory"? How about one
to recover files with bad checksums (a bit like getting fragments
out oflost+found in the bad old days)?
argv! Why does this keep coming up? UFS fsck does not recover
data! It only recovers metadata, sometimes.
Yep, interesting question. But since you say "even zpool status
shows no error at all after a couple of scrubs" makes me think
that you've had errors in the past?
You bet! 5 unrecoverable errors, and maybe 10 or so recoverable
ones. About once a month, zpool status shows an error (note this
machine is being used as an X-terminal, so it hardly does any i/o)
and a scrub gets rid of it.
heh, if the fault is in memory, then the scrub will be correcting
correct data :-)
Please check the image views with md5 digests and get back to us.
If you get a chance, run SunVTS to verify the memory and CPU,
too. If the CPU is b0rken, the fletcher4 checksum for the recv may
be tickling it.
If the CPU was broken, wouldn't it always fail at the same point in
the stream?
Not necessarily. All failure modes are mechanical. There are a class
of failure modes in semiconductors which are due to changes in the
speed of transistors as a function of temperature. Temperature increases
as a function of the frequency of input changes in a CMOS gate. So,
if your software causes a specific change in the temperature of a portion
of a device, then it could trip on a temperature-induced fault. These
tend to be rare because of the margins, but if the hardware is flaky,
it is already arguably beyond the margins.
These sorts of codes might be humorously classified as halt-and-catch-fire.
But they do exist, and there are some cool thermographs which show
how the heat is distributed for various workloads.
http://en.wikipedia.org/wiki/Halt_and_Catch_Fire
It definitely doesn't. Could you expand a little on what
it means to do md5sums on the image views? I'm not sure what an image
view is in this context. AFAIK SUNWvts is available only in SXCE, not
in Open Solaris. Oddly, you can load SUNWvts via pkg, but evidently
not smcwebserver - please correct me if I am wrong. FWIW we are running
SXCE on SPARC (installed via jumpstart) and indiana on X86 (installed
via live CD and updated to snv111a via pkg.
<sidebar>
Microsoft got so tired of defending its software against memory
errors, that it requires Windows Server platforms to use ECC. But
even Microsoft doesn't have the power to force the vendors to use
ECC for all PCs.
</sidebar>
Quite. My point exactly! My only issue is that I have experienced
what is IMO an unreasonably large number of unrecoverable errors on
mirrored drives. I was merely speculating on reasons for this and
possible solutions. Ironically, my applications are running beautifully,
and the users are quite happy with the performance and stability. ZFS
is wonderful because updates are so easy to roll back and painless
to install, snapshots are so useful, and all the other reasons that
make every other fs seem so antiquated...
There may be an opportunity here. Let's assume that your disks
were fine and the bad checksums were caused by transient
memory faults. In such cases, a re-read of the data would effectively
clear the transient fault. In a sense, this is where mirroring works
against us -- ZFS will attempt to repair. This brings up a lot of
much more complex system issues, which makes me glad that
FMA exists ;-)
-- richard
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss