2012-01-22 22:58, Richard Elling wrote:
On Jan 21, 2012, at 6:32 AM, Jim Klimov wrote:

... So "it currently seems to me, that":

1) My on-disk data could get corrupted for whatever reason
ZFS tries to protect it from, at least once probably
from misdirected writes (i.e. the head landed not where
it was asked to write). It can not be ruled out that the
checksums got broken in non-ECC RAM before writes of
block pointers for some of my data, thus leading to
mismatches. One way or another, ZFS noted the discrepancy
during scrubs and "normal" file accesses. There is no
(automatic) way to tell which part is faulty - checksum
or data.

Untrue. If a block pointer is corrupted, then on read it will be logged
and ignored. I'm not sure you have grasped the concept of checksums
in the parent object.

If a block pointer is corrupted on disk after the write -
then yes, it will not match the parent's checksum, and
there would be another 1 or 2 ditto copies with possibly
correct data. Is that the correct grasping of the concept? ;)

Now, the (non-zero possibility) scenario I meant was that
the checksum for the block was calculated and was corrupted
in RAM/CPU before the ditto blocks were fanned out to disks,
and before the parent block checksums were calculated.

In this case on-disk data block is correct as compared to
other sources (if it is copies=2 - it may even be the same
as its other copy), but it does not match the BP's checksum
while the BP tree seems valid (all tree checksums match).

I believe in this case ZFS should flag the data checksum
mismatch, although in reality (with miniscule probability)
it is the bad checksum mismatching the good data. Anyway,
the situation would seem the same if the data block was
corrupted in RAM before fanning out with copies>1, and that
is more probable given the size of this block compared to
the 256 bits of checksum.

Just *HOW* probable is that on an ECC and a non-ECC system,
with or without an overclocked overheated CPU in enthusiasts
overpumped workstation or unsuspecting consumer's dusty
closet - that is a separate maths questions, with different
answers for different models. Random answer - on par with
disk UBER errors which ZFS by design considers serious
enough to combat.

2) In the case where on-disk data did get corrupted, the
checksum in block pointer was correct (matching original
data), but the raidz2 redundancy did not aid recovery.

I think your analysis is incomplete.

As I last wrote, I dumped the blocks with ZDB and compared
the bytes with the same block from a good copy. Particularly,
that copy had the same SHA256 checksum as was stored in my
problematic pool's blockpointer entry for the corrupt block.

These blocks differed in three sets of 4096 bytes starting
at "round" offsets at even intervals (4KB, 36KB, 68KB).
4kb is my disks' block size. It seems that some disk(s?)
overwrote existing data, or got scratched, or whatever
(no IO errors in dmesg though).

I am not certain why raidz2 did not suffice to fix the block,
and what garbage or data exists on all 6 drives - I did not
get zdb to dump all 0x30000 bytes of raidz2 raw data to try
permutations myself.

Possibly, for whatever reason (such as cable error, or some
firmware error given the same model of the drives), several
drives got the same erroneous write command at once, and
ultimately invalidated parts of the same stripe.

Many of the files in peril now, have existed on the pool
for some time, and scrubs completed successfully many times.

> Have you determined the root cause?

Unfortunately, I'm currently in another country away from
my home-NAS server. So all physical maintenance including
pushing the reset button is done by friends living in the
apartment. And there is not much physical examination that
can be done this way.

At one point in time recently (during a scrub in January),
one of the disks got lost and was not seen by motherboard
even after reboots, so I had my friends take out and replug
the SATA cables. This helped, so connector noise was possibly
the root cause. It might also account for incorrect address
for a certain write that slashed randomly on the platter.

The PSU is excessive for the box's requirements, with slack
performance to degrade ;) The P4 CPU is not overclocked.
RAM is non-ECC and that is not changeable given the Intel
CPU, chipset and motherboard. HDDs are on MB's controller.
The 6 HDDs in raidz2 pool are consumer-grade SATA Seagate
ST2000DL003-9VT166 firmware CC32.

Degrading cabling and/or connectors can indeed be one
of about two main causes, the other being non-ECC RAM.
Or aging CPU.

3) The file in question was created on a dataset with enabled
deduplication, so at the very least the dedup bit was set
on the corrupted block's pointer and a DDT entry likely
existed. Attempts to rewrite the block with the original
one (having "dedup=on") failed in fact, probably because
the matching checksum was already in DDT.

Works as designed.

If this is the case, the design does not account for finding
an error (ZFS found it first, not I) - and still trusts the
DDT entry although it points to garbage now. The design should
be fixed then ;)



Rewrites of such blocks with "dedup=off" or "dedup=verify"
succeeded.

Failure/success were tested by "sync; md5sum FILE" some
time after the fix attempt. (When done just after the
fix, test tends to return success even if the ondisk data
is bad, "thanks" to caching).

No, I think you've missed the root cause. By default, data that does
not match its checksum is not used.

I suspected that reads subsequent to a write (with dedup=on)
came from the ARC cache. At least, it was my expectation that
like on many other caching systems, recent writes' buffers
are moved into the read-cache for the addresses in question.

Thus, if the system only incremented the DDT counter, it has
no information that the on-disk data mismatched the checksum.

Anyhow, the fact is that in this test case, when I read the
"fixed" local files just after the rsync from a good remote
source, I got no IO errors and correct md5sums. When I re-read
them after a few minutes, the IO errors were there again.
With "dedup=off" and "dedup=verify" the errors did not return
after a while.

I explained my interpretation of these facts based on my
current understanding of the system; hopefully you have some
better-informed explanation ;)

My last attempt was to set "dedup=on" and write the block
again and sync; the (remote) computer hung instantly :(

3*)The RFE stands: deduped blocks found to be invalid and not
recovered by redundancy should somehow be evicted from DDT
(or marked for required verification-before-write) so as
not to pollute further writes, including repair attmepts.

Alternatively, "dedup=verify" takes care of the situation
and should be the recommended option.

I have lobbied for this, but so far people prefer performance to
dependability.

At the very least, in the docs outlining deduplication, this
possible situation could be marked as an incentive to use
"verify".

And if the errors are found (by scrub/read), something should
be done - like forcing "verify"? Or at least suggesting that?
Logged in illumos bugtracker [1]

[1] https://www.illumos.org/issues/1981

Regarding performance... it seems that some of design decisions
were influenced by certain customers and their setups. There's
nothing inherently wrong when tuning the system (by default)
with reference to real-life situations, until such cases
dictate the one-and-only required policy for everybody else.

Like with vdev-prefetch code slated for removal [2] - not
everybody of illumos users has hundreds of disks on memoryless
head nodes. And those who don't might have more benefit from
prefetch than they lose with dedicating a few MB of RAM to that
cache. So I think for that case it was sufficient to have the
zeroed cache size by default and leaving the ability to use
more if we desire. Perhaps there can even be room for vdev-
prefetching improvement, also logged in bugtracker [2],[3] ;)

[2] https://www.illumos.org/issues/175
[3] https://www.illumos.org/issues/2017

3**) It was suggested to set "dedupditto" to small values,
like "2". My oi_148a refused to set values smaller than 100.
Moreover, it seems reasonable to have two dedupditto values:
for example, to make a ditto copy when DDT reference counter
exceeds some small value (2-5), and add ditto copies every
"N" values for frequently-referenced data (every 64-128).

Also logged in bugtracker [4] ;)

[4] https://www.illumos.org/issues/2016


4) I did not get to check whether "dedup=verify" triggers a
checksum mismatch alarm if the preexisting on-disk data
does not in fact match the checksum.

All checksum mismatches are handled the same way.


I think such alarm should exist and to as much as a scrub,
read or other means of error detection and recovery would.

Checksum mismatches are logged, what was your root cause?

As written above, at least for one case it was probably
a random write by a disk over existing sectors, invalidating
the block.

I have yet to test (to be certain) whether writing over a
block which is invalid on-disk and marked as deduped, with
dedup=verify, would increase the CKSUM counter.

Waiting for friends to reboot that computer now to make the
test ;(

Still, according to "Works as designed" above, logging the
mismatch so far has no effect on not-using the old DDT entry
pointing to corrupt data.

Just in case, logged as https://www.illumos.org/issues/2015

5) It seems like a worthy RFE to include a pool-wide option to
"verify-after-write/commit" - to test that recent TXG sync
data has indeed made it to disk on (consumer-grade) hardware
into the designated sector numbers. Perhaps the test should
be delayed several seconds after the sync writes.

There are highly-reliable systems that do this in the fault-tolerant
market.

Do you want to sell your systems (and/or OS) or others to fill
this niche? ;)

At least, having the option for such checks (as much as disks
would obey it) is better than not having. Whether to enable
it - is another matter.

I logged an RFE in illumos bugtracker yesterday [5], lest the
idea be forgotten, at the very least...

[5] https://www.illumos.org/issues/2008

And, ummm, however much you dislike tunables, I think this
situation calls for an on-off switch and two codepaths, because
like dedup=verify, this feature will by design blow a heavy hit
to performance. Still, one size cannot fit all. Those who like
purple, likely won't wear white ;)

Some want performance, others just want their photo archive to
survive the decade, and say 10MBps would suffice to copy their
new set pictures from the CF card ;) If the verification code
gets into the OS, it can be turned on or off by a tunable
(switch) depending on other components' {hardware} reliability
and performance requirements. Perhaps, we should trust consumer
drives less than enterprise ones (100 orders of magnitute in
UBER difference, better materials, more in-factory testing)
and not request such verification?.. That's not up to us to
decide (one size for all), but up to end-users or their
integrators.

Sad that nobody ever contradicted that (mis)understanding
of mine.

Perhaps some day you can become a ZFS guru, but the journey is long...

Well, I do look forward to that, and hope I can learn from the
likes of you. And there are not many "ZFS under the hood" type
of textbooks out there. So I go on asking and asking ;)

//Jim Klimov
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to