Re: [zfs-discuss] Digging in the bowels of ZFS

Jim Klimov Tue, 11 Dec 2012 07:45:52 -0800

On 2012-12-02 05:42, Jim Klimov wrote:

My plan is to dig out the needed sectors of the broken block from
each of the 6 disks and try any and all reasonable recombinations
of redundancy and data sectors to try and match the checksum - this
should be my definite answer on whether ZFS (of that oi151.1.3-based
build) does all I think it can to save data or not. Either I put the
last nail into my itching question's coffin, or I'd nail a bug to
yell about ;)


Well, I've come to a number of conclusions, though did not yet close
this matter to myself. One regards the definition of "all reasonable
recombinations" - ZFS does not do "*everything* possible" to recover
corrupt data, and in fact it can't, nobody can.

When I took this to an extreme, assuming that the bytes at different
offsets within a sector might fail on different disks that comprise a
block, attempt to reconstruct and test single failed sector per-byte
becomes computationally infeasible - for 4 data disks and 1 parity I
got about 4^4096 combinations to test. The next Big Bang will happen
sooner than I'd get a "yes or no", or so they say (yes, I did a rough
estimate - about 10^100 seconds if I used all computing horsepower on
Earth today).

If there are R known-broken rows of data (be it bits, bytes, sectors,
whole columns, or whatever quantum of data we take) on D data disks
and P parity disks (all readable without HW IO errors), where "known
brokenness" is both a parity mismatch in this row and checksum mismatch
for the whole userdata block, we do not know in advance how many errors
there are in the row (only hope that not more than there are parity
columns) nor where exactly the problem is. Thanks to checksum mismatch
we do know that at least one error is in the data disks' on-disk data.

We might hope to find a correct "original data" which matches the
checksum by determining for each data disk the possible alternate
byte values (computed from bytes at same offsets on other disks of
data and parity), and checksumming the recombined userdata blocks
with some of the on-disk bytes replaced by these calculated values.

For each row we test 1..P alternate column values, and we must apply
the alteration to all of the rows where known errors exist, in order
to detect some neighboring but not overlapping errors in different
components of the block's allocation. (This was the breakage scenario
that was deemed possible for raidzN with disk heads hovering over
similar locations all the time).

This can yield a very large field of combinations with small height
of rows (i.e. matching 1 byte per disk), or too few combinations with
row height chosen too big (i.e. whole portion of one disk's part of
the userdata - quarter in case of my 4-data-disk set).

For single-break-per-row tests based on hypotheses from P parities,
D data disks and R broken rows, we need to checksum P*(D^R) userdata
recombinations in order to determine that we can't recover the block.

To catch the less probable several errors per row (up to the amount of
parities we have), we need to retry even more combinations afterwards.

My 5-year-old Pentium D tested 1000 sha256 checksums over 128KB blocks
in about 2-3 seconds, so it is reasonable to keep reconstruction loops
and thus the smallness of a step and thus the amount of steps within a
given arbitrarily chosen timeout (30 sec? 1 sec?) With a fixed amount
of parity and data disks in a particular TLVDEV, we can determine the
"reasonable" row heights. Also, this low-level recovery at higher
amount of cycles might be a job for a separate tool - i.e. "on-line"
recovery during ZFS IO and scrubs might be limited by a few sectors,
and whatever is not fixed by that can be manually fed to programmatic
number-cruncher and possibly get recovered overnight...

I now know that it is cheap and fast to determine parity mismatches
for each single-byte column offset in a userdata block (leading to
D*R userdata bytes whose contents we are not certain of), so even
if the quantum of data for reconstructions is a sector, it is
quite reasonable to start with byte-by-byte mismatch detection.

Locations of detected errors can help us determine whether the
errors are colocated in a single row of sectors (so likely one or
more sectors at the same offset on different disks got broken),
or in several sectors (we might be lucky and have single errors
per disk in neighboring sector numbers).

It is, after all, not reasonable to go below 512b or even the
larger HW sector size as the quantum of data for recovery attempts.
But testing *only* whole columns (*if* this is done today) also
avoids some chances of automated recovery - though, certainly,
the recovery attempts should start with some of the most probable
combinations, such as all errors being confined to a single disk,
and then going down in step size and testing possible errors on
several component disks. We can afford several thousand checksum
tests, which might give a chance to recover more data that might
be recoverable today *if* the tests are not so exhaustive...

To conclude, I still do not know (did not read that deep into code)
how ZFS does its raidz recovery attempts today - is the "row height"
the whole single disk's portion (i.e. 32Kb from a 128KB block over 4
data disks), or some physical or logical sector size (4kb, 512b)?..

Even per-sector reconstructions of a 128KB block over 4 512b-sectored
disks yields and a single alternate variant per-byte from parity
reconstructions, if I am not mistaken, an impressive and infeasible
4^64 combinations to test with checksums by pure brute force.
Heck, just counting from 1 to 2^64 in a "i++" loop takes a lot
of CPU time :)

And so far my problems had occurred to compressed blocks which had
the physical allocation of a single sector in height or so, and at
this "resolution" it was not possible to find one broken sector and
fix the userdata. I'm still waiting for scrub to complete so that
I could get some corrupted files with parity errors in different
rows of HW-sector height.

Thanks for listening, I'm out :)
//Jim Klimov

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Digging in the bowels of ZFS

Reply via email to