On 2012-12-02 05:42, Jim Klimov wrote:
My plan is to dig out the needed sectors of the broken block from each of the 6 disks and try any and all reasonable recombinations of redundancy and data sectors to try and match the checksum - this should be my definite answer on whether ZFS (of that oi151.1.3-based build) does all I think it can to save data or not. Either I put the last nail into my itching question's coffin, or I'd nail a bug to yell about ;)
Well, I've come to a number of conclusions, though did not yet close this matter to myself. One regards the definition of "all reasonable recombinations" - ZFS does not do "*everything* possible" to recover corrupt data, and in fact it can't, nobody can. When I took this to an extreme, assuming that the bytes at different offsets within a sector might fail on different disks that comprise a block, attempt to reconstruct and test single failed sector per-byte becomes computationally infeasible - for 4 data disks and 1 parity I got about 4^4096 combinations to test. The next Big Bang will happen sooner than I'd get a "yes or no", or so they say (yes, I did a rough estimate - about 10^100 seconds if I used all computing horsepower on Earth today). If there are R known-broken rows of data (be it bits, bytes, sectors, whole columns, or whatever quantum of data we take) on D data disks and P parity disks (all readable without HW IO errors), where "known brokenness" is both a parity mismatch in this row and checksum mismatch for the whole userdata block, we do not know in advance how many errors there are in the row (only hope that not more than there are parity columns) nor where exactly the problem is. Thanks to checksum mismatch we do know that at least one error is in the data disks' on-disk data. We might hope to find a correct "original data" which matches the checksum by determining for each data disk the possible alternate byte values (computed from bytes at same offsets on other disks of data and parity), and checksumming the recombined userdata blocks with some of the on-disk bytes replaced by these calculated values. For each row we test 1..P alternate column values, and we must apply the alteration to all of the rows where known errors exist, in order to detect some neighboring but not overlapping errors in different components of the block's allocation. (This was the breakage scenario that was deemed possible for raidzN with disk heads hovering over similar locations all the time). This can yield a very large field of combinations with small height of rows (i.e. matching 1 byte per disk), or too few combinations with row height chosen too big (i.e. whole portion of one disk's part of the userdata - quarter in case of my 4-data-disk set). For single-break-per-row tests based on hypotheses from P parities, D data disks and R broken rows, we need to checksum P*(D^R) userdata recombinations in order to determine that we can't recover the block. To catch the less probable several errors per row (up to the amount of parities we have), we need to retry even more combinations afterwards. My 5-year-old Pentium D tested 1000 sha256 checksums over 128KB blocks in about 2-3 seconds, so it is reasonable to keep reconstruction loops and thus the smallness of a step and thus the amount of steps within a given arbitrarily chosen timeout (30 sec? 1 sec?) With a fixed amount of parity and data disks in a particular TLVDEV, we can determine the "reasonable" row heights. Also, this low-level recovery at higher amount of cycles might be a job for a separate tool - i.e. "on-line" recovery during ZFS IO and scrubs might be limited by a few sectors, and whatever is not fixed by that can be manually fed to programmatic number-cruncher and possibly get recovered overnight... I now know that it is cheap and fast to determine parity mismatches for each single-byte column offset in a userdata block (leading to D*R userdata bytes whose contents we are not certain of), so even if the quantum of data for reconstructions is a sector, it is quite reasonable to start with byte-by-byte mismatch detection. Locations of detected errors can help us determine whether the errors are colocated in a single row of sectors (so likely one or more sectors at the same offset on different disks got broken), or in several sectors (we might be lucky and have single errors per disk in neighboring sector numbers). It is, after all, not reasonable to go below 512b or even the larger HW sector size as the quantum of data for recovery attempts. But testing *only* whole columns (*if* this is done today) also avoids some chances of automated recovery - though, certainly, the recovery attempts should start with some of the most probable combinations, such as all errors being confined to a single disk, and then going down in step size and testing possible errors on several component disks. We can afford several thousand checksum tests, which might give a chance to recover more data that might be recoverable today *if* the tests are not so exhaustive... To conclude, I still do not know (did not read that deep into code) how ZFS does its raidz recovery attempts today - is the "row height" the whole single disk's portion (i.e. 32Kb from a 128KB block over 4 data disks), or some physical or logical sector size (4kb, 512b)?.. Even per-sector reconstructions of a 128KB block over 4 512b-sectored disks yields and a single alternate variant per-byte from parity reconstructions, if I am not mistaken, an impressive and infeasible 4^64 combinations to test with checksums by pure brute force. Heck, just counting from 1 to 2^64 in a "i++" loop takes a lot of CPU time :) And so far my problems had occurred to compressed blocks which had the physical allocation of a single sector in height or so, and at this "resolution" it was not possible to find one broken sector and fix the userdata. I'm still waiting for scrub to complete so that I could get some corrupted files with parity errors in different rows of HW-sector height. Thanks for listening, I'm out :) //Jim Klimov _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss