Richard Elling wrote:
There are many error correcting codes available. RAID2 used Hamming
codes, but that's just one of many options out there. Par2 uses
configurable strength Reed-Solomon to get multi bit error
correction. The par2 source is available, although from a ZFS
perspective is hindered by the CDDL-GPL license incompatibility problem.
It is possible to write a FUSE filesystem using Reed-Solomon (like
par2) as the underlying protection. A quick search of the FUSE
website turns up the Reed-Solomon FS (a FUSE-based filesystem):
"Shielding your files with Reed-Solomon codes"
http://ttsiodras.googlepages.com/rsbep.html
While most FUSE work is on Linux, and there is a ZFS-FUSE project for
it, there has also been FUSE work done for OpenSolaris:
http://www.opensolaris.org/os/project/fuse/
BTW, if you do have the case where unprotected data is not
readable, then I have a little DTrace script that I'd like you to run
which would help determine the extent of the corruption. This is
one of those studies which doesn't like induced errors ;-)
http://www.richardelling.com/Home/scripts-and-programs-1/zcksummon
Is this intended as general monitoring script or only after one has
otherwise experienced corruption problems?
It is intended to try to answer the question of whether the errors we see
in real life might be single bit errors. I do not believe they will
be single
bit errors, but we don't have the data.
To be pedantic, wouldn't protected data also be affected if all
copies are damaged at the same time, especially if also damaged in
the same place?
Yep. Which is why there is RFE CR 6674679, complain if all data
copies are identical and corrupt.
-- richard
There is a related but an unlikely scenario, that is also probably not
covered yet. I'm not sure what kind of common cause would lead to it.
Maybe a disk array turning into swiss cheese with bad sectors suddenly
showing up on multiple drives? Its probability increases with larger
logical block sizes (e.g. 128k blocks are at higher risk than 4k blocks;
a block being the smallest piece of storage real estate used by the
filesystem). It is the edge case of multiple damaged copies where the
damage is unreadable bad sectors on different corresponding sectors of a
block. This could be recovered from by copying the readable sectors
from each copy and filling in the holes using the appropriate sectors
from the other copies. The final result, a rebuilt block, should pass
the checksum tests assuming there were not any other problems with the
still readable sectors.
---
A bad sector specific recovery technique is to instruct the disk to
return raw read data rather than trying to correct it. The READ LONG
command can do this (though the specs say it only works on 28 bit LBA).
(READ LONG corresponds to writes done with WRITE LONG (28 bit) or WRITE
UNCORRECTABLE EXT (48 bit). Linux HDPARM uses these write commands when
it is used to create bad sectors with the --make-bad-sector command.
The resulting sectors are low level logically bad where the sector's
data and ECC do not match; they are not physically bad). With multiple
read attempts, a statistical distribution of the likely 'true' contents
of the sector can be found. Spinrite claims to do this. Linux 'HDPARM
--read-sector' can sometimes return data from nominally bad sectors too
but it doesn't have a built in statistical recovery method (a wrapper
script could probably solve that). I don't know if HDPARM --read sector
uses READ LONG or not.
HDPARM man page: http://linuxreviews.org/man/hdparm/
Good description of IDE commands including READ LONG and WRITE LONG
(specs say they are 28 bit only)
http://www.repairfaq.org/filipg/LINK/F_IDE-tech.html
SCSI versions of READ LONG and WRITE LONG
http://en.wikipedia.org/wiki/SCSI_Read_Commands#Read_Long
http://en.wikipedia.org/wiki/SCSI_Write_Commands#Write_Long
Here is a report by forum member "qubit" modifying his Linux taskfile
driver to use READ LONG for data recovery purposes, and his subsequent
analysis:
http://forums.storagereview.net/index.php?showtopic=5910
http://www.tech-report.com/news_reply.x/3035
http://techreport.com/ja.zz?comments=3035&page=5
------ quote ------
318. Posted at 07:00 am on Jun 6th 2002 by qubit
My DTLA-307075 (75GB 75GXP) went bad 6 months ago. But I didn't write
off the data as being unrecoverable. I used WinHex to make a ghost image
of the drive onto my new larger one, zeroing out the bad sectors in the
target while logging each bad sector. (There were bad sectors in the FAT
so I combined the good parts from FATs 1 and 2.) At this point I had a
working mirror of the drive that went bad, with zeroed-out 512 byte
holes in files where the bad sectors were.
Then I set the 75GXP aside, because I knew it was possible to recover
some of the data *on* the bad sectors, but I didn't have the tools to do
it. So I decided to wait until then to RMA it.
I did write a program to parse the bad sector list along with the
partition's FAT, to create a list of files with bad sectors in them, so
at least I knew which files were effected. There are 8516 bad sectors,
and 722 files effected.
But this week, I got Linux working on my new computer (upgraded not too
long after the 75GXP went bad) and modified the IDE taskfile driver to
allow me to use READ LONG on the bad sectors -- thus allowing me to
salvage data from the bad sectors, while avoiding the nasty
click-click-click and delay of retrying (I can now repeat reads of a bad
sector about twice per second) and I can also get the 40 bytes of ECC
data. Each read of one sector turns up different data, and by comparing
them I can try to divine what the original was. That part I'm still
working on (it'd help a lot to know what encoding method the drive uses
- it's not RLL(2,7), which is the only one I've been able to get the
details on).
But today I did a different kind of analysis, with VERY interesting
results. I wrote a program to convert the list of bad sectors into a
graphics file, using the data on zones and sectors per track found in
IBM's specification. After some time and manipulation, I discovered that
all the bad sectors are in a line going from the outer edge 1/3 of the
way to the inner edge, on one platter surface! It's actually a spiral,
because of the platter rotation. But this explains why all the sectors
went bad at once. One of the heads must have executed a write cycle
while seeking! I could even measure the seek speed from my bad sector
data -- it's 4.475 ms/track! (assuming precisely 7200 rpm) And there are
evenly spaced nodes along the line where larger chunks were corrupted --
starting 300 ms apart, gradually fading to where they actually are
*less* corrupted than the line itself, at 750 ms apart.
I don't know if anyone else will find this interesting, but I found it
fascinating, and it explained a lot. If you'd like to talk to me about
the technical aspects of 75GXP failure, please email me at
quSPAMLESSbitATinorNOSPAMbitDOTcom (remove the chunks of spam, change AT
and DOT to their respective symbols).
For completeness, I should say that I had the drive for a year before it
developed the rash of bad sectors. It\'s made in Hungary, SEP-2000.
I wasn\'t using it too heavily until I got an HDTV card, then I was
recording HDTV onto the drive; this heavy usage might have helped it
along to failure. (2.4 MB/sec sustained writing -- and it was quite
noisy too.)
I updated the drive\'s firmware not too long after it developed the bad
sectors; of course this didn\'t let me read them any better -- I didn\'t
expect it to. I\'m not sure if the firmware update will make the drive
safe to use after a reformat, but I\'ll surely try it once I\'ve
recovered as much of the bad sectors as I can. Even if I still RMA the
drive, I\'d like to know.
------ end quote ------
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss