Richard Elling wrote:
There are many error correcting codes available. RAID2 used Hamming codes, but that's just one of many options out there. Par2 uses configurable strength Reed-Solomon to get multi bit error correction. The par2 source is available, although from a ZFS perspective is hindered by the CDDL-GPL license incompatibility problem.

It is possible to write a FUSE filesystem using Reed-Solomon (like par2) as the underlying protection. A quick search of the FUSE website turns up the Reed-Solomon FS (a FUSE-based filesystem): "Shielding your files with Reed-Solomon codes" http://ttsiodras.googlepages.com/rsbep.html

While most FUSE work is on Linux, and there is a ZFS-FUSE project for it, there has also been FUSE work done for OpenSolaris:
http://www.opensolaris.org/os/project/fuse/

BTW, if you do have the case where unprotected data is not
readable, then I have a little DTrace script that I'd like you to run
which would help determine the extent of the corruption.  This is
one of those studies which doesn't like induced errors ;-)
http://www.richardelling.com/Home/scripts-and-programs-1/zcksummon
Is this intended as general monitoring script or only after one has otherwise experienced corruption problems?


It is intended to try to answer the question of whether the errors we see
in real life might be single bit errors. I do not believe they will be single
bit errors, but we don't have the data.

To be pedantic, wouldn't protected data also be affected if all copies are damaged at the same time, especially if also damaged in the same place?

Yep.  Which is why there is RFE CR 6674679, complain if all data
copies are identical and corrupt.
-- richard

There is a related but an unlikely scenario, that is also probably not covered yet. I'm not sure what kind of common cause would lead to it. Maybe a disk array turning into swiss cheese with bad sectors suddenly showing up on multiple drives? Its probability increases with larger logical block sizes (e.g. 128k blocks are at higher risk than 4k blocks; a block being the smallest piece of storage real estate used by the filesystem). It is the edge case of multiple damaged copies where the damage is unreadable bad sectors on different corresponding sectors of a block. This could be recovered from by copying the readable sectors from each copy and filling in the holes using the appropriate sectors from the other copies. The final result, a rebuilt block, should pass the checksum tests assuming there were not any other problems with the still readable sectors.

---

A bad sector specific recovery technique is to instruct the disk to return raw read data rather than trying to correct it. The READ LONG command can do this (though the specs say it only works on 28 bit LBA). (READ LONG corresponds to writes done with WRITE LONG (28 bit) or WRITE UNCORRECTABLE EXT (48 bit). Linux HDPARM uses these write commands when it is used to create bad sectors with the --make-bad-sector command. The resulting sectors are low level logically bad where the sector's data and ECC do not match; they are not physically bad). With multiple read attempts, a statistical distribution of the likely 'true' contents of the sector can be found. Spinrite claims to do this. Linux 'HDPARM --read-sector' can sometimes return data from nominally bad sectors too but it doesn't have a built in statistical recovery method (a wrapper script could probably solve that). I don't know if HDPARM --read sector uses READ LONG or not.
HDPARM man page: http://linuxreviews.org/man/hdparm/

Good description of IDE commands including READ LONG and WRITE LONG (specs say they are 28 bit only)
http://www.repairfaq.org/filipg/LINK/F_IDE-tech.html
SCSI versions of READ LONG and WRITE LONG
http://en.wikipedia.org/wiki/SCSI_Read_Commands#Read_Long
http://en.wikipedia.org/wiki/SCSI_Write_Commands#Write_Long

Here is a report by forum member "qubit" modifying his Linux taskfile driver to use READ LONG for data recovery purposes, and his subsequent analysis:

http://forums.storagereview.net/index.php?showtopic=5910
http://www.tech-report.com/news_reply.x/3035
http://techreport.com/ja.zz?comments=3035&page=5

------ quote ------
318. Posted at 07:00 am on Jun 6th 2002 by qubit

My DTLA-307075 (75GB 75GXP) went bad 6 months ago. But I didn't write off the data as being unrecoverable. I used WinHex to make a ghost image of the drive onto my new larger one, zeroing out the bad sectors in the target while logging each bad sector. (There were bad sectors in the FAT so I combined the good parts from FATs 1 and 2.) At this point I had a working mirror of the drive that went bad, with zeroed-out 512 byte holes in files where the bad sectors were.

Then I set the 75GXP aside, because I knew it was possible to recover some of the data *on* the bad sectors, but I didn't have the tools to do it. So I decided to wait until then to RMA it.

I did write a program to parse the bad sector list along with the partition's FAT, to create a list of files with bad sectors in them, so at least I knew which files were effected. There are 8516 bad sectors, and 722 files effected.

But this week, I got Linux working on my new computer (upgraded not too long after the 75GXP went bad) and modified the IDE taskfile driver to allow me to use READ LONG on the bad sectors -- thus allowing me to salvage data from the bad sectors, while avoiding the nasty click-click-click and delay of retrying (I can now repeat reads of a bad sector about twice per second) and I can also get the 40 bytes of ECC data. Each read of one sector turns up different data, and by comparing them I can try to divine what the original was. That part I'm still working on (it'd help a lot to know what encoding method the drive uses - it's not RLL(2,7), which is the only one I've been able to get the details on).

But today I did a different kind of analysis, with VERY interesting results. I wrote a program to convert the list of bad sectors into a graphics file, using the data on zones and sectors per track found in IBM's specification. After some time and manipulation, I discovered that all the bad sectors are in a line going from the outer edge 1/3 of the way to the inner edge, on one platter surface! It's actually a spiral, because of the platter rotation. But this explains why all the sectors went bad at once. One of the heads must have executed a write cycle while seeking! I could even measure the seek speed from my bad sector data -- it's 4.475 ms/track! (assuming precisely 7200 rpm) And there are evenly spaced nodes along the line where larger chunks were corrupted -- starting 300 ms apart, gradually fading to where they actually are *less* corrupted than the line itself, at 750 ms apart.

I don't know if anyone else will find this interesting, but I found it fascinating, and it explained a lot. If you'd like to talk to me about the technical aspects of 75GXP failure, please email me at quSPAMLESSbitATinorNOSPAMbitDOTcom (remove the chunks of spam, change AT and DOT to their respective symbols).

For completeness, I should say that I had the drive for a year before it developed the rash of bad sectors. It\'s made in Hungary, SEP-2000.

I wasn\'t using it too heavily until I got an HDTV card, then I was recording HDTV onto the drive; this heavy usage might have helped it along to failure. (2.4 MB/sec sustained writing -- and it was quite noisy too.)

I updated the drive\'s firmware not too long after it developed the bad sectors; of course this didn\'t let me read them any better -- I didn\'t expect it to. I\'m not sure if the firmware update will make the drive safe to use after a reformat, but I\'ll surely try it once I\'ve recovered as much of the bad sectors as I can. Even if I still RMA the drive, I\'d like to know.
------ end quote ------
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to