Re: [zfs-discuss] ZFS + fsck

Robert Milkowski Sun, 08 Nov 2009 05:57:08 -0800

fyi


Robert Milkowski wrote:

XXX wrote:
| Have you actually tried to roll-back to previous uberblocks when you
| hit the issue?  I'm asking as I haven't yet heard about any case
| of the issue witch was not solved by rolling back to a previous
| uberblock. The problem though was that the way to do it was "hackish".

 Until recently I didn't even know that this was possible or a likely
solution to 'pool panics system on import' and similar pool destruction,
and I don't have any tools to do it. (Since we run Solaris 10, we won't
have official support for it for quite some time.)
I wouldn't be that surprised if this particular feature would actuallybe backported to S10 soon. At least you may raise a CR asking for it -maybe you will get an access to IDR first (I'm not saying there is orisn't already one).
 If there are (public) tools for doing this, I will give them a try
the next time I get a test pool into this situation.
IIRC someone send one to the zfs-discuss list some time ago.
Then usually you will also need to poke with zdb.
A sketchy and unsupported procedure was discussed on the list as well.
Look at the archives.
| The bugs which prevented importing a pool in some circumstances were
| really "annoying" but lets face it - it was bound to happen and they
| are just bugs which are getting fixed. ZFS is still young after all.
| And when you google for data loss on other filesystems I'm sure you
| will find lots of user testimonies - be it ufs, ext3, raiserfs or your
| favourite one.

 The difference between ZFS and those other filesystems is that with
a few exceptions (XFS, ReiserFS), which sysadmins in the field didn't
like either, those filesystems didn't generally lose *all* your data
when something went wrong. Their official repair tools could usually
put things back together to at least some extent.
Generally they didn't although I've seen situation when entire ext2and ufs were lost and fsck was not able to get them even mounted(kernel panics right after mounting them). In other occassion fsck wascrashing the box in yet another one fsck claimed everything was ok butthen when doing backup system was crashing (fsck can't really properlyfix filesystem state - it is more of guessing and sometimes it goesterribly wrong).
But I agrre that generally with other file systems you can recovermost or all data just fine.And generally it is the case with zfs - there were probably more bugsin ZFS as it is much younger filesystem, but most of them were veryquickly fixed. And the uberblock one - I 100% agree then when you hitthe issue and didn't know about manual method to recover it was verybad - but it has finally been fixed.
(Just as importantly, when they couldn't put things back together you
could honestly tell management and the users 'we ran the recovery tools
and this is all they could get back'. At the moment, we would have
to tell users and management 'well, there are no (official) recovery
tools...', unless Sun Support came through for once.)
But these tools are built-in into zfs and are happening automaticallyand with virtually 100% confidence that if something can be fixed itis fixed correctly and if something is wrong it will be detected -thanks to end-to-end checksumming of data and meta-data. The problem*was* that one case scenario when rolling back to previous uberblockis required was not implemented and required a complicated andundocumented procedure to follow. It wasn't high priority for Sun asit was very rare , wasn't affecting much enterprise customers andalthough complicated the procedure is there is one and wassuccessfully used on many occasions even for non paying customersthanks to guys like Victor on the zfs mailing list who helped somepeople in such a situations.
But you didn't know about it and it seems like Sun's support servicewas no use for you - which is really a shame.In your case I would probably point that out to them and at least getsome good deal as a compensation or something...
But what is most important is that finally fully supported, built inand easy to use procedure is available to recover from suchsituations. As time will progress and more bugs will be fixed ZFS willbehave much better under many corner cases as it does already in OpenSolaris - last 6 months or so were really very productive in fixingmany bugs like that.
| However the whole point of the discussion is that zfs reallydoesn't | need a fsck tool.| All the problems encountered so far were bugs and most of them are| already fixed. One missing feature was a built-in support for |rolling-back uberblock which just has been integrated. But I'm sure |there are more bugs to be found..
 I disagree strongly. Fsck tools have multiple purposes; ZFS obsoletes
some of them but not all. One thing fsck is there for is to recover as
much as possible after things happen that are supposed to be impossible,
like operating system bugs or crazy corruption. ZFS's current attitude
is more or less that impossible things won't happen so it doesn't have
to do anything (except, perhaps, panic with assert failures).
This is not true - I will try to explain why.
Generally if you want to recover some data from a filesystem you needto get it into a state you can mount it (at least RO). Most legacyfilesystems when hitting with the problem that metadata do not makesense to them and they think it is wrong won't allow you to mount thefilesystem and will ask you to run fsck. Now as there are not checksumin these filesystems generally there is no accurate way of telling howthe bad metadata should be fixed. Fsck is looking for obvious thingsand is trying to "guess" in many cases and sometimes it is right andsometimes it is not. Then sometimes it won't even detect then therewas corruption. Also keep in mind that fsck in most filesystems doesnot even try to check for user data - just metadata. The main reasonis that it can't really do it.Now because running fsck could potentially be disastrous to afilesystem and lead to even more damage if it is started automatically(for example during system boot) it is started in an interactive-modeand if some less obvious fixes are required it will require a human toconfirm its action. But even then it is still just guessing what it issupposed to do. And it happens that situation gets even worse.
Then sometimes there were bugs both in filesystems and fsck and userwas left with no access to data at all until these bugs were fixed (oruser was skilled enough to fix/workaround them on his/her own). I cameacross such problems on EMC IP4700, EMC Celerra and couple of othersystems in my life. For example fsck was running for well over 10hconsuming more and more memory and finally server was running out ofmemory and fsck died... and it all started over again, failedagain.... in other case fsck was just crashing during repair in thesame location and file system was crashing the os after couple ofminutes after mounting it..
The other problem with fsck is that even if it thinks that filesystemis ok it actually might not be - even its metadata state. Then alldifferent things might happen - like when accessing a given file ordirectory a system will panic or more data will get corrupted... I wasin such a situation couple of times and it took days to copy filesfrom such a filesystem to another one with many panics in-between whenwe had to skip such files or directories, etc. fsck didn't help andreported everything is fine.
Now with ZFS it is completely different world. ZFS is able invirtually all cases to detect if its meta-data and data on-disk iscorrupted in anyway or not thanks to its end-to-end checksumming. Ifsomeone is concern with how strong default checksumming is (fletcher4)then currently one cas switch zfs to use sha256 to have a good sleep.So here is first big difference compared to most filesystems in amarket - ZFS if some data is corrupted does not have to *guess* if itis the case or not but can actually detect it with almost 100%confidence when it is the case.Once such a case is detected ZFS will try to automatically fix theissue if there is redundant copy of corrupted block available - ifthere is it will all happen transparently to applications without anyneed to unmount filesystems or run external tools like fsck. Thenbecause ZFS checksums both metadata and user data it will be able todetect and possibly fix data corruptions in both cases (which fsckcan't even if it is lucky). Now even if you are not doing anyredundancy at pool level by using ZFS its metadata blocks are alwayskept in at least two copies physically separated on disk if possible.What it means is that even in a single disk configuration (or stirpe)if some data is corrupted zfs will be able to detect it and if it ismeta-data block it will be able not only to detect it but alsoautomatically and transparently fix it and preserve filesystemconsistency. There is a simple test you may run - create a pool on topof one disk drive, put some files in it then overwrite lets say 20% ofthe disk drives with some random data or zeros while zfs is running.Then flush caches (export/import pool) and try to access allmetadataby doing a full ls -lra on a filesystem. You should be able to get afull listing with proper attributes, etc. but if you check zpoolstatus it will probably report many checksum errors which werecorrected. (when overwriting overwrite so portion of the beginning ofthe disk as zfs will usually start writing to a disk from thebeginning). Now if you actually try to read a file contents it shouldbe fine if you lucky enough to read onwhich was not overwritten and ifyou are unlucky you won't be able to read blocks which are corrupted(and since you don't have ane redundancy at zfs level it can't fix itsuser-data but can detect it) but you will be able to read all theother blocks from the file. Now try to do something like these withany other file system - you will probably end-up with os panic and inmany cases fsck won't be able to recover file system to such a pointso you can recover some data.... and when fixing it will be onlyguessing what to do and skip user data entirely...
Now there is a specific scenario case of the above when metadata iscorrupted which is describing pool itself or its root block and itcan't be fixed as all copies are wrong. ZFS can also detect it but anextra functionality was not implemented until very recently toactually try to use N-1 rootblock in such the case. This was veryunfortunate but because it was very rare in the field and resourcesare limited as usual it wasn't implemented - instead there was anundocumented, unsupported and hard to follow procedure on how to do itmanually - and some people did use it successfully (check zfs-discussarchives). But of course it shouldn't be like that and ZFS developersdid recognized it by having accepted bug report on it. Bur limitedresources...... fortunately a built-in mechanism to deal with such acase has finally been implemented. So now when it happens a user willhave a choice of importing a pool with extra option to rollback to aprevious version of txg so the pool can be imported. From now one allthe mechanisms described above will kick-in. And again - no guessinghere but a guarantee of detecting a corruption and fixing it ifpossible. And you don't even have to run any check and wait hourssometimes days on large filesystems with millions of files before youcan access your data (and still not be sure what exactly you'reaccessing and if it won't cause further issues). Of course it wouldprobably be wise to run zpool scrub to force reading all data andmetadata and checking their checksum and fix them if possible atconvinient time for you but in a mean time you may run yourapplications and any corruptions will be detected and fixed while datais being accessed.
So from the practical point of view you make think of the mechanismsin ZFS as a built-in fsck with an ability to actually detect whencorruption happens (instead of just guessing it and just formeta-data), get it fixed if a redundant copy is available (and do ittransparently to applications). Having a separate tool doesn't reallymakes sense here. Of course you can always write a script calledfsck.zfs which will import a pool and run zpool scrub if you want. Andsometimes people will do exactly that before going back intoproduction. But having a genuine extra tool like fsck doesn't reallymake sense - what such a tool should exactly do (keeping in mind allthe above)?
Then there were a couple of bugs which prevented ZFS from importing apool with some specific corruptions which were entirely fixable (AFAIKall known were fixed in Open Solaris). When you think about it - weare talking about bugs here - if you would put all the recoverymechanisms into a separate tool called fsck with the same bugs itwouldn't be able to repair such a pool anyway, would it? So you wouldneed to fix these bugs first - but once you fixed them the zfs willable to mount such a pool and still an external tool to do so is notneeded (or after applying a patch/fix do 'alias fsck='zpool import''and then fsck pool will get your pool fixed... :)You might ask but what are you supposed to do until such a bug isfixed? Well, what would you do if you wouldn't be able to mount ext2filesystem (or any other) and there was a bug in its fsck which wouldprevent it from getting the fs into a mountable state.... you wouldhave to wait for a fix, or get it fixed yourself, or play with itson-disk format with tools like e2fs, fsdb, ... and try to fixfilesystem manually. Well, on zfs you've also have zdb...
or you would probably be forced to recover data from backup.
The point here is that most filesystem and their tool had such bugsand zfs is one of the youngest filesystems in the market so it is nowonder in a way that such bugs are getting fixed now and not 5-7 yearsago. Then there is a critical mass of users required for a givenfilesystems so it is deployed in many different environments,different workloads, hardware, drievers, usage cases, ... so all thesecorner cases can surface, users hopefully will report them and theywill get fixed. ZFS is becoming widely deployed only for last coupleof years or so so no wonder that most of these bugs were spotted (andfixed) during the same period.
But then thanks to a fundamentally different architecture of ZFS oncemost (all? :)) of bugs like these are fixed ZFS offers something MUCHbetter than legacy filesystems + fsck. It offers a guarantee ofdetecting data corruption and fixing it properly when possible whilereporting what can't be fixed and still providing an access to all theother data in your pool.
btw: the email exchange is private so I don't won't to includezfs-discuss without your consent but if you want to forward this emailto zfs-discuss for other users benefit feel free to do so.
) As the evolution of ZFS has demonstrated, impossible things *do*happen
and you *do* need the ability to recover as much as possible.  ZFS is
busy slapping bandaids over specific problems instead of dealing with
the general issue.
Just a quick "google" and:

1. fsck fails and causes panic of Linux kernel
https://bugzilla.redhat.com/show_bug.cgi?id=126238
2. btrfs - filesystem gots corrupted, running btrfsck causes even moredamage and entire filesystem is nuked due to a bug. BTRFS is not thebest example as it is far from being production ready but still...
https://bugzilla.redhat.com/show_bug.cgi?id=497821
3. linux gfs2 - fsck has a bug (or lack of feature) and is not able tofix the filesystem with a specific corruption, but filesystem isunmountable. The only option is to manually fix data on-disk with helpfrom a support service on case-by-case basis...
https://bugzilla.redhat.com/show_bug.cgi?id=457557

4. e2fsck segfaults + dumps core when trying to check a filesystem

https://bugzilla.redhat.com/show_bug.cgi?id=108075
5. ext3 filesystem crashes - fsck can't repair it and goes intoinfinite loop.... fixed in development version of fsck
https://bugzilla.redhat.com/show_bug.cgi?id=467677
6. gfs2 corruption is causing a linux kernel to panic.... fsck says itfixes the issue but it doesn't and system crashes all over again underload...
https://bugzilla.redhat.com/show_bug.cgi?id=519049
7. ext3 filesystem can't be mounted anf fsck won't finish after 10days of running (probably some kind of infinite looping bug again)
http://ubuntuforums.org/archive/index.php/t-394744.html
8. AIX JFS2 filesystem corruption - due to a bug in fsck it can't fixthe fs, data had to be recovered from backup
http://unix.ittoolbox.com/groups/technical-functional/ibm-aix-l/error-518-file-system-corruption-366503
9.

https://bugzilla.redhat.com/show_bug.cgi?id=514511
https://bugzilla.redhat.com/show_bug.cgi?id=477856


And there are many more...
The point again is that bugs happen even in fsck and until they arefixed a common user/sysadmin quote often won't be able to recover onits own. ZFS is not exception here when it comes to bugs. But thanksto its different approach (mostly end-to-end checksumming + COW) itsability to detect data corruption and deal with it exceeds mostgenerally available solutions in the market. The fixes for some bugsmentioned before make it only more robust and reliable even for thoseunlucky users before... :)


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS + fsck

Reply via email to