fyi

Robert Milkowski wrote:
XXX wrote:
| Have you actually tried to roll-back to previous uberblocks when you
| hit the issue?  I'm asking as I haven't yet heard about any case
| of the issue witch was not solved by rolling back to a previous
| uberblock. The problem though was that the way to do it was "hackish".

 Until recently I didn't even know that this was possible or a likely
solution to 'pool panics system on import' and similar pool destruction,
and I don't have any tools to do it. (Since we run Solaris 10, we won't
have official support for it for quite some time.)
I wouldn't be that surprised if this particular feature would actually be backported to S10 soon. At least you may raise a CR asking for it - maybe you will get an access to IDR first (I'm not saying there is or isn't already one).

 If there are (public) tools for doing this, I will give them a try
the next time I get a test pool into this situation.

IIRC someone send one to the zfs-discuss list some time ago.
Then usually you will also need to poke with zdb.
A sketchy and unsupported procedure was discussed on the list as well.
Look at the archives.

| The bugs which prevented importing a pool in some circumstances were
| really "annoying" but lets face it - it was bound to happen and they
| are just bugs which are getting fixed. ZFS is still young after all.
| And when you google for data loss on other filesystems I'm sure you
| will find lots of user testimonies - be it ufs, ext3, raiserfs or your
| favourite one.

 The difference between ZFS and those other filesystems is that with
a few exceptions (XFS, ReiserFS), which sysadmins in the field didn't
like either, those filesystems didn't generally lose *all* your data
when something went wrong. Their official repair tools could usually
put things back together to at least some extent.
Generally they didn't although I've seen situation when entire ext2 and ufs were lost and fsck was not able to get them even mounted (kernel panics right after mounting them). In other occassion fsck was crashing the box in yet another one fsck claimed everything was ok but then when doing backup system was crashing (fsck can't really properly fix filesystem state - it is more of guessing and sometimes it goes terribly wrong).

But I agrre that generally with other file systems you can recover most or all data just fine. And generally it is the case with zfs - there were probably more bugs in ZFS as it is much younger filesystem, but most of them were very quickly fixed. And the uberblock one - I 100% agree then when you hit the issue and didn't know about manual method to recover it was very bad - but it has finally been fixed.

(Just as importantly, when they couldn't put things back together you
could honestly tell management and the users 'we ran the recovery tools
and this is all they could get back'. At the moment, we would have
to tell users and management 'well, there are no (official) recovery
tools...', unless Sun Support came through for once.)

But these tools are built-in into zfs and are happening automatically and with virtually 100% confidence that if something can be fixed it is fixed correctly and if something is wrong it will be detected - thanks to end-to-end checksumming of data and meta-data. The problem *was* that one case scenario when rolling back to previous uberblock is required was not implemented and required a complicated and undocumented procedure to follow. It wasn't high priority for Sun as it was very rare , wasn't affecting much enterprise customers and although complicated the procedure is there is one and was successfully used on many occasions even for non paying customers thanks to guys like Victor on the zfs mailing list who helped some people in such a situations.

But you didn't know about it and it seems like Sun's support service was no use for you - which is really a shame. In your case I would probably point that out to them and at least get some good deal as a compensation or something...

But what is most important is that finally fully supported, built in and easy to use procedure is available to recover from such situations. As time will progress and more bugs will be fixed ZFS will behave much better under many corner cases as it does already in Open Solaris - last 6 months or so were really very productive in fixing many bugs like that.

| However the whole point of the discussion is that zfs really doesn't | need a fsck tool. | All the problems encountered so far were bugs and most of them are | already fixed. One missing feature was a built-in support for | rolling-back uberblock which just has been integrated. But I'm sure | there are more bugs to be found..

 I disagree strongly. Fsck tools have multiple purposes; ZFS obsoletes
some of them but not all. One thing fsck is there for is to recover as
much as possible after things happen that are supposed to be impossible,
like operating system bugs or crazy corruption. ZFS's current attitude
is more or less that impossible things won't happen so it doesn't have
to do anything (except, perhaps, panic with assert failures).
This is not true - I will try to explain why.
Generally if you want to recover some data from a filesystem you need to get it into a state you can mount it (at least RO). Most legacy filesystems when hitting with the problem that metadata do not make sense to them and they think it is wrong won't allow you to mount the filesystem and will ask you to run fsck. Now as there are not checksum in these filesystems generally there is no accurate way of telling how the bad metadata should be fixed. Fsck is looking for obvious things and is trying to "guess" in many cases and sometimes it is right and sometimes it is not. Then sometimes it won't even detect then there was corruption. Also keep in mind that fsck in most filesystems does not even try to check for user data - just metadata. The main reason is that it can't really do it. Now because running fsck could potentially be disastrous to a filesystem and lead to even more damage if it is started automatically (for example during system boot) it is started in an interactive-mode and if some less obvious fixes are required it will require a human to confirm its action. But even then it is still just guessing what it is supposed to do. And it happens that situation gets even worse.

Then sometimes there were bugs both in filesystems and fsck and user was left with no access to data at all until these bugs were fixed (or user was skilled enough to fix/workaround them on his/her own). I came across such problems on EMC IP4700, EMC Celerra and couple of other systems in my life. For example fsck was running for well over 10h consuming more and more memory and finally server was running out of memory and fsck died... and it all started over again, failed again.... in other case fsck was just crashing during repair in the same location and file system was crashing the os after couple of minutes after mounting it..

The other problem with fsck is that even if it thinks that filesystem is ok it actually might not be - even its metadata state. Then all different things might happen - like when accessing a given file or directory a system will panic or more data will get corrupted... I was in such a situation couple of times and it took days to copy files from such a filesystem to another one with many panics in-between when we had to skip such files or directories, etc. fsck didn't help and reported everything is fine.

Now with ZFS it is completely different world. ZFS is able in virtually all cases to detect if its meta-data and data on-disk is corrupted in anyway or not thanks to its end-to-end checksumming. If someone is concern with how strong default checksumming is (fletcher4) then currently one cas switch zfs to use sha256 to have a good sleep. So here is first big difference compared to most filesystems in a market - ZFS if some data is corrupted does not have to *guess* if it is the case or not but can actually detect it with almost 100% confidence when it is the case. Once such a case is detected ZFS will try to automatically fix the issue if there is redundant copy of corrupted block available - if there is it will all happen transparently to applications without any need to unmount filesystems or run external tools like fsck. Then because ZFS checksums both metadata and user data it will be able to detect and possibly fix data corruptions in both cases (which fsck can't even if it is lucky). Now even if you are not doing any redundancy at pool level by using ZFS its metadata blocks are always kept in at least two copies physically separated on disk if possible. What it means is that even in a single disk configuration (or stirpe) if some data is corrupted zfs will be able to detect it and if it is meta-data block it will be able not only to detect it but also automatically and transparently fix it and preserve filesystem consistency. There is a simple test you may run - create a pool on top of one disk drive, put some files in it then overwrite lets say 20% of the disk drives with some random data or zeros while zfs is running. Then flush caches (export/import pool) and try to access allmetadata by doing a full ls -lra on a filesystem. You should be able to get a full listing with proper attributes, etc. but if you check zpool status it will probably report many checksum errors which were corrected. (when overwriting overwrite so portion of the beginning of the disk as zfs will usually start writing to a disk from the beginning). Now if you actually try to read a file contents it should be fine if you lucky enough to read onwhich was not overwritten and if you are unlucky you won't be able to read blocks which are corrupted (and since you don't have ane redundancy at zfs level it can't fix its user-data but can detect it) but you will be able to read all the other blocks from the file. Now try to do something like these with any other file system - you will probably end-up with os panic and in many cases fsck won't be able to recover file system to such a point so you can recover some data.... and when fixing it will be only guessing what to do and skip user data entirely...

Now there is a specific scenario case of the above when metadata is corrupted which is describing pool itself or its root block and it can't be fixed as all copies are wrong. ZFS can also detect it but an extra functionality was not implemented until very recently to actually try to use N-1 rootblock in such the case. This was very unfortunate but because it was very rare in the field and resources are limited as usual it wasn't implemented - instead there was an undocumented, unsupported and hard to follow procedure on how to do it manually - and some people did use it successfully (check zfs-discuss archives). But of course it shouldn't be like that and ZFS developers did recognized it by having accepted bug report on it. Bur limited resources...... fortunately a built-in mechanism to deal with such a case has finally been implemented. So now when it happens a user will have a choice of importing a pool with extra option to rollback to a previous version of txg so the pool can be imported. From now one all the mechanisms described above will kick-in. And again - no guessing here but a guarantee of detecting a corruption and fixing it if possible. And you don't even have to run any check and wait hours sometimes days on large filesystems with millions of files before you can access your data (and still not be sure what exactly you're accessing and if it won't cause further issues). Of course it would probably be wise to run zpool scrub to force reading all data and metadata and checking their checksum and fix them if possible at convinient time for you but in a mean time you may run your applications and any corruptions will be detected and fixed while data is being accessed.

So from the practical point of view you make think of the mechanisms in ZFS as a built-in fsck with an ability to actually detect when corruption happens (instead of just guessing it and just for meta-data), get it fixed if a redundant copy is available (and do it transparently to applications). Having a separate tool doesn't really makes sense here. Of course you can always write a script called fsck.zfs which will import a pool and run zpool scrub if you want. And sometimes people will do exactly that before going back into production. But having a genuine extra tool like fsck doesn't really make sense - what such a tool should exactly do (keeping in mind all the above)?

Then there were a couple of bugs which prevented ZFS from importing a pool with some specific corruptions which were entirely fixable (AFAIK all known were fixed in Open Solaris). When you think about it - we are talking about bugs here - if you would put all the recovery mechanisms into a separate tool called fsck with the same bugs it wouldn't be able to repair such a pool anyway, would it? So you would need to fix these bugs first - but once you fixed them the zfs will able to mount such a pool and still an external tool to do so is not needed (or after applying a patch/fix do 'alias fsck='zpool import'' and then fsck pool will get your pool fixed... :) You might ask but what are you supposed to do until such a bug is fixed? Well, what would you do if you wouldn't be able to mount ext2 filesystem (or any other) and there was a bug in its fsck which would prevent it from getting the fs into a mountable state.... you would have to wait for a fix, or get it fixed yourself, or play with its on-disk format with tools like e2fs, fsdb, ... and try to fix filesystem manually. Well, on zfs you've also have zdb...
or you would probably be forced to recover data from backup.

The point here is that most filesystem and their tool had such bugs and zfs is one of the youngest filesystems in the market so it is no wonder in a way that such bugs are getting fixed now and not 5-7 years ago. Then there is a critical mass of users required for a given filesystems so it is deployed in many different environments, different workloads, hardware, drievers, usage cases, ... so all these corner cases can surface, users hopefully will report them and they will get fixed. ZFS is becoming widely deployed only for last couple of years or so so no wonder that most of these bugs were spotted (and fixed) during the same period.

But then thanks to a fundamentally different architecture of ZFS once most (all? :)) of bugs like these are fixed ZFS offers something MUCH better than legacy filesystems + fsck. It offers a guarantee of detecting data corruption and fixing it properly when possible while reporting what can't be fixed and still providing an access to all the other data in your pool.


btw: the email exchange is private so I don't won't to include zfs-discuss without your consent but if you want to forward this email to zfs-discuss for other users benefit feel free to do so.


) As the evolution of ZFS has demonstrated, impossible things *do* happen
and you *do* need the ability to recover as much as possible.  ZFS is
busy slapping bandaids over specific problems instead of dealing with
the general issue.
Just a quick "google" and:

1. fsck fails and causes panic of Linux kernel
https://bugzilla.redhat.com/show_bug.cgi?id=126238

2. btrfs - filesystem gots corrupted, running btrfsck causes even more damage and entire filesystem is nuked due to a bug. BTRFS is not the best example as it is far from being production ready but still...

https://bugzilla.redhat.com/show_bug.cgi?id=497821

3. linux gfs2 - fsck has a bug (or lack of feature) and is not able to fix the filesystem with a specific corruption, but filesystem is unmountable. The only option is to manually fix data on-disk with help from a support service on case-by-case basis...

https://bugzilla.redhat.com/show_bug.cgi?id=457557

4. e2fsck segfaults + dumps core when trying to check a filesystem

https://bugzilla.redhat.com/show_bug.cgi?id=108075

5. ext3 filesystem crashes - fsck can't repair it and goes into infinite loop.... fixed in development version of fsck

https://bugzilla.redhat.com/show_bug.cgi?id=467677

6. gfs2 corruption is causing a linux kernel to panic.... fsck says it fixes the issue but it doesn't and system crashes all over again under load...

https://bugzilla.redhat.com/show_bug.cgi?id=519049

7. ext3 filesystem can't be mounted anf fsck won't finish after 10 days of running (probably some kind of infinite looping bug again)

http://ubuntuforums.org/archive/index.php/t-394744.html

8. AIX JFS2 filesystem corruption - due to a bug in fsck it can't fix the fs, data had to be recovered from backup

http://unix.ittoolbox.com/groups/technical-functional/ibm-aix-l/error-518-file-system-corruption-366503


9.

https://bugzilla.redhat.com/show_bug.cgi?id=514511
https://bugzilla.redhat.com/show_bug.cgi?id=477856


And there are many more...

The point again is that bugs happen even in fsck and until they are fixed a common user/sysadmin quote often won't be able to recover on its own. ZFS is not exception here when it comes to bugs. But thanks to its different approach (mostly end-to-end checksumming + COW) its ability to detect data corruption and deal with it exceeds most generally available solutions in the market. The fixes for some bugs mentioned before make it only more robust and reliable even for those unlucky users before... :)




_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to