XXX wrote:
| Have you actually tried to roll-back to previous uberblocks when you
| hit the issue? I'm asking as I haven't yet heard about any case
| of the issue witch was not solved by rolling back to a previous
| uberblock. The problem though was that the way to do it was "hackish".
Until recently I didn't even know that this was possible or a likely
solution to 'pool panics system on import' and similar pool destruction,
and I don't have any tools to do it. (Since we run Solaris 10, we won't
have official support for it for quite some time.)
I wouldn't be that surprised if this particular feature would actually
be backported to S10 soon. At least you may raise a CR asking for it -
maybe you will get an access to IDR first (I'm not saying there is or
isn't already one).
If there are (public) tools for doing this, I will give them a try
the next time I get a test pool into this situation.
IIRC someone send one to the zfs-discuss list some time ago.
Then usually you will also need to poke with zdb.
A sketchy and unsupported procedure was discussed on the list as well.
Look at the archives.
| The bugs which prevented importing a pool in some circumstances were
| really "annoying" but lets face it - it was bound to happen and they
| are just bugs which are getting fixed. ZFS is still young after all.
| And when you google for data loss on other filesystems I'm sure you
| will find lots of user testimonies - be it ufs, ext3, raiserfs or your
| favourite one.
The difference between ZFS and those other filesystems is that with
a few exceptions (XFS, ReiserFS), which sysadmins in the field didn't
like either, those filesystems didn't generally lose *all* your data
when something went wrong. Their official repair tools could usually
put things back together to at least some extent.
Generally they didn't although I've seen situation when entire ext2
and ufs were lost and fsck was not able to get them even mounted
(kernel panics right after mounting them). In other occassion fsck was
crashing the box in yet another one fsck claimed everything was ok but
then when doing backup system was crashing (fsck can't really properly
fix filesystem state - it is more of guessing and sometimes it goes
terribly wrong).
But I agrre that generally with other file systems you can recover
most or all data just fine.
And generally it is the case with zfs - there were probably more bugs
in ZFS as it is much younger filesystem, but most of them were very
quickly fixed. And the uberblock one - I 100% agree then when you hit
the issue and didn't know about manual method to recover it was very
bad - but it has finally been fixed.
(Just as importantly, when they couldn't put things back together you
could honestly tell management and the users 'we ran the recovery tools
and this is all they could get back'. At the moment, we would have
to tell users and management 'well, there are no (official) recovery
tools...', unless Sun Support came through for once.)
But these tools are built-in into zfs and are happening automatically
and with virtually 100% confidence that if something can be fixed it
is fixed correctly and if something is wrong it will be detected -
thanks to end-to-end checksumming of data and meta-data. The problem
*was* that one case scenario when rolling back to previous uberblock
is required was not implemented and required a complicated and
undocumented procedure to follow. It wasn't high priority for Sun as
it was very rare , wasn't affecting much enterprise customers and
although complicated the procedure is there is one and was
successfully used on many occasions even for non paying customers
thanks to guys like Victor on the zfs mailing list who helped some
people in such a situations.
But you didn't know about it and it seems like Sun's support service
was no use for you - which is really a shame.
In your case I would probably point that out to them and at least get
some good deal as a compensation or something...
But what is most important is that finally fully supported, built in
and easy to use procedure is available to recover from such
situations. As time will progress and more bugs will be fixed ZFS will
behave much better under many corner cases as it does already in Open
Solaris - last 6 months or so were really very productive in fixing
many bugs like that.
| However the whole point of the discussion is that zfs really
doesn't | need a fsck tool.
| All the problems encountered so far were bugs and most of them are
| already fixed. One missing feature was a built-in support for |
rolling-back uberblock which just has been integrated. But I'm sure |
there are more bugs to be found..
I disagree strongly. Fsck tools have multiple purposes; ZFS obsoletes
some of them but not all. One thing fsck is there for is to recover as
much as possible after things happen that are supposed to be impossible,
like operating system bugs or crazy corruption. ZFS's current attitude
is more or less that impossible things won't happen so it doesn't have
to do anything (except, perhaps, panic with assert failures).
This is not true - I will try to explain why.
Generally if you want to recover some data from a filesystem you need
to get it into a state you can mount it (at least RO). Most legacy
filesystems when hitting with the problem that metadata do not make
sense to them and they think it is wrong won't allow you to mount the
filesystem and will ask you to run fsck. Now as there are not checksum
in these filesystems generally there is no accurate way of telling how
the bad metadata should be fixed. Fsck is looking for obvious things
and is trying to "guess" in many cases and sometimes it is right and
sometimes it is not. Then sometimes it won't even detect then there
was corruption. Also keep in mind that fsck in most filesystems does
not even try to check for user data - just metadata. The main reason
is that it can't really do it.
Now because running fsck could potentially be disastrous to a
filesystem and lead to even more damage if it is started automatically
(for example during system boot) it is started in an interactive-mode
and if some less obvious fixes are required it will require a human to
confirm its action. But even then it is still just guessing what it is
supposed to do. And it happens that situation gets even worse.
Then sometimes there were bugs both in filesystems and fsck and user
was left with no access to data at all until these bugs were fixed (or
user was skilled enough to fix/workaround them on his/her own). I came
across such problems on EMC IP4700, EMC Celerra and couple of other
systems in my life. For example fsck was running for well over 10h
consuming more and more memory and finally server was running out of
memory and fsck died... and it all started over again, failed
again.... in other case fsck was just crashing during repair in the
same location and file system was crashing the os after couple of
minutes after mounting it..
The other problem with fsck is that even if it thinks that filesystem
is ok it actually might not be - even its metadata state. Then all
different things might happen - like when accessing a given file or
directory a system will panic or more data will get corrupted... I was
in such a situation couple of times and it took days to copy files
from such a filesystem to another one with many panics in-between when
we had to skip such files or directories, etc. fsck didn't help and
reported everything is fine.
Now with ZFS it is completely different world. ZFS is able in
virtually all cases to detect if its meta-data and data on-disk is
corrupted in anyway or not thanks to its end-to-end checksumming. If
someone is concern with how strong default checksumming is (fletcher4)
then currently one cas switch zfs to use sha256 to have a good sleep.
So here is first big difference compared to most filesystems in a
market - ZFS if some data is corrupted does not have to *guess* if it
is the case or not but can actually detect it with almost 100%
confidence when it is the case.
Once such a case is detected ZFS will try to automatically fix the
issue if there is redundant copy of corrupted block available - if
there is it will all happen transparently to applications without any
need to unmount filesystems or run external tools like fsck. Then
because ZFS checksums both metadata and user data it will be able to
detect and possibly fix data corruptions in both cases (which fsck
can't even if it is lucky). Now even if you are not doing any
redundancy at pool level by using ZFS its metadata blocks are always
kept in at least two copies physically separated on disk if possible.
What it means is that even in a single disk configuration (or stirpe)
if some data is corrupted zfs will be able to detect it and if it is
meta-data block it will be able not only to detect it but also
automatically and transparently fix it and preserve filesystem
consistency. There is a simple test you may run - create a pool on top
of one disk drive, put some files in it then overwrite lets say 20% of
the disk drives with some random data or zeros while zfs is running.
Then flush caches (export/import pool) and try to access allmetadata
by doing a full ls -lra on a filesystem. You should be able to get a
full listing with proper attributes, etc. but if you check zpool
status it will probably report many checksum errors which were
corrected. (when overwriting overwrite so portion of the beginning of
the disk as zfs will usually start writing to a disk from the
beginning). Now if you actually try to read a file contents it should
be fine if you lucky enough to read onwhich was not overwritten and if
you are unlucky you won't be able to read blocks which are corrupted
(and since you don't have ane redundancy at zfs level it can't fix its
user-data but can detect it) but you will be able to read all the
other blocks from the file. Now try to do something like these with
any other file system - you will probably end-up with os panic and in
many cases fsck won't be able to recover file system to such a point
so you can recover some data.... and when fixing it will be only
guessing what to do and skip user data entirely...
Now there is a specific scenario case of the above when metadata is
corrupted which is describing pool itself or its root block and it
can't be fixed as all copies are wrong. ZFS can also detect it but an
extra functionality was not implemented until very recently to
actually try to use N-1 rootblock in such the case. This was very
unfortunate but because it was very rare in the field and resources
are limited as usual it wasn't implemented - instead there was an
undocumented, unsupported and hard to follow procedure on how to do it
manually - and some people did use it successfully (check zfs-discuss
archives). But of course it shouldn't be like that and ZFS developers
did recognized it by having accepted bug report on it. Bur limited
resources...... fortunately a built-in mechanism to deal with such a
case has finally been implemented. So now when it happens a user will
have a choice of importing a pool with extra option to rollback to a
previous version of txg so the pool can be imported. From now one all
the mechanisms described above will kick-in. And again - no guessing
here but a guarantee of detecting a corruption and fixing it if
possible. And you don't even have to run any check and wait hours
sometimes days on large filesystems with millions of files before you
can access your data (and still not be sure what exactly you're
accessing and if it won't cause further issues). Of course it would
probably be wise to run zpool scrub to force reading all data and
metadata and checking their checksum and fix them if possible at
convinient time for you but in a mean time you may run your
applications and any corruptions will be detected and fixed while data
is being accessed.
So from the practical point of view you make think of the mechanisms
in ZFS as a built-in fsck with an ability to actually detect when
corruption happens (instead of just guessing it and just for
meta-data), get it fixed if a redundant copy is available (and do it
transparently to applications). Having a separate tool doesn't really
makes sense here. Of course you can always write a script called
fsck.zfs which will import a pool and run zpool scrub if you want. And
sometimes people will do exactly that before going back into
production. But having a genuine extra tool like fsck doesn't really
make sense - what such a tool should exactly do (keeping in mind all
the above)?
Then there were a couple of bugs which prevented ZFS from importing a
pool with some specific corruptions which were entirely fixable (AFAIK
all known were fixed in Open Solaris). When you think about it - we
are talking about bugs here - if you would put all the recovery
mechanisms into a separate tool called fsck with the same bugs it
wouldn't be able to repair such a pool anyway, would it? So you would
need to fix these bugs first - but once you fixed them the zfs will
able to mount such a pool and still an external tool to do so is not
needed (or after applying a patch/fix do 'alias fsck='zpool import''
and then fsck pool will get your pool fixed... :)
You might ask but what are you supposed to do until such a bug is
fixed? Well, what would you do if you wouldn't be able to mount ext2
filesystem (or any other) and there was a bug in its fsck which would
prevent it from getting the fs into a mountable state.... you would
have to wait for a fix, or get it fixed yourself, or play with its
on-disk format with tools like e2fs, fsdb, ... and try to fix
filesystem manually. Well, on zfs you've also have zdb...
or you would probably be forced to recover data from backup.
The point here is that most filesystem and their tool had such bugs
and zfs is one of the youngest filesystems in the market so it is no
wonder in a way that such bugs are getting fixed now and not 5-7 years
ago. Then there is a critical mass of users required for a given
filesystems so it is deployed in many different environments,
different workloads, hardware, drievers, usage cases, ... so all these
corner cases can surface, users hopefully will report them and they
will get fixed. ZFS is becoming widely deployed only for last couple
of years or so so no wonder that most of these bugs were spotted (and
fixed) during the same period.
But then thanks to a fundamentally different architecture of ZFS once
most (all? :)) of bugs like these are fixed ZFS offers something MUCH
better than legacy filesystems + fsck. It offers a guarantee of
detecting data corruption and fixing it properly when possible while
reporting what can't be fixed and still providing an access to all the
other data in your pool.
btw: the email exchange is private so I don't won't to include
zfs-discuss without your consent but if you want to forward this email
to zfs-discuss for other users benefit feel free to do so.
) As the evolution of ZFS has demonstrated, impossible things *do*
happen
and you *do* need the ability to recover as much as possible. ZFS is
busy slapping bandaids over specific problems instead of dealing with
the general issue.
Just a quick "google" and:
1. fsck fails and causes panic of Linux kernel
https://bugzilla.redhat.com/show_bug.cgi?id=126238
2. btrfs - filesystem gots corrupted, running btrfsck causes even more
damage and entire filesystem is nuked due to a bug. BTRFS is not the
best example as it is far from being production ready but still...
https://bugzilla.redhat.com/show_bug.cgi?id=497821
3. linux gfs2 - fsck has a bug (or lack of feature) and is not able to
fix the filesystem with a specific corruption, but filesystem is
unmountable. The only option is to manually fix data on-disk with help
from a support service on case-by-case basis...
https://bugzilla.redhat.com/show_bug.cgi?id=457557
4. e2fsck segfaults + dumps core when trying to check a filesystem
https://bugzilla.redhat.com/show_bug.cgi?id=108075
5. ext3 filesystem crashes - fsck can't repair it and goes into
infinite loop.... fixed in development version of fsck
https://bugzilla.redhat.com/show_bug.cgi?id=467677
6. gfs2 corruption is causing a linux kernel to panic.... fsck says it
fixes the issue but it doesn't and system crashes all over again under
load...
https://bugzilla.redhat.com/show_bug.cgi?id=519049
7. ext3 filesystem can't be mounted anf fsck won't finish after 10
days of running (probably some kind of infinite looping bug again)
http://ubuntuforums.org/archive/index.php/t-394744.html
8. AIX JFS2 filesystem corruption - due to a bug in fsck it can't fix
the fs, data had to be recovered from backup
http://unix.ittoolbox.com/groups/technical-functional/ibm-aix-l/error-518-file-system-corruption-366503
9.
https://bugzilla.redhat.com/show_bug.cgi?id=514511
https://bugzilla.redhat.com/show_bug.cgi?id=477856
And there are many more...
The point again is that bugs happen even in fsck and until they are
fixed a common user/sysadmin quote often won't be able to recover on
its own. ZFS is not exception here when it comes to bugs. But thanks
to its different approach (mostly end-to-end checksumming + COW) its
ability to detect data corruption and deal with it exceeds most
generally available solutions in the market. The fixes for some bugs
mentioned before make it only more robust and reliable even for those
unlucky users before... :)