Re: [zfs-discuss] more ZFS recovery

Miles Nordin Thu, 07 Aug 2008 10:29:28 -0700

>>>>> "r" == Ross  <[EMAIL PROTECTED]> writes:


     r> Tom wrote "There was a problem with the SAS bus which caused
     r> various errors including the inevitable kernel panic".  It's
     r> the various errors part that catches my eye,

yeah, possibly, but there are checksums on the SAS bus, and its
confirmation of what CDB's have completed should always be accurate.
If the problem was ``another machine booted up, and I told the other
machine to 'zpool import -f' '' then maybe you have some point.  but
just tripping over a cable shouldn't qualify as weird, nor should
Erik's problem of the FC array losing power or connectivity.  These
are both within the ``unclean shutdown'' category handled by UFS+log,
FFS+softdep, ext3, reiser, xfs, vxfs, jfs, HFS+, ...

     r> Can fsck always recover a disk?  Or if the corruption is
     r> severe enough, are there times when even that fails?  

This question is obviously silly.  write zeroes over the disk, and now
the corruption is severe enough.  However fsck can always recover a
disk from a kernel panic, or a power failure of the host or of the
disks, because these things don't randomly scribble over the disk.

  (now, yeah, I know I posted earlier a story from Ted Ts'o about SGI
  hardware and about random disk scribbling as the 5V rail started
  drooping.  yes, I posted that one.  but it doesn't happen _that
  much_.  and it doesn't even apply to Tom and Erik's case of a loose
  SAS cable or tripping over an FC cord.)

If the kernel panic was caused by a bug in the filesystem, then you'll
say aHA!  aaHAh!  but then, then it might do the scribbling!

Well, yes.  so in that case we agree there's a bug in the filesystem. :)

You'll say ``but WHAT if the kernel panic was a bug in the DISK
DRIVER, eh?  eh, then maybe ZFS is not at fault!''  sure, fine, read
on.

     r> I don't see that we have enough information here to really
     r> compare ZFS with UFS

what we certainly have, between Tom and Erik and my own experience
with resilvering-related errors accumulating in the CKSUM column when
iSCSI targets go away, is enough information that ``you should have
had redundant pools'' doesn't settle the issue.  Reports of zpool
corruption on single vdev's mounted over SAN's would benefit from
further investigation, or at least a healthily-suspicious scientific
attitude that encourages someone to investigate this if it happens in
more favorable conditions, such as inside Sun, or to someone with a
support contract and enough time to work on a case (maybe Tom?), or
someone who knows ZFS well like Pavel.  Also, there is enough concern
for people designing paranoid systems to approach them with the view,
``ZFS is not always-consistent-on-disk unless it has working
redundancy''---choosing to build a ZFS system the same way as a UFS
system without ZFS-level redundancy, based on our experience so far,
is not just foregoing some of ZFS's whizz-bang new feeechurs.  It's
significantly less safe than the UFS system.  For as long as the
argument remains unsettled, conservative people need to understand
that.  Conservative people should also understand point (c) below.

It sounds to me like Tom's and Erik's problems are more likely ZFS's
fault than not.  The dialog has gone like this:

1. This isn't within the class of errors ZFS should handle.  get
   redundancy.

2. It sounds to me exactly like the class of error ZFS is supposed to
   handle.

3. You cannot prove 100% that this is necessarily the class of error
   ZFS is supposed to handle.  Somethinig else might have happened.

   BTW, did I tell you how good ZFS (sometimes) is at dealing with
   ``might have happened'' if you give it redundancy?  It's new, and
   exciting, and unprecedented!  Is that a rabbit over there?  Look, a
   redheaded girl juggling frisbies!

What next, you'll drag out screaming Dick Cheney on a chain?

Recapping my view:

  a. it looks like a ZFS problem  (okay, okay, PROBABLY a zfs problem)

  b. it's a big problem

  c. there's no good reason to believe people with redundant pools are
     immune from it, because they will run into it when they need their
     redundancy to cover a broken disk.

It also deserves more testing by me: I'm going to back up my smaller
'aboveground' pool and try to provoke it.

     r> although I do agree that some kind of ZFS repair tool
     r> sounds like it would be useful.

I don't want to dictate architecture when I don't know the internals
well.  What's immediately important to me is that ZFS handle unclean
shutdown rigorously, as most other filesystems claim to and eventually
mostly accomplish.  This could be adding an fsck tool, but more likely
it will be simply fixing a bug.

Old computers had to bring up their swap space before fsck'ing big
filesystems because the fsck process needed so much memory.  The
filesystem implementation was a small text of fragile code that would
panic if it read the wrong bits from the disk, but it was fast and
didn't take much memory.  It made sense to split the filesystem into
two pieces, the fsck piece and the main piece, to conserve the
machine's core (and make the programming simpler).

We have plenty of memory for text segments now, so it might make more
sense to build fsck into the filesystem.  The filesystem should be
able to mount any state you would expect a hypothetical fsck tool to
handle, and mount it almost immediately, and correct any ``errors'' it
finds while running.  If you want to proactively correct errors, it
should do this while mounted.

That was the original ZFS pitch, and I think it's not crazy.  It's
basically what we're supposed to have now with the ``always consistent
on disk'' claim and 'zpool scrub' O(n)? online fsck-equivalent.

FFS+softdep sort of works this way, too.  It's designed to safely
mount ``unclean'' filesystems, so in that sense, it's ``always
consistent.''  It does not roll a log, because there isn't one---it
just mounts the filesystem as it was when the cord was pulled, and it
can do this with no risk of kernel panicing or odd behavior to
userland because of the careful order in which it writes data before
the panic.  However, after an unclean shutdown, the filesystem is
still considered dirty even though it mounts and works.  FreeBSD then
starts the old fsck tool in the background.  The fsck is still O(n^2).
so...FFS+softdep sort of follows the new fsck-less model where the
filesystem is one unified piece that does all its work after mounting,
but follows it clumsily because it's reusing the old FFS code and
on-disk format.

To my non-developer perspective, there seem to be the equivalent of
mini-FFS+softdep-style fsck's inside ZFS already.  Sometimes when a
mirror component goes away, ZFS does (what looks in 'zpool status'
like) a mini-resilver on the remaining component.  There's no
redundancy in the vdev, so there's nothing to actually resilver.
Maybe this has to do with the quorum rules or the (seemingly broken)
dirty region logging, both of which I still don't understand.  And
there is also my old problem of 'zpool offline' reporting ``no valid
replicas'', until I've done a scrub, after which 'zpool offline' works
again, so a scrub is not really a purely proactive thing: burried
inside ZFS there is some notion of dirtyness preventing my 'zpool
offline', and a successful scrub clears the dirty bit (as do,
possibly, other things, like rebooting :( ).  so, the architecture
might be fine as-is since scrub is already a little more than what it
claims to be, and is doing some sort of metadata or RAID-level
fsck-ing.  I wouldn't expect that the fix for these corrupt
single-vdev pools come in some specific form based on prejudices from
earlier filesystems.

Now there is another tool Anton mentioned, a recovery tool or forensic
tool:  one that leaves the filesystem unmounted, treats the disks as
read-only, and tries to copy data out of it onto a new filesystem.  If
there were going to be a separate tool---say, something to handle disks
that have been scribbled on, or fixes for problems that are really
tricky or logically inappropriate to deal with on the mounted
filesystem---I think a forensic/recovery tool makes more sense than an
fsck.  If this odd stuff isn't supposed to happen, and it has happened
anyway, you want a tool you can run more than once.  You want the
chance to improve the tool and run it again, or to try an older
version of the tool if the current one keeps crashing.

I'm just really far from convinced that Tom needs this tool.

     r> To me, it sounds like Sun have designed ZFS to always know if
     r> there is corruption on the disk, and to write data in a way
     r> that corruption of the whole filesystem *should* never happen.

sounds like depends on to what you're listening.  If you're listening
to Sun's claims, then yes, of course that's exactly what they claim.
If you're listening to experience on this list, it sounds different.
The closest we've come is, we agree I haven't completely invalidated
the original claims, which is pretty far from making me believe them
again.

pgpah9Wtko3v5.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] more ZFS recovery

Reply via email to