Re: [zfs-discuss] more ZFS recovery

Miles Nordin Wed, 06 Aug 2008 12:45:39 -0700

>>>>> "re" == Richard Elling <[EMAIL PROTECTED]> writes:


     c>  If that's really the excuse for this situation, then ZFS is
     c> not ``always consistent on the disk'' for single-VDEV pools.

    re> I disagree with your assessment.  The on-disk format (any
    re> on-disk format) necessarily assumes no faults on the media.

The media never failed, only the connection to the media.  We've every
good reason to believe that every CDB that the storage controller
acknowledged as complete, was completed and is still there---and that
is the only statement which must be true of unfaulty media.  We've no
strong reason to doubt it.

    re> I see no evidence that the data is or is not correct.

the ``evidence'' is that it was on a SAN, and the storage itself never
failed, only the connection between ZFS and the storage.  Remember:

 this device is 48 1T SATA drives presented as a 42T LUN via hardware
 RAID 6 on a SAS bus which had a ZFS on it as a single device.

This sort of SAN-outage happens all the time, so it's not straining my
belief to suggest that probably nothing else happened other than
disruption of the connection between ZFS and the storage.  It's not
like a controller randomly ``acted up'' or something, so that I would
suspect a bad disk.

     c> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048375.html

    re> I have no idea what Eric is referring to, and it does not
    re> match my experience.

unfortunately it's very easy to match the experience of ``nothing
happened'' and hard to match the experience ``exactly the same thing
happened to me.''  Have you been provoking ZFS in exactly the way Eric
described, a single-vdev pool on FC where the FC SAN often has outages
or where the storage is rebooted while ZFS is still running?  If not,
obviously it doesn't match your experience because you have none with
this situation.  OTOH if you've been doing that a lot, your not
running into this problem means something.  Otherwise, it's another
case of the home-user defense: ``I can't tell you how close to zero the number
of problems I've had with it is.  It's so close to zero, it is zero,
so there's virtually 0% chance what you're saying happened to you
really did happen to you.  and also to this other guy.''

When I say ``doesn't mathc my experience'' I meant I _do_ see Mac OS X
pinwheels and for me it's ``usually'' traceable back to VM pressure or
dead NFS server, not some random application-level userinterface
modal-wait as others claimed: I'm selecting for the same situation you
are, and gettin g a different result.

that said, yeah, a CR would be nice.  For such a serious problem, I'd
like to think someone's collected an image of the corrupt filesystem
and is trying to figure out wtf happened.

I care about how safe is my data, not how pretty is your baby.  I want
its relative safety accurately represented based on the experience
available to us.

     c> How about the scenario where you lose power suddenly, but only
     c> half of a mirrored VDEV is available when power is restored?
     c> Is ZFS vulnerable to this type of unfixable corruption in that
     c> scenario, too?

    re> No, this works just fine as long as one side works.  But that
    re> is a very different case.  -- richard

Why do you regard this case as very different from a single vdev?  I
don't have confidence that it's clearly different w.r.t. whatever
hypothetical bug Eric and Tom have run into.

    wm> If data is sent, but corruption somewhere (the SAS bus,
    wm> apparently, here) causes bad data to be written, ZFS can
    wm> generally detect but not fix that.

Why would there be bad data written?  The SAS bus has checksums.  The
problem AIUI was that the bus went away, not that it started
scribbling random data all over the place.  Am I wrong?  Remember what
Tom's SAS bus is connected to.

    wm> "verifywrites"

The verification is the storage array returning success to the command
it was issued.  ZFS is supposed to, for example, delay returning from
fsync() until this has happened.  The same mechanism is used to write
batches of things in a well-defined order to supposedly achieve the
``always-consistent''.  It depends on the drive/array's ability to
accurately report when data is committed to stable storage, not on
rereading what was written, and this is the correct dependency because
ZFS leaves write caches on, so the drive could satisfy a read from the
small on-disk cache RAM even though that data would be lost if you
pulled the disk's power cord.

The system contains all the tools needed to keep the consistency
promises even if you go around yanking SAS cables.

And this is a data-loss issue, not just an availability issue like we
were discussing before w.r.t. pulling drives.

    wm> Every filesystem is vulnerable to corruption, all the time.

Every filesystem in recent history makes rigorous guarantees about
what will survive if you pull the connection to the disk array, or the
host's power, at any time you wish.  The guarantees always include the
integrity of data written before an fsync() command was called so long
as power/connectivity is lost after fsync() returns.  It also includes
enough metadata consistency that you won't lose a whole friggin' pool
like this scenaryo with some ``corrupt data, End of Line'' error.

  UFS+logging
  vxfs
  FFS+softdep
  ext3
  xfs
  reiserfs
  HFS+

Disks that go bad, storage subsystems with a RAID5 write hole, PATA
busses that given noisy cables autodegrade to a non-CRC mode and then
corrupt data, disks that silently return bad data, controllers that go
nuts and scribble random data as the 5V rail starts dropping after the
cord is pulled, can, yes, all interfere with these guarantees.  but
NONE OF THOSE THINGS HAPPENED IN THIS CASE.

We absolutely do not live in fear that we will lose whole filesystems
if the cord is pulled at the wrong time.  That has not been true
since, like, the early 90's.  ancient history. :'

pgpywJdpmRzYO.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] more ZFS recovery

Reply via email to