>>>>> "hj" == Henrik Johansson <henr...@henkis.net> writes:

    hj> I have been operating quite large deployments of SVM/UFS
    hj> VxFS/VxVM for some years and while you sometimes are forced to
    hj> do a filesystem check and some files might end up in
    hj> lost+found I have never lost a whole filesystem.

I think in the world we want, even with the other filesystems, the SAN
fabric or array controller or disk shelf should be able to reboot
without causing any files to show up in lost+found, or requiring
anything other than the normal log roll-forward.  I bet there are
rampant misimplementations.

Maybe the whole SAN situation is ubiquitously misthought because
filesystem designers build things assuming that whenever anything
``crashes,'' the kernel and their own code will go down too.  They
invent a clever way to handle a non-SAN cord-yanking, test it, and yup
you can yank the cord it works fine.  But this isn't the actual way
things can fail.

In the diagram below the disk loses power, but the host, SAN, and
controller don't.  I doubt this is too common.  Probably I should redo
diagrams like this after better understanding the disk commandset and
iSCSI tagged commands and stuff, for other parts of the stack
rebooting like the SAN or the controller.


      filesystem  initiator     SAN      controller     diskbuffer    platter

       [...earlier writes not shown...]

 t     SYNC    ------..
 i                     ---------..
 m                                -----------..
 e                                             -------------            write(A)
 |                                                          .           write(B)
 v                                                          .           write(C)
                                             ..-------------
                                ..-----------
                     ..---------
       success ------
         good.  A-C are
         on the platter.
         commit ueberblock(D).

       write(D) -----..
                       ---------..
       write(E) -----..           -----------..
                       --------..              ------------ [D]
       write(F) -----..          -----------..
                       -------..              ------------- [E]
       write(G) -----..         -----------..          =======POWER 
FAILURE========
                       -------..             -------------- poof...[F] gone
                                -----------..
                                             XXXX no
                                           ..XXXX disk
                              ..-----------
                     ..-------
       ERROR(G) <----

       ohno! couldn't write G.
        increment error counter                       =======POWER 
RESTORED========
        retry

       write(G) -----..
                       -------..
       SYNC     -----..         -----------..
                       -------..             -------------- [G]
                                -----------..
                                             --------------           write(G)
                                                           .
                                           ..--------------
                             ..------------
                    ..-------
       success -----
         good.  that means D-G are
         on the platter.
         commit ueberblock(H)

       write(H)  <-- DANGER, Will Robinson.


Writes D - F were lost in this ``event,'' and the filesystem has no
idea.  If ===POWER FAILURE=== applied to the filesystem and the disk
at the same time, then this problem would not exist---the way we are
using SYNC here would be enough to stop H from being written---so
power failures for non-SAN setups are safe from this.

Also if we treat the disk as bad the moment it says ``write failure'',
and the array controller decides ``this disk is bad, forever,'', if,
the instant it loses power and times out write F the controller
considers its entire contents lost and does not bother reading
ANYthing from it until it's been resilvered by other disks in the
RAIDset, then we also do not have this problem, so power failures on
SVM mirror with no understanding of the overlying filesystem are okay.

Using naked UFS or ext3 or whatever over a SAN still has this problem
I think.  The filesystems are just better at losing some data but not
the whole filesystem, compared to ZFS.

I think ZFS attempts to be smarter than SVM, and also more broadly
ambitious than one power supply all in one box, but is probably not
smart enough to finish the job.  Rather than just more UFS/VxFS-style
robustness I'd like to see the job finished and this SAN write hole
closed up.

It's important to accept that nothing is broken in this event.  It's
just a yanked power cord.  I won't accept, ``a device failed, and you
didn't have enough redundancy, so all bets are off.  You must feed ZFS
more redundnacy.  You expect the impossible.''  No, that argument is
bullshit.  Losing power unexpectedly is not the same as a device
failure---unexpected power loss is part of the overall state diagram
of a normal, working storage system.

    hj> We are currently evaluating if we should begin to implement
    hj> ZFS in our SAN. I can see great opportunities with ZFS but if
    hj> we have a higher risk of loosing entire pools

Optimistically, the ueberblock rollback will make ZFS like the other
filesystems, though maybe faster to recover.  If you are tied to
stable solaris it'll probably take like a year before you get your
hands on it, but so far I think everyone agrees it's promising.

I think it's not enough though.  If the problem is that a batch of
writes were lost, then a trick to recover the pool still won't recover
those lost writes, and you promised applications those writes were on
the disk.  Databases and filesystems inside zvol's could still become 
corrupt.  What this really means, is that using SAN's makes corruption
in general more likely.

I think we sysadmins should start using some tiny 10-line programs to
test the SAN's and figure out what's wrong with them.  I think in the
end we will need about two things to fix it:

 * some kind of commit/replay feature in iSCSI and FC initiators.

   or else the same feature implemented in the filesystems right above
   them but cooperating with the initiators pretty intimately.
   Gigabytes of write data could be ``in flight''---we are talking
   about however much data is between the return of a first
   SYNCHRONIZE CACHE command and the next one---so it'd be good to
   arrange that it not be buffered two or three or four times, which
   may require layer-violating cooperation.

   I'm all but certain nobody's doing this now.

    - is it in the initiator? commit/replay in the initiator would
      mean the initiator issues SYNCHRONIZE CACHE commands for itself,
      ones not demanded by the filesystem above it, whenever its
      replay write cache gets too large.  I've never heard of that.
      and I don't think anyone would put up with an iSCSI/FC initiator
      burning up gigabytes of RAM without an explanation which would
      mean that I'd hear about it and be worried about tuning it.

    - is it in the filesystem?  Any filesystem designed before SAN's
      will expect to eventually get a successful return from any
      SYNCHRONIZE CACHE command it passes to storage.  a failed SYNC
      will happen in the form of someone yanking the cord, so the
      filesystem code will never see the failure because it won't be
      executing any longer.  UFS and ext3 don't even bother to issue
      SYNCHRONIZE CACHE at all, much less pay attention to its return
      value and buffer writes so they can be replayed if it fails, so
      I doubt they have an exception path for a failed SYNC command.

      Putting repaly in the filesystem also means, if the iSCSI
      initiator notices the target bounce, then it MUST warn the
      layers above that writes were lost, for example by waiting for
      the next SYNCHRONIZE CACHE command to come along and
      deliberately returning it failed without consulting the target,
      even though the LUN would say it succeeded if it were issued.
      I've never heard of anything like this.

 * pay some attention to what happens to ZFS when a SAN controller
   reboots, separately with each 'failmode' setting.  To maintain
   correctness with NFS clients the zpool is serving, or with
   replicated/tiered database applications where the dbms app is
   keeping several nodes in sync, ZFS may need a failmode=umount that
   kills any app with outstanding writes on a failed pool and
   un-NFS-exports all the pool's filesystems.  

   the existing failmode=panic could probably be verified (and likely
   have to be fixed) to provide the same level of correctness, but
   that would not be as good as the umount-and-kill because it'd make
   HA and zones more antagonistic to each other, by putting many zones
   at the mercy of the weakest pool on the system, which could even be
   a USB stick or something.  It's the wrong direction to move.

   I am not sure what failmode=continue and failmode=wait mean now, or
   what they should mean to fix this problem.  It'd be nice if they
   meant what they claim to be: ``wait: use commit/replay schemes so
   that no writes are lost even if the SAN controller reboots.  apps
   should be frozen until they can be allowed to continue as if
   nothing went wrong.  continue: fsync() returns -1 immediately for
   the first data that never made it to disk, and continues returning
   -1 until all writes issued up to now are on the platter, including
   writes that had to be replayed because of the reboot.  Once fsync()
   has been called and has returned -1, all write() to that file must
   also fail because of the barrier.  And once your app calls fsync()
   a second, third, fourth time and finally gets a 0 return from
   fsync(), it can be sure no data was lost.''  Of course all that
   seems optimistic beyond ridiculous, even for UFS and VxFS.  but if
   implemented like that, panic and wait should both be safe for SAN
   outages, and continue we already understand to be unsafe but
   implemented like this it becomes possible to write a cooperating
   app, like a database or a user-mode iSCSI target app for example,
   which is correct.

    hj> So, what is the opinion, is this an existing problem even when
    hj> using enterprise arrays? If I understand this correctly there
    hj> should be no risk of loosing an entire pool if
    hj> DKIOCFLUSHWRITECACHE is honored by the array?

no, the timing diagram I showed explains how I think data might still
be lost during a SAN reboot, even for a SAN which respects cache
flushes.  but all this is pretty speculative for now.

Attachment: pgpxOhGmGGGBw.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to