Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

Miles Nordin Thu, 20 May 2010 13:37:15 -0700

>>>>> "rsk" == Roy Sigurd Karlsbakk <r...@karlsbakk.net> writes:
>>>>> "dm" == David Magda <dma...@ee.ryerson.ca> writes:
>>>>> "tt" == Travis Tabbal <tra...@tabbal.net> writes:


   rsk> Disabling ZIL is, according to ZFS best practice, NOT
   rsk> recommended.

    dm> As mentioned, you do NOT want to run with this in production,
    dm> but it is a quick way to check.

REPEAT: I disagree.

Once you associate the disasterizing and dire warnings from the
developer's advice-wiki with the specific problems that ZIL-disabling
causes for real sysadmins rather than abstract notions of ``POSIX'' or
``the application'', a lot more people end up wanting to disable their
ZIL's.

In fact, most of the SSD's sold seem to be relying on exactly the
trick disabled-ZIL ZFS does for much of their high performance, if not
their feasibility within their price bracket period: provide a
guarantee of write ordering without durability, and many applications
are just, poof, happy.

If the SSD's arrange that no writes are reordered across a SYNC CACHE,
but don't bother actually providing durability, end uzarZ will ``OMG
windows fast and no corruption.'' --> ssd sales.

The ``do-not-disable-buy-SSD!!!1!'' advice thus translates to ``buy
one of these broken SSD's, and you will be basically happy.  Almost
everyone is.  When you aren't, we can blame the SSD instead of ZFS.''
all that bottlenecked SATA traffic host<->SSD is just CYA and of no
real value (except for kernel panics).


Now, if someone would make a Battery FOB, that gives broken SSD 60
seconds of power, then we could use the consumer crap SSD's in servers
again with real value instead of CYA value.  FOB should work like
this:

                        == RUNNING ==
   battery   ,------->     SATA port: pass   -----.
 recharged? /              power to SSD: on        \  input
           /                                        \ power
          (                                          . lost
          |                                          |
          .                       input  ,---\       v
                                  power /     v
                              restored /   =power lost=
    =power restored=                   .   =hold-down =
    =hold down     =                    --     SATA port: block
        power to SSD: off                      power to SSD: on
           ^                                       |
           |                                       |
           .                                      .  60 seconds
   input    \                                    /   elapsed
   power     .          =power off=             ,
   restored    --------     power to SSD: off <-


The device must know when its battery has gone bad and stick itself in
``power restored hold down'' state.  Knowing when the battery is bad
may require more states to test the battery, but this is the general
idea.

I think it would be much cheaper to build an SSD with supercap, and
simpler because you can assume the supercap is good forever instead of
testing it.  However because of ``market forces'' the FOB approach
might sell for cheaper because the FOB cannot be tied to the SSD and
used as a way to segment the market.  If there are 2 companies making
only FOB's and not making SSD's, only then competition will work like
people want it to.  Otherwise FOBs will be $1000 or something because
only ``enterprise'' users are smart/dumb enough to demand them.

Normally I would have a problem that the FOB and SSD are separable,
but see, the FOB and SSD can be put together with double-sided tape:
the tape only has to hold for 60 seconds after $event, and there's no
way to separate the two by tripping over a cord.  You can safely move
SSD+FOB from one chassis to another without fearing all is lost if you
jiggle the connection.  I think it's okay overall.

    tt> This risk is mostly mitigated by UPS backup and auto-shutdown
    tt> when the UPS detects power loss, correct?

no no it's about cutting off a class of failure cases and constraining
ourselves to relatively sane forms of failure.  We are not haggling
about NO FAILURES EVAR yet.  First, for STEP 1 we isolate the insane
kinds of failure that cost us days or months of data rather than just
a few seconds, the kinds that call for crazy unplannable ad-hoc
recovery methods like `Viktor plz help me' and ``is anyone here a
Postgres data recovery expert?'' and ``is there a way I can invalidate
the batch of billing auth requests I uploaded yesterday so I can rerun
it without double-billing anyone?''  For STEP 1 we make the insane
fail almost impossible through clever software and planning.  A UPS
never never ever qualifies as ``almost impossible''.  

Then, once that's done, we come back for STEP 2 where we try to
minimize the sane failures also, and for step 2 things like UPS might
be useful.  For STEP 2 it makes sense to talk about percent
availability, probability of failure, length of time to recover from
Scenario X.  but in STEP 1 all the failures are insane ones, so you
cannot measure any of these things.  UPS is not about how ``paranoid''
you are or how far you want to take STEP 1.  you take STEP 1 all the
way to completion before worrying about STEP 2.

For NFS, the STEP 1 risk on the table is ``server reboots, client does
not.''  It is okay if both reboot at once.  It is okay if neither
reboots.  but if you 

    disable ZIL
    OR
    have broken SSD like X25
  AND
    NFS server reboots, client doesn't

then you have a STEP 1 insane failure case that can cause corrupted
database files or virtual disk images on the NFS clients.

For example if you fail to complete STEP 1, and then you plug the NFS
clients into a more expensive UPS with proper transfer switches for
maintenance and A/B power, and the server into a rather ordinary UPS,
then you will be at greater risk of this particular NFS problem than
if you used no UPS at all.  That's not intuitive!  But it's true! This
comes from putting step 2 before step 1.  You must do them in order if
you want to stay sane.

If you do not care about this NFS problem (or the others) then maybe
you can just disable the ZIL.  It is a matter of working through step
1.  Working through STEP 1 might be ``doesn't affect us.  Disable
ZIL.''  Or it might be ``get slog with supercap''.  STEP 1 will never
be ``plug in OCZ Vertex cheaposlog that ignores cacheflush'' if you
are doing it right.  And Step 2 has nothing to do with anything yet
until we finish STEP 1 and the insane failure cases.

pgpvHWfZAembS.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

Reply via email to