Re: [zfs-discuss] ZFS vs ZFS + HW raid? Which is best?

Miles Nordin Tue, 13 Jan 2009 09:17:19 -0800

>>>>> "ok" == Orvar Korvar <knatte_fnatte_tja...@yahoo.com> writes:


    ok> Nobody really knows for sure.

    ok> I will tell people that ZFS + HW raid is good enough, and I
    ok> will not recommend against HW raid anymore.

jesus, ok fine if you threaten to let ignorant Windows morons strut
around like arrogant experts, voicing some baseless feel-good opinion
and then burdening you with proving a negative, then I'll take the
bait and do your homework for you.

What people ``don't really know for sure'' is what hardware RAID you
mean.  There are worlds of difference between NetApp RAID and Dell
PERC RAID, differences in their ability to avoid data loss in common
failure scenarios, not just in some bulleted features or admin tool
GUI's that the Windows admins understand.  For example among
dirt-cheap RAID5's, whether or not there's a battery-backed write
cache influences how likely they are to corrupt the filesystem on top,
while the Windows admins probably think it's for ``performance
issues'' or something.  And there are other things.  NetApp has most
of them, so if you read a few of the papers they publish bragging
about their features you'll get an idea how widely the robustness of
RAIDs can vary:

 http://pages.cs.wisc.edu/~krioukov/Krioukov-ParityLost.pdf


Advantages of ZFS-on-some-other-RAID:

 * probably better performance for RAID6 than for raidz2.  It is not
   because of ``checksum overhead''.  It's because RAID6 stripes use
   less seek bandwidth than ZFS which mostly does full-stripe writes
   like RAID3.

 * It's performant to use a filesystem other than ZFS on a hardware
   RAID, which maybe protects your investment somewhat.

 * better availability, sometimes.  

   It's important to distinguish between availability and data loss.
   Availability means when a disk goes bad, applications don't notice.
   Data loss is about, AFTER the system panics, the sysadmin notices,
   unplugs bad drives, replaces things, reboots, is the data still
   there or not?  And is *ALL* the data there, satisfying ACID rules
   for databases and MTA's, or just most of the data there?  because
   it is supposed to be ALL there.  Even when you have to intervene,
   that is not license to break all the promises and descend into
   chaos.

   Some (probably most) hardware RAID has better availability, is
   better at handling failing or disconnected drives than ZFS.  But
   ZFS on JBOD is probably better at avoiding data loss than ZFS on
   HWRAID.

 * better exception handling, sometimes.  Really fantastically good
   hardware RAID can handle the case of two mirrored disks, both with
   unreadable sectors, but none of the unreadable sectors are in
   common.  Partial disk failures are the common case, not the
   exception, while the lower-quality RAID implementations most of us
   are stuck with treat disks as either good or bad, as a whole unit.

   I expect most hardware RAIDs don't have much grace in this
   department either, though.  It seems really difficult to do it well
   because disks lock up when you touch their bad blocks, so to
   gracefully extract information from a partially-failed disk, you
   have to load special firmware into them and/or keep a list of
   poison blocks which you must never accidentally read.

   My general procedure for recovering hardware RAIDs is, (0) shut
   down the RAID and use as JBOD, (1) identify the bad disks and buy
   blank disks to replace them, (2) copy bad disks onto blank disks
   using dd_rescue or dd conv=noerror,sync, and (3) run the RAID
   recovery tool.  There are a lot of bad things about this procedure!
   It denies the RAID layer access to the disk's reports about which
   sector is bad, which leads to parity pollution, and silent
   corruption that slips through the RAID layer, discussed below.  It
   can also not work, if the RAID is identifying disks like ZFS
   sometimes does, by devid/serialno rather than by a disk label or
   port number.

   but RAID layers read the report ``bad sector'' as in fact a report
   ``bad disk!''  so with the very cheap RAID's I've used this 0,1,2,3 
   corruption-prone procedure which works around the bad exception
   handling saves more data than the supported procedure.

 * novolatile cache.  Some cheap hardware RAID gives you a
   battery-backed write cache that can be more expensive to get the
   ZFS slog way.

Advantages to ZFS-on-JBOD:

 * unified admin tool.  There is one less layer, so you can administer
   everything with zpool.  With hardware RAID you will have to use
   both zpool and the hardware RAID's configurator.

 * most hardware RAID has no way to deal with silent corruption.  It
   can deal with latent sector errors, when the drive explicitly
   reports a failed read, but it has no way to deal with drives that
   silently return incorrect data.  There are a huge number of
   patterns in which this happens in real life according to Netapp,
   and they have names for them like ``torn writes'' and ``misdirected
   writes'' and so on.  RAID5's read-modify-write behavior can magnify
   this type of error through ``parity pollution.''

   
http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf

   I think there are some tricks in Solaris for Veritas and SVM
   mirrors, which applications like Oracle can use if they do
   checksumming above the filesystem level.  Oracle can say, ``no, I
   don't like that block.  Can you roll the dice again?  What else
   might be at that lseek offset?''  See DKIODMR under dkio(7I).  I'm
   not sure in exactly what circumstance the tricks work, but am sure
   they will NOT work on hardware RAID-on-a-card.  Also the tricks are
   not used by ZFS, so if you insert ZFS between your RAID and your
   application, the DKIODMR-style checksum-redundancy is blocked off.
   The idea is that you should use ZFS checksums and ZFS redundancy
   instead, but you can't do that by feeding ZFS a single LUN.

 * hardware RAID ties your data to a RAID implementation.  This is
   more constraining than tying it to ZFS because the Nexenta and
   Opensolaris licenses allow you to archive the ZFS software
   implementation and share it with your friends.  This has bitten
   several of my friends particularly with ``RAID-on-a-card''
   products.  It is really uncool to lose your entire dataset to a bad
   controller, and then be unable to obtain the same controller with
   the same software revision, because the hardware has gone through
   so many steppings, and the software isn't freely redistributable,
   archiveable, or even clear that it exists at all.  Sometimes the
   configurator tool is very awkward to run.  There may even be a
   remedial configurator in the firmware and a full-featured
   configurator which is difficult to run just when you need it most,
   when recovering from a failure.

   Some setups are even worse, are unclear about where they are
   storing the metadata---on the diskset or in the controller?  so you
   could lose the whole dataset because of a few bits in some $5
   lithium battery RTC chip.  Being a good sysadmin in this
   environment requires levels of distrust that are really unhealthy.

   The more expensive implementations hold your data hostage to a
   support contract, so while you are paying to buy the RAID you are
   in fact more like renting a place to put your data, and they
   reserve the right to raise your rent.  Without the contract you
   don't just lose their attention---you cannot get basic things like
   older software, or manuals for the software you have now, and they
   threaten to sue you for software license violation if you try to
   sell the hardware to someone else.

 * there are bugs either in ZFS or in the overall system that make ZFS
   much more prone to corruption when it's not managing a layer of
   redundancy.

   In particular, we know ZFS does not work well when underlying
   storage reports writes are committed to disk when they're not, and
   this problem seems to be rampant:

    * some SATA disks do it, but the people who know which ones aren't
      willing to tell.  They only say ``a Major vendor.''

    * Sun's sun4v hypervisor does it for virtualized disk access on
      the T2000.

    * Sun's iSCSI target does it (iscsitadm.  not sure yet about
      Comstar.)

    * We think many PeeCee virtualization platforms like virtualbox
      and vmware and stuff might do it.

    * Linux does it if you are using disk through LVM2.

   The problem is rampant, seems to be more dangerous to ZFS than
   other filesystems, and progress in tracking it down and fixing it
   is glacial.  Giving ZFS a mirror or a raidz seems to improve its
   survival.

   To work around this with hardware RAID, you need to make a zpool
   that's a mirror of two hardware RAIDsets.  This wastes a lot of
   disk.  If you had that much disk with ZFS JBOD, you could make much
   better use of it as a filesystem-level backup, like backup with
   rsync to a non-ZFS filesystem or to a separate zpool with 
   'zfs send | zfs recv'.  You really need this type of backup with
   zfs because of the lack of fsck and the huge number of panics and
   assertions.

HTH.

To my reckoning the consensus best practice is usually JBOD right now.
When I defend ZFS-over-RAID5 it is mostly because I think the poor
availability during failures and the corruption bugs discussed in the
last point need to be tracked down and squashed.


Here's my list of papers that have been mentioned on this list, so you
can catch up.  You can also dump all the papers on the annoying
Windows admins, and when they say ``I really think hardware RAID has
fewer `issues' because I just think so.  It's your job to prove why
not,'' then you can answer ``well have you read the papers?  No?  Then
take my word for it.''  If they doubt the papers' authority, then cite
the price the people writing the papers charge for their
services---the Windows admins should at least understand that.

 
http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf
 http://www.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.ps
 http://labs.google.com/papers/disk_failures.html
 http://pages.cs.wisc.edu/~krioukov/Krioukov-ParityLost.pdf
 http://www.nber.org/sys-admin/linux-nas-raid.html

pgpCPQd9sQbAy.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS vs ZFS + HW raid? Which is best?

Reply via email to