Re: [zfs-discuss] ZFS with Traditional SAN

Miles Nordin Thu, 21 Aug 2008 12:21:37 -0700

>>>>> "vf" == Vincent Fox <[EMAIL PROTECTED]> writes:


    vf> Because arrays & drives can suffer silent errors in the data
    vf> that are not found until too late.  My zpool scrubs
    vf> occasionally find & FIX errors that none of the array or
    vf> RAID-5 stuff caught.

well, just to make it clear again:

 * some people on the list believe an incrementing count in the CKSUM
   column means ZFS is protecting you from other parts of the storage
   stack mysteriously failing.

 * others (me) believe CKSUM counts are often but not always the
   latent symptom of corruption bugs in ZFS.

They make guesses about what other parts of the stack might fail,
sometimes desperate ones like ``failure on the bus between the ECC ram
controller and the CPU,'' and I make guesses about corruption bugs in
ZFS.  I call implausible, and they call ``i don't believe it happened
unless it happened in a way that's convenient to debug.''

anyway, for example that it does happen, I can make CKSUM errors by
saying 'iscsiadm remove discovery-address 1.2.3.4' to take down the
target on one half of a mirror vdev.  When the target comes back, it
onlines itself, I scrub the pool, and that target accumulates CKSUM
errors.  But what happened isn't ``silent corruption''.  It's plain
old resilvering.  And ZFS resilvers without requiring a manual scrub
and without counting latent CKSUM errors if I take down half the
mirror in some other way, such as 'zpool offline'.  There are probably
other scenarios that make latent CKSUM errors---ex., almost the same
thing, fault a device, shutdown, fix the device, boot, scrub in bug
6675685---but my intuition is that a whole class of ZFS bugs will
manifest themselves with this symptom.  At least the one I just
described should be in Sol10u5 if you want to test it.

Maybe this is too much detail for Bill and his snarky ``buffer
overflow reading your message'' comments, and too much speculation for
some others, but the point is:

  ZFS indicating an error doesn't automatically mean there's no
  problem with ZFS.

  -and-

  You should use zpool-level redundancy, as in different LUN's not
  just copies=2, with ZFS on SAN, because experience here shows that
  you're less likely to lose an entire pool to metadata corruption if
  you have this kind of redundancy.  There's some dispute about the
  ``why'', but if you don't do it (and also if you do but definitely if
  you don't), be sure to have some kind of real backup not just
  snapshots and mirrors, and not 'zfs send' blobs either.

pgpKfITrREagX.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS with Traditional SAN

Reply via email to