Re: [zfs-discuss] The best motherboard for a home ZFS fileserver

Miles Nordin Wed, 23 Jul 2008 18:23:36 -0700

>>>>> "ic" == Ian Collins <[EMAIL PROTECTED]> writes:


    ic> I'd use mirrors rather than raidz2.  You should see better
    ic> performance 

the problem is that it's common for a very large drive to have
unreadable sectors.  This can happen because the drive is so big that
its bit-error-rate matters.  But usually it happens because the drive
is starting to go bad but you don't realize this because you haven't
been scrubbing it weekly.  Then, when some other drive actually does
fail hard, you notice and replace the hard-failed drive, and you're
forced to do an implicit scrub, and THEN you discover the second
failed drive.  too late for mirrors or raidz to help.

 http://www.opensolaris.org/jive/message.jspa?messageID=255647&tstart=0#255647

If you don't scrub, in my limited experience this situation is the
rule rather than the exception.  especially with digital video from
security cameras and backups of large DVD movie collections---where
most blocks don't get read for years unless you scrub.

    ic> you really can grab two of the disks and still leave behind a
    ic> working file server!

this really works with 4-disk raidz2, too.

I don't fully understand ZFS's quorum rules, but I have tried a 4-disk
raidz2 pool running on only 2 disks.  

You're right, it doesn't work quite as simply as two 2-disk mirrors.
Since I have half my disks in one tower, half in another, and each
tower connected to ZFS with iSCSI, I often want to shutdown one whole
tower without rebooting the ZFS host.  I find I can do that with
mirrors, but not with 4-disk raidz2.  I'll elaborate.

The only shitty thing is, zpool will only let you offline one of the
four disks.  When you try to offline the second, it says ``no valid
replicas.''  A pair of mirrors doesn't have that problem.

But, if you forcibly take two disks away from 4-disk raidz2, the pool
does keep working as promised.  The next problem(s) comes after you
give the two disks back.

 1. zpool shows all four disks ONLINE, and then resilvers.  There's no
    indication as to which disks are being resilvered and which are
    already ``current,'' though---it just shows all four as ONLINE.
    so you don't know which two disks absolutely cannot be
    removed---which are the target of the resilver and which are the
    source.  SVM used to tell you this.  What happens when a disk
    fails during the resilver?  Does something different happen
    depending on whether it's an up-to-date disk or a resilveree disk?
    probably worth testing, but I haven't.

    Secondly, if you have many 4-disk raidz2 vdev's, there's no
    indication about which vdev is being resilvered.  If I have 20
    vdev's, I may very well want to proceed to another vdev, offline
    one disk (or two, damnit!), maintain it, before the resilver
    finishes.  not enough information in zpool status to do this.  Is
    it even possible to 'zpool offline' a disk in another raidz2 vdev
    during the resilver, or will it say 'no valid replicas'?  I
    haven't tested, probably should, but I only have two towers so
    far.

    so, (a) disks which will result in 'no valid replicas' when you
    attempt to offline them should not be listed as ONLINE in 'zpool
    status'.  They're different and should be singled out.

    and (b) the set of these disks should be as small as arrangeably
    possible

 2. after resilvering says it's complete, 0 errrors everywhere, zpool
    still will not let you offline ANY of the four disks, not even
    one.  no valid replicas.

 3. 'zpool scrub'

 4. now you can offline any one of the four disks.  You can also
    online the disk, and offline a different disk, as much as you like
    so long as only one disk is offline (but you're supposed to get
    two!).  You do not need to scrub in between.  If you take away a
    disk forcibly instead of offlining it, then you go back to step 1
    and cannot offline anything without a scrub.

 5. insert a 'step 1.5, reboot' or 'step 2.5, reboot', and although I
    didn't test it, I fear checksum errors.  I used to have that
    problem, and 6675685 talks about it.  SVM could handle rebooting
    during a resilver somewhat well.  I fear at least unwarranted
    generosity, like I bet 'step 2.5 reboot' can substitute for 'step
    3 scrub', letting me use zpool offline again even though whatever
    failsafe was stopping me from using it before can't possibly have
    resolved itself.

    so, (c) the set of disks which result in 'no valid replicas' when
    you attempt to offline them seems to have no valid excuse for
    changing across a reboot, yet I'm pretty sure it does.

kind of annoying and confusing.

but, if your plan is to stuff two disks in your bag and catch the next
flight to Tel Aviv, my experience says raidz2 should work ok for that.

     c> 3. burn in the raidset for at least one month before trusting
     c> the disks to not all fail simultaneously.

    ic> Has anyone ever seen this happen for real?

yeah.  Among 20 drives I've bought over five years I think at least
two have been DoA, but what's more disturbing: of five drives I bought
in the last three months, two have failed within the first two weeks.
It's disturbing because the drives which are not DoA still fail, and
now it's after you've had a chance to load data onto them, AND it is
failure with a lot of temporal locality.  Those odds aren't good
enough for a mirror.

Let's blame it on my crappy supply chain, for the sake of argument.
This means the old RAID assumption that drive failures are independent
events doesn't hold, and while raidz2 is better than the mirror it
doesn't really address this scary two-week window enough.  You need to
take steps to make disk failure events more independent, like buying
drives from different manufacturers, different retailers, shipped in
different boxes, aging them on your own shelf to get drives from
different factory ``batches''.

Anyway, you can find more anecdotes in the archives of this list.
IIRC someone else corroborated that he found, among non-DoA drives,
failures are more likely in the first month than in the second month,
but I couldn't find the post.

I did find Richard Elling's posting of this paper:

 
http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf

but it does not support my claim about first-month failures.  Maybe my
experience is related to something NetApp didn't have, maybe related
to the latest batch of consumer drives released after that study, or
to the consumer supply chain.

The paper does say that disk failures aren't independent events, but
they blame it on controllers.  Maybe Netapp is already doing the
aging-in-multiple-warehouses, shipping-diversity scheme I talked
about.

In this paper, note that ``latent sector errors'' and ``checksum
mismatches'' are different.  ``latent sector errors'' are much more
common---that's when the disk returns UNC, unrecoverable read error.
The disk doesn't return bad data---it returns a clear error, possibly
after causing other problems by retrying for 30 seconds but the error
it returns is clear, and isn't just silently scrambled data.  AIUI
``checksum mismatches'' are the errors detected by ZFS and also by
Netapp/EMC/Hitachi stuff that uses custom disk firmware with larger
sector sizes, but are NOT detected by Areca and other ``hardware
RAID'' cards, SVM, LVM2...

Home users often say things like, ``i don't believe that happened to
you because I've done exactly that, and I've had absolutely zero
problems with it.  I can't tell you how close to zero the number of
problems I've had with it is, all three times I tried it.  It's so
close to zero, it IS zero, so I think the odds of that happening are
astonishingly low.''  this way of thinking is astonishingly
nonsensical, when I phrase it that way.  And it's still nonsensical
when you set it next to my claim, ``I've only bought five drives, not
500,000, but two of them did fail in the first month.''

I keep running into the ``aBSOLUTELY *NO* problems'' defense, so maybe
there is some oddity about the way I describe the problems I've had
that people find either threatening or incredulous.

pgpiIoJG2fpDG.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] The best motherboard for a home ZFS fileserver

Reply via email to