Re: [zfs-discuss] Proposed idea for enhancement - damage control

Miles Nordin Thu, 18 Feb 2010 11:56:57 -0800

>>>>> "dc" == Daniel Carosone <d...@geek.com.au> writes:


    dc> single-disk laptops are a pretty common use-case.

It does not help this case. 

It helps the case where a single laptop disk fails and you recover it
with dd conv=noerror,sync.  This case is uncommon because few people
know how to do it, or bother.  This should not ever be part of your
plan even if you know how to do it because it will only help maybe
half the time: it's silly to invest in this case.

    dc> As an aside, there can be non-device causes of this,
    dc> especially when sharing disks with other operating systems,
    dc> booting livecd's and etc.

solution in search of a problem, as opposed to operational experience.

The copies= feature is not so new that we need to imagine so
optimistically, and in practice the advice ``it's generally not
useful'' is the best you can give because it seems like it's currently
misunderstood more often than it's used in a realistically helpful
way.

    >> * drives do not immediately turn red and start brrk-brrking
    >> when they go bad.  In the real world, they develop latent
    >> sector errors,

    dc> Yes, exactly - at this point, with copies=1, you get a signal
    dc> that your drive is about to go bad, and that data has been
    dc> lost.  With copies=2, you get a signal that your drive is
    dc> about to go bad, but less disruption and data loss to go with
    dc> it.

No, to repeat myself, with copies=2 you get a system that freezes and
crashes oddly, sometimes runs for a while but cannot ever complete a
'zfs send' of the filesystems.  With copies=1 you get the exact same
thing.  imagination does not match experience.

This is what you get even on an x4500: many posters here report when
a disk starts going bad you need to find it and entirely remove it
before you can bother with any kind of recovery.

    dc> I dunno about BER spec, but I have seen sectors go unreadable
    dc> many times.

yes.  obviously.

    dc> Regardless of what you do in response, and how soon you
    dc> replace the drive, copies >1 can cover that interval.

no, you are caught in taxonomic obsession again, because the exposure
is not that parts of the disk gradually go bad in a
predictable/controllable way with gradually rising probability and a
bit of clumpyness you can avoid by spraying your copies randomly
LBA-wise.  It's that the disk slowly accumulates software landmines
that prevent it from responding to commands in a reasonable way
(increase the response time to each individual command from 30ms to 30
seconds), and confuse the storage stack above it into
seemingly-arbitrary and highly controller-dependent odd behavior
(causing crashes or multiplying the 30 seconds to somewhere between
180 seconds and a couple hours).  Once teh disk starts going bad,
anything you can recover from it is luck, and aside from disks with
maybe like one bad sector where you can note which file you were
reading when the system froze, reboot, and not read that file any
more, I just don't think it matches experience to believe you will get
a chance to read the second copy your copies=2 wrote.  Remember, if
the machine is still functioning but its performance is reduced
1000-fold, it's functionally equivalent to frozen for all but the most
pedantic purposes.

pgpVpT7FEOEye.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Proposed idea for enhancement - damage control

Reply via email to