>>>>> "ic" == Ian Collins <[EMAIL PROTECTED]> writes:
ic> I'd use mirrors rather than raidz2. You should see better ic> performance the problem is that it's common for a very large drive to have unreadable sectors. This can happen because the drive is so big that its bit-error-rate matters. But usually it happens because the drive is starting to go bad but you don't realize this because you haven't been scrubbing it weekly. Then, when some other drive actually does fail hard, you notice and replace the hard-failed drive, and you're forced to do an implicit scrub, and THEN you discover the second failed drive. too late for mirrors or raidz to help. http://www.opensolaris.org/jive/message.jspa?messageID=255647&tstart=0#255647 If you don't scrub, in my limited experience this situation is the rule rather than the exception. especially with digital video from security cameras and backups of large DVD movie collections---where most blocks don't get read for years unless you scrub. ic> you really can grab two of the disks and still leave behind a ic> working file server! this really works with 4-disk raidz2, too. I don't fully understand ZFS's quorum rules, but I have tried a 4-disk raidz2 pool running on only 2 disks. You're right, it doesn't work quite as simply as two 2-disk mirrors. Since I have half my disks in one tower, half in another, and each tower connected to ZFS with iSCSI, I often want to shutdown one whole tower without rebooting the ZFS host. I find I can do that with mirrors, but not with 4-disk raidz2. I'll elaborate. The only shitty thing is, zpool will only let you offline one of the four disks. When you try to offline the second, it says ``no valid replicas.'' A pair of mirrors doesn't have that problem. But, if you forcibly take two disks away from 4-disk raidz2, the pool does keep working as promised. The next problem(s) comes after you give the two disks back. 1. zpool shows all four disks ONLINE, and then resilvers. There's no indication as to which disks are being resilvered and which are already ``current,'' though---it just shows all four as ONLINE. so you don't know which two disks absolutely cannot be removed---which are the target of the resilver and which are the source. SVM used to tell you this. What happens when a disk fails during the resilver? Does something different happen depending on whether it's an up-to-date disk or a resilveree disk? probably worth testing, but I haven't. Secondly, if you have many 4-disk raidz2 vdev's, there's no indication about which vdev is being resilvered. If I have 20 vdev's, I may very well want to proceed to another vdev, offline one disk (or two, damnit!), maintain it, before the resilver finishes. not enough information in zpool status to do this. Is it even possible to 'zpool offline' a disk in another raidz2 vdev during the resilver, or will it say 'no valid replicas'? I haven't tested, probably should, but I only have two towers so far. so, (a) disks which will result in 'no valid replicas' when you attempt to offline them should not be listed as ONLINE in 'zpool status'. They're different and should be singled out. and (b) the set of these disks should be as small as arrangeably possible 2. after resilvering says it's complete, 0 errrors everywhere, zpool still will not let you offline ANY of the four disks, not even one. no valid replicas. 3. 'zpool scrub' 4. now you can offline any one of the four disks. You can also online the disk, and offline a different disk, as much as you like so long as only one disk is offline (but you're supposed to get two!). You do not need to scrub in between. If you take away a disk forcibly instead of offlining it, then you go back to step 1 and cannot offline anything without a scrub. 5. insert a 'step 1.5, reboot' or 'step 2.5, reboot', and although I didn't test it, I fear checksum errors. I used to have that problem, and 6675685 talks about it. SVM could handle rebooting during a resilver somewhat well. I fear at least unwarranted generosity, like I bet 'step 2.5 reboot' can substitute for 'step 3 scrub', letting me use zpool offline again even though whatever failsafe was stopping me from using it before can't possibly have resolved itself. so, (c) the set of disks which result in 'no valid replicas' when you attempt to offline them seems to have no valid excuse for changing across a reboot, yet I'm pretty sure it does. kind of annoying and confusing. but, if your plan is to stuff two disks in your bag and catch the next flight to Tel Aviv, my experience says raidz2 should work ok for that. c> 3. burn in the raidset for at least one month before trusting c> the disks to not all fail simultaneously. ic> Has anyone ever seen this happen for real? yeah. Among 20 drives I've bought over five years I think at least two have been DoA, but what's more disturbing: of five drives I bought in the last three months, two have failed within the first two weeks. It's disturbing because the drives which are not DoA still fail, and now it's after you've had a chance to load data onto them, AND it is failure with a lot of temporal locality. Those odds aren't good enough for a mirror. Let's blame it on my crappy supply chain, for the sake of argument. This means the old RAID assumption that drive failures are independent events doesn't hold, and while raidz2 is better than the mirror it doesn't really address this scary two-week window enough. You need to take steps to make disk failure events more independent, like buying drives from different manufacturers, different retailers, shipped in different boxes, aging them on your own shelf to get drives from different factory ``batches''. Anyway, you can find more anecdotes in the archives of this list. IIRC someone else corroborated that he found, among non-DoA drives, failures are more likely in the first month than in the second month, but I couldn't find the post. I did find Richard Elling's posting of this paper: http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf but it does not support my claim about first-month failures. Maybe my experience is related to something NetApp didn't have, maybe related to the latest batch of consumer drives released after that study, or to the consumer supply chain. The paper does say that disk failures aren't independent events, but they blame it on controllers. Maybe Netapp is already doing the aging-in-multiple-warehouses, shipping-diversity scheme I talked about. In this paper, note that ``latent sector errors'' and ``checksum mismatches'' are different. ``latent sector errors'' are much more common---that's when the disk returns UNC, unrecoverable read error. The disk doesn't return bad data---it returns a clear error, possibly after causing other problems by retrying for 30 seconds but the error it returns is clear, and isn't just silently scrambled data. AIUI ``checksum mismatches'' are the errors detected by ZFS and also by Netapp/EMC/Hitachi stuff that uses custom disk firmware with larger sector sizes, but are NOT detected by Areca and other ``hardware RAID'' cards, SVM, LVM2... Home users often say things like, ``i don't believe that happened to you because I've done exactly that, and I've had absolutely zero problems with it. I can't tell you how close to zero the number of problems I've had with it is, all three times I tried it. It's so close to zero, it IS zero, so I think the odds of that happening are astonishingly low.'' this way of thinking is astonishingly nonsensical, when I phrase it that way. And it's still nonsensical when you set it next to my claim, ``I've only bought five drives, not 500,000, but two of them did fail in the first month.'' I keep running into the ``aBSOLUTELY *NO* problems'' defense, so maybe there is some oddity about the way I describe the problems I've had that people find either threatening or incredulous.
pgpiIoJG2fpDG.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss