>>>>> "vf" == Vincent Fox <[EMAIL PROTECTED]> writes:
vf> Because arrays & drives can suffer silent errors in the data vf> that are not found until too late. My zpool scrubs vf> occasionally find & FIX errors that none of the array or vf> RAID-5 stuff caught. well, just to make it clear again: * some people on the list believe an incrementing count in the CKSUM column means ZFS is protecting you from other parts of the storage stack mysteriously failing. * others (me) believe CKSUM counts are often but not always the latent symptom of corruption bugs in ZFS. They make guesses about what other parts of the stack might fail, sometimes desperate ones like ``failure on the bus between the ECC ram controller and the CPU,'' and I make guesses about corruption bugs in ZFS. I call implausible, and they call ``i don't believe it happened unless it happened in a way that's convenient to debug.'' anyway, for example that it does happen, I can make CKSUM errors by saying 'iscsiadm remove discovery-address 1.2.3.4' to take down the target on one half of a mirror vdev. When the target comes back, it onlines itself, I scrub the pool, and that target accumulates CKSUM errors. But what happened isn't ``silent corruption''. It's plain old resilvering. And ZFS resilvers without requiring a manual scrub and without counting latent CKSUM errors if I take down half the mirror in some other way, such as 'zpool offline'. There are probably other scenarios that make latent CKSUM errors---ex., almost the same thing, fault a device, shutdown, fix the device, boot, scrub in bug 6675685---but my intuition is that a whole class of ZFS bugs will manifest themselves with this symptom. At least the one I just described should be in Sol10u5 if you want to test it. Maybe this is too much detail for Bill and his snarky ``buffer overflow reading your message'' comments, and too much speculation for some others, but the point is: ZFS indicating an error doesn't automatically mean there's no problem with ZFS. -and- You should use zpool-level redundancy, as in different LUN's not just copies=2, with ZFS on SAN, because experience here shows that you're less likely to lose an entire pool to metadata corruption if you have this kind of redundancy. There's some dispute about the ``why'', but if you don't do it (and also if you do but definitely if you don't), be sure to have some kind of real backup not just snapshots and mirrors, and not 'zfs send' blobs either.
pgpKfITrREagX.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss