>>>>> "r" == Ross <[EMAIL PROTECTED]> writes:
r> Tom wrote "There was a problem with the SAS bus which caused r> various errors including the inevitable kernel panic". It's r> the various errors part that catches my eye, yeah, possibly, but there are checksums on the SAS bus, and its confirmation of what CDB's have completed should always be accurate. If the problem was ``another machine booted up, and I told the other machine to 'zpool import -f' '' then maybe you have some point. but just tripping over a cable shouldn't qualify as weird, nor should Erik's problem of the FC array losing power or connectivity. These are both within the ``unclean shutdown'' category handled by UFS+log, FFS+softdep, ext3, reiser, xfs, vxfs, jfs, HFS+, ... r> Can fsck always recover a disk? Or if the corruption is r> severe enough, are there times when even that fails? This question is obviously silly. write zeroes over the disk, and now the corruption is severe enough. However fsck can always recover a disk from a kernel panic, or a power failure of the host or of the disks, because these things don't randomly scribble over the disk. (now, yeah, I know I posted earlier a story from Ted Ts'o about SGI hardware and about random disk scribbling as the 5V rail started drooping. yes, I posted that one. but it doesn't happen _that much_. and it doesn't even apply to Tom and Erik's case of a loose SAS cable or tripping over an FC cord.) If the kernel panic was caused by a bug in the filesystem, then you'll say aHA! aaHAh! but then, then it might do the scribbling! Well, yes. so in that case we agree there's a bug in the filesystem. :) You'll say ``but WHAT if the kernel panic was a bug in the DISK DRIVER, eh? eh, then maybe ZFS is not at fault!'' sure, fine, read on. r> I don't see that we have enough information here to really r> compare ZFS with UFS what we certainly have, between Tom and Erik and my own experience with resilvering-related errors accumulating in the CKSUM column when iSCSI targets go away, is enough information that ``you should have had redundant pools'' doesn't settle the issue. Reports of zpool corruption on single vdev's mounted over SAN's would benefit from further investigation, or at least a healthily-suspicious scientific attitude that encourages someone to investigate this if it happens in more favorable conditions, such as inside Sun, or to someone with a support contract and enough time to work on a case (maybe Tom?), or someone who knows ZFS well like Pavel. Also, there is enough concern for people designing paranoid systems to approach them with the view, ``ZFS is not always-consistent-on-disk unless it has working redundancy''---choosing to build a ZFS system the same way as a UFS system without ZFS-level redundancy, based on our experience so far, is not just foregoing some of ZFS's whizz-bang new feeechurs. It's significantly less safe than the UFS system. For as long as the argument remains unsettled, conservative people need to understand that. Conservative people should also understand point (c) below. It sounds to me like Tom's and Erik's problems are more likely ZFS's fault than not. The dialog has gone like this: 1. This isn't within the class of errors ZFS should handle. get redundancy. 2. It sounds to me exactly like the class of error ZFS is supposed to handle. 3. You cannot prove 100% that this is necessarily the class of error ZFS is supposed to handle. Somethinig else might have happened. BTW, did I tell you how good ZFS (sometimes) is at dealing with ``might have happened'' if you give it redundancy? It's new, and exciting, and unprecedented! Is that a rabbit over there? Look, a redheaded girl juggling frisbies! What next, you'll drag out screaming Dick Cheney on a chain? Recapping my view: a. it looks like a ZFS problem (okay, okay, PROBABLY a zfs problem) b. it's a big problem c. there's no good reason to believe people with redundant pools are immune from it, because they will run into it when they need their redundancy to cover a broken disk. It also deserves more testing by me: I'm going to back up my smaller 'aboveground' pool and try to provoke it. r> although I do agree that some kind of ZFS repair tool r> sounds like it would be useful. I don't want to dictate architecture when I don't know the internals well. What's immediately important to me is that ZFS handle unclean shutdown rigorously, as most other filesystems claim to and eventually mostly accomplish. This could be adding an fsck tool, but more likely it will be simply fixing a bug. Old computers had to bring up their swap space before fsck'ing big filesystems because the fsck process needed so much memory. The filesystem implementation was a small text of fragile code that would panic if it read the wrong bits from the disk, but it was fast and didn't take much memory. It made sense to split the filesystem into two pieces, the fsck piece and the main piece, to conserve the machine's core (and make the programming simpler). We have plenty of memory for text segments now, so it might make more sense to build fsck into the filesystem. The filesystem should be able to mount any state you would expect a hypothetical fsck tool to handle, and mount it almost immediately, and correct any ``errors'' it finds while running. If you want to proactively correct errors, it should do this while mounted. That was the original ZFS pitch, and I think it's not crazy. It's basically what we're supposed to have now with the ``always consistent on disk'' claim and 'zpool scrub' O(n)? online fsck-equivalent. FFS+softdep sort of works this way, too. It's designed to safely mount ``unclean'' filesystems, so in that sense, it's ``always consistent.'' It does not roll a log, because there isn't one---it just mounts the filesystem as it was when the cord was pulled, and it can do this with no risk of kernel panicing or odd behavior to userland because of the careful order in which it writes data before the panic. However, after an unclean shutdown, the filesystem is still considered dirty even though it mounts and works. FreeBSD then starts the old fsck tool in the background. The fsck is still O(n^2). so...FFS+softdep sort of follows the new fsck-less model where the filesystem is one unified piece that does all its work after mounting, but follows it clumsily because it's reusing the old FFS code and on-disk format. To my non-developer perspective, there seem to be the equivalent of mini-FFS+softdep-style fsck's inside ZFS already. Sometimes when a mirror component goes away, ZFS does (what looks in 'zpool status' like) a mini-resilver on the remaining component. There's no redundancy in the vdev, so there's nothing to actually resilver. Maybe this has to do with the quorum rules or the (seemingly broken) dirty region logging, both of which I still don't understand. And there is also my old problem of 'zpool offline' reporting ``no valid replicas'', until I've done a scrub, after which 'zpool offline' works again, so a scrub is not really a purely proactive thing: burried inside ZFS there is some notion of dirtyness preventing my 'zpool offline', and a successful scrub clears the dirty bit (as do, possibly, other things, like rebooting :( ). so, the architecture might be fine as-is since scrub is already a little more than what it claims to be, and is doing some sort of metadata or RAID-level fsck-ing. I wouldn't expect that the fix for these corrupt single-vdev pools come in some specific form based on prejudices from earlier filesystems. Now there is another tool Anton mentioned, a recovery tool or forensic tool: one that leaves the filesystem unmounted, treats the disks as read-only, and tries to copy data out of it onto a new filesystem. If there were going to be a separate tool---say, something to handle disks that have been scribbled on, or fixes for problems that are really tricky or logically inappropriate to deal with on the mounted filesystem---I think a forensic/recovery tool makes more sense than an fsck. If this odd stuff isn't supposed to happen, and it has happened anyway, you want a tool you can run more than once. You want the chance to improve the tool and run it again, or to try an older version of the tool if the current one keeps crashing. I'm just really far from convinced that Tom needs this tool. r> To me, it sounds like Sun have designed ZFS to always know if r> there is corruption on the disk, and to write data in a way r> that corruption of the whole filesystem *should* never happen. sounds like depends on to what you're listening. If you're listening to Sun's claims, then yes, of course that's exactly what they claim. If you're listening to experience on this list, it sounds different. The closest we've come is, we agree I haven't completely invalidated the original claims, which is pretty far from making me believe them again.
pgpah9Wtko3v5.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss