>>>>> "rs" == Ross Smith <[EMAIL PROTECTED]> writes: >>>>> "nw" == Nicolas Williams <[EMAIL PROTECTED]> writes:
rs> I disagree Bob, I think this is a very different function to rs> that which FMA provides. I see two problems. (1) FMA doesn't seem to work very well, and was used as an excuse to keep proper exception handling out of ZFS for a couple years, so im sort of...skeptical whenever it's brought up like a panacea. (2) The FMA model of collecting telemmetry, taking it into user-space, chin-strokingly contemplating it for a while, then decreeing a diagnosis, is actually a rather limited one. I can think of two kinds of limit: (a) you're diagnosing the pool FMA is running on. FMA is on the root pool, but the root pool won't unfreeze until FMA diagnoses it. In practice it's much worse, because problems in one pool's devices can freeze all of ZFS, even other pools. Or if the system is NFS-rooted and also exporting ZFS filesystems over NFS, maybe all of NFS freezes? problems like that, knocking out FMA. Diagnosis in kernel is harder to knock out. (b) calls are sleeping uninterruptably in the path that returns events to FMA. ``Call down into the controller driver, wait for return success or failure, then count the event and callback to FMA as appropriate. If something's borked, FMA will eventually return diagnosis.'' This plan is useless if the controller just freezes. FMA never sees anything. You are analyzing faults, yes, but you can only do it with hindsight. When do you do the FMA callback? To implement this timeout, you'd have to do a callback before and after each I/O, which is obviously too expensive. Likewise, when FMA returns the diagnosis, are you prepared to act on it? Or are you busy right now, and you're going to act on it just as soon as that controller returns success or failure? You can't abstract the notion of time out of your diagnosis. Trying to compartmentalize it interferes with working it into low-level event loops in a way that's sometimes needed. It's not a matter of where things taxonomically belong, where it feels clean to put some functionality in your compartmentalized layered tower. Certain things just aren't achievable from certain places. nw> If we're talking isolated, or even clumped-but-relatively-few nw> bad sectors, then having a short timeout for writes and nw> remapping should be possible I'm not sure I understand the state machine for the remapping plan but...I think your idea is, try to write to some spot on the disk. If it takes too long, cancel the write, and try writing somewhere else instead. Then do bad-block-remapping: fix up all the pointers for the new location, mark the spot that took too long as poisonous, all that. I don't think it'll work. First, you can't cancel the write. Once you dispatch a write that hangs, you've locked up, at a minimum, the drive trying to write. You don't get the option of remapping and writing elsewhere, because the drive's stopped listening to you. Likely, you've also locked up the bus (if the drive's on PATA or SCSI), or maybe the whole controller. (This is IMHO the best reason for laying out a RAID to survive a controller failure---interaction with a bad drive could freeze a whole controller.) Even if you could cancel the write, when do you cancel it? If you can learn your drive and controller so well you convince them to ignore you for 10 seconds instead of two minutes when they hit a block they can't write, you've got approximately the same problem, because you don't know where the poison sectors are. You'll probably hit another one. Even a ten-second write means the drive's performance is shot by almost three orders of magnitude---it's not workable. Finally, this approach interferes with diagnosis. The drives have their own retry state machine. If you start muddling all this ad-hoc stuff on top of it you can't tell the difference between drive failures, cabling problems, controller failures. You end up with normal thermal recalibration events being treated as some kind of ``spurious late read'' and inventing all these strange unexplained failure terms which make it impossible to write a paper like the Netapp or Google papers on UNC's we used to cite in here all the time, because your failure statistics no longer correspond to a single layer of the storage stack and can't be compared to others' statistics. Also, remember that we suspect and wish to tolerate drives that operate many standard deviations outside their specification, even when they're not broken or suspect or about to break. There are two reasons. First, we think they might do it. Second, otherwise you can't collect performance statistics you can compare with others'. That's why the added failure handling I suggested is only to ignore drives---either for a little while, or permanently. Merely ignoring a drive, without telling the drive you're ignoring it, doesn't interfere with collecting statistics from it. The two queues inside the drive (retryable and deadline) would let you do this bad-block-remapping, but no drive implements it, and it's probably impossible to implement because of the sorts of things drives do while ``retrying''. I described the drive-QoS idea to explain why this B_FAILFAST-ish plan of supervising the drive's recovery behavior, or any plan involving ``cancelling'' CDB's, is never going to work. Here is one variant of this remapping plan I think could work, which somewhat preserves the existing storage stack layering: * add a timeout to B_FAILFAST cdb's above the controller driver, a short one like a couple seconds. * when a drive is busy on a non-B_FAILFAST transaction for longer than the B_FAILFAST timeout, walk through the CDB queue and instantly fail all the B_FAILFAST transactions, without even sending them to the drive. * when a drive blows a B_FAILFAST timeout, admit no more B_FAILFAST transactions until it successfully completes a non-B_FAILFAST transaction. If the drive is marked timeout-blown, and no transactions are queued for it, wait 60 seconds and then make up a fake transaction for it, like ``read one sector in the middle of the disk.'' I like the vdev-layer ideas better than the block-layer ideas though. nw> What should be the failure mode of a jbod disappearing due to nw> a pulled cable (or power supply failure)? A pause in nw> operation (hangs)? Or faulting of all affected vdevs, and if nw> you're mirrored across different jbods, incurring the need to nw> re-silver later, with degraded operation for hours on end? The resilvering shoudl only include things written during the outage, so the degraded operation will last some time proportional to the outage. Resilvering is already supposed to work this way. The argument, I think, will be over the idea of auto-onlining things. My opinion: if you are dealing with failure by deciding to return success to fsync() with fewer copies of the data written, then this should require either a spare rebuild or manually issuing 'zpool clear' to get back to normal. Certain kinds of rocking behavior---like changes to the mirror roundrobin, or delaying writes of non-fsync() data, are okay, but rocking back and forth between redundancy states automatically during normal operation is probably unacceptable. The counter opinion I suppose might be that we get more MTDL by writing as quickly as possible to as many places as possible, so automatic-onlining is good. but i dont think so.
pgpEj1BG5bFK4.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss