>>>>> "tt" == Toby Thain <[EMAIL PROTECTED]> writes:
tt> Why would it be assumed to be a bug in Solaris? Seems more tt> likely on balance to be a problem in the error reporting path tt> or a controller/ firmware weakness. It's not really an assumption. It's been discussed in here a lot, and we know why it's happening. It's just a case of ``it's a feature not a bug'' combined with ``somebody else's problem.'' The error-reporting path you mention is inside Solaris, so I have a little trouble decoding your statement. I wish drives had a failure-aware QoS with a split queue for aggressive-retry cdb's and deadline cdb's. This would make the B_FAILFAST primitive the Solaris developers seem to believe in actually mean something. Solaris is supposed to have a B_FAILFAST option for block I/O that ZFS could start using to capture vdev-level knowledge like ``don't try too hard to read this block from one device, because we can get it faster by asking another device.'' In the real world B_FAILFAST is IMO quite silly and, not exactly useless but at best deceptive to the higher-layer developer, because even IF the drive could be told to fail faster than 30 seconds by some future fancier sd driver, there would still be some fail-slow cdbs hitting the drive, and the two can't be parallelized. Sending a fail-slow cdb to a drive freezes the drive for up to 30 seconds * <n>, where <n> is the multiplier of some cargo-cult state machine built into the host adapter driver involving ``bus resets'' and other such stuff. All the B_FAILFAST cdbs queued behind the fail-slow may as well forget the flag becasue the drive's busy with the slow cdb. If you have a very few of these retryable cdbs peppered into your transaction stream, which are expected to take 10 - 100ms each but actually take one or two MINUTES each, the drive will be so slow it'd be more expressive to mark it dead. What will probably happen in $REALITY is, the sysadmin will declare his machine ``frozen without a panic message'' and reboot it, losing any write-cached data which, if not for this idiocy, could have been committed to other drives in a redundant vdev, as well as rebooting the rest of the system unrelated to this stuck zpool. However, it's inappropriate for a driver to actually report ``drive dead'' in this scenario, because the drive is NOT dead. The drive-failure-statistic papers posted in here say that drives usually fail with a bunch of contiguous or clumped-together unreadable sectors. You can still get most of the data off them with dd_rescue or 'dd if=baddrive of=gooddrive bs=512 conv=noerror,sync', if you wait about a week. About four hours of that week is spent copying data and the rest spent aggressively ``retrying''. An instantanious manual command, ``I insist this drive is failed. Mark it failed instantly, without leaving me stuck in bogus state machines for two minutes or two hours,'' would be a huge improvement, but I think graceful automatic behavior is not too much to wish for because this isn't some strange circumstance. This is *the way drives usually fail*. SCSI drives have all kinds of retry-tuning in the ``mode pages'' in a standardized format. Even 5.25" 10MB SCSI drives had these pages. One of NetApp's papers said they don't even let their SCSI/FC drives do their own bad-block reallocation. They do all that in host software. so there are a lot of secret tuning knobs, and they're AIUI largely standardized across manufacturers and through the years. ATA drives, AIUI, don't have the pages, but some WD gamer drives have some goofy DOS RAID-tuner tool. But even what SCSI drives offer isn't enough to provide the architecture ZFS seems to dream ov. What's really needed to provide ZFS developer's expectations of B_FAILFAST is QoS inside the drive firmware. Drives need to have split queues, with an aggressive-retry queue and a deadline-service queue. While retrying a stuck cdb in the aggressive queue, they must keep servicing the deadline queue. I've never heard of anything like this existing in a real drive. I think it's beyond the programming skill of an electrical engineer, and it may be too constraining for them because drives seem to do spastic head-seeks and sometimes partway spin themselves down and back up during a retry cycle. ZFS still seems to have this taxonomic-arcania view of drives that they are ``failing operations'' or the drive itself is ``failed''. It belongs to the driver's realm to decide whether it's the whole drive or just the ``operation'' which is failing, because that's how the square peg fits snugly into it's square hole. One of the NetApp papers mentions they have proprietary statistical heuristics for when to ignore a drive for a little while and use redundant drives instead, and when to fail a drive and call autosupport. And they log drive behavior really explicitly and unambiguously separate from ``controller'' failure, which is why they're able to write the paper at all. I'm in favour of heuristics, but most of the ZFS developers seem to think the issue lies with every driver in Solaris being not up to its promised standards. I still think the ZFS approach is wrong and the Netapp approach right. * I think if SATA is to be supported, then the fantasy that drives can be configured to return failure early should be cast off forever. * I don't think ZFS will match the availability behavior of Netapp or even of Areca/PERC/RAID-on-a-card until it includes vdev-level handling of slow devices. This means vdev-level timers inside ZFS, above the block driver level, driving error-recovery decisions. * I think a pool read/write that takes longer than other drives in a redundant vdev, or longer than other cdb's took on the same drive, should be re-dispatched to fetch redundant data. I think this should happen with really tight tolerance and should be stateful, such that a mirror could have a remote iSCSI component and a local component, and only the local component would be used for reads. * If a drive is taking 30 seconds do perform every cdb, but is still present and the driver refuses to mark it bad, ZFS needs to be able to mark it bad on its own, so that it no longer blocks synchronous writes, and so hot-spare replacement can start to get the pool back up to policy's redundancy expectation. If we're designing systems with multiple controllers to avoid a ``single point of failure'' then it's not okay to punt and say, well this isn't our problem because we're waiting patiently on the controller to do something sane. The short-term decisions require vdev-level knowledge which doesn't exist inside the driver, but arguably marking drives failed does not require vdev-level knowledge and could be done in the driver rather than ZFS. I still think this is wrong. Based on our experience so far with controller drivers, they aren't very good, and controller chips are rather short-lived so they're never going to be very good, and the drivers are often proprietary so the work has to be redone inside ZFS just to have a bit of software freedom again. A practical modern storage design is robust against bugs in the controller driver, bugs exercised by combinations of drive firmware and controllers or by doing strange things with cables. If this won't go inside ZFS, then people will reasonably want some pseudodevice like an SVM soft partition or a multipath layer to protect them from failing controller drivers. They might want a way to manually, and instantly, without waiting on stupid state machines, mark the device failed, crack ZFS and the controller driver apart so they're not locked in some deadly embrace of failure that requires rebooting. If we agree there is a need for multipath to a single device, why can we not agree that we expect protection from failures of a controller or its driver even when we don't have multipath but have laid out our vdev's with enough redundancy to tolerate controller failure? In practice, I think drives that become really slow instead of failing outright is the real problem, but bringing in multipath and controller redundancy shows what is to my view the taxonomic hypocricy of wanting to keep this out of ZFS. * Management commands like export, status, detach, offline, replace, must either (a) never block waiting for I/O. use kernel state only, and do disk writes asynchronously reporting failure through inspection commands that the user polls like 'zpool status'. This world is possible---we don't expect the mirror to be in sync before 'zpool attach' returns though we could. Or (b) sleep interruptably, and include a more drastic version that doesn't block, so normally you type 'zpool offline' and when the prompt returns without error, you know that all your labels are updated. But if you don't get a prompt back, you can ^C and 'zpool offline -f'. Not being able to get rid of a drive without access to the drive you want to get rid of is as ridiculous as ``keyboard not found. Press F1 to continue.'' Even square-peg square-hole taxonomists ought to agree on this one. And I don't like getting ``no valid replicas'' errors to situations that ZFS will tolerate if you force it by rebooting or by hot-unplugging the device---there should be clear deliniation of which pool states are acceptable and which are not, and I should be able to explore all the acceptable states by moving the pool through them manually. If I can't 'zpool offline' a device, and _instantly_ if I insist on it, then the pool should not mount at boot without that device. I shouldn't have to involve rebooting in my testing, or else it feels like fisher price wallpapered crap. I sometimes run my dishwasher with the door open for a half second when I become suspicious of it. The sky doesn't fall. but these days it seems like people believe any interlock anywhere, even a preposterous invented one, is as sacred as the one on a microwave or a UV oven. oh, and when possible ZFS should not forget its knowledge of inconsistencies across a reboot, and should for example continue interrupted resilvers like SVM did.
pgp0dmsGfvvvl.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss