>>>>> "bf" == Bob Friesenhahn <[EMAIL PROTECTED]> writes:
bf> If the system or device is simply overwelmed with work, then bf> you would not want the system to go haywire and make the bf> problems much worse. None of the decisions I described its making based on performance statistics are ``haywire''---I said it should funnel reads to the faster side of the mirror, and do this really quickly and unconservatively. What's your issue with that? bf> You are saying that I can't split my mirrors between a local bf> disk in Dallas and a remote disk in New York accessed via bf> iSCSI? nope, you've misread. I'm saying reads should go to the local disk only, and writes should go to both. See SVM's 'metaparam -r'. I suggested that unlike the SVM feature it should be automatic, because by so being it becomes useful as an availability tool rather than just performance optimisation. The performance-statistic logic should influence read scheduling immediately, and generate events which are fed to FMA, then FMA can mark devices faulty. There's no need for both to make the same decision at the same time. If the events aren't useful for diagnosis, ZFS could not bother generating them, or fmd could ignore them in its diagnosis. I suspect they *would* be useful, though. I'm imagining the read rescheduling would happen very quickly, quicker than one would want a round-trip from FMA, in much less than a second. That's why it would have to compare devices to others in the same vdev, and to themselves over time, rather than use fixed timeouts or punt to haphazard driver and firmware logic. bf> o System waits substantial time for devices to (possibly) bf> recover in order to ensure that subsequently written data has bf> the least chance of being lost. There's no need for the filesystem to *wait* for data to be written, unless you are calling fsync. and maybe not even then if there's a slog. I said clearly that you read only one half of the mirror, but write to both. But you're right that the trick probably won't work perfectly---eventually dead devices need to be faulted. The idea is that normal write caching will buy you orders of magnitude longer time in which to make a better decision before anyone notices. Experience here is that ``waits substantial time'' usually means ``freezes for hours and gets rebooted''. There's no need to be abstract: we know what happens when a drive starts taking 1000x - 2000x longer than usual to respond to commands, and we know that this is THE common online failure mode for drives. That's what started the thread. so, think about this: hanging for an hour trying to write to a broken device may block other writes to devices which are still working, until the patiently-waiting data is eventually lost in the reboot. bf> o System immediately ignores slow devices and switches to bf> non-redundant non-fail-safe non-fault-tolerant bf> may-lose-your-data mode. When system is under intense load, bf> it automatically switches to the may-lose-your-data mode. nobody's proposing a system which silently rocks back and forth between faulted and online. That's not what we have now, and no such system would naturally arise. If FMA marked a drive faulty based on performance statistics, that drive would get retired permanently and hot-spare-replaced. Obviously false positives are bad, just as obviously as freezes/reboots are bad. It's not my idea to use FMA in this way. This is how FMA was pitched, and the excuse for leaving good exception handling out of ZFS for two years. so, where's the beef?
pgpUDw139jf6A.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss