>>>>> "es" == Eric Schrock <[EMAIL PROTECTED]> writes:
es> The main problem with exposing tunables like this is that they es> have a direct correlation to service actions, and es> mis-diagnosing failures costs everybody (admin, companies, es> Sun, etc) lots of time and money. Once you expose such a es> tunable, it will be impossible to trust any FMA diagnosis, Yeah, I tend to agree that the constants shouldn't be tunable, becuase I hoped Sun would become a disciplined collection-point for experience to set the constants, discipline meaning the constants are only adjusted in response to bad diagnosis not ``preference,'' and in a direction that improves diagnosis for everyone, not for ``the site''. I'm not yet won over to the idea that statistical FMA diagnosis constants shouldn't exist. I think drives can't diagnose themselves for shit, and I think drivers these days are diagnosees, not diagnosers. But clearly a confusingly-bad diagnosis is much worse than diagnosis that's bad in a simple way. es> If I issue a write to both halves of a mirror, should es> I return when the first one completes, or when both complete? well, if it's not a synchronous write, you return before you've written either half of the mirror, so it's only an issue for O_SYNC/ZIL writes, true? BTW what does ZFS do right now for synchronous writes to mirrors, wait for all, wait for two, or wait for one? es> any such "best effort RAS" is a little dicey because you have es> very little visibility into the state of the pool in this es> scenario - "is my data protected?" becomes a very difficult es> question to answer. I think it's already difficult. For example, a pool will say ONLINE while it's resilvering, won't it? I might be wrong. Take a pool that can only tolerate one failure. Is the difference between replacing an ONLINE device (still redundant) and replacing an OFFLINE device (not redundant until resilvered) captured? Likewise, should a pool with a spare in use really be marked DEGRADED both before the spare resilvers and after? The answers to the questions aren't important so much as that you have to think about the answers---what should they be, what are they now---which means ``is my data protected?'' is already a difficult question to answer. Also there were recently fixed bugs with DTL. The status of each device's DTL, even the existence and purpose of the DTL, isn't well-exposed to the admin, and is relevant to answering the ``is my data protected?'' question---indirect means of inspecting it like tracking the status of resilvering seem too wallpapered given that the bug escaped notice for so long. I agree with the problem 100% and don't wish to worsen it, just disagree that it's a new one. re> 3 orders of magnitude range for magnetic disk I/Os, 4 orders re> of magnitude for power managed disks. I would argue for power management a fixed timeout. The time to spin up doesn't have anything to do with the io/s you got before the disk spun down. There's no reason to disguise the constant for which we secretly wish inside some fancy math for deriving it just because writing down constants feels bad. unless you _know_ the disk is spinning up through some in-band means, and want to compare its spinup time to recorded measurements of past spinups. This is a good case for pointing out there are two sets of rules: * 'metaparam -r' rules + not invoked at all if there's no redundancy. + very complicated - involve sets of disks, not one disk. comparison of statistic among disks within a vdev (definitely), and comparison of individual disks to themselves over time (possibly). - complicated output: rules return a set of disks per vdev, not a yay-or-nay diagnosis per disk. And there are two kinds of output decision: o for n-way mirrors, select anywhere from 1 to n disks. for example, a three-way mirror with two fast local mirrors, one slow remote iSCSI mirror, should split reads among the two local disks. for raidz and raidz2 they can eliminate 0, 1 (,or 2) disks from the read-us set. It's possible to issue all the reads and take the first sufficient set to return as Anton suggested, but I imagine 4-device raidz2 vdevs will be common which could some day perform as well as a 2-device mirror. o also, decide when to stop waiting on an existing read and re-issue it. so the decision is not only about future reads, but has to cancel already-issued reads, possibly replacing the B_FAILFAST mechanism so there will be a second uncancellable round of reads once the first round exhausts all redundancy. o that second decision needs to be made thousands of times per second without a lot of CPU overhead + small consequence if the rules deliver false-positives, just reduced performance (which is the same with the TCP fast-retransmit rules Bill mentioned) + large consequence for false-negatives (system freeze), so one can't really say, ``we won't bother doing it for raidz2 because it's too complicated.'' The rules are NOT just about optimizing performance. + at least partly in kernel * diagnosis rules + should it be invoked for single-device vdev's? Does ZFS diagnosis already consider that a device in an unredundant vdev should be FAULTED less aggressively (ex., never for CKSUM errors)? this is arguable. + diagnosis is strictly per-disk and should compare disks only to themselves, or to cultural memory of The Typical Disk in the form of untunable constants, never others in the same vdev + three possible verdicts per disk: - all's good - warn the sysadmin about this disk but keep writing to it - fault this disk in ZFS. no further I/O, not even writes, and start rebuilding it onto a spare Erik points out that false positives are expensive in BOTH cases, not just the second, because even the warning can initiate expensive repare procedures and reduce trust in FMA diagnoses. so, there should probably be only two verdicts, good and fault. If the statistics are extractable, more aggressive sysadmins can devise their own warning rules and competitively try to predict the future. The owners of large clusters might be better at crafting warning rules than Sun, but their results won't be general. + potentially complicated, but might be really simple, like ``an I/O takes more than three minutes to complete.'' + A more complicated but still somewhat simple hypothetical rule: ``one I/O hasn't returned completion or failure after 10 mintues, OR at least one I/O originally issued to the driver from within each of three separate four-minute-long buckets within the last 40 minutes takes 1000 times longer than usual or more than 120 seconds, whichever is larger (three slow I/O's in recent past)'' These might be really bad rules. my point is that variance, or some more complicated statistic than addition and buckets, might be good for diagnosing bad disks but isn't necessarily required, while for the 'metaparam -r' rules it IS required. for diagnosing bad disks, a big bag of traditional-AI rules might be better than statistical/machine-learning rules, and will be easier for less-sophisticated developers to modify according to experience and futuristic hardware. ex., power-managed disk spinning up takes less than x seconds and should not be spinning down more often than every y minutes. SAN fabric disconnection should reconnect within z seconds, and unannounced outages don't need to be tolerated silently without intervention more than once per day. u.s.w. It may even be possible to generate negative fault events, like ``disk IS replying, not silent, and it says Not-ready-coming-ready, so don't fault it for 1 minute.'' The option of creating this kind of hairy mess of special-case layer-violating codified-tradition rules is the advantage I perceived in tolerating the otherwise disgusting hairy bolt-on shared-lib-spaghetty mess that is FMA. But for the 'metaparam -r' rules OTOH, variance/machine-learning is probably the only approach. + rules are in userland, can be more expensive CPU-wise, and return feedback to the kernel only a couple times a minute, not per-I/O like the 'metaparam -r' reissue rules. I guess I'm changing my story slightly. I *would* want ZFS to collect drive performance statistics and report them to FMA, but I wouldn't suggest reporting the _decision_ outputs of the 'metaparam -r'-replacement engine to FMA, only the raw stats. and, of course, ``reporting'' is tricky for the diagnosis case becuase of the bolted-on separation of FMA. You can't usefully report ``the I/O took 3 hours to complete'' because you've now waited three hours to get the report, and the completed I/O has a normal driver error attached to it so no fancy statistical decisions are any longer needed. Instead, you have to make polled reports to userland a couple times a minute, containing the list of incomplete outstanding I/O's, along with averages and variances and whatever else.
pgphahWkGxlRt.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss