On Wed, Nov 26, 2008 at 07:02:11PM -0500, Miles Nordin wrote: > (2) The FMA model of collecting telemmetry, taking it into > user-space, chin-strokingly contemplating it for a while, then > decreeing a diagnosis, is actually a rather limited one. I can > think of two kinds of limit:
As mentioned previously, this is not an accurate description of what's going on. FMA allows diagnosis to happen at the detector when the telemetry is conclusive and cross-domain or predictive analysis isn't required. This is exactly what ZFS does on recent nevada builds. If a drive is pathologically broken (i.e. a reopen fails, or reads and writes to the label fail), it will *immediately* fail the drive and not wait for any further diagnosis from FMA. For drives that randomly fail I/Os or take along time, but otherwise respond to basic requests, ZFS is often in no better position to perform a diagnosis in the kernel. And as of build 101, ZFS behaves much better in these circumstances by not aggressively retrying commands before exhausting all other options. Are you running your experiments on build 101 or later? And what experiments are you running? Drawing conclusions from previous experience or reports is basically pointless given the amount of change that has occurred recently (Jeff's putback wasn't nicknamed "SPA 3.0" for nothing). While there are no doubt more rough edges, we have incorporated much of the previous feedback into new behavior that should provide a much improved experience. - Eric P.S. I'm also not sure that B_FAILFAST behaves in the way you think it does. My reading of sd.c seems to imply that much of what you suggest is actually how it currently behaves, but you should probably bring up the issue on storage-discuss where you will find more experts in this area. -- Eric Schrock, Fishworks http://blogs.sun.com/eschrock _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss