On 25-Nov-08, at 5:10 AM, Ross Smith wrote: > Hey Jeff, > > Good to hear there's work going on to address this. > > What did you guys think to my idea of ZFS supporting a "waiting for a > response" status for disks as an interim solution that allows the pool > to continue operation while it's waiting for FMA or the driver to > fault the drive? > ... > > The first of these is already covered by ZFS with its checksums (with > FMA doing the extra work to fault drives), so it's just the second > that needs immediate attention, and for the life of me I can't think > of any situation that a simple timeout wouldn't catch. > > Personally I'd love to see two parameters, allowing this behavior to > be turned on if desired, and allowing timeouts to be configured: > > zfs-auto-device-timeout > zfs-auto-device-timeout-fail-delay > > The first sets whether to use this feature, and configures the maximum > time ZFS will wait for a response from a device before putting it in a > "waiting" status.
The shortcomings of timeouts have been discussed on this list before. How do you tell the difference between a drive that is dead and a path that is just highly loaded? I seem to recall the argument strongly made in the past that making decisions based on a timeout alone can provoke various undesirable cascade effects. > The second would be optional and is the maximum > time ZFS will wait before faulting a device (at which point it's > replaced by a hot spare). > > The reason I think this will work well with the FMA work is that you > can implement this now and have a real improvement in ZFS > availability. Then, as the other work starts bringing better modeling > for drive timeouts, the parameters can be either removed, or set > automatically by ZFS. > ... it should be possible for ZFS to read or > write from other devices while it's waiting for an 'official' result > from any one faulty component. Sounds good - devil, meet details, etc. --Toby > > Ross > > > On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick > <[EMAIL PROTECTED]> wrote: >> I think we (the ZFS team) all generally agree with you. ... >> >> The reason this is all so much harder than it sounds is that we're >> trying to provide increasingly optimal behavior given a collection of >> devices whose failure modes are largely ill-defined. (Is the disk >> dead or just slow? Gone or just temporarily disconnected? ... >> >> Jeff >> >> On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote: >>> But that's exactly the problem Richard: AFAIK. >>> >>> Can you state that absolutely, categorically, there is no failure >>> mode out there (caused by hardware faults, or bad drivers) that >>> won't lock a drive up for hours? You can't, obviously, which is >>> why we keep saying that ZFS should have this kind of timeout >>> feature. >>> ... _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss