Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Richard Elling Wed, 27 Aug 2008 10:19:33 -0700

Mattias Pantzare wrote:
> 2008/8/27 Richard Elling <[EMAIL PROTECTED]>:
>   
>>>>> Either the drives should be loaded with special firmware that
>>>>> returns errors earlier, or the software LVM should read redundant data
>>>>> and collect the statistic if the drive is well outside its usual
>>>>> response latency.
>>>>>
>>>>>           
>>>> ZFS will handle this case as well.
>>>>
>>>>         
>>> How is ZFS handling this? Is there a timeout in ZFS?
>>>
>>>       
>> Not for this case, but if configured to manage redundancy, ZFS will
>> "read redundant data" from alternate devices.
>>     
>
> No, ZFS will not, ZFS waits for the device driver to report an error,
> after that it will read from alternate devices.
>


Yes, ZFS will, ZFS waits for the device driver to report an error,
after that it will read from alternate devices.

> ZFS could detect that there is probably a problem with the device and
> read from an alternate device much faster while it waits for the
> device to answer.
>   

Rather than complicating ZFS code with error handling code
which is difficult to port or maintain over time, ZFS leverages
the Solaris Fault Management Architecture.  There is opportunity
to expand features using the flexible FMA framework.  Feel free
to propose additional RFEs.

> You can't do this at any other level than ZFS.
>
>
>
>   
>>>>>  One thing other LVM's seem like they may do better
>>>>> than ZFS, based on not-quite-the-same-scenario tests, is not freeze
>>>>> filesystems unrelated to the failing drive during the 30 seconds it's
>>>>> waiting for the I/O request to return an error.
>>>>>
>>>>>
>>>>>           
>>>> This is not operating in ZFS code.
>>>>
>>>>         
>>> In what way is freezing a ZFS filesystem not operating in ZFS code?
>>>
>>> Notice that he wrote filesystems unrelated to the failing drive.
>>>
>>>
>>>       
>> At the ZFS level, this is dictated by the failmode property.
>>     
>
> But that is used after ZFS has detected an error?
>   

I don't understand this question.  Could you rephrase to clarify?

>> I find comparing unprotected ZFS configurations with LVMs
>> using protected configurations to be disingenuous.
>>     
>
> I don't think anyone is doing that.
>   

harrumph

>>> What is your definition of unrecoverable reads?
>>>
>>>       
>> I wrote data, but when I try to read, I don't get back what I wrote.
>>     
>
> There is only one case where ZFS is better, that is when wrong data is
> returned. All other cases are managed by layers below ZFS. Wrong data
> returned is not normaly called unrecoverable reads.
>   

It depends on your perspective.  T10 has provided a standard error
code for a device to tell a host that it experienced an unrecoverable
read error.  However, we still find instances where what we wrote
is not what we read, whether it is detected at the media level or higher
in the software stack.  In my pile of borken parts, I have devices
which fail to indicate an unrecoverable read, yet do indeed suffer
from forgetful media.  To carry that discussion very far, it quickly
descends into the ability of the device's media checksums to detect
bad data -- even ZFS's checksums.  But here is another case where
enterprise-class devices tend to perform better than consumer-grade
devices.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] pulling disks was: ZFS hangs/freezes after disk failure,

Reply via email to