Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Ross Smith Thu, 27 Nov 2008 23:03:44 -0800

On Fri, Nov 28, 2008 at 5:05 AM, Richard Elling <[EMAIL PROTECTED]> wrote:
> Ross wrote:
>>
>> Well, you're not alone in wanting to use ZFS and iSCSI like that, and in
>> fact my change request suggested that this is exactly one of the things that
>> could be addressed:
>>
>> "The idea is really a two stage RFE, since just the first part would have
>> benefits.  The key is to improve ZFS availability, without affecting it's
>> flexibility, bringing it on par with traditional raid controllers.
>>
>> A.  Track response times, allowing for lop sided mirrors, and better
>> failure detection.
>
> I've never seen a study which shows, categorically, that disk or network
> failures are preceded by significant latency changes.  How do we get
> "better failure detection" from such measurements?


Not preceded by as such, but a disk or network failure will certainly
cause significant latency changes.  If the hardware is down, there's
going to be a sudden, and very large change in latency.  Sure, FMA
will catch most cases, but we've already shown that there are some
cases where it doesn't work too well (and I would argue that's always
going to be possible when you are relying on so many different types
of driver).  This is there to ensure that ZFS can handle *all* cases.


>>  Many people have requested this since it would facilitate remote live
>> mirrors.
>>
>
> At a minimum, something like VxVM's preferred plex should be reasonably
> easy to implement.
>
>> B.  Use response times to timeout devices, dropping them to an interim
>> failure mode while waiting for the official result from the driver.  This
>> would prevent redundant pools hanging when waiting for a single device."
>>
>
> I don't see how this could work except for mirrored pools.  Would that
> carry enough market to be worthwhile?
> -- richard

I have to admit, I've not tested this with a raided pool, but since
all ZFS commands hung when my iSCSI device went offline, I assumed
that you would get the same effect of the pool hanging if a raid-z2
pool is waiting for a response from a device.  Mirrored pools do work
particularly well with this since it gives you the potential to have
remote mirrors of your data, but if you had a raid-z2 pool, you still
wouldn't want that hanging if a single device failed.

I will go and test the raid scenario though on a current build, just to be sure.
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better

Reply via email to