Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

Moore, Joe Tue, 25 Nov 2008 07:11:10 -0800

Ross Smith wrote:
> My justification for this is that it seems to me that you can split
> disk behavior into two states:
> - returns data ok
> - doesn't return data ok
> 
> And for the state where it's not returning data, you can again split
> that in two:
> - returns wrong data
> - doesn't return data

The state in discussion in this thread is "the I/O requested by ZFS hasn't 
finished after 60, 120, 180, 3600, etc. seconds"

The pool is waiting (for device timeouts) to distinguish between the first two 
states.

More accurate state descriptions are:
- The I/O has returned data
- The I/O hasn't yet returned data and the user (admin) is justifiably 
impatient.

For the first state, the data is either correct (verified by the ZFS checksums, 
or ESUCCESS on write) or incorrect and retried.

> 
> The first of these is already covered by ZFS with its checksums (with
> FMA doing the extra work to fault drives), so it's just the second
> that needs immediate attention, and for the life of me I can't think
> of any situation that a simple timeout wouldn't catch.
> 
> Personally I'd love to see two parameters, allowing this behavior to
> be turned on if desired, and allowing timeouts to be configured:
> 
> zfs-auto-device-timeout
> zfs-auto-device-timeout-fail-delay

I'd prefer these be set at the (default) pool level:
zpool-device-timeout
zpool-device-timeout-fail-delay

with specific per-VDEV overrides possible:
vdev-device-timeout and vdev-device-fail-delay

This would allow but not require slower VDEVs to be tuned specifically for that 
case without hindering the default pool behavior on the local fast disks.  
Specifically, consider where I'm using mirrored VDEVs with one half over iSCSI, 
and want to have the iSCSI retry logic to still apply.  Writes that failed 
while the iSCSI link is down would have to be resilvered, but at least reads 
would switch to the local devices faster.

Set them to the default magic "0" value to have the system use the current 
behavior, of relying on the device drivers to report failures.
Set to a number (in ms probably) and the pool would consider an I/O that takes 
longer than that as "returns invalid data"

When the FMA work discussed below, these could be augmented by the pools "best 
heuristic guess" as to what the proper timeouts should be, which could be saved 
in (kstat?) vdev-device-autotimeout.

If you set the timeout to the magic "-1" value, the pool would use 
vdev-device-autotimeout.

All that would be required is for the I/O that caused the disk to take a long 
time to be given a deadline (now + (vdev-device-timeout ?: 
(zpool-device-timeout?: forever)))* and consider the I/O complete with whatever 
data has returned after that deadline: if that's a bunch of 0's in a read, 
which would have a bad checksum; or a partially-completed write that would have 
to be committed somewhere else.

Unfortunately, I'm not enough of a programmer to implement this.

--Joe
* with the -1 magic, it would be a little more complicated calculation.
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] "ZFS, Smashing Baby" a fake???

Reply via email to