Ross Smith wrote: > My justification for this is that it seems to me that you can split > disk behavior into two states: > - returns data ok > - doesn't return data ok > > And for the state where it's not returning data, you can again split > that in two: > - returns wrong data > - doesn't return data
The state in discussion in this thread is "the I/O requested by ZFS hasn't finished after 60, 120, 180, 3600, etc. seconds" The pool is waiting (for device timeouts) to distinguish between the first two states. More accurate state descriptions are: - The I/O has returned data - The I/O hasn't yet returned data and the user (admin) is justifiably impatient. For the first state, the data is either correct (verified by the ZFS checksums, or ESUCCESS on write) or incorrect and retried. > > The first of these is already covered by ZFS with its checksums (with > FMA doing the extra work to fault drives), so it's just the second > that needs immediate attention, and for the life of me I can't think > of any situation that a simple timeout wouldn't catch. > > Personally I'd love to see two parameters, allowing this behavior to > be turned on if desired, and allowing timeouts to be configured: > > zfs-auto-device-timeout > zfs-auto-device-timeout-fail-delay I'd prefer these be set at the (default) pool level: zpool-device-timeout zpool-device-timeout-fail-delay with specific per-VDEV overrides possible: vdev-device-timeout and vdev-device-fail-delay This would allow but not require slower VDEVs to be tuned specifically for that case without hindering the default pool behavior on the local fast disks. Specifically, consider where I'm using mirrored VDEVs with one half over iSCSI, and want to have the iSCSI retry logic to still apply. Writes that failed while the iSCSI link is down would have to be resilvered, but at least reads would switch to the local devices faster. Set them to the default magic "0" value to have the system use the current behavior, of relying on the device drivers to report failures. Set to a number (in ms probably) and the pool would consider an I/O that takes longer than that as "returns invalid data" When the FMA work discussed below, these could be augmented by the pools "best heuristic guess" as to what the proper timeouts should be, which could be saved in (kstat?) vdev-device-autotimeout. If you set the timeout to the magic "-1" value, the pool would use vdev-device-autotimeout. All that would be required is for the I/O that caused the disk to take a long time to be given a deadline (now + (vdev-device-timeout ?: (zpool-device-timeout?: forever)))* and consider the I/O complete with whatever data has returned after that deadline: if that's a bunch of 0's in a read, which would have a bad checksum; or a partially-completed write that would have to be committed somewhere else. Unfortunately, I'm not enough of a programmer to implement this. --Joe * with the -1 magic, it would be a little more complicated calculation. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss