Alexander Motin schrieb am 23.02.2010 18:35 (localtime):
...
One understanding question: If the drive doesn't complete a command,
regardless if it's due to a firmware bug, a disk surface error or
whatever, is there no way for the driver to terminate the request and
take the drive offline after some time? This would be a very important
behaviour for me. It doesn't make sense building RAIDz storage when a
failing drive hangs the complete machine, even if the system partitions
are on a complete different SSD.

That's what timeouts are used for. When timeout detected, driver resets
device and reports error to upper layer. After receiving error, CAM
reinitializes device. If device is completely dead, reinitialization
will fail and device will be dropped immediately. If device is still
alive, reinit succeed and CAM will retry command again. If all retries
failed, error reported to the GEOM layer and then possibly to file
system. I have no idea how RAIDZ behaves in such case. May be after few
such errors it should drop that device out of array.

Timeout is a worst possible case for any device, as it takes too much
time and doesn't give any recovery information. Half-dead case is worst
possible case of timeout. It is difficult to say what which way is
better: drop last drive from degraded array and lost all info, or retry
forever. There is probably no right answer.

I see. Thanks a lot for clarification.
Before getting the machine onsite I did some ZFS tests like removing one disk when cvs checkout was running. I can remember that ZFS hadn't showed the removed drive as offline, but there was no hang. The pool was degraded and after reinserting and rebooting I could resilver the pool. I couldn't manage to get it consistent without rebooting, but I accepted that since I would have to walk on site for changing the drive any way. I'll restore the default vfs.zfs.txg.timeout=30, so the hang can be easily reproduced and see if I can 'camcontrol stop' the drive. Do you think I can get usefull information with that test?

Thanks,

-Harry

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to