On Jun 17, 2010, at 4:35 PM, Garrett D'Amore wrote: > > I actually started with DKIOCGSTATE as my first approach, modifying > sd.c. But I had problems because what I found is that nothing was > issuing this ioctl properly except for removable/hotpluggable media (and > the SAS/SATA controllers/frameworks are not indicating this. I tried > overriding that in sd.c but I still found that there was another bug > where the HAL module that does the monitoring does not monitor devices > that are present and in use (mounted filesystems) during boot. I think > HAL was designed for removable media that would not be automatically > mounted by zfs during boot. I didn't analyze this further.
ZFS issues the ioctl() from vdev_disk.c. It is up to the HBA drivers to correctly represent the DEV_GONE state (and is known to work with a variety of SATA drivers). > Is "sd.c" considered a legacy driver? Its what is responsible for the > vast majority of disks. That said, perhaps the problem is the HBA > drivers? It's the HBA drivers. > So how do we distinguish "removed on purpose" as opposed to "removed by > accident, faulted cable, or other non administrative issue?" I presume > that a removal initiated via cfgadm or some other tool could put the ZFS > vdev into an offline state, and this would prevent the logic from > accidentally marking the device FAULTED. (Ideally it would also mark > the device "REMOVED".) If there is no physical connection (detected to the best of the driver's ability), then it is removed (REMOVED is different from OFFLINE). Surprise device removal is not a fault - Solaris is designed to support removal of disks at any time without administrative intervention. A fault is defined as broken hardware, which is not the case for a removed device. There are projects underway to a) represent devices that are physically present but unable to attach to generate faults and b) topology-based diagnosis to detect bad cables, expanders, etc. This is a complicated problem and not always tractable, but can be solved reasonably well for modern systems and transports. A completely orthogonal feature is the ability to represent extended periods of device removal as a defect. While removing a disk is not itself a defect, leaving your pool running minus one disk for hours/days/weeks is clearly broken. If you have a solution that correctly detects devices as REMOVED for a new class of HBAs/drivers, that'd be more than welcome. If you choose to represent missing devices as faulted in your own third party system, that's your own prerogative, but it's not the current Solaris FMA model. Hope that helps, - Eric -- Eric Schrock, Fishworks http://blogs.sun.com/eschrock _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss