On Jun 17, 2010, at 4:35 PM, Garrett D'Amore wrote:
> 
> I actually started with DKIOCGSTATE as my first approach, modifying
> sd.c.  But I had problems because what I found is that nothing was
> issuing this ioctl properly except for removable/hotpluggable media (and
> the SAS/SATA controllers/frameworks are not indicating this.  I tried
> overriding that in sd.c but I still found that there was another bug
> where the HAL module that does the monitoring does not monitor devices
> that are present and in use (mounted filesystems) during boot.  I think
> HAL was designed for removable media that would not be automatically
> mounted by zfs during boot.  I didn't analyze this further.

ZFS issues the ioctl() from vdev_disk.c.  It is up to the HBA drivers to 
correctly represent the DEV_GONE state (and is known to work with a variety of 
SATA drivers).

> Is "sd.c" considered a legacy driver?  Its what is responsible for the
> vast majority of disks.  That said, perhaps the problem is the HBA
> drivers?

It's the HBA drivers.

> So how do we distinguish "removed on purpose" as opposed to "removed by
> accident, faulted cable, or other non administrative issue?"  I presume
> that a removal initiated via cfgadm or some other tool could put the ZFS
> vdev into an offline state, and this would prevent the logic from
> accidentally marking the device FAULTED.  (Ideally it would also mark
> the device "REMOVED".)

If there is no physical connection (detected to the best of the driver's 
ability), then it is removed (REMOVED is different from OFFLINE).  Surprise 
device removal is not a fault - Solaris is designed to support removal of disks 
at any time without administrative intervention.  A fault is defined as broken 
hardware, which is not the case for a removed device.

There are projects underway to a) represent devices that are physically present 
but unable to attach to generate faults and b) topology-based diagnosis to 
detect bad cables, expanders, etc.  This is a complicated problem and not 
always tractable, but can be solved reasonably well for modern systems and 
transports.

A completely orthogonal feature is the ability to represent extended periods of 
device removal as a defect.  While removing a disk is not itself a defect, 
leaving your pool running minus one disk for hours/days/weeks is clearly broken.

If you have a solution that correctly detects devices as REMOVED for a new 
class of HBAs/drivers, that'd be more than welcome.  If you choose to represent 
missing devices as faulted in your own third party system, that's your own 
prerogative, but it's not the current Solaris FMA model.

Hope that helps,

- Eric

--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to