Re: [zfs-discuss] hot detach of disks, ZFS and FMA integration

Garrett D'Amore Thu, 17 Jun 2010 15:14:35 -0700

On Thu, 2010-06-17 at 17:53 -0400, Eric Schrock wrote:
> On Jun 17, 2010, at 4:35 PM, Garrett D'Amore wrote:
> > 
> > I actually started with DKIOCGSTATE as my first approach, modifying
> > sd.c.  But I had problems because what I found is that nothing was
> > issuing this ioctl properly except for removable/hotpluggable media (and
> > the SAS/SATA controllers/frameworks are not indicating this.  I tried
> > overriding that in sd.c but I still found that there was another bug
> > where the HAL module that does the monitoring does not monitor devices
> > that are present and in use (mounted filesystems) during boot.  I think
> > HAL was designed for removable media that would not be automatically
> > mounted by zfs during boot.  I didn't analyze this further.
> 
> ZFS issues the ioctl() from vdev_disk.c.  It is up to the HBA drivers to 
> correctly represent the DEV_GONE state (and is known to work with a variety 
> of SATA drivers).


So maybe the problem is the SAS adapters I'm dealing with (LSI).  This
is not an imaginary problem -- if there is a better (more correct)
solution, then I'd like to use it.  Right now it probably is not
reasonable for me to fix every HBA driver (I can't, as I don't have
source code to a number of them.)

Actually, the problem *might* be the MPXIO vHCI layer....


> 
> > Is "sd.c" considered a legacy driver?  Its what is responsible for the
> > vast majority of disks.  That said, perhaps the problem is the HBA
> > drivers?
> 
> It's the HBA drivers.

Ah, so you need to see CMD_DEV_GONE on the transport layer.
Interesting.  I don't think the SAS drivers are doing this.

> 
> > So how do we distinguish "removed on purpose" as opposed to "removed by
> > accident, faulted cable, or other non administrative issue?"  I presume
> > that a removal initiated via cfgadm or some other tool could put the ZFS
> > vdev into an offline state, and this would prevent the logic from
> > accidentally marking the device FAULTED.  (Ideally it would also mark
> > the device "REMOVED".)
> 
> If there is no physical connection (detected to the best of the driver's 
> ability), then it is removed (REMOVED is different from OFFLINE).  Surprise 
> device removal is not a fault - Solaris is designed to support removal of 
> disks at any time without administrative intervention.  A fault is defined as 
> broken hardware, which is not the case for a removed device.

So how do you diagnose the situation where someone trips over a cable,
or where the drive was bumped and detached from the cable?  I guess I'm
OK with the idea that these are in a REMOVED state, but I'd like the
messaging to say something besides "the administrator has removed the
device" or somesuch (which is what it says now).  Clearly that's not
what happened.

It gets more interesting with other kinds of transports.  For example,
iSCSI or some other transport (I worked with ATAoverEthernet at one
point) -- in that case if the remote node goes belly up, or the network
is lost, its clearly not the case that this was "removed".  The
situation here is a device that you can't talk to.  I'd argue its a
FAULT.

For busses like 1394 or USB, where the typical use is at a desktop where
folks just plug in/out all the time, I don't see this as such a problem.
But for enterprise grade storage, I have higher expectations.

> 
> There are projects underway to a) represent devices that are physically 
> present but unable to attach to generate faults and b) topology-based 
> diagnosis to detect bad cables, expanders, etc.  This is a complicated 
> problem and not always tractable, but can be solved reasonably well for 
> modern systems and transports.

I think there are a significant number of cases where you can't tell the
difference between a unit dying and a bad cable and a disconnected
cable.  With some special magnetics, you might be able to use
time-domain-reflectometry to diagnose things, but this requires unusual
hardware and is clearly outside of the normal scope of things we're
dealing with.  (Interestingly, some ethernet PHYs have this capability.)

> 
> A completely orthogonal feature is the ability to represent extended periods 
> of device removal as a defect.  While removing a disk is not itself a defect, 
> leaving your pool running minus one disk for hours/days/weeks is clearly 
> broken.

Agreed that this is orthogonal.  I'd say that this is best handled via
more strong handling of the DEGRADED state.

> 
> If you have a solution that correctly detects devices as REMOVED for a new 
> class of HBAs/drivers, that'd be more than welcome.  If you choose to 
> represent missing devices as faulted in your own third party system, that's 
> your own prerogative, but it's not the current Solaris FMA model.
> 

I can certainly flag the device as REMOVED rather than FAULTED, although
that will require some extra changes to libzfs I think.  (A new
zpool_vdev_removed or vdev_unreachable function or somesuch.)

My point here is that I'm willing to refine the work so that it helps
folks.

What's important to my mind is two things:

a) when a unit is removed, a spare is recruited to replace it if one is
available.  (I.e. zfs-retire needs to work.)

b) ideally, this should be logged/handled in some manner asynchronously,
so that if such an event has occurred, it does not come as a surprise to
the administrator 2 weeks after the fact when the *2nd* unit dies or is
removed.

Its that last point "b" that makes me feel less good about "REMOVED".
The current code seems to assume that removal is always intentional, and
therefore no further notification is needed.  But when a disk stops
answering SCSI commands, it may indicate an unplanned device failure.

One other thought -- I think ZFS should handle this in a manner such
that the behavior appears to the administrator to be the same,
regardless of whether I/O was occurring on the unit or not.

An interesting question is what happens if I yank a drive while there
are outstanding commands pending?  Those commands should time out at the
HBA, but will it report them as CMD_DEV_GONE, or will it report an error
causing a fault to be flagged?

        - Garrett


> 
> - Eric
> 
> --
> Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock
> 
> 


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] hot detach of disks, ZFS and FMA integration

Reply via email to