On Jun 17, 2010, at 6:13 PM, Garrett D'Amore wrote:
> 
> So how do you diagnose the situation where someone trips over a cable,
> or where the drive was bumped and detached from the cable?  I guess I'm
> OK with the idea that these are in a REMOVED state, but I'd like the
> messaging to say something besides "the administrator has removed the
> device" or somesuch (which is what it says now).  Clearly that's not
> what happened.

Are you requesting that we diagnose the difference between tripping over a 
cable and intentionally unplugging it?  That's clearly beyond any software's 
ability to diagnose.

On the SS7000 series, you get an alert that the enclosure has been detached 
from the system.  The fru-monitor code (generalization of the disk-monitor) 
that generates this sysevent has not yet been pushed to ON.

> a) when a unit is removed, a spare is recruited to replace it if one is
> available.  (I.e. zfs-retire needs to work.)

This is handled by the REMOVED state, as zfs-retire subscribes to 
resource.removed.

> b) ideally, this should be logged/handled in some manner asynchronously,
> so that if such an event has occurred, it does not come as a surprise to
> the administrator 2 weeks after the fact when the *2nd* unit dies or is
> removed.

These are logged as alerts in the SS7000.  The first-class notion of a Solaris 
alert is not new, and has been proposed in the past as part of FMA work.  The 
FMA team is currently working on a project that will introduce some of the 
underlying infrastructure to formalized alerts in Solaris.  These events (the 
primitives are not called alerts) represent formalized things of interest that 
are not directly related to a fault or defect.  That, along with the ability to 
diagnose a defect over extended periods of removal, is the correct way to 
represent this situation.

> Its that last point "b" that makes me feel less good about "REMOVED".
> The current code seems to assume that removal is always intentional, and
> therefore no further notification is needed.  But when a disk stops
> answering SCSI commands, it may indicate an unplanned device failure.

There are many, many, failure modes that can be distinguished just fine from 
physical device removal.  For example, you can have a PHY up but the attached 
device completely unresponsive, but you know there is a device there.  Or you 
can look at the SES data to determine physical presence.  Converting all 
hotplug events into faults is too broad a brush here.

> One other thought -- I think ZFS should handle this in a manner such
> that the behavior appears to the administrator to be the same,
> regardless of whether I/O was occurring on the unit or not.
> 
> An interesting question is what happens if I yank a drive while there
> are outstanding commands pending?  Those commands should time out at the
> HBA, but will it report them as CMD_DEV_GONE, or will it report an error
> causing a fault to be flagged?

This is detected as device removal.  There is a timeout associated with I/O 
errors in zfs-diagnosis that gives some grace period to detect removal before 
declaring a disk faulted.

- Eric

--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to