On Jun 17, 2010, at 6:13 PM, Garrett D'Amore wrote: > > So how do you diagnose the situation where someone trips over a cable, > or where the drive was bumped and detached from the cable? I guess I'm > OK with the idea that these are in a REMOVED state, but I'd like the > messaging to say something besides "the administrator has removed the > device" or somesuch (which is what it says now). Clearly that's not > what happened.
Are you requesting that we diagnose the difference between tripping over a cable and intentionally unplugging it? That's clearly beyond any software's ability to diagnose. On the SS7000 series, you get an alert that the enclosure has been detached from the system. The fru-monitor code (generalization of the disk-monitor) that generates this sysevent has not yet been pushed to ON. > a) when a unit is removed, a spare is recruited to replace it if one is > available. (I.e. zfs-retire needs to work.) This is handled by the REMOVED state, as zfs-retire subscribes to resource.removed. > b) ideally, this should be logged/handled in some manner asynchronously, > so that if such an event has occurred, it does not come as a surprise to > the administrator 2 weeks after the fact when the *2nd* unit dies or is > removed. These are logged as alerts in the SS7000. The first-class notion of a Solaris alert is not new, and has been proposed in the past as part of FMA work. The FMA team is currently working on a project that will introduce some of the underlying infrastructure to formalized alerts in Solaris. These events (the primitives are not called alerts) represent formalized things of interest that are not directly related to a fault or defect. That, along with the ability to diagnose a defect over extended periods of removal, is the correct way to represent this situation. > Its that last point "b" that makes me feel less good about "REMOVED". > The current code seems to assume that removal is always intentional, and > therefore no further notification is needed. But when a disk stops > answering SCSI commands, it may indicate an unplanned device failure. There are many, many, failure modes that can be distinguished just fine from physical device removal. For example, you can have a PHY up but the attached device completely unresponsive, but you know there is a device there. Or you can look at the SES data to determine physical presence. Converting all hotplug events into faults is too broad a brush here. > One other thought -- I think ZFS should handle this in a manner such > that the behavior appears to the administrator to be the same, > regardless of whether I/O was occurring on the unit or not. > > An interesting question is what happens if I yank a drive while there > are outstanding commands pending? Those commands should time out at the > HBA, but will it report them as CMD_DEV_GONE, or will it report an error > causing a fault to be flagged? This is detected as device removal. There is a timeout associated with I/O errors in zfs-diagnosis that gives some grace period to detect removal before declaring a disk faulted. - Eric -- Eric Schrock, Fishworks http://blogs.sun.com/eschrock _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss