Folks - I'm preparing to submit the attached PSARC case to provide better support for device removal and insertion within ZFS. Since this is a rather complex issue, with a fair share of corner issues, I thought I'd send the proposal out to the ZFS community at large for further comment before submitting it.
The prototype is functional except for the offline device insertion and hot spares functionality. I hope to have this integrated within the next month, along with the next phase of FMA integration. Please respond with any comments, concerns, or suggestions. Thanks, - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
1. INTRODUCTION Currently, ZFS supports what is affectionately known as "poor man's hotplug". If a device is removed from the system, then it is assumed that upon I/O failure, an attempt to reopen the same device will fail. This will trigger a FMA fault, substituting a hot spare if available. This is undesirable for two reasons: - There is no distinction between device removal and arbitrary failure. If a device is removed from the system, it should be treated as a deliberate action different from normal failure. - There is no support for automatic response to device insertion. For a server configured with a ZFS pool, the administrator should be able to walk up, remove any drive (preferably a faulted one), insert a new drive, and not have to issue any ZFS commands to reconfigure the pool. This is particularly true for the appliance space, where hardware reconfiguration should "just work". This case enhances ZFS to respond to device removal and provides a mechanism to automatically deal with device insertion. While the framework is generic, the primary target is devices supported by the SATA framework. The only device-specific portion of this proposal concerns determining if a device is in the same "physical location" as a previously known device, involve correlating a transport's enumeration of the device with the device's physical location within the chassis. 2. DEVICE REMOVAL There are two types of device removal within Solaris. Coordinated device removal involves stopping all consumers of the device, using the appropriate cfgadm(1M) command (PSARC 1996/285), and then physically removing the device. Uncoordinated removal (also known as "surprise removal") is when a device is physically removed while still in active use by the system. The latter increasingly common as more I/O protocols support hotplug and higher level software (ZFS) becomes more capable. There are several ways to detect device removal within Solaris. Fibre channel drivers generate the NDI events FCAL_INSERT_EVENT and FCAL_REMOVE_EVENT. USB and 1394 drivers generate the NDI events DDI_DEVI_INSERT_EVENT and DDI_DEVI_REMOVE_EVENT. In addition to these event channels, there is also the DKIOCSTATE ioctl() which returns (on capable drivers) DKIO_DEV_GONE if the device has been removed. Of these, the ioctl() is the most widely supported, and is the mechanism used as part of this case. Since this is an implementation detail of the current architecture, it does not preclude using alternate mechanisms in the future. When an I/O to a disk fails, ZFS will query the media state by the DKIOCSTATE ioctl. If the device is any state other than DKIO_INSERTED, ZFS will transition the device to a new REMOVED state. No FMA fault will be triggered, and a hot spare (if any) will be substituted if available. Note that the DKIO_DEV_GONE can be returned for a variety of reasons (pulling cables, external chassis being powered off, etc). In the absence of additional FMA information, it is assumed that this is intentional administrative action. As part of this work, lofiadm(1M) will be expanded to include a new force (-f) flag when removing devices. Combined with the upcoming lofi devfs events (PSARC 2006/709), this will provide a much simpler testing framework without the need for physical hardware interaction. When this flag is used, the underlying file will be closed, any further I/O or attempts to open the device will fail, and DKIOCSTATE will return DKIO_DEV_GONE. This flag will remain private for testing only, and will not be documented. An example of this in action: # lofiadm -a /disk/a /dev/lofi/1 # lofiadm -a /disk/b /dev/lofi/2 # lofiadm -a /disk/c /dev/lofi/3 # zpool create -f test mirror /dev/lofi/1 /dev/lofi/2 spare /dev/lofi/3 # while :; do touch /test/foo; sync; sleep 1; done & [1] 100662 # zpool status pool: test state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 /dev/lofi/1 ONLINE 0 0 0 /dev/lofi/2 ONLINE 0 0 0 spares /dev/lofi/3 AVAIL errors: No known data errors # lofiadm -d /disk/a -f # zpool status pool: test state: DEGRADED scrub: resilver completed with 0 errors on Mon Mar 12 10:57:43 2007 config: NAME STATE READ WRITE CKSUM test DEGRADED 0 0 0 mirror DEGRADED 0 0 0 spare DEGRADED 0 0 0 /dev/lofi/1 REMOVED 0 0 0 /dev/lofi/3 ONLINE 0 0 0 /dev/lofi/2 ONLINE 0 0 0 spares /dev/lofi/3 INUSE currently in use errors: No known data errors This behavior is universal for all pools, and cannot be disabled. If a device doesn't support DKIOCSTATE, then it will be diagnosed as faulty through the standard FMA mechanisms. The 'REMOVED' state is not persistent, so if a machine is rebooted with a device in the REMOVED state, it will appear as FAULTED when the machine comes up. 3. DEVICE INSERTION When a device is inserted, there are two possible outcomes of interest to ZFS: - If a previously known device is inserted, then we want to online the device. - If a new device is inserted into a physical location that previously contained a ZFS device, then we want to format the device and replace the original device. The former is applicable to any pool, and is always enabled. The latter is potentially damaging, as it will automatically overwrite any data present on newly inserted devices. To protect against this, a new pool property (PSARC 2006/577), 'autoreplace', will be defined. This boolean property will be off by default to minimize the impact on existing systems or unknown hardware. If unset, the current behavior remains the same, and any replacement operation must be initiated by the administrator via zpool(1M). When set, it indicates that any new device found in the same physical location as a device previously belonging to the pool will be automatically formatted and replaced. To ensure consistent behavior, ZFS must behave in the same manner when the device is replaced (via hotplug) while the system is running, as well as when the device is replaced while the system is powered off. 4. ONLINE DEVICE INSERTION A new syseventd module will be introduced that listens for EC_DEV_ADD events of subclass ESC_DISK or ESC_LOFI. This event is triggered when the device node for the disk or lofi device is created, not necessarily when a disk is inserted. Currently, the USB framework auto-configures drives on insertion, while the SATA framework does not. Modifying the SATA framework behavior will be pursued under a separate case and is outside the scope of this case. In the meantime, these SATA events will be triggered only by an explicit 'cfgadm -c configure' by the user. When one of these events is received, the corresponding device path is derived from the sysevent payload. For disks, this will be the device node, while for lofi it will be a particular minor node. If the device has a devid, then we first search all pools for a vdev with a matching devid. If none is found, or the device does not have a devid, then we search all pools for vdevs with the specified device path. As part of this work, the ZFS configuration will be expanded to store the physical device path as part of the vdev label. This will also have the benefit of allowing ZFS to boot from devices which don't support devids. Currently, ZFS only identifies by devid or /dev path, neither of which may be available when mounting the root filesystem. This simplistic mechanism will only work for devices which have the behavior that the device path identifies a physical location, which may not be true for FC, or iSCSI devices, or for devices plumbed under MPxIO. This logic can be expanded in the future if there are protocols or drivers which do not adhere to this behavior. If no matching vdevs are found, then the event is ignored and nothing is done. Otherwise, the device is onlined to determine if it is a known ZFS device. This online operation will automatically remove any attached spare when the resilver is complete. To continue the above example: # zpool status pool: test state: DEGRADED scrub: resilver completed with 0 errors on Mon Mar 12 10:57:43 2007 config: NAME STATE READ WRITE CKSUM test DEGRADED 0 0 0 mirror DEGRADED 0 0 0 spare DEGRADED 0 0 0 /dev/lofi/1 REMOVED 0 0 0 /dev/lofi/3 ONLINE 0 0 0 /dev/lofi/2 ONLINE 0 0 0 spares /dev/lofi/3 INUSE currently in use errors: No known data errors # lofiadm -a /disk/a /dev/lofi/1 # zpool status pool: test state: ONLINE scrub: resilver completed with 0 errors on Mon Mar 12 10:58:22 2007 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 /dev/lofi/1 ONLINE 0 0 0 /dev/lofi/2 ONLINE 0 0 0 spares /dev/lofi/3 AVAIL errors: No known data errors If the online attempt failed, then we are dealing with a new device inserted into the same physical slot. If the 'autoreplace' property is unset, then the event is ignored. If the original event was ESC_DISK and the vdev is not a whole disk, then the event is also ignored. Otherwise, the disk is labeled with an EFI label in the same manner as when the pool is initially created. If that succeeds, then the corresponding 'zpool replace' command is automatically invoked. To continue the above example: # lofiadm -d /disk/a -f # zpool status pool: test state: DEGRADED scrub: resilver completed with 0 errors on Mon Mar 12 10:57:43 2007 config: NAME STATE READ WRITE CKSUM test DEGRADED 0 0 0 mirror DEGRADED 0 0 0 spare DEGRADED 0 0 0 /dev/lofi/1 REMOVED 0 0 0 /dev/lofi/3 ONLINE 0 0 0 /dev/lofi/2 ONLINE 0 0 0 spares /dev/lofi/3 INUSE currently in use errors: No known data errors # lofiadm -a /disk/d /dev/lofi/1 # zpool status pool: test state: DEGRADED scrub: resilver completed with 0 errors on Mon Mar 12 17:31:06 2007 config: NAME STATE READ WRITE CKSUM test DEGRADED 0 0 0 mirror DEGRADED 0 0 0 spare DEGRADED 0 0 0 replacing DEGRADED 0 0 0 /dev/lofi/1/old FAULTED 0 0 0 corrupted data /dev/lofi/1 ONLINE 0 0 0 /dev/lofi/3 ONLINE 0 0 0 /dev/lofi/2 ONLINE 0 0 0 spares /dev/lofi/3 INUSE currently in use errors: No known data errors # zpool status pool: test state: ONLINE scrub: resilver completed with 0 errors on Mon Mar 12 17:31:06 2007 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 /dev/lofi/1 ONLINE 0 0 0 /dev/lofi/2 ONLINE 0 0 0 spares /dev/lofi/3 AVAIL errors: No known data errors In this case, the device was automatically replaced with the next contents. 5. OFFLINE DEVICE INSERTION If a device is replaced while the system is powered off, then ZFS should behave in a similar manner. If devices change attachment points (i.e. swapped) while the system is powered off, ZFS already handles this case for devices which support devids. If a device can be opened but the devid doesn't match, then ZFS will treat this as a disk insertion event. If the 'autoreplace' property is set, then ZFS will label the disk and perform the appropriate 'zpool replace' operation to resilver the device. 6. HOT SPARES Currently, ZFS does not do any I/O to inactive hot spares, so it is incapable of detecting when a hot spare is removed from the system. This case will modify ZFS to periodically attempt to read from all hot spares and make sure they are online and available. If a hot spare is removed, then when this I/O fails it will trigger the normal remove path. This case will also allow offline hot spares to be replaced. With these changes, hot spares will be treated as normal devices with respect to hotplug. If an active hot spare is removed, then the hot spare will be detached and marked removed. If another hot spare is available, then it will be substituted in its place. If a hot spare is inserted, and there is a faulted device with no current hot spare, then inserting the device will automatically trigger a hot spare. 7. MANPAGE DIFFS XXX 8. REFERENCES PSARC 1996/285 Dynamic Attach/Detach of CPU/Memory Boards PSARC 2002/240 ZFS PSARC 2006/223 ZFS Hot Spares PSARC 2006/577 zpool property to disable delegation PSARC 2006/709 lofi devfs events
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss