Folks -

I'm preparing to submit the attached PSARC case to provide better
support for device removal and insertion within ZFS.  Since this is a
rather complex issue, with a fair share of corner issues, I thought I'd
send the proposal out to the ZFS community at large for further comment
before submitting it.

The prototype is functional except for the offline device insertion and
hot spares functionality.  I hope to have this integrated within the
next month, along with the next phase of FMA integration.  Please
respond with any comments, concerns, or suggestions.

Thanks,

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock
1. INTRODUCTION

Currently, ZFS supports what is affectionately known as "poor man's
hotplug".  If a device is removed from the system, then it is assumed
that upon I/O failure, an attempt to reopen the same device will fail.
This will trigger a FMA fault, substituting a hot spare if available.
This is undesirable for two reasons:

- There is no distinction between device removal and arbitrary failure.
  If a device is removed from the system, it should be treated as a
  deliberate action different from normal failure.

- There is no support for automatic response to device insertion.  For a
  server configured with a ZFS pool, the administrator should be able to
  walk up, remove any drive (preferably a faulted one), insert a new
  drive, and not have to issue any ZFS commands to reconfigure the pool.
  This is particularly true for the appliance space, where hardware
  reconfiguration should "just work".

This case enhances ZFS to respond to device removal and provides a
mechanism to automatically deal with device insertion.  While the
framework is generic, the primary target is devices supported by
the SATA framework.  The only device-specific portion of this proposal
concerns determining if a device is in the same "physical location" as a
previously known device, involve correlating a transport's enumeration
of the device with the device's physical location within the chassis.


2. DEVICE REMOVAL

There are two types of device removal within Solaris.  Coordinated
device removal involves stopping all consumers of the device, using the
appropriate cfgadm(1M) command (PSARC 1996/285), and then physically
removing the device.  Uncoordinated removal (also known as "surprise
removal") is when a device is physically removed while still in active
use by the system.  The latter increasingly common as more I/O protocols
support hotplug and higher level software (ZFS) becomes more capable.

There are several ways to detect device removal within Solaris.  Fibre
channel drivers generate the NDI events FCAL_INSERT_EVENT and
FCAL_REMOVE_EVENT.  USB and 1394 drivers generate the NDI events
DDI_DEVI_INSERT_EVENT and DDI_DEVI_REMOVE_EVENT.  In addition to these
event channels, there is also the DKIOCSTATE ioctl() which returns (on
capable drivers) DKIO_DEV_GONE if the device has been removed.

Of these, the ioctl() is the most widely supported, and is the mechanism
used as part of this case.  Since this is an implementation detail of
the current architecture, it does not preclude using alternate
mechanisms in the future.  When an I/O to a disk fails, ZFS will query
the media state by the DKIOCSTATE ioctl.  If the device is any state
other than DKIO_INSERTED, ZFS will transition the device to a new
REMOVED state.  No FMA fault will be triggered, and a hot spare (if any)
will be substituted if available.  Note that the DKIO_DEV_GONE can be
returned for a variety of reasons (pulling cables, external chassis
being powered off, etc).  In the absence of additional FMA information,
it is assumed that this is intentional administrative action.

As part of this work, lofiadm(1M) will be expanded to include a new
force (-f) flag when removing devices.  Combined with the upcoming lofi
devfs events (PSARC 2006/709), this will provide a much simpler testing
framework without the need for physical hardware interaction.  When this
flag is used, the underlying file will be closed, any further I/O or
attempts to open the device will fail, and DKIOCSTATE will return
DKIO_DEV_GONE.  This flag will remain private for testing only, and will
not be documented.

An example of this in action:

# lofiadm -a /disk/a
/dev/lofi/1
# lofiadm -a /disk/b
/dev/lofi/2
# lofiadm -a /disk/c
/dev/lofi/3
# zpool create -f test mirror /dev/lofi/1 /dev/lofi/2 spare /dev/lofi/3
# while :; do touch /test/foo; sync; sleep 1; done &
[1] 100662
# zpool status
  pool: test
 state: ONLINE
 scrub: none requested
config:

        NAME             STATE     READ WRITE CKSUM
        test             ONLINE       0     0     0
          mirror         ONLINE       0     0     0
            /dev/lofi/1  ONLINE       0     0     0
            /dev/lofi/2  ONLINE       0     0     0
        spares
          /dev/lofi/3    AVAIL

errors: No known data errors
# lofiadm -d /disk/a -f
# zpool status
  pool: test
 state: DEGRADED
 scrub: resilver completed with 0 errors on Mon Mar 12 10:57:43 2007
config:

        NAME               STATE     READ WRITE CKSUM
        test               DEGRADED     0     0     0
          mirror           DEGRADED     0     0     0
            spare          DEGRADED     0     0     0
              /dev/lofi/1  REMOVED      0     0     0
              /dev/lofi/3  ONLINE       0     0     0
            /dev/lofi/2    ONLINE       0     0     0
        spares
          /dev/lofi/3      INUSE     currently in use

errors: No known data errors

This behavior is universal for all pools, and cannot be disabled. If a
device doesn't support DKIOCSTATE, then it will be diagnosed as faulty
through the standard FMA mechanisms.

The 'REMOVED' state is not persistent, so if a machine is rebooted with
a device in the REMOVED state, it will appear as FAULTED when the
machine comes up.


3. DEVICE INSERTION

When a device is inserted, there are two possible outcomes of interest
to ZFS:

- If a previously known device is inserted, then we want to online the
  device.

- If a new device is inserted into a physical location that previously
  contained a ZFS device, then we want to format the device and replace
  the original device.

The former is applicable to any pool, and is always enabled.  The latter
is potentially damaging, as it will automatically overwrite any data
present on newly inserted devices.  To protect against this, a new pool
property (PSARC 2006/577), 'autoreplace', will be defined.  This boolean
property will be off by default to minimize the impact on existing
systems or unknown hardware.  If unset, the current behavior remains the
same, and any replacement operation must be initiated by the
administrator via zpool(1M).  When set, it indicates that any new device
found in the same physical location as a device previously belonging to
the pool will be automatically formatted and replaced.

To ensure consistent behavior, ZFS must behave in the same manner when
the device is replaced (via hotplug) while the system is running, as
well as when the device is replaced while the system is powered off.


4. ONLINE DEVICE INSERTION

A new syseventd module will be introduced that listens for EC_DEV_ADD
events of subclass ESC_DISK or ESC_LOFI.  This event is triggered when
the device node for the disk or lofi device is created, not necessarily
when a disk is inserted.  Currently, the USB framework auto-configures
drives on insertion, while the SATA framework does not.  Modifying the
SATA framework behavior will be pursued under a separate case and is
outside the scope of this case.  In the meantime, these SATA events will
be triggered only by an explicit 'cfgadm -c configure' by the user.

When one of these events is received, the corresponding device path is
derived from the sysevent payload.  For disks, this will be the device
node, while for lofi it will be a particular minor node.  If the device
has a devid, then we first search all pools for a vdev with a matching
devid.  If none is found, or the device does not have a devid, then we
search all pools for vdevs with the specified device path.  As part of
this work, the ZFS configuration will be expanded to store the physical
device path as part of the vdev label.  This will also have the benefit
of allowing ZFS to boot from devices which don't support devids.
Currently, ZFS only identifies by devid or /dev path, neither of which
may be available when mounting the root filesystem.  This simplistic
mechanism will only work for devices which have the behavior that the
device path identifies a physical location, which may not be true for
FC, or iSCSI devices, or for devices plumbed under MPxIO.  This logic
can be expanded in the future if there are protocols or drivers which do
not adhere to this behavior.

If no matching vdevs are found, then the event is ignored and nothing is
done.  Otherwise, the device is onlined to determine if it is a known
ZFS device.  This online operation will automatically remove any
attached spare when the resilver is complete.  To continue the above
example:

# zpool status                                                                  
                                              
  pool: test
 state: DEGRADED
 scrub: resilver completed with 0 errors on Mon Mar 12 10:57:43 2007
config:

        NAME               STATE     READ WRITE CKSUM
        test               DEGRADED     0     0     0
          mirror           DEGRADED     0     0     0
            spare          DEGRADED     0     0     0
              /dev/lofi/1  REMOVED      0     0     0
              /dev/lofi/3  ONLINE       0     0     0
            /dev/lofi/2    ONLINE       0     0     0
        spares
          /dev/lofi/3      INUSE     currently in use

errors: No known data errors
# lofiadm -a /disk/a                                                            
                                              
/dev/lofi/1
# zpool status                                                                  
                                              
  pool: test
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Mar 12 10:58:22 2007
config:

        NAME             STATE     READ WRITE CKSUM
        test             ONLINE       0     0     0
          mirror         ONLINE       0     0     0
            /dev/lofi/1  ONLINE       0     0     0
            /dev/lofi/2  ONLINE       0     0     0
        spares
          /dev/lofi/3    AVAIL

errors: No known data errors

If the online attempt failed, then we are dealing with a new device
inserted into the same physical slot.  If the 'autoreplace' property is
unset, then the event is ignored.  If the original event was ESC_DISK
and the vdev is not a whole disk, then the event is also ignored.
Otherwise, the disk is labeled with an EFI label in the same manner as
when the pool is initially created.  If that succeeds, then the
corresponding 'zpool replace' command is automatically invoked.  To
continue the above example:

# lofiadm -d /disk/a -f
# zpool status
  pool: test
 state: DEGRADED
 scrub: resilver completed with 0 errors on Mon Mar 12 10:57:43 2007
config:

        NAME               STATE     READ WRITE CKSUM
        test               DEGRADED     0     0     0
          mirror           DEGRADED     0     0     0
            spare          DEGRADED     0     0     0
              /dev/lofi/1  REMOVED      0     0     0
              /dev/lofi/3  ONLINE       0     0     0
            /dev/lofi/2    ONLINE       0     0     0
        spares
          /dev/lofi/3      INUSE     currently in use

errors: No known data errors
# lofiadm -a /disk/d
/dev/lofi/1
# zpool status                                                                  
                                              
  pool: test
 state: DEGRADED
 scrub: resilver completed with 0 errors on Mon Mar 12 17:31:06 2007
config:

        NAME                     STATE     READ WRITE CKSUM
        test                     DEGRADED     0     0     0
          mirror                 DEGRADED     0     0     0
            spare                DEGRADED     0     0     0
              replacing          DEGRADED     0     0     0
                /dev/lofi/1/old  FAULTED      0     0     0  corrupted data
                /dev/lofi/1      ONLINE       0     0     0
              /dev/lofi/3        ONLINE       0     0     0
            /dev/lofi/2          ONLINE       0     0     0
        spares
          /dev/lofi/3            INUSE     currently in use

errors: No known data errors
# zpool status                                                                  
                                              
  pool: test
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Mar 12 17:31:06 2007
config:

        NAME             STATE     READ WRITE CKSUM
        test             ONLINE       0     0     0
          mirror         ONLINE       0     0     0
            /dev/lofi/1  ONLINE       0     0     0
            /dev/lofi/2  ONLINE       0     0     0
        spares
          /dev/lofi/3    AVAIL

errors: No known data errors

In this case, the device was automatically replaced with the next
contents.


5. OFFLINE DEVICE INSERTION

If a device is replaced while the system is powered off, then ZFS should
behave in a similar manner.  If devices change attachment points (i.e.
swapped) while the system is powered off, ZFS already handles this case
for devices which support devids.  If a device can be opened but the
devid doesn't match, then ZFS will treat this as a disk insertion event.
If the 'autoreplace' property is set, then ZFS will label the disk and
perform the appropriate 'zpool replace' operation to resilver the
device.


6. HOT SPARES

Currently, ZFS does not do any I/O to inactive hot spares, so it is
incapable of detecting when a hot spare is removed from the system.
This case will modify ZFS to periodically attempt to read from all hot
spares and make sure they are online and available.  If a hot spare is
removed, then when this I/O fails it will trigger the normal remove
path.  This case will also allow offline hot spares to be replaced.
With these changes, hot spares will be treated as normal devices with
respect to hotplug.

If an active hot spare is removed, then the hot spare will be detached
and marked removed.  If another hot spare is available, then it will be
substituted in its place.  If a hot spare is inserted, and there is a
faulted device with no current hot spare, then inserting the device will
automatically trigger a hot spare.


7. MANPAGE DIFFS

XXX


8. REFERENCES

PSARC 1996/285 Dynamic Attach/Detach of CPU/Memory Boards
PSARC 2002/240 ZFS
PSARC 2006/223 ZFS Hot Spares
PSARC 2006/577 zpool property to disable delegation
PSARC 2006/709 lofi devfs events
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to