I think a better question would be: what kind of tests would be most
promising for turning some subclass of these lost pools reported on
the mailing list into an actionable bug?
my first bet would be writing tools that test for ignored sync cache
commands leading to lost writes, and apply them to the case when iSCSI
targets are rebooted but the initiator isn't.
I think in the process of writing the tool you'll immediately bump
into a defect, because you'll realize there is no equivalent of a
'hard' iSCSI mount like there is in NFS. and there cannot be a strict
equivalent to 'hard' mounts in iSCSI, because we want zpool redundancy
to preserve availability when an iSCSI target goes away. I think the
whole model is wrong somehow.
I'd surely hope that a ZFS pool with redundancy built on iSCSI targets
could survive the loss of some targets whether due to actual failures or
necessary upgrades to the iSCSI targets (think OS upgrades + reboots on
the systems that are offering iSCSI devices to the network.)
My suggestion is use multi-way redundancy with iSCSI...e.g. 3 way
mirrors or RAIDZ2...so that you can safely offline one of the iSCSI
targets while still leaving the pool with some redundancy. Sure there
is an increased risk while that device is offline, but the window of
opportunity is small for a failure of the 2nd level redundancy; and even
then nothing is yet lost until a 3rd device has a fault. Failures
should also distinguish between complete failure (e.g. device no longer
responds to commands whatsoever) and intermittent failure (e.g. a
"sticky" patch of sectors, or the drive stops responding for a minute
because it has a non-changeable TLER value that otherwise may cause
trouble in a RAID configuration). Drives have a gradation from complete
failure to flaky to flawless...if the software running on them
recognizes this, better decisions can be made about what to do when an
error is encountered rather than the simplistic good/failed model that
has been used in RAIDs for years.
My preference for storage behavior is that it should not cause a system
panic (ever). Graceful error recovery techniques are important. File
system error messages should be passed up the line when possible so the
user can figure out something is amiss with some files (even if not all)
even though the sysadmin is not around or email notification of problems
is not working. If it is possible to returning a CRC errors to a
network share client, that would seem to be a close match to a
uncorrectable checksum failure. (Windows throws these errors when it
cannot read a CD/DVD.)
A good damage mitigation feature is to provide some mechanism to allow a
user to ignore the checksum failure as in many user data cases partial
recovery is preferable to no recovery. To ensure that damaged files are
not accidentally confused with good files, ignoring the checksum
failures might only be allowed through a special "recovery filesystem"
that only lists damaged files the authenticated user has access to.
From the network client's perspective, this would be another shared
folder/subfolder that is only present when uncorrectable, damaged files
have been found. ZFS would set up the appropriate links to replicate
the directory structure of the original as needed to include the damaged
file.
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss