I think a better question would be: what kind of tests would be most
promising for turning some subclass of these lost pools reported on
the mailing list into an actionable bug?

my first bet would be writing tools that test for ignored sync cache
commands leading to lost writes, and apply them to the case when iSCSI
targets are rebooted but the initiator isn't.

I think in the process of writing the tool you'll immediately bump
into a defect, because you'll realize there is no equivalent of a
'hard' iSCSI mount like there is in NFS.  and there cannot be a strict
equivalent to 'hard' mounts in iSCSI, because we want zpool redundancy
to preserve availability when an iSCSI target goes away.  I think the
whole model is wrong somehow.
I'd surely hope that a ZFS pool with redundancy built on iSCSI targets could survive the loss of some targets whether due to actual failures or necessary upgrades to the iSCSI targets (think OS upgrades + reboots on the systems that are offering iSCSI devices to the network.)

My suggestion is use multi-way redundancy with iSCSI...e.g. 3 way mirrors or RAIDZ2...so that you can safely offline one of the iSCSI targets while still leaving the pool with some redundancy. Sure there is an increased risk while that device is offline, but the window of opportunity is small for a failure of the 2nd level redundancy; and even then nothing is yet lost until a 3rd device has a fault. Failures should also distinguish between complete failure (e.g. device no longer responds to commands whatsoever) and intermittent failure (e.g. a "sticky" patch of sectors, or the drive stops responding for a minute because it has a non-changeable TLER value that otherwise may cause trouble in a RAID configuration). Drives have a gradation from complete failure to flaky to flawless...if the software running on them recognizes this, better decisions can be made about what to do when an error is encountered rather than the simplistic good/failed model that has been used in RAIDs for years.

My preference for storage behavior is that it should not cause a system panic (ever). Graceful error recovery techniques are important. File system error messages should be passed up the line when possible so the user can figure out something is amiss with some files (even if not all) even though the sysadmin is not around or email notification of problems is not working. If it is possible to returning a CRC errors to a network share client, that would seem to be a close match to a uncorrectable checksum failure. (Windows throws these errors when it cannot read a CD/DVD.)

A good damage mitigation feature is to provide some mechanism to allow a user to ignore the checksum failure as in many user data cases partial recovery is preferable to no recovery. To ensure that damaged files are not accidentally confused with good files, ignoring the checksum failures might only be allowed through a special "recovery filesystem" that only lists damaged files the authenticated user has access to. From the network client's perspective, this would be another shared folder/subfolder that is only present when uncorrectable, damaged files have been found. ZFS would set up the appropriate links to replicate the directory structure of the original as needed to include the damaged file.

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to