>>>>> "gm" == Gary Mills <mi...@cc.umanitoba.ca> writes:
gm> Is there any more that I've missed? 1. Filesystem/RAID layer dispatches writes 'aaaaaaaaa' to iSCSI initiator. iSCSI initiator accepts them, buffers them, returns success to RAID layer. 2. iSCSI initiator sends to iSCSI target. iSCSI Target writes 'aaaaaaaa'. 3. Network connectivity is interrupted, target is rebooted, something like that. 4. Filesystem/RAID layer dispatches writes 'bbbbbbbb' to iSCSI initiator. initiator accepts, buffers, returns success. 5. iSCSI initiator can't write 'bbbbbbbb' 6. iSCSI initiator goes through some cargo-cult error-recovery scheme. retry this 3 times, timeout, disconnect, reconnect, retry really-hard 5 times, timeout, return various errors to RAID layer, maybe. 7. OH! Target's back! good. 8. Filesystem/RAID layer writes 'ccccccccc' to iSCSI initiator. maybe gets an error. maybe flags 'ccccccccc' destination blocks bad, increments RAID-layer coutners, tries to ``rewrite'' the 'cccccccc', eventually gets success back from the initiator. 9. Filesystem/RAID layer issues SYNCHRONIZE CACHE to the iSCSI initiator. 10. initiator flushes 'cccccccc' to the target, and waits for target to confirm 'ccccccc' and all previous writes are on physical media. 11. initiator returns success for the SYNCHRONIZE CACHE command. 12. Filesystem/RAID layer writes 'd' commit sector updating pointers, aiming various important things at 'bbbbbbbbb' Now, the RAID layer thinks 'aaaaaaaaa' and 'bbbbbbbbb' and 'ccccccccc' and 'd' are all written, but in fact only 'aaaaaaaaa' and 'cccccccccc' and 'd' are written, and 'd' points at garbage. NFS has a state machine designed to handle server reboots without breaking any consistency promises. Substitute ``the userland app'' for Filesystem/RAID, and ``NFSv3 client'' for iSCSI initiator. The NFSv3 client will keep track of which writes are actually committed to disk and batch them into commit blocks of which the userland app is entirely unaware. The NFS client won't free a commit block from its RAM write cache until it's on disk. If the server reboots it will replay the open commit blocks. If the server AND client reboot the commit block will be lost from RAM, but then 'd' are not written, so the datastore is not corrupt. The iSCSI initiator probably needs to do something similar to NFSv3 to enforce that success from SYNCHRONIZE CACHE really means what ZFS thinks it means. It's a little trickier to do this with ZFS/iSCSI because the NFS cop-out was to use 'hard' mounts---you _never_ propogate write failures up the stack. You just freeze the application until you can finally complete the write, and if you can't write you evade the consistency guarantees by killing the app. Then, it's a solveable problem to design apps that won't corrupt their datastores when they're killed, so the overall system works. This world order won't work analagously for ZFS-on-iSCSI which needs to see failures to handle redundancy. We may even need some new kind of failure code to solve the problem, but maybe something clever can be crammed into the old API. Imagine the stream of writes to a disk as a bucket-brigade separated by SYNCHRONIZE CACHE commands. The writes within each bucket can be sloshed around (reordered) arbitrarily. And if the machine crashes, we might pour _part_ of the water in the last bucket on the fire, but then we stop and drop all the other buckets. So far, we can handle it. But we've no way to handle the situation where someone in the _middle_ of the brigade spills the water in his bucket. There's no way to cleanly restart the brigade after this happens. ZFS needs to gracefully handle a SYNCHRONIZE CACHE command that returns _failure_, and needs to interpret such a failure really aggressively, as in: Any writes you issued since the last SYNCHRONIZE CACHE, *even if you got a Success return to your block-layer write() command*, may or may not be committed to disk, and waiting will NOT change the situation---they're just gone. But, the disk is still here, and is working, meh, ~fine. This failure is not ``retryable''. If you issue a second SYNCHRONIZE CACHE command, and it Succeeds, that does NOT change what I've just told you. That Success only referrs to writes issued between this failing SYNCHRONIZE CACHE command and the next one. Once iSCSI initiator is fixed, probably we need to go back and add NFS-style commit batches to even SATA disk drivers which can suffer the same problem if you hot-swap them, or maybe even if you don't hot-swap but the disk reports some error which invokes some convoluted sd/ssd exception handling involving ``resets''. The assumption doesn't hold that write, write, write, synchronize cache promises all those writes are on-disk once synchronize cache returns. The only way to make it hold is to promise to panic the kernel whenever any disk, controller, bus, or iscsi session is ``reset''---the simple, obvious ``SYNCHRONIZE CACHE is the final word of God'' assumption ought to handle cord-yanking just fine, but not smaller failures.
pgpxHJh4l1f7Y.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss