Steve, > I have a couple of questions and concerns about using ZFS in an > environment where the underlying LUNs are replicated at a block > level using products like HDS TrueCopy or EMC SRDF. Apologies in > advance for the length, but I wanted the explanation to be clear. > > (I do realise that there are other possibilities such as zfs send/ > recv and there are technical and business pros and cons for the > various options. I don't want to start a 'which is best' argument :) ) > > The CoW design of ZFS means that it goes to great lengths to always > maintain on-disk self-consistency, and ZFS can make certain > assumptions about state (e.g not needing fsck) based on that. This > is the basis of my questions. > > 1) First issue relates to the überblock. Updates to it are assumed > to be atomic, but if the replication block size is smaller than the > überblock then we can't guarantee that the whole überblock is > replicated as an entity. That could in theory result in a corrupt > überblock at the > secondary. > > Will this be caught and handled by the normal ZFS checksumming? If > so, does ZFS just use an alternate überblock and rewrite the > damaged one transparently? > > 2) Assuming that the replication maintains write-ordering, the > secondary site will always have valid and self-consistent data, > although it may be out-of-date compared to the primary if the > replication is asynchronous, depending on link latency, buffering, > etc. > > Normally most replication systems do maintain write ordering, [i] > except[/i] for one specific scenario. If the replication is > interrupted, for example secondary site down or unreachable due to > a comms problem, the primary site will keep a list of changed > blocks. When contact between the sites is re-established there > will be a period of 'catch-up' resynchronization. In most, if not > all, cases this is done on a simple block-order basis. Write- > ordering is lost until the two sites are once again in sync and > routine replication restarts. > > I can see this has having major ZFS impact. It would be possible > for intermediate blocks to be replicated before the data blocks > they point to, and in the worst case an updated überblock could be > replicated before the block chains that it references have been > copied. This breaks the assumption that the on-disk format is > always self-consistent.
For most implementations of resynchronization, not only are changes resilvered in a block-ordered basis, resynchronization is also done in a single pass over the volume(s). To address the fact that resynchronization happens while additional changes are also being replicated, the concept of a resynchronization point is kept. As this resynchronization point traverse the volume from beginning to end, I/ Os occurring before, or at this point need to be replicated inline, whereas I/Os occurring after this point need to marked such that they will be replicated later in block order. You are quite correct in that the data is not consistent. > If a disaster happened during the 'catch-up', and the partially- > resynchronized LUNs were imported into a zpool at the secondary > site, what would/could happen? Refusal to accept the whole zpool? > Rejection just of the files affected? System panic? How could > recovery from this situation be achieved? The state of the partially-resynchronized LUNs are much worse than you know. During active resynchronization, the remote volume contains a mixture of prior write-order consistent data, resilvered block- order data, plus new replicated data. Essentially the partially- resynchronized LUNs are totally inconsistent until such a times as the single pass over all data is 100% complete. For some, but not all replication software, if the 'catch-up' resynchronization failed, read access to the LUNs should be prevented, or a least read access while the LUNs are configured as remote mirrors. Availability Suite's Remote Mirror software (SNDR) marks such volumes as "need synchronization" and fails all application read and write I/Os. > Obviously all filesystems can suffer with this scenario, but ones > that expect less from their underlying storage (like UFS) can be > fscked, and although data that was being updated is potentially > corrupt, existing data should still be OK and usable. My concern > is that ZFS will handle this scenario less well. > > There are ways to mitigate this, of course, the most obvious being > to take a snapshot of the (valid) secondary before starting resync, > as a fallback. This isn't always easy to do, especially since the > resync is usually automatic; there is no clear trigger to use for > the snapshot. It may also be difficult to synchronize the snapshot > of all LUNs in a pool. I'd like to better understand the risks/ > behaviour of ZFS before starting to work on mitigation strategies. Since Availability Suite is both Remote Mirroring and Point-in-Time Copy software, the software can be configured to automatically take a snapshot prior to re-synchronization, and automatically delete the snapshot if completed successfully. The use of I/O consistency groups assure that not only are the replicas write-order consistent during replication, but also that snapshots taken prior to re- synchronization are consistent too. > Thanks > > Steve > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Jim Dunham Storage Platform Software Group Sun Microsystems, Inc. wk: 781.442.4042 http://blogs.sun.com/avs http://www.opensolaris.org/os/project/avs/ http://www.opensolaris.org/os/project/iscsitgt/ http://www.opensolaris.org/os/community/storage/ _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss