Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

Jim Dunham Fri, 14 Dec 2007 11:30:26 -0800

Steve,

> I have a couple of questions and concerns about using ZFS in an  
> environment where the underlying LUNs are replicated at a block  
> level using products like HDS TrueCopy or EMC SRDF.  Apologies in  
> advance for the length, but I wanted the explanation to be clear.
>
> (I do realise that there are other possibilities such as zfs send/ 
> recv and there are technical and business pros and cons for the  
> various options. I don't want to start a 'which is best' argument :) )
>
> The CoW design of ZFS means that it goes to great lengths to always  
> maintain on-disk self-consistency, and ZFS can make certain  
> assumptions about state (e.g not needing fsck) based on that.  This  
> is the basis of my questions.
>
> 1) First issue relates to the überblock.  Updates to it are assumed  
> to be atomic, but if the replication block size is smaller than the  
> überblock then we can't guarantee that the whole überblock is  
> replicated as an entity.  That could in theory result in a corrupt  
> überblock at the
> secondary.
>
> Will this be caught and handled by the normal ZFS checksumming? If  
> so, does ZFS just use an alternate überblock and rewrite the  
> damaged one transparently?
>
> 2) Assuming that the replication maintains write-ordering, the  
> secondary site will always have valid and self-consistent data,  
> although it may be out-of-date compared to the primary if the  
> replication is asynchronous, depending on link latency, buffering,  
> etc.
>
> Normally most replication systems do maintain write ordering, [i] 
> except[/i] for one specific scenario.  If the replication is  
> interrupted, for example secondary site down or unreachable due to  
> a comms problem, the primary site will keep a list of changed  
> blocks.  When contact between the sites is re-established there  
> will be a period of 'catch-up' resynchronization.  In most, if not  
> all, cases this is done on a simple block-order basis.  Write- 
> ordering is lost until the two sites are once again in sync and  
> routine replication restarts.
>
> I can see this has having major ZFS impact.  It would be possible  
> for intermediate blocks to be replicated before the data blocks  
> they point to, and in the worst case an updated überblock could be  
> replicated before the block chains that it references have been  
> copied.  This breaks the assumption that the on-disk format is  
> always self-consistent.


For most implementations of resynchronization, not only are changes  
resilvered in a block-ordered basis, resynchronization is also done  
in a single pass over the volume(s). To address the fact that  
resynchronization happens while additional changes are also being  
replicated, the concept of a resynchronization point is kept. As this  
resynchronization point traverse the volume from beginning to end, I/ 
Os occurring before, or at this point need to be replicated inline,  
whereas I/Os occurring after this point need to marked such that they  
will be replicated later in block order. You are quite correct in  
that the data is not consistent.

> If a disaster happened during the 'catch-up', and the partially- 
> resynchronized LUNs were imported into a zpool at the secondary  
> site, what would/could happen? Refusal to accept the whole zpool?  
> Rejection just of the files affected? System panic? How could  
> recovery from this situation be achieved?

The state of the partially-resynchronized LUNs are much worse than  
you know. During active resynchronization, the remote volume contains  
a mixture of prior write-order consistent data, resilvered block- 
order data, plus new replicated data. Essentially the partially- 
resynchronized LUNs are totally inconsistent until such a times as  
the single pass over all data is 100% complete.

For some, but not all replication software, if the 'catch-up'  
resynchronization failed, read access to the LUNs should be  
prevented, or a least read access while the LUNs are configured as  
remote mirrors. Availability Suite's Remote Mirror software (SNDR)  
marks such volumes as "need synchronization" and fails all  
application read and write I/Os.

> Obviously all filesystems can suffer with this scenario, but ones  
> that expect less from their underlying storage (like UFS) can be  
> fscked, and although data that was being updated is potentially  
> corrupt, existing data should still be OK and usable.  My concern  
> is that ZFS will handle this scenario less well.
>
> There are ways to mitigate this, of course, the most obvious being  
> to take a snapshot of the (valid) secondary before starting resync,  
> as a fallback.  This isn't always easy to do, especially since the  
> resync is usually automatic; there is no clear trigger to use for  
> the snapshot. It may also be difficult to synchronize the snapshot  
> of all LUNs in a pool. I'd like to better understand the risks/ 
> behaviour of ZFS before starting to work on mitigation strategies.

Since Availability Suite is both Remote Mirroring and Point-in-Time  
Copy software, the software can be configured to automatically take a  
snapshot prior to re-synchronization, and automatically delete the  
snapshot if completed successfully. The use of I/O consistency groups  
assure that not only are the replicas write-order consistent during  
replication, but also that snapshots taken prior to re- 
synchronization are consistent too.


> Thanks
>
> Steve
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Jim Dunham
Storage Platform Software Group
Sun Microsystems, Inc.
wk: 781.442.4042

http://blogs.sun.com/avs
http://www.opensolaris.org/os/project/avs/
http://www.opensolaris.org/os/project/iscsitgt/
http://www.opensolaris.org/os/community/storage/

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

Reply via email to