I have a couple of questions and concerns about using ZFS in an environment 
where the underlying LUNs are replicated at a block level using products like 
HDS TrueCopy or EMC SRDF.  Apologies in advance for the length, but I wanted 
the explanation to be clear.

(I do realise that there are other possibilities such as zfs send/recv and 
there are technical and business pros and cons for the various options. I don't 
want to start a 'which is best' argument :) )

The CoW design of ZFS means that it goes to great lengths to always maintain 
on-disk self-consistency, and ZFS can make certain assumptions about state (e.g 
not needing fsck) based on that.  This is the basis of my questions. 

1) First issue relates to the überblock.  Updates to it are assumed to be 
atomic, but if the replication block size is smaller than the überblock then we 
can't guarantee that the whole überblock is replicated as an entity.  That 
could in theory result in a corrupt überblock at the
secondary. 

Will this be caught and handled by the normal ZFS checksumming? If so, does ZFS 
just use an alternate überblock and rewrite the damaged one transparently?

2) Assuming that the replication maintains write-ordering, the secondary site 
will always have valid and self-consistent data, although it may be out-of-date 
compared to the primary if the replication is asynchronous, depending on link 
latency, buffering, etc. 

Normally most replication systems do maintain write ordering, [i]except[/i] for 
one specific scenario.  If the replication is interrupted, for example 
secondary site down or unreachable due to a comms problem, the primary site 
will keep a list of changed blocks.  When contact between the sites is 
re-established there will be a period of 'catch-up' resynchronization.  In 
most, if not all, cases this is done on a simple block-order basis.  
Write-ordering is lost until the two sites are once again in sync and routine 
replication restarts. 

I can see this has having major ZFS impact.  It would be possible for 
intermediate blocks to be replicated before the data blocks they point to, and 
in the worst case an updated überblock could be replicated before the block 
chains that it references have been copied.  This breaks the assumption that 
the on-disk format is always self-consistent. 

If a disaster happened during the 'catch-up', and the partially-resynchronized 
LUNs were imported into a zpool at the secondary site, what would/could happen? 
Refusal to accept the whole zpool? Rejection just of the files affected? System 
panic? How could recovery from this situation be achieved?

Obviously all filesystems can suffer with this scenario, but ones that expect 
less from their underlying storage (like UFS) can be fscked, and although data 
that was being updated is potentially corrupt, existing data should still be OK 
and usable.  My concern is that ZFS will handle this scenario less well. 

There are ways to mitigate this, of course, the most obvious being to take a 
snapshot of the (valid) secondary before starting resync, as a fallback.  This 
isn't always easy to do, especially since the resync is usually automatic; 
there is no clear trigger to use for the snapshot. It may also be difficult to 
synchronize the snapshot of all LUNs in a pool. I'd like to better understand 
the risks/behaviour of ZFS before starting to work on mitigation strategies. 

Thanks

Steve
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to