Miles,

>>>>>> "mb" == Matt Beebe <[EMAIL PROTECTED]> writes:
>
>    mb> When using AVS's "Async replication with memory queue", am I
>    mb> guaranteed a consistent ZFS on the distant end?  The assumed
>    mb> failure case is that the replication broke, and now I'm trying
>    mb> to promote the secondary replicate with what might be stale
>    mb> data.  Recognizing in advance that some of the data would be
>    mb> (obviously) stale,
>
>    mb> my concern is whether or not ZFS stayed consistent, or does
>    mb> AVS know how to "bundle" ZFS's atomic writes properly?
>
> Assuming the ZFS claims of ``always consistent on disk'' are true (or
> are fixed to be true), all that's required is to write the updates in
> time order.
>
> simoncr was saying in the thread that Maurice quoted:
>
> http://www.opensolaris.org/jive/thread.jspa?threadID=68881&tstart=30
>
> that during a partial-resync after a loss of connectivity AVS writes
> in LBA order while DRBD writes in time order.  The thread was about
> resyncing and restoring replication, not about broken async
> replication.
>
> The DRBD virtue here is if you start a resync and want to abandon
> it---if the resync took a long time, or the network failed permanently
> half way through resync---something like that.  With DRBD it's
> possible to give up, discard the unsync'd data, and bring up the
> cluster on the partially-updated sync-target.
>
> With AVS and LBA-order resync, you have the ``give up'' option only
> before you begin the resync: the proposed sync target doesn't have the
> latest data on it, but it's mountable.  You lose some protection by
> agreeing to start a sync: after you begin, the sync target is totally
> inconsistent and unmountable until the sync completes successfully.
> so, if the sync source node were destroyed or a crappy network
> connection went fully down during the resync, you lose everything!

To address this issue there is a feature call ndr_ii. This is an  
automatic snapshot taken before resynchronization starts, so that on  
the remote node there is always a write-order consistent volume  
available. If replication stops, is taking too long, etc., the  
snapshot can be restored, so that one does not lose everything.

> DRBD's way sounds like a clear and very simple win at first, but makes
> me ask:
>
> 1. DRBD cannot _really_ write in time order because (a) it would mean
>    making a write barrier between each sector and (b) there isn't a
>    fixed time order to begin with because block layers and even some
>    disks allow multiple outstanding commands.
>
>    Does he mean DRBD stores the write barriers in its dirty-log and
>    implements them during resync?  In this case, the target will NOT
>    be a point-in-time copy of a past source volume, it'll just be
>    ``correct'' w.r.t. the barrier rules.  I could imagine this
>    working in a perfect world,...or, at least, a well-tested
>    well-integrated world.
>
>    In our world, that strategy could make for an interesting test of
>    filesystem bugs w.r.t. write barriers---are they truly issuing all
>    the barriers needed for formal correctness, or are they
>    unknowingly dependent on the 95th-percentile habits of real-world
>    disks?  What if you have some mistake that is blocking write
>    barriers entirely (like LVM2/devicemapper)---on real disks it
>    might just cause some database corruption, but DRBD implementing
>    this rule precisely could imagineably degrade to the AVS case, and
>    write two days of stale data in LBA order because it hasn't seen a
>    write barrier in two days!
>
> 2. on DRBD's desired feature list is: to replicate sets of disks
>    rather than individual disks, keeping them all in sync.  ZFS
>    probably tends to:
>
>    (a) write Green blocks
>    (b) issue barriers to all disks in a vdev
>    (c) write Orange blocks
>    (d) wait until the last disk has acknowledged its barrier
>    (e) write Red blocks
>
>    After this pattern it's true pool-wide (not disk-wide) that no Red
>    blocks will be written on any disk unless all Green blocks have
>    been written to all disks.
>
>    AIUI, DRBD can't preserve this right now.  It resynchronizes disks
>    independently, not in sets.
>
> Getting back to your question, I'd guess that running in async mode is
> like you are constantly resynchronizing, and an ordinary cluster
> failover in async mode is equivalent to an interrupted resync.
>
> so, AVS doesn't implement (1) during a regular resync.  But maybe for
> a cluster that's online in async mode it DOES implement (1)?
>
> HOWEVER, even if AVS implemented a (1)-like DRBD policy when it's in
> ``async'' mode (I don't know that it does), I can't imagine that it
> would manage (2) correctly.  Does AVS have any concept of ``async disk
> sets'', where write barriers have a meaning across disks?

AVS has the concept of I/O consistency groups, where all disks of a  
multi-volume filesystem (ZFS, QFS) or database (Oracle, Sybase) are  
kept write-order consistent when using either sync or async replication.

> I can't imagine it existing without a configuration knob for it.   
> And ZFS
> needs (2).
>
> I would expect AVS ``sync'' mode to provide (1) and (2), so the
> question is only about ``async'' mode failovers.
>
> so...based on my reasoning, it's UNSAFE to use AVS in async mode for
> ZFS replication on any pool which needs more than 1 device to have
> ``sufficient replicas''.  A single device would meet that requirement,
> and so would a pool containing a single mirror vdev with two devices.
>
> I've no particular knowledge of AVS at all though, besides what we've
> all read here.

I can surely help with this: http://docs.sun.com/app/docs?p=coll%2FAVS4.0

>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Jim Dunham
Engineering Manager
Storage Platform Software Group
Sun Microsystems, Inc.

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to