>>>>> "mb" == Matt Beebe <[EMAIL PROTECTED]> writes:

    mb> When using AVS's "Async replication with memory queue", am I
    mb> guaranteed a consistent ZFS on the distant end?  The assumed
    mb> failure case is that the replication broke, and now I'm trying
    mb> to promote the secondary replicate with what might be stale
    mb> data.  Recognizing in advance that some of the data would be
    mb> (obviously) stale, 

    mb> my concern is whether or not ZFS stayed consistent, or does
    mb> AVS know how to "bundle" ZFS's atomic writes properly?

Assuming the ZFS claims of ``always consistent on disk'' are true (or
are fixed to be true), all that's required is to write the updates in
time order.

simoncr was saying in the thread that Maurice quoted:

 http://www.opensolaris.org/jive/thread.jspa?threadID=68881&tstart=30

that during a partial-resync after a loss of connectivity AVS writes
in LBA order while DRBD writes in time order.  The thread was about
resyncing and restoring replication, not about broken async
replication.

The DRBD virtue here is if you start a resync and want to abandon
it---if the resync took a long time, or the network failed permanently
half way through resync---something like that.  With DRBD it's
possible to give up, discard the unsync'd data, and bring up the
cluster on the partially-updated sync-target.

With AVS and LBA-order resync, you have the ``give up'' option only
before you begin the resync: the proposed sync target doesn't have the
latest data on it, but it's mountable.  You lose some protection by
agreeing to start a sync: after you begin, the sync target is totally
inconsistent and unmountable until the sync completes successfully.
so, if the sync source node were destroyed or a crappy network
connection went fully down during the resync, you lose everything!

DRBD's way sounds like a clear and very simple win at first, but makes
me ask:

 1. DRBD cannot _really_ write in time order because (a) it would mean
    making a write barrier between each sector and (b) there isn't a
    fixed time order to begin with because block layers and even some
    disks allow multiple outstanding commands.

    Does he mean DRBD stores the write barriers in its dirty-log and
    implements them during resync?  In this case, the target will NOT
    be a point-in-time copy of a past source volume, it'll just be
    ``correct'' w.r.t. the barrier rules.  I could imagine this
    working in a perfect world,...or, at least, a well-tested
    well-integrated world.  

    In our world, that strategy could make for an interesting test of
    filesystem bugs w.r.t. write barriers---are they truly issuing all
    the barriers needed for formal correctness, or are they
    unknowingly dependent on the 95th-percentile habits of real-world
    disks?  What if you have some mistake that is blocking write
    barriers entirely (like LVM2/devicemapper)---on real disks it
    might just cause some database corruption, but DRBD implementing
    this rule precisely could imagineably degrade to the AVS case, and
    write two days of stale data in LBA order because it hasn't seen a
    write barrier in two days!

 2. on DRBD's desired feature list is: to replicate sets of disks
    rather than individual disks, keeping them all in sync.  ZFS
    probably tends to:

    (a) write Green blocks
    (b) issue barriers to all disks in a vdev
    (c) write Orange blocks
    (d) wait until the last disk has acknowledged its barrier
    (e) write Red blocks

    After this pattern it's true pool-wide (not disk-wide) that no Red
    blocks will be written on any disk unless all Green blocks have
    been written to all disks.

    AIUI, DRBD can't preserve this right now.  It resynchronizes disks
    independently, not in sets.

Getting back to your question, I'd guess that running in async mode is
like you are constantly resynchronizing, and an ordinary cluster
failover in async mode is equivalent to an interrupted resync.

so, AVS doesn't implement (1) during a regular resync.  But maybe for
a cluster that's online in async mode it DOES implement (1)?

HOWEVER, even if AVS implemented a (1)-like DRBD policy when it's in
``async'' mode (I don't know that it does), I can't imagine that it
would manage (2) correctly.  Does AVS have any concept of ``async disk
sets'', where write barriers have a meaning across disks?  I can't
imagine it existing without a configuration knob for it.  And ZFS
needs (2).

I would expect AVS ``sync'' mode to provide (1) and (2), so the
question is only about ``async'' mode failovers.

so...based on my reasoning, it's UNSAFE to use AVS in async mode for
ZFS replication on any pool which needs more than 1 device to have
``sufficient replicas''.  A single device would meet that requirement,
and so would a pool containing a single mirror vdev with two devices.

I've no particular knowledge of AVS at all though, besides what we've
all read here.

Attachment: pgpR13XAMHMcd.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to