>>>>> "mb" == Matt Beebe <[EMAIL PROTECTED]> writes:
mb> When using AVS's "Async replication with memory queue", am I mb> guaranteed a consistent ZFS on the distant end? The assumed mb> failure case is that the replication broke, and now I'm trying mb> to promote the secondary replicate with what might be stale mb> data. Recognizing in advance that some of the data would be mb> (obviously) stale, mb> my concern is whether or not ZFS stayed consistent, or does mb> AVS know how to "bundle" ZFS's atomic writes properly? Assuming the ZFS claims of ``always consistent on disk'' are true (or are fixed to be true), all that's required is to write the updates in time order. simoncr was saying in the thread that Maurice quoted: http://www.opensolaris.org/jive/thread.jspa?threadID=68881&tstart=30 that during a partial-resync after a loss of connectivity AVS writes in LBA order while DRBD writes in time order. The thread was about resyncing and restoring replication, not about broken async replication. The DRBD virtue here is if you start a resync and want to abandon it---if the resync took a long time, or the network failed permanently half way through resync---something like that. With DRBD it's possible to give up, discard the unsync'd data, and bring up the cluster on the partially-updated sync-target. With AVS and LBA-order resync, you have the ``give up'' option only before you begin the resync: the proposed sync target doesn't have the latest data on it, but it's mountable. You lose some protection by agreeing to start a sync: after you begin, the sync target is totally inconsistent and unmountable until the sync completes successfully. so, if the sync source node were destroyed or a crappy network connection went fully down during the resync, you lose everything! DRBD's way sounds like a clear and very simple win at first, but makes me ask: 1. DRBD cannot _really_ write in time order because (a) it would mean making a write barrier between each sector and (b) there isn't a fixed time order to begin with because block layers and even some disks allow multiple outstanding commands. Does he mean DRBD stores the write barriers in its dirty-log and implements them during resync? In this case, the target will NOT be a point-in-time copy of a past source volume, it'll just be ``correct'' w.r.t. the barrier rules. I could imagine this working in a perfect world,...or, at least, a well-tested well-integrated world. In our world, that strategy could make for an interesting test of filesystem bugs w.r.t. write barriers---are they truly issuing all the barriers needed for formal correctness, or are they unknowingly dependent on the 95th-percentile habits of real-world disks? What if you have some mistake that is blocking write barriers entirely (like LVM2/devicemapper)---on real disks it might just cause some database corruption, but DRBD implementing this rule precisely could imagineably degrade to the AVS case, and write two days of stale data in LBA order because it hasn't seen a write barrier in two days! 2. on DRBD's desired feature list is: to replicate sets of disks rather than individual disks, keeping them all in sync. ZFS probably tends to: (a) write Green blocks (b) issue barriers to all disks in a vdev (c) write Orange blocks (d) wait until the last disk has acknowledged its barrier (e) write Red blocks After this pattern it's true pool-wide (not disk-wide) that no Red blocks will be written on any disk unless all Green blocks have been written to all disks. AIUI, DRBD can't preserve this right now. It resynchronizes disks independently, not in sets. Getting back to your question, I'd guess that running in async mode is like you are constantly resynchronizing, and an ordinary cluster failover in async mode is equivalent to an interrupted resync. so, AVS doesn't implement (1) during a regular resync. But maybe for a cluster that's online in async mode it DOES implement (1)? HOWEVER, even if AVS implemented a (1)-like DRBD policy when it's in ``async'' mode (I don't know that it does), I can't imagine that it would manage (2) correctly. Does AVS have any concept of ``async disk sets'', where write barriers have a meaning across disks? I can't imagine it existing without a configuration knob for it. And ZFS needs (2). I would expect AVS ``sync'' mode to provide (1) and (2), so the question is only about ``async'' mode failovers. so...based on my reasoning, it's UNSAFE to use AVS in async mode for ZFS replication on any pool which needs more than 1 device to have ``sufficient replicas''. A single device would meet that requirement, and so would a pool containing a single mirror vdev with two devices. I've no particular knowledge of AVS at all though, besides what we've all read here.
pgpR13XAMHMcd.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss