Miles, >>>>>> "mb" == Matt Beebe <[EMAIL PROTECTED]> writes: > > mb> When using AVS's "Async replication with memory queue", am I > mb> guaranteed a consistent ZFS on the distant end? The assumed > mb> failure case is that the replication broke, and now I'm trying > mb> to promote the secondary replicate with what might be stale > mb> data. Recognizing in advance that some of the data would be > mb> (obviously) stale, > > mb> my concern is whether or not ZFS stayed consistent, or does > mb> AVS know how to "bundle" ZFS's atomic writes properly? > > Assuming the ZFS claims of ``always consistent on disk'' are true (or > are fixed to be true), all that's required is to write the updates in > time order. > > simoncr was saying in the thread that Maurice quoted: > > http://www.opensolaris.org/jive/thread.jspa?threadID=68881&tstart=30 > > that during a partial-resync after a loss of connectivity AVS writes > in LBA order while DRBD writes in time order. The thread was about > resyncing and restoring replication, not about broken async > replication. > > The DRBD virtue here is if you start a resync and want to abandon > it---if the resync took a long time, or the network failed permanently > half way through resync---something like that. With DRBD it's > possible to give up, discard the unsync'd data, and bring up the > cluster on the partially-updated sync-target. > > With AVS and LBA-order resync, you have the ``give up'' option only > before you begin the resync: the proposed sync target doesn't have the > latest data on it, but it's mountable. You lose some protection by > agreeing to start a sync: after you begin, the sync target is totally > inconsistent and unmountable until the sync completes successfully. > so, if the sync source node were destroyed or a crappy network > connection went fully down during the resync, you lose everything!
To address this issue there is a feature call ndr_ii. This is an automatic snapshot taken before resynchronization starts, so that on the remote node there is always a write-order consistent volume available. If replication stops, is taking too long, etc., the snapshot can be restored, so that one does not lose everything. > DRBD's way sounds like a clear and very simple win at first, but makes > me ask: > > 1. DRBD cannot _really_ write in time order because (a) it would mean > making a write barrier between each sector and (b) there isn't a > fixed time order to begin with because block layers and even some > disks allow multiple outstanding commands. > > Does he mean DRBD stores the write barriers in its dirty-log and > implements them during resync? In this case, the target will NOT > be a point-in-time copy of a past source volume, it'll just be > ``correct'' w.r.t. the barrier rules. I could imagine this > working in a perfect world,...or, at least, a well-tested > well-integrated world. > > In our world, that strategy could make for an interesting test of > filesystem bugs w.r.t. write barriers---are they truly issuing all > the barriers needed for formal correctness, or are they > unknowingly dependent on the 95th-percentile habits of real-world > disks? What if you have some mistake that is blocking write > barriers entirely (like LVM2/devicemapper)---on real disks it > might just cause some database corruption, but DRBD implementing > this rule precisely could imagineably degrade to the AVS case, and > write two days of stale data in LBA order because it hasn't seen a > write barrier in two days! > > 2. on DRBD's desired feature list is: to replicate sets of disks > rather than individual disks, keeping them all in sync. ZFS > probably tends to: > > (a) write Green blocks > (b) issue barriers to all disks in a vdev > (c) write Orange blocks > (d) wait until the last disk has acknowledged its barrier > (e) write Red blocks > > After this pattern it's true pool-wide (not disk-wide) that no Red > blocks will be written on any disk unless all Green blocks have > been written to all disks. > > AIUI, DRBD can't preserve this right now. It resynchronizes disks > independently, not in sets. > > Getting back to your question, I'd guess that running in async mode is > like you are constantly resynchronizing, and an ordinary cluster > failover in async mode is equivalent to an interrupted resync. > > so, AVS doesn't implement (1) during a regular resync. But maybe for > a cluster that's online in async mode it DOES implement (1)? > > HOWEVER, even if AVS implemented a (1)-like DRBD policy when it's in > ``async'' mode (I don't know that it does), I can't imagine that it > would manage (2) correctly. Does AVS have any concept of ``async disk > sets'', where write barriers have a meaning across disks? AVS has the concept of I/O consistency groups, where all disks of a multi-volume filesystem (ZFS, QFS) or database (Oracle, Sybase) are kept write-order consistent when using either sync or async replication. > I can't imagine it existing without a configuration knob for it. > And ZFS > needs (2). > > I would expect AVS ``sync'' mode to provide (1) and (2), so the > question is only about ``async'' mode failovers. > > so...based on my reasoning, it's UNSAFE to use AVS in async mode for > ZFS replication on any pool which needs more than 1 device to have > ``sufficient replicas''. A single device would meet that requirement, > and so would a pool containing a single mirror vdev with two devices. > > I've no particular knowledge of AVS at all though, besides what we've > all read here. I can surely help with this: http://docs.sun.com/app/docs?p=coll%2FAVS4.0 > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Jim Dunham Engineering Manager Storage Platform Software Group Sun Microsystems, Inc. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss