Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

Paolo Bonzini Wed, 30 May 2012 08:07:06 -0700

Il 30/05/2012 14:34, Geert Jansen ha scritto:
> 
> On 05/29/2012 02:52 PM, Paolo Bonzini wrote:
> 
>>> Does the drive-mirror coroutine send the writes to the target in the
>>> same order as they are sent to the source? I assume so.
>>
>> No, it doesn't.  It's asynchronous; for continuous replication, the
>> target knows that it has a consistent view whenever it sees a flush on
>> the NBD stream.  Flushing the target is needed anyway before writing the
>> dirty bitmap, so the target might as well exploit them to get
>> information about the state of the source.
>>
>> The target _must_ flush to disk when it receives a flush commands, not
>> matter how close they are.  It _may_ choose to snapshot the disk, for
>> example establishing one new snapshot every 5 seconds.
> 
> Interesting. So it works quite differently than i had assumed. Some
> follow-up questions hope you don't mind...
> 
>  * I assume a flush roughly corresponds to an fsync() in the guest OS?


Yes (or a metadata flush from the guest OS filesystem, since our guest
models do not support attaching the FUA bit to single writes).

>  * Writes will not be re-ordered over a flush boundary, right?

More or less.  This for example is a valid ordering:

    write sector 0
                             write 0 returns
    flush
    write sector 1
                             write 1 returns
                             flush returns

However, writes that have already returned will not be re-ordered over a
flush boundary.

>> A synchronous implementation is not forbidden by the spec (by design),
>> but at the moment it's a bit more complex to implement because, as you
>> mention, it requires buffering the I/O data on the host.
> 
> So if i understand correctly, you'd only be keeping a list of
> (len, offset) tuples without any data, and drive-mirror then reads the
> data from the disk image? If that is the case how do you handle a flush?
> Does a flush need to wait for drive-mirror to drain the entire outgoing
> queue to the target before it can complete? If not how do prevent writes
> that happen after a flush from overwriting the data that will be sent to
> the target in case that hasn't reached the flush point yet.

The key is that:

1) you only flush the target when you have a consistent image of the
source on the destination, and the replication server only creates a
snapshot when it receives a flush.  Thus, the server does not create a
consistent snapshot unless the client was able to keep pace with the guest.

2) target flushes do not have to coincide with a source flush.  Writes
after the last source flush _can_ be inconsistent between the source and
the destination!  What matters is that all writes up to the last source
flush are consistent.

Say the guest starts with (4 characters = 1 sectors) "AAAA BBBB CCCC" on
disk

and then the following happens

    guest           disk           dirty count        mirroring
 -------------------------------------------------------------------
                                       0
    write 1 = XXXX                     1
    FLUSH
                    write 1 = XXXX
                    dirty bitmap: sector 1 dirty
    write 2 = YYYY                     2
                                       1           copy sector 1 = XXXX
                                       0           copy sector 2 = YYYY
                                                   FLUSH
                    dirty bitmap: all clean
    write 0 = ZZZZ
                    write 0 = ZZZZ

and then a power loss happens on the source.

The guest now has the dirty bitmap saying "all clean" even though the
source now is "ZZZZ XXXX CCCC" and the destination "AAAA XXXX YYYY".
However, this is not a problem because both are consistent with the last
flush.

I attach a Promela model of the algorithm I'm going to implement.  It's
not exactly the one I posted upthread; I successfully ran this one
through a model checker, so this one works. :)  (I tested 3 sectors / 1
write, i.e. the case above, and 2 sectors / 3 writes. It should be
enough given that it goes exhaustively through the entire state space).

I have another model with two concurrent writers, but it is quite messy
and I don't think it adds much.

It shouldn't be hard to follow, the only tricky thing is that multiple
branches in an "if" or "do" can be true, and if so all paths will be
explored by the model checker.  An "else" is only executed if all the
other paths are false.

>> Yes, this is already all correct.
> 
> OK, i think i was confused by your description of "drive-mirror" in the
> wiki. It says that starts mirroring, but what it also does is that it
> copies the source to the target before it does that. It is clear from
> the description of the "sync" option though.

Yes, the "sync" option simply fills in the dirty bitmap before starting
the actual loop.

Paolo

/*
 * Formal model for disk synchronization.
 *
 * Copyright (C) 2012 Red Hat, Inc.
 * Author: Paolo Bonzini <pbonz...@redhat.com>
 *
 * State space explodes real fast:
 *
 *   SEC \ MAX       1           2              3
 *    2           170K          5M            48M
 *    3            17M          ??             ??
 *
 * and so does the amount of memory required to run verification,
 * so I didn't allow customizing the number of sectors beyond 3.
 */

/* Number of writes tested (max 255, good luck ;).   */
#define MAX 3

/* Number of sectors tested (max 3).  */
#define SEC 2

#define EXTRA_ASSERT

byte cache_src[SEC];        /* Value written to the source (on cache) */
byte disk_src[SEC];         /* Value written to the source (on medium) */
byte cache_dest[SEC];       /* Value written to the destination (on cache) */
byte disk_dest[SEC];        /* Value written to the destination (on medium) */
bit cache_dirty[SEC];       /* Value of the persistent dirty bitmap (on cache) 
*/
bit disk_dirty[SEC];        /* Value of the persistent dirty bitmap (on medium) 
*/
bit volatile_dirty[SEC];    /* Value of the volatile dirty bitmap */
byte count;                 /* Number of bits set in the volatile dirty bitmap 
*/
bit sim_done;

#ifdef EXTRA_ASSERT
/* Only used for extra assertion checks, not part of the algorithm.
 * Does not make verification substantially slower.
 */
bit in_flight[SEC];
#endif

/* This process repeatedly tests assertions and checks for the end
 * of the run.
 */
active proctype assertions() {
    byte sector;

    do
    :: true -> atomic {
        sector = (sector + 1) % SEC;

        /* The cache must not have older data than the disk.  */
        assert(disk_src[sector] <= cache_src[sector]);
        assert(disk_dest[sector] <= cache_dest[sector]);

        /* Data cannot come from the future.  :) */
        assert(cache_dest[sector] <= cache_src[sector]);

        /* The volatile bitmap can say anything, but the persistent
         * bitmap must be clear only if the destination disk can
         * be recovered.  Flushing the destination before the source
         * is fine.
         */
        assert(cache_dirty[sector] || disk_dest[sector] >= disk_src[sector]);

        /* The persistent bitmap conservatively approximates the volatile
         * bitmap.  */
        assert(cache_dirty[sector] >= volatile_dirty[sector]);

        /* The count must be accurate.  */
        assert(count == (volatile_dirty[0]
#if SEC > 1
                        + volatile_dirty[1]
#endif
#if SEC > 2
                        + volatile_dirty[2]
#endif
                ));

#if 0
        if
        :: volatile_dirty[0] || cache_dirty[0] || disk_dirty[0] || disk_src[0] 
> disk_dest[0] || disk_src[0] < MAX
#if SEC > 1
        || volatile_dirty[1] || cache_dirty[1] || disk_dirty[1] || disk_src[1] 
> disk_dest[1] || disk_src[1] < MAX
#endif
#if SEC > 2
        || volatile_dirty[2] || cache_dirty[2] || disk_dirty[2] || disk_src[2] 
> disk_dest[2] || disk_src[2] < MAX
#endif
        :: else                                                     -> sim_done 
= 1;
        fi;
#endif
    }
    od;
}

/* The OS can flush data to disk at arbitrary times.  But data can be assumed
 * to never be on the source platters until an explicit flush.  */

active proctype os() {
    do
    :: disk_dest[0] != cache_dest[0]   -> disk_dest[0] = cache_dest[0];
    :: disk_dirty[0] != cache_dirty[0] -> disk_dirty[0] = cache_dirty[0];
#if SEC > 1
    :: disk_dest[1] != cache_dest[1]   -> disk_dest[1] = cache_dest[1];
    :: disk_dirty[1] != cache_dirty[1] -> disk_dirty[1] = cache_dirty[1];
#endif
#if SEC > 2
    :: disk_dest[2] != cache_dest[2]   -> disk_dest[2] = cache_dest[2];
    :: disk_dirty[2] != cache_dirty[2] -> disk_dirty[2] = cache_dirty[2];
#endif
    od;
}

active proctype writes() {
    byte sector;

loop:
    if
    :: cache_src[0] < MAX -> atomic { sector = 0; goto write; }
#if SEC > 1
    :: cache_src[1] < MAX -> atomic { sector = 1; goto write; }
#endif
#if SEC > 2
    :: cache_src[2] < MAX -> atomic { sector = 2; goto write; }
#endif

    :: true -> goto flush;
    fi;

write:
#ifdef EXTRA_ASSERT
    in_flight[sector] = 1;
#endif

    /* First write the data.  */
    cache_src[sector]++;

    /* Then update the persistent dirty bitmap.  It's cheaper than copying
     * the volatile bitmap to the persistent bitmap on every flush.
     *
     * Updates to the bitmap have to be done under a lock, otherwise
     * the mirroring process can copy volatile_dirty to cache_dirty
     * at the point marked XXX.  That would set cache_dirty[sector]
     * back to 0 even though the write hasn't reached disk_dest yet.
     *
     * When implementing this with coroutines the lock will be implicit,
     * because context switches can only happen in precise places.
     */
    atomic {
        cache_dirty[sector] = 1;

        /* XXX */

        count = count + !volatile_dirty[sector];
        volatile_dirty[sector] = 1;
#ifdef EXTRA_ASSERT
        in_flight[sector] = 0;
#endif
    }
    goto loop;

flush:
    /* Flush the data and the bitmap, any order will do.  */
    do
    :: disk_src[0] != cache_src[0]     -> disk_src[0] = cache_src[0];
    :: disk_dirty[0] != cache_dirty[0] -> disk_dirty[0] = cache_dirty[0];
#if SEC > 1
    :: disk_src[1] != cache_src[1]     -> disk_src[1] = cache_src[1];
    :: disk_dirty[1] != cache_dirty[1] -> disk_dirty[1] = cache_dirty[1];
#endif
#if SEC > 2
    :: disk_src[2] != cache_src[2]     -> disk_src[2] = cache_src[2];
    :: disk_dirty[2] != cache_dirty[2] -> disk_dirty[2] = cache_dirty[2];
#endif
    :: else -> break;
    od;
    goto loop;
}

active proctype mirror() {
    byte sector;

loop:
    /* Pick a sector, block until one is available.  */
    if
    :: volatile_dirty[0] -> sector = 0;
#if SEC > 1
    :: volatile_dirty[1] -> sector = 1;
#endif
#if SEC > 2
    :: volatile_dirty[2] -> sector = 2;
#endif
    fi;

    /* Update the volatile dirty bitmap.  */
    atomic {
        count = count - volatile_dirty[sector];
        volatile_dirty[sector] = 0;
    }

    /* Copy the data */
    cache_dest[sector] = cache_src[sector];

    /* Check whether the data should be flushed.  */
    if
    :: count == 0 ->
        /* Flush the destination to disk.  Writes can happen in the
         * meanwhile, and they will set count > 0.
         */
        do
        :: disk_dest[0] != cache_dest[0] -> disk_dest[0] = cache_dest[0];
#if SEC > 1
        :: disk_dest[1] != cache_dest[1] -> disk_dest[1] = cache_dest[1];
#endif
#if SEC > 2
        :: disk_dest[2] != cache_dest[2] -> disk_dest[2] = cache_dest[2];
#endif
        :: else -> break;
        od;

        /* Reset the persistent dirty bitmap under a lock.  */
        atomic {
            cache_dirty[0] = volatile_dirty[0];
#if SEC > 1
            cache_dirty[1] = volatile_dirty[1];
#endif
#if SEC > 2
            cache_dirty[2] = volatile_dirty[2];
#endif

#ifdef EXTRA_ASSERT
            assert(cache_dest[0] == cache_src[0] || volatile_dirty[0] || 
in_flight[0]);
#if SEC > 1
            assert(cache_dest[1] == cache_src[1] || volatile_dirty[1] || 
in_flight[1]);
#endif
#if SEC > 2
            assert(cache_dest[2] == cache_src[2] || volatile_dirty[2] || 
in_flight[2]);
#endif
#endif
        }

        /* No need to flush the dirty bitmap to disk here.  */

    :: else -> skip;
    fi;
    goto loop;
}

Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

Reply via email to