Am 24.07.2012 13:04, schrieb Paolo Bonzini: > This patch adds the implementation of a new job that mirrors a disk to > a new image while letting the guest continue using the old image. > The target is treated as a "black box" and data is copied from the > source to the target in the background. This can be used for several > purposes, including storage migration, continuous replication, and > observation of the guest I/O in an external program. It is also a > first step in replacing the inefficient block migration code that is > part of QEMU. > > The job is possibly never-ending, but it is logically structured into > two phases: 1) copy all data as fast as possible until the target > first gets in sync with the source; 2) keep target in sync and > ensure that reopening to the target gets a correct (full) copy > of the source data. > > The second phase is indicated by the progress in "info block-jobs" > reporting the current offset to be equal to the length of the file. > When the job is cancelled in the second phase, QEMU will run the > job until the source is clean and quiescent, then it will report > successful completion of the job. > > In other words, the BLOCK_JOB_CANCELLED event means that the target > may _not_ be consistent with a past state of the source; the > BLOCK_JOB_COMPLETED event means that the target is consistent with > a past state of the source. (Note that it could already happen > that management lost the race against QEMU and got a completion > event instead of cancellation). > > It is not yet possible to complete the job and switch over to the target > disk. The next patches will fix this and add many refinements to the > basic idea introduced here. These include improved error management, > some tunable knobs and performance optimizations. > > Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> > --- > block/Makefile.objs | 2 +- > block/mirror.c | 232 > +++++++++++++++++++++++++++++++++++++++++++++++++++ > block_int.h | 20 +++++ > qapi-schema.json | 17 ++++ > trace-events | 7 ++ > 5 files changed, 277 insertions(+), 1 deletion(-) > create mode 100644 block/mirror.c > > diff --git a/block/Makefile.objs b/block/Makefile.objs > index c45affc..f1a394a 100644 > --- a/block/Makefile.objs > +++ b/block/Makefile.objs > @@ -9,4 +9,4 @@ block-obj-$(CONFIG_LIBISCSI) += iscsi.o > block-obj-$(CONFIG_CURL) += curl.o > block-obj-$(CONFIG_RBD) += rbd.o > > -common-obj-y += stream.o > +common-obj-y += stream.o mirror.o > diff --git a/block/mirror.c b/block/mirror.c > new file mode 100644 > index 0000000..f7d36f9 > --- /dev/null > +++ b/block/mirror.c > @@ -0,0 +1,232 @@ > +/* > + * Image mirroring > + * > + * Copyright Red Hat, Inc. 2012 > + * > + * Authors: > + * Paolo Bonzini <pbonz...@redhat.com> > + * > + * This work is licensed under the terms of the GNU LGPL, version 2 or later. > + * See the COPYING.LIB file in the top-level directory. > + * > + */ > + > +#include "trace.h" > +#include "blockjob.h" > +#include "block_int.h" > +#include "qemu/ratelimit.h" > + > +enum { > + /* > + * Size of data buffer for populating the image file. This should be > large > + * enough to process multiple clusters in a single call, so that > populating > + * contiguous regions of the image is efficient. > + */ > + BLOCK_SIZE = 512 * BDRV_SECTORS_PER_DIRTY_CHUNK, /* in bytes */ > +}; > + > +#define SLICE_TIME 100000000ULL /* ns */ > + > +typedef struct MirrorBlockJob { > + BlockJob common; > + RateLimit limit; > + BlockDriverState *target; > + MirrorSyncMode mode; > + int64_t sector_num; > + uint8_t *buf; > +} MirrorBlockJob; > + > +static int coroutine_fn mirror_iteration(MirrorBlockJob *s) > +{ > + BlockDriverState *source = s->common.bs; > + BlockDriverState *target = s->target; > + QEMUIOVector qiov; > + int ret, nb_sectors; > + int64_t end; > + struct iovec iov; > + > + end = s->common.len >> BDRV_SECTOR_BITS; > + s->sector_num = bdrv_get_next_dirty(source, s->sector_num); > + nb_sectors = MIN(BDRV_SECTORS_PER_DIRTY_CHUNK, end - s->sector_num); > + bdrv_reset_dirty(source, s->sector_num, nb_sectors); > + > + /* Copy the dirty cluster. */ > + iov.iov_base = s->buf; > + iov.iov_len = nb_sectors * 512; > + qemu_iovec_init_external(&qiov, &iov, 1); > + > + trace_mirror_one_iteration(s, s->sector_num, nb_sectors); > + ret = bdrv_co_readv(source, s->sector_num, nb_sectors, &qiov); > + if (ret < 0) { > + return ret; > + } > + return bdrv_co_writev(target, s->sector_num, nb_sectors, &qiov); > +} > + > +static void coroutine_fn mirror_run(void *opaque) > +{ > + MirrorBlockJob *s = opaque; > + BlockDriverState *bs = s->common.bs; > + int64_t sector_num, end; > + int ret = 0; > + int n; > + bool synced = false; > + > + if (block_job_is_cancelled(&s->common)) { > + goto immediate_exit; > + } > + > + s->common.len = bdrv_getlength(bs); > + if (s->common.len < 0) { > + block_job_completed(&s->common, s->common.len); > + return; > + } > + > + end = s->common.len >> BDRV_SECTOR_BITS; > + s->buf = qemu_blockalign(bs, BLOCK_SIZE); > + > + if (s->mode == MIRROR_SYNC_MODE_FULL || s->mode == MIRROR_SYNC_MODE_TOP) > {
I think this is the common case, so s->mode != MIRROR_SYNC_MODE_NONE might describe it better? > + /* First part, loop on the sectors and initialize the dirty bitmap. > */ > + BlockDriverState *base; > + base = s->mode == MIRROR_SYNC_MODE_FULL ? NULL : bs->backing_hd; > + for (sector_num = 0; sector_num < end; ) { > + int64_t next = (sector_num | (BDRV_SECTORS_PER_DIRTY_CHUNK - 1)) > + 1; > + ret = bdrv_co_is_allocated_above(bs, base, > + sector_num, next - sector_num, > &n); > + > + if (ret < 0) { > + break; > + } else if (ret == 1) { > + bdrv_set_dirty(bs, sector_num, n); > + sector_num = next; > + } else { > + sector_num += n; > + } Maybe it would be worth checking for n == 0 and returning an error in that case. One example where this happens is when asking for the allocation status after EOF. It shouldn't happen as long as bdrv_truncate() is forbidden while the job runs, but an extra check rarely hurts. > + } > + } > + > + if (ret < 0) { > + goto immediate_exit; > + } Why not do that directly instead of having a break; first just to get here? > + > + s->sector_num = -1; > + for (;;) { > + uint64_t delay_ns; > + int64_t cnt; > + bool should_complete; > + > + cnt = bdrv_get_dirty_count(bs); > + if (cnt != 0) { > + ret = mirror_iteration(s); > + if (ret < 0) { > + break; goto immediate_exit? It's the same now, but code after the loop may be added in the future. > + } > + cnt = bdrv_get_dirty_count(bs); > + } > + > + if (cnt != 0) { > + should_complete = false; > + } else { > + trace_mirror_before_flush(s); > + bdrv_flush(s->target); No error handling? Kevin