On Mon, 07/09 19:37, Vladimir Sementsov-Ogievskiy wrote: > Fleecing scheme works as follows: we want a kind of temporary snapshot > of active drive A. We create temporary image B, with B->backing = A. > Then we start backup(sync=none) from A to B. From this point, B reads > as point-in-time snapshot of A (A continues to be active drive, > accepting guest IO). > > This scheme needs some additional synchronization between reads from B > and backup COW operations, otherwise, the following situation is > theoretically possible: > > (assume B is qcow2, client is NBD client, reading from B) > > 1. client starts reading and take qcow2 mutex in qcow2_co_preadv, and > goes up to l2 table loading (assume cache miss) > > 2) guest write => backup COW => qcow2 write => > try to take qcow2 mutex => waiting > > 3. l2 table loaded, we see that cluster is UNALLOCATED, go to > "case QCOW2_CLUSTER_UNALLOCATED" and unlock mutex before > bdrv_co_preadv(bs->backing, ...) > > 4) aha, mutex unlocked, backup COW continues, and we finally finish > guest write and change cluster in our active disk A > > 5. actually, do bdrv_co_preadv(bs->backing, ...) and read > _new updated_ data. > > To avoid this, let's make backup writes serializing, to not intersect > with reads from B. > > Note: we expand range of handled cases from (sync=none and > B->backing = A) to just (A in backing chain of B), to finally allow > safe reading from B during backup for all cases when A in backing chain > of B, i.e. B formally looks like point-in-time snapshot of A. > > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com> > --- > block/backup.c | 20 ++++++++++++++------ > 1 file changed, 14 insertions(+), 6 deletions(-) > > diff --git a/block/backup.c b/block/backup.c > index f3e4e814b6..319fc922e8 100644 > --- a/block/backup.c > +++ b/block/backup.c > @@ -47,6 +47,8 @@ typedef struct BackupBlockJob { > HBitmap *copy_bitmap; > bool use_copy_range; > int64_t copy_range_size; > + > + bool serialize_target_writes; > } BackupBlockJob; > > static const BlockJobDriver backup_job_driver; > @@ -102,6 +104,8 @@ static int coroutine_fn > backup_cow_with_bounce_buffer(BackupBlockJob *job, > QEMUIOVector qiov; > BlockBackend *blk = job->common.blk; > int nbytes; > + int read_flags = is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0; > + int write_flags = job->serialize_target_writes ? BDRV_REQ_SERIALISING : > 0; > > hbitmap_reset(job->copy_bitmap, start / job->cluster_size, 1); > nbytes = MIN(job->cluster_size, job->len - start); > @@ -112,8 +116,7 @@ static int coroutine_fn > backup_cow_with_bounce_buffer(BackupBlockJob *job, > iov.iov_len = nbytes; > qemu_iovec_init_external(&qiov, &iov, 1); > > - ret = blk_co_preadv(blk, start, qiov.size, &qiov, > - is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0); > + ret = blk_co_preadv(blk, start, qiov.size, &qiov, read_flags); > if (ret < 0) { > trace_backup_do_cow_read_fail(job, start, ret); > if (error_is_read) { > @@ -124,11 +127,11 @@ static int coroutine_fn > backup_cow_with_bounce_buffer(BackupBlockJob *job, > > if (qemu_iovec_is_zero(&qiov)) { > ret = blk_co_pwrite_zeroes(job->target, start, > - qiov.size, BDRV_REQ_MAY_UNMAP); > + qiov.size, write_flags | > BDRV_REQ_MAY_UNMAP); > } else { > ret = blk_co_pwritev(job->target, start, > - qiov.size, &qiov, > - job->compress ? BDRV_REQ_WRITE_COMPRESSED : 0); > + qiov.size, &qiov, write_flags | > + (job->compress ? BDRV_REQ_WRITE_COMPRESSED : > 0)); > } > if (ret < 0) { > trace_backup_do_cow_write_fail(job, start, ret); > @@ -156,6 +159,8 @@ static int coroutine_fn > backup_cow_with_offload(BackupBlockJob *job, > int nr_clusters; > BlockBackend *blk = job->common.blk; > int nbytes; > + int read_flags = is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0; > + int write_flags = job->serialize_target_writes ? BDRV_REQ_SERIALISING : > 0; > > assert(QEMU_IS_ALIGNED(job->copy_range_size, job->cluster_size)); > nbytes = MIN(job->copy_range_size, end - start); > @@ -163,7 +168,7 @@ static int coroutine_fn > backup_cow_with_offload(BackupBlockJob *job, > hbitmap_reset(job->copy_bitmap, start / job->cluster_size, > nr_clusters); > ret = blk_co_copy_range(blk, start, job->target, start, nbytes, > - is_write_notifier ? BDRV_REQ_NO_SERIALISING : 0, > 0); > + read_flags, write_flags); > if (ret < 0) { > trace_backup_do_cow_copy_range_fail(job, start, ret); > hbitmap_set(job->copy_bitmap, start / job->cluster_size, > @@ -701,6 +706,9 @@ BlockJob *backup_job_create(const char *job_id, > BlockDriverState *bs, > sync_bitmap : NULL; > job->compress = compress; > > + /* Detect image-fleecing (and similar) schemes */ > + job->serialize_target_writes = bdrv_chain_contains(target, bs); > + > /* If there is no backing file on the target, we cannot rely on COW if > our > * backup cluster size is smaller than the target cluster size. Even for > * targets with a backing file, try to avoid COW if possible. */
I always thought mirror is by far the most complicated job, but now backup is catching up quickly! Reviewed-by: Fam Zheng <f...@redhat.com>