10.04.2019 23:20, Max Reitz wrote: > What bs->file and bs->backing mean depends on the node. For filter > nodes, both signify a node that will eventually receive all R/W > accesses. For format nodes, bs->file contains metadata and data, and > bs->backing will not receive writes -- instead, writes are COWed to > bs->file. Usually. > > In any case, it is not trivial to guess what a child means exactly with > our currently limited form of expression. It is better to introduce > some functions that actually guarantee a meaning: > > - bdrv_filtered_cow_child() will return the child that receives requests > filtered through COW. That is, reads may or may not be forwarded > (depending on the overlay's allocation status), but writes never go to > this child. > > - bdrv_filtered_rw_child() will return the child that receives requests > filtered through some very plain process. Reads and writes issued to > the parent will go to the child as well (although timing, etc. may be > modified). > > - All drivers but quorum (but quorum is pretty opaque to the general > block layer anyway) always only have one of these children: All read > requests must be served from the filtered_rw_child (if it exists), so > if there was a filtered_cow_child in addition, it would not receive > any requests at all. > (The closest here is mirror, where all requests are passed on to the > source, but with write-blocking, write requests are "COWed" to the > target. But that just means that the target is a special child that > cannot be introspected by the generic block layer functions, and that > source is a filtered_rw_child.) > Therefore, we can also add bdrv_filtered_child() which returns that > one child (or NULL, if there is no filtered child). > > Also, many places in the current block layer should be skipping filters > (all filters or just the ones added implicitly, it depends) when going > through a block node chain. They do not do that currently, but this > patch makes them. > > One example for this is qemu-img map, which should skip filters and only > look at the COW elements in the graph. The change to iotest 204's > reference output shows how using blkdebug on top of a COW node used to > make qemu-img map disregard the rest of the backing chain, but with this > patch, the allocation in the base image is reported correctly. > > Furthermore, a note should be made that sometimes we do want to access > bs->backing directly. This is whenever the operation in question is not > about accessing the COW child, but the "backing" child, be it COW or > not. This is the case in functions such as bdrv_open_backing_file() or > whenever we have to deal with the special behavior of @backing as a > blockdev option, which is that it does not default to null like all > other child references do. > > Finally, the query functions (query-block and query-named-block-nodes) > are modified to return any filtered child under "backing", not just > bs->backing or COW children. This is so that filters do not interrupt > the reported backing chain. This changes the output of iotest 184, as > the throttled node now appears as a backing child. > > Signed-off-by: Max Reitz <mre...@redhat.com> > ---
[..] > --- a/block/mirror.c > +++ b/block/mirror.c > @@ -660,8 +660,9 @@ static int mirror_exit_common(Job *job) > &error_abort); > if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) { > BlockDriverState *backing = s->is_none_mode ? src : s->base; > - if (backing_bs(target_bs) != backing) { > - bdrv_set_backing_hd(target_bs, backing, &local_err); > + if (bdrv_backing_chain_next(target_bs) != backing) { > + bdrv_set_backing_hd(bdrv_skip_rw_filters(target_bs), backing, here you support filters above target_bs ... > + &local_err); > if (local_err) { > error_report_err(local_err); > ret = -EPERM; > @@ -711,7 +712,7 @@ static int mirror_exit_common(Job *job) > block_job_remove_all_bdrv(bjob); > bdrv_child_try_set_perm(mirror_top_bs->backing, 0, BLK_PERM_ALL, > &error_abort); > - bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs), > &error_abort); > + bdrv_replace_node(mirror_top_bs, mirror_top_bs->backing->bs, > &error_abort); > > /* We just changed the BDS the job BB refers to (with either or both of > the > * bdrv_replace_node() calls), so switch the BB back so the cleanup does > @@ -903,7 +904,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp) > } else { > s->target_cluster_size = BDRV_SECTOR_SIZE; > } > - if (backing_filename[0] && !target_bs->backing && > + if (backing_filename[0] && !bdrv_filtered_cow_child(target_bs) && And here we need it too [continuing from here] > s->granularity < s->target_cluster_size) { > s->buf_size = MAX(s->buf_size, s->target_cluster_size); > s->cow_bitmap = bitmap_new(length); > @@ -1083,7 +1084,7 @@ static void mirror_complete(Job *job, Error **errp) > if (s->backing_mode == MIRROR_OPEN_BACKING_CHAIN) { > int ret; > > - assert(!target->backing); > + assert(!bdrv_filtered_cow_child(target)); hmm and here... Possibly, we should add a kind of s->filtered_target to use in all such cases. > ret = bdrv_open_backing_file(target, NULL, "backing", errp); > if (ret < 0) { > return; > @@ -1650,7 +1651,9 @@ static void mirror_start_job(const char *job_id, > BlockDriverState *bs, > * any jobs in them must be blocked */ > if (target_is_backing) { > BlockDriverState *iter; > - for (iter = backing_bs(bs); iter != target; iter = backing_bs(iter)) > { > + for (iter = bdrv_filtered_bs(bs); iter != target; should it be filtered_target too? > + iter = bdrv_filtered_bs(iter)) > + { > /* XXX BLK_PERM_WRITE needs to be allowed so we don't block > * ourselves at s->base (if writes are blocked for a node, they > are > * also blocked for its backing file). The other options would > be a > @@ -1691,7 +1694,7 @@ fail: > > bdrv_child_try_set_perm(mirror_top_bs->backing, 0, BLK_PERM_ALL, > &error_abort); > - bdrv_replace_node(mirror_top_bs, backing_bs(mirror_top_bs), > &error_abort); > + bdrv_replace_node(mirror_top_bs, mirror_top_bs->backing->bs, > &error_abort); > > bdrv_unref(mirror_top_bs); > } > @@ -1707,14 +1710,14 @@ void mirror_start(const char *job_id, > BlockDriverState *bs, > MirrorCopyMode copy_mode, Error **errp) > { > bool is_none_mode; > - BlockDriverState *base; > + BlockDriverState *base = NULL; dead assignment > > if (mode == MIRROR_SYNC_MODE_INCREMENTAL) { > error_setg(errp, "Sync mode 'incremental' not supported"); > return; > } > is_none_mode = mode == MIRROR_SYNC_MODE_NONE; > - base = mode == MIRROR_SYNC_MODE_TOP ? backing_bs(bs) : NULL; > + base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL; > mirror_start_job(job_id, bs, creation_flags, target, replaces, > speed, granularity, buf_size, backing_mode, > on_source_error, on_target_error, unmap, NULL, NULL, > diff --git a/block/qapi.c b/block/qapi.c > index 110d05dc57..478c6f5e0d 100644 > --- a/block/qapi.c > +++ b/block/qapi.c > @@ -149,9 +149,13 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend > *blk, > return NULL; > } > > - if (bs0->drv && bs0->backing) { > + if (bs0->drv && bdrv_filtered_child(bs0)) { > + /* > + * Put any filtered child here (for backwards compatibility to > when > + * we put bs0->backing here, which might be any filtered child). > + */ > info->backing_file_depth++; > - bs0 = bs0->backing->bs; > + bs0 = bdrv_filtered_bs(bs0); > (*p_image_info)->has_backing_image = true; > p_image_info = &((*p_image_info)->backing_image); > } else { > @@ -160,9 +164,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk, > > /* Skip automatically inserted nodes that the user isn't aware of > for > * query-block (blk != NULL), but not for query-named-block-nodes */ > - while (blk && bs0->drv && bs0->implicit) { > - bs0 = backing_bs(bs0); > - assert(bs0); > + if (blk) { > + bs0 = bdrv_skip_implicit_filters(bs0); > } > } > > @@ -347,9 +350,9 @@ static void bdrv_query_info(BlockBackend *blk, BlockInfo > **p_info, > BlockDriverState *bs = blk_bs(blk); > char *qdev; > > - /* Skip automatically inserted nodes that the user isn't aware of */ > - while (bs && bs->drv && bs->implicit) { > - bs = backing_bs(bs); > + if (bs) { > + /* Skip automatically inserted nodes that the user isn't aware of */ > + bs = bdrv_skip_implicit_filters(bs); > } > > info->device = g_strdup(blk_name(blk)); > @@ -506,6 +509,7 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, > BlockBackend *blk) > static BlockStats *bdrv_query_bds_stats(BlockDriverState *bs, > bool blk_level) > { > + BlockDriverState *cow_bs; > BlockStats *s = NULL; > > s = g_malloc0(sizeof(*s)); > @@ -518,9 +522,8 @@ static BlockStats *bdrv_query_bds_stats(BlockDriverState > *bs, > /* Skip automatically inserted nodes that the user isn't aware of in > * a BlockBackend-level command. Stay at the exact node for a node-level > * command. */ > - while (blk_level && bs->drv && bs->implicit) { > - bs = backing_bs(bs); > - assert(bs); > + if (blk_level) { > + bs = bdrv_skip_implicit_filters(bs); > } > > if (bdrv_get_node_name(bs)[0]) { > @@ -535,9 +538,10 @@ static BlockStats *bdrv_query_bds_stats(BlockDriverState > *bs, > s->parent = bdrv_query_bds_stats(bs->file->bs, blk_level); > } > > - if (blk_level && bs->backing) { > + cow_bs = bdrv_filtered_cow_bs(bs); So, if we at blk_level and top bs is explicit filter, you don't want to show it's child? Hmm, at least, we can't show it if it is file-child, as qapi filed already called backing. So, if we can't show for file-child-based filters, it may be better to not show filter children here at all. > + if (blk_level && cow_bs) { > s->has_backing = true; > - s->backing = bdrv_query_bds_stats(bs->backing->bs, blk_level); > + s->backing = bdrv_query_bds_stats(cow_bs, blk_level); > } > > return s; > diff --git a/block/stream.c b/block/stream.c > index bfaebb861a..23d5c890e0 100644 > --- a/block/stream.c > +++ b/block/stream.c > @@ -65,6 +65,7 @@ static int stream_prepare(Job *job) > StreamBlockJob *s = container_of(job, StreamBlockJob, common.job); > BlockJob *bjob = &s->common; > BlockDriverState *bs = blk_bs(bjob->blk); > + BlockDriverState *unfiltered = bdrv_skip_rw_filters(bs); Aha, I'd call it filtered, but unfiltered is correct too, it's amazing > BlockDriverState *base = s->base; > Error *local_err = NULL; > int ret = 0; > @@ -72,7 +73,7 @@ static int stream_prepare(Job *job) > bdrv_unfreeze_backing_chain(bs, base); > s->chain_frozen = false; > > - if (bs->backing) { > + if (bdrv_filtered_cow_child(unfiltered)) { > const char *base_id = NULL, *base_fmt = NULL; > if (base) { > base_id = s->backing_file_str; > @@ -80,7 +81,7 @@ static int stream_prepare(Job *job) > base_fmt = base->drv->format_name; > } > } > - ret = bdrv_change_backing_file(bs, base_id, base_fmt); > + ret = bdrv_change_backing_file(unfiltered, base_id, base_fmt); > bdrv_set_backing_hd(bs, base, &local_err); > if (local_err) { > error_report_err(local_err); > @@ -121,7 +122,7 @@ static int coroutine_fn stream_run(Job *job, Error **errp) > int64_t n = 0; /* bytes */ > void *buf; > > - if (!bs->backing) { > + if (!bdrv_filtered_child(bs)) { > goto out; > } this condition checks that there is nothing to stream, so, I thing it's better to check if (!bdrv_backing_chain_next(bs)) { goto out; } > > @@ -162,7 +163,7 @@ static int coroutine_fn stream_run(Job *job, Error **errp) > } else if (ret >= 0) { > /* Copy if allocated in the intermediate images. Limit to the > * known-unallocated area [offset, offset+n*BDRV_SECTOR_SIZE). > */ > - ret = bdrv_is_allocated_above(backing_bs(bs), base, > + ret = bdrv_is_allocated_above(bdrv_filtered_bs(bs), base, > offset, n, &n); Hmm, if we trying to support bs to be filter, and actually operate on first-non-filter, as you write in qapi spec, this is wrong. Again it should be bdrv_filtered_cow_bs(bdrv_skip_rw_filters(bs)).. Or, may be better, we at stream start should calculate reald top bs to operate on, and forget about all filters above.. i.e., do bs = bdrv_skip_rw_filters(bs) at the very beginning, when creating a job. > > /* Finish early if end of backing file has been reached */ > @@ -268,7 +269,9 @@ void stream_start(const char *job_id, BlockDriverState > *bs, > * disappear from the chain after this operation. The streaming job > reads > * every block only once, assuming that it doesn't change, so block > writes > * and resizes. */ > - for (iter = backing_bs(bs); iter && iter != base; iter = > backing_bs(iter)) { > + for (iter = bdrv_filtered_bs(bs); iter && iter != base; > + iter = bdrv_filtered_bs(iter)) > + { > block_job_add_bdrv(&s->common, "intermediate node", iter, 0, > BLK_PERM_CONSISTENT_READ | > BLK_PERM_WRITE_UNCHANGED, > &error_abort); > diff --git a/blockdev.c b/blockdev.c > index 4775a07d93..bb71b8368d 100644 > --- a/blockdev.c > +++ b/blockdev.c > @@ -1094,7 +1094,7 @@ void hmp_commit(Monitor *mon, const QDict *qdict) > return; > } > > - bs = blk_bs(blk); > + bs = bdrv_skip_implicit_filters(blk_bs(blk)); > aio_context = bdrv_get_aio_context(bs); > aio_context_acquire(aio_context); > > @@ -1663,7 +1663,7 @@ static void external_snapshot_prepare(BlkActionState > *common, > goto out; > } > > - if (state->new_bs->backing != NULL) { > + if (bdrv_filtered_cow_child(state->new_bs)) { Do we allow to create filter snapshot? We should either restrict it explicitly or check bdrv_filtered_child here.. And we can't allow file-based-filters anyway.. [skipped up to the end of blockdev.c, I'm tired o_O] [..] > diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c > index d1bb863cb6..f99f753fba 100644 > --- a/migration/block-dirty-bitmap.c > +++ b/migration/block-dirty-bitmap.c > @@ -285,9 +285,7 @@ static int init_dirty_bitmap_migration(void) > const char *drive_name = bdrv_get_device_or_node_name(bs); > > /* skip automatically inserted nodes */ > - while (bs && bs->drv && bs->implicit) { > - bs = backing_bs(bs); > - } > + bs = bdrv_skip_implicit_filters(bs); this intersects with Jonh's patch [PATCH v2] migration/dirty-bitmaps: change bitmap enumeration method https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg03340.html > > for (bitmap = bdrv_dirty_bitmap_next(bs, NULL); bitmap; > bitmap = bdrv_dirty_bitmap_next(bs, bitmap)) > diff --git a/nbd/server.c b/nbd/server.c > index e21bd501dc..e41ae89dbe 100644 > --- a/nbd/server.c > +++ b/nbd/server.c > @@ -1506,13 +1506,13 @@ NBDExport *nbd_export_new(BlockDriverState *bs, > uint64_t dev_offset, > if (bitmap) { > BdrvDirtyBitmap *bm = NULL; > > - while (true) { > + while (bs) { > bm = bdrv_find_dirty_bitmap(bs, bitmap); > - if (bm != NULL || bs->backing == NULL) { > + if (bm != NULL) { > break; > } > > - bs = bs->backing->bs; > + bs = bdrv_filtered_bs(bs); > } Check in documentation: "@bitmap: Also export the dirty bitmap reachable from @device". "Reachable" is not bad, but we may want to clarify that extended backing chain is meant > > if (bm == NULL) { > diff --git a/qemu-img.c b/qemu-img.c > index aa6f81f1ea..bcfbb743fc 100644 > --- a/qemu-img.c > +++ b/qemu-img.c > @@ -982,7 +982,7 @@ static int img_commit(int argc, char **argv) > if (!blk) { > return 1; > } > - bs = blk_bs(blk); > + bs = bdrv_skip_implicit_filters(blk_bs(blk)); hope there should not be any, but for consistancy we may skip them > > qemu_progress_init(progress, 1.f); > qemu_progress_print(0.f, 100); > @@ -999,7 +999,7 @@ static int img_commit(int argc, char **argv) > /* This is different from QMP, which by default uses the deepest > file in > * the backing chain (i.e., the very base); however, the traditional > * behavior of qemu-img commit is using the immediate backing file. > */ > - base_bs = backing_bs(bs); > + base_bs = bdrv_filtered_cow_bs(bs); > if (!base_bs) { > error_setg(&local_err, "Image does not have a backing file"); > goto done; > @@ -1616,19 +1616,18 @@ static int convert_iteration_sectors(ImgConvertState > *s, int64_t sector_num) > > if (s->sector_next_status <= sector_num) { > int64_t count = n * BDRV_SECTOR_SIZE; > + BlockDriverState *src_bs = blk_bs(s->src[src_cur]); > + BlockDriverState *base; > > if (s->target_has_backing) { > - > - ret = bdrv_block_status(blk_bs(s->src[src_cur]), > - (sector_num - src_cur_offset) * > - BDRV_SECTOR_SIZE, > - count, &count, NULL, NULL); > + base = bdrv_backing_chain_next(src_bs); > } else { > - ret = bdrv_block_status_above(blk_bs(s->src[src_cur]), NULL, > - (sector_num - src_cur_offset) * > - BDRV_SECTOR_SIZE, > - count, &count, NULL, NULL); > + base = NULL; > } > + ret = bdrv_block_status_above(src_bs, base, > + (sector_num - src_cur_offset) * > + BDRV_SECTOR_SIZE, > + count, &count, NULL, NULL); > if (ret < 0) { > error_report("error while reading block status of sector %" > PRId64 > ": %s", sector_num, strerror(-ret)); > @@ -2434,7 +2433,8 @@ static int img_convert(int argc, char **argv) > * s.target_backing_sectors has to be negative, which it will > * be automatically). The backing file length is used only > * for optimizations, so such a case is not fatal. */ > - s.target_backing_sectors = bdrv_nb_sectors(out_bs->backing->bs); > + s.target_backing_sectors = > + bdrv_nb_sectors(bdrv_filtered_cow_bs(out_bs)); can't out_bs be filter itself? > } else { > s.target_backing_sectors = -1; > } > @@ -2797,6 +2797,7 @@ static int get_block_status(BlockDriverState *bs, > int64_t offset, > > depth = 0; > for (;;) { > + bs = bdrv_skip_rw_filters(bs); Why? Filters may have own implementation of block_status, why to skip it? Or, thay cannot? Really, may be disallow filters have block_status, we may solve inefficient block_status_above we talked about before. > ret = bdrv_block_status(bs, offset, bytes, &bytes, &map, &file); > if (ret < 0) { > return ret; > @@ -2805,7 +2806,7 @@ static int get_block_status(BlockDriverState *bs, > int64_t offset, > if (ret & (BDRV_BLOCK_ZERO|BDRV_BLOCK_DATA)) { > break; > } > - bs = backing_bs(bs); > + bs = bdrv_filtered_cow_bs(bs); > if (bs == NULL) { > ret = 0; > break; > @@ -2944,7 +2945,7 @@ static int img_map(int argc, char **argv) > if (!blk) { > return 1; > } > - bs = blk_bs(blk); > + bs = bdrv_skip_implicit_filters(blk_bs(blk)); > > if (output_format == OFORMAT_HUMAN) { > printf("%-16s%-16s%-16s%s\n", "Offset", "Length", "Mapped to", > "File"); > diff --git a/tests/qemu-iotests/184.out b/tests/qemu-iotests/184.out > index 3deb3cfb94..1d61f7e224 100644 > --- a/tests/qemu-iotests/184.out > +++ b/tests/qemu-iotests/184.out > @@ -27,6 +27,11 @@ Testing: > "iops_rd": 0, > "detect_zeroes": "off", > "image": { > + "backing-image": { > + "virtual-size": 1073741824, > + "filename": "null-co://", > + "format": "null-co" > + }, > "virtual-size": 1073741824, > "filename": "json:{\"throttle-group\": \"group0\", > \"driver\": \"throttle\", \"file\": {\"driver\": \"null-co\"}}", > "format": "throttle" > @@ -34,7 +39,7 @@ Testing: > "iops_wr": 0, > "ro": false, > "node-name": "throttle0", > - "backing_file_depth": 0, > + "backing_file_depth": 1, > "drv": "throttle", > "iops": 0, > "bps_wr": 0, > diff --git a/tests/qemu-iotests/204.out b/tests/qemu-iotests/204.out > index f3a10fbe90..684774d763 100644 > --- a/tests/qemu-iotests/204.out > +++ b/tests/qemu-iotests/204.out > @@ -59,5 +59,6 @@ Offset Length File > 0x900000 0x2400000 TEST_DIR/t.IMGFMT > 0x3c00000 0x1100000 TEST_DIR/t.IMGFMT > 0x6a00000 0x400000 TEST_DIR/t.IMGFMT > +0x6e00000 0x1200000 TEST_DIR/t.IMGFMT.base > No errors were found on the image. > *** done > -- Best regards, Vladimir