On 11/10/2014 06:45 AM, Max Reitz wrote: > Add a function qcow2_change_refcount_order() which allows changing the > refcount order of a qcow2 image. > > Signed-off-by: Max Reitz <mre...@redhat.com> > --- > block/qcow2-refcount.c | 424 > +++++++++++++++++++++++++++++++++++++++++++++++++ > block/qcow2.h | 4 + > 2 files changed, 428 insertions(+)
This is fairly big; you may want to get a second review from a maintainer rather than blindly trusting me. My review was not linear, but I left the email in linear order. Feel free to ask for clarification if my presentation is too hard to follow. > + > +/** > + * This "operation" for walk_over_reftable() allocates the refblock on disk > (if > + * it is not empty) and inserts its offset into the new reftable. The size of > + * this new reftable is increased as required. > + */ > +static int alloc_refblock(BlockDriverState *bs, uint64_t **reftable, > + uint64_t reftable_index, uint64_t *reftable_size, > + void *refblock, bool refblock_empty, Error **errp) > +{ > + BDRVQcowState *s = bs->opaque; > + int64_t offset; > + > + if (!refblock_empty && reftable_index >= *reftable_size) { > + uint64_t *new_reftable; > + uint64_t new_reftable_size; > + > + new_reftable_size = ROUND_UP(reftable_index + 1, > + s->cluster_size / sizeof(uint64_t)); > + if (new_reftable_size > QCOW_MAX_REFTABLE_SIZE / sizeof(uint64_t)) { > + error_setg(errp, > + "This operation would make the refcount table grow " > + "beyond the maximum size supported by QEMU, > aborting"); > + return -ENOTSUP; > + } > + > + new_reftable = g_try_realloc(*reftable, new_reftable_size * > + sizeof(uint64_t)); Safe from overflow based on checks a few lines earlier. Good. > + if (!new_reftable) { > + error_setg(errp, "Failed to increase reftable buffer size"); > + return -ENOMEM; > + } > + > + memset(new_reftable + *reftable_size, 0, > + (new_reftable_size - *reftable_size) * sizeof(uint64_t)); > + > + *reftable = new_reftable; > + *reftable_size = new_reftable_size; Just to check my math here: Suppose we have an image with 512-byte clusters, and are changing from 16-bit refcount (order 4) to 64-bit refcount. Also, suppose the existing image has exactly filled a one-cluster refcount table (that is, there are 64 refcount blocks, each describing at a refblock with all 256 refcount entries full, for a total image size of exactly 8M). The original image occupies a header (1 cluster), L1 and L2 tables, and data; but 65 of the 16k clusters tied up in the image are dedicated to the refcount structures. Meanwhile, the new refcount table will have to point to 256 refcount blocks, each holding only 64 entries, which in turn implies that the refcount table now has to be at least 4 clusters long. But as this requires at least 260 clusters to represent, then even if we were able to reuse the 65 clusters of the original table, we'd still be allocating at least 195 clusters; in reality, your code doesn't free any old clusters until after allocating the new, because it is easier to keep the old table live until the new table is populated. The process of allocating the new clusters means we actually end up with a new refcount table of 5 clusters long, where not all 320 refblocks will be populated. But as long as we are keeping the old table up-to-date for the refblock allocations, it ALSO means that we caused a rollover of the old table from 1 cluster into 2, which itself consumes several clusters (the larger table must be contiguous, and we must also set up a refblock to describe the larger table, so we've added at least three clusters associated to the original table during the course of preparing the new table). Hmm - that means I found a bug in your implementation. See [3] below. > + } > + > + if (refblock_empty) { > + if (reftable_index < *reftable_size) { > + (*reftable)[reftable_index] = 0; Necessary since you used g_try_realloc which leaves the new reftable uninitialized. Reasonable (rather than a memset) since the caller will be visiting every single refblock in the table anyways. > + } > + } else { > + offset = qcow2_alloc_clusters(bs, s->cluster_size); As mentioned above, this action will potentially change s->refcount_table_size of the original table, which in turn makes the caller execute its loops more often to cover the increased allocation. Does qcow2_alloc_clusters() guarantee that the just-allocated cluster is zero-initialized (and/or should we add a flag to the function to allow the caller to choose whether to force zero allocation instead of leaving uninitialized)? See [4] below for why I ask. > + if (offset < 0) { > + error_setg_errno(errp, -offset, "Failed to allocate refblock"); > + return offset; > + } > + (*reftable)[reftable_index++] = offset; > + } > + > + return 0; > +} > + > +/** > + * This "operation" for walk_over_reftable() writes the refblock to disk at > the > + * offset specified by the new reftable's entry. It does not modify the new > + * reftable or change any refcounts. > + */ > +static int flush_refblock(BlockDriverState *bs, uint64_t **reftable, > + uint64_t reftable_index, uint64_t *reftable_size, > + void *refblock, bool refblock_empty, Error **errp) > +{ > + BDRVQcowState *s = bs->opaque; > + int64_t offset; > + int ret; > + > + if (refblock_empty) { > + if (reftable_index < *reftable_size) { > + assert((*reftable)[reftable_index] == 0); > + } > + } else { > + /* The first pass with alloc_refblock() made the reftable large > enough > + */ > + assert(reftable_index < *reftable_size); Okay, I see why you couldn't hoist this assert outside of the if - the caller may call this with refblock_empty for any refblocks at the tail of the final partial reftable cluster. > + offset = (*reftable)[reftable_index]; > + assert(offset != 0); > + > + ret = qcow2_pre_write_overlap_check(bs, 0, offset, s->cluster_size); > + if (ret < 0) { > + error_setg_errno(errp, -ret, "Overlap check failed"); > + return ret; > + } > + > + ret = bdrv_pwrite(bs->file, offset, refblock, s->cluster_size); > + if (ret < 0) { > + error_setg_errno(errp, -ret, "Failed to write refblock"); > + return ret; If we fail here, do we leak all clusters written so far? At least the image is still consistent. After reading further, I think I answered myself at point [5]. > + } > + } > + > + return 0; > +} > + > +/** > + * This function walks over the existing reftable and every referenced > refblock; > + * if @new_set_refcount is non-NULL, it is called for every refcount entry to > + * create an equal new entry in the passed @new_refblock. Once that > + * @new_refblock is completely filled, @operation will be called. > + * > + * @operation is expected to combine the @new_refblock and its entry in the > new > + * reftable (which is described by the parameters starting with "reftable"). > + * @refblock_empty is set if all entries in the refblock are zero. > + * > + * @status_cb and @cb_opaque are used for the amend operation's status > callback. > + * @index is the index of the walk_over_reftable() calls and @total is the > total > + * number of walk_over_reftable() calls per amend operation. Both are used > for > + * calculating the parameters for the status callback. Nice writeup; I was referring to it frequently during review. > + */ > +static int walk_over_reftable(BlockDriverState *bs, uint64_t **new_reftable, > + uint64_t *new_reftable_index, > + uint64_t *new_reftable_size, > + void *new_refblock, int new_refblock_size, > + int new_refcount_bits, > + int (*operation)(BlockDriverState *bs, > + uint64_t **reftable, > + uint64_t reftable_index, > + uint64_t *reftable_size, > + void *refblock, > + bool refblock_empty, > + Error **errp), Worth a typedef? Maybe not; I managed. > + Qcow2SetRefcountFunc *new_set_refcount, > + BlockDriverAmendStatusCB *status_cb, > + void *cb_opaque, int index, int total, > + Error **errp) After several reads of the patch, I see that this walk function gets called twice - first with a NULL new_set_refcount (merely to figure out how big the new reftable should be, as well as allocating all necessary non-zero refcount blocks, but not committing the top-level reftable to any particular file location); the second walk then commits the new refcounts to disk (updating each non-zero entry in all the new refcount blocks to match their original counterparts, but no allocation required). Pretty slick to ensure that we are sure that the new table is feasible before actually swapping over to it, while still allowing a fairly clean rollback on early failure. > +{ > + BDRVQcowState *s = bs->opaque; > + uint64_t reftable_index; > + bool new_refblock_empty = true; > + int refblock_index; > + int new_refblock_index = 0; > + int ret; > + > + for (reftable_index = 0; reftable_index < s->refcount_table_size; > + reftable_index++) Outer loop - for each cluster of the top-level reference table, visit each child table and update the status callback. On the first walk, s->refcount_table_size might be increasing during calls to operation(). > + { > + uint64_t refblock_offset = s->refcount_table[reftable_index] > + & REFT_OFFSET_MASK; > + > + status_cb(bs, (uint64_t)index * s->refcount_table_size + > reftable_index, > + (uint64_t)total * s->refcount_table_size, cb_opaque); > + This never quite reaches 100%, and the caller also never reaches 100%. I think you want one more call to status_cb() at the end of the loop (at either site [1] or [2]) that passes an equal index and total to make it obvious that this (portion of the) long-running conversion is complete. Since s->refcount_table_size may grow during the loop, the callback does not necessarily have a constant total size; good thing we already documented that progress bars need not have a constant total. > + if (refblock_offset) { > + void *refblock; > + > + if (offset_into_cluster(s, refblock_offset)) { > + qcow2_signal_corruption(bs, true, -1, -1, "Refblock offset > %#" > + PRIx64 " unaligned (reftable index: > %#" > + PRIx64 ")", refblock_offset, > + reftable_index); > + error_setg(errp, > + "Image is corrupt (unaligned refblock offset)"); > + return -EIO; > + } > + > + ret = qcow2_cache_get(bs, s->refcount_block_cache, > refblock_offset, > + &refblock); > + if (ret < 0) { > + error_setg_errno(errp, -ret, "Failed to retrieve refblock"); > + return ret; > + } > + > + for (refblock_index = 0; refblock_index < s->refcount_block_size; > + refblock_index++) > + { If a child table (refcount block) exists, visit each refcount entry within the table (at least one refcount in that visit should be non-empty, otherwise we could garbage-collect the refblock and put a 0 entry in the outer loop). > + uint64_t refcount; > + > + if (new_refblock_index >= new_refblock_size) { > + /* new_refblock is now complete */ > + ret = operation(bs, new_reftable, *new_reftable_index, > + new_reftable_size, new_refblock, > + new_refblock_empty, errp); The new refcount table will either be filled faster than the original (when going from small to large refcount - calling operation() multiple times per inner loop) or will be filled slower than the original (when going from large to small; operation() will only be called after several outer loops). > + if (ret < 0) { > + qcow2_cache_put(bs, s->refcount_block_cache, > &refblock); > + return ret; > + } > + > + (*new_reftable_index)++; > + new_refblock_index = 0; > + new_refblock_empty = true; > + } > + > + refcount = s->get_refcount(refblock, refblock_index); > + if (new_refcount_bits < 64 && refcount >> new_refcount_bits) > { Technically, this get_refcount() call is dead code on the second walk, since the first walk already validated things, so you could push all of this code... > + uint64_t offset; > + > + qcow2_cache_put(bs, s->refcount_block_cache, &refblock); > + > + offset = ((reftable_index << s->refcount_block_bits) > + + refblock_index) << s->cluster_bits; > + > + error_setg(errp, "Cannot decrease refcount entry width > to " > + "%i bits: Cluster at offset %#" PRIx64 " has > a " > + "refcount of %" PRIu64, new_refcount_bits, > + offset, refcount); > + return -EINVAL; > + } > + > + if (new_set_refcount) { > + new_set_refcount(new_refblock, new_refblock_index++, > refcount); > + } else { ...here, in the branch only run on the first walk. > + new_refblock_index++; > + } > + new_refblock_empty = new_refblock_empty && refcount == 0; Worth condensing to 'new_refblock_empty &= !refcount'? Maybe not. > + } > + > + ret = qcow2_cache_put(bs, s->refcount_block_cache, &refblock); > + if (ret < 0) { > + error_setg_errno(errp, -ret, "Failed to put refblock back > into " > + "the cache"); > + return ret; > + } > + } else { > + /* No refblock means every refcount is 0 */ > + for (refblock_index = 0; refblock_index < s->refcount_block_size; > + refblock_index++) Again, visiting each (implied) entry for the given refcount block of the outer loop. When enlarging the width, each of the new blocks will also be all zero; but when shrinking the width, even though all entries on this pass are zero, we may be combining this pass with another outer loop with non-zero data for a non-zero block in the resulting new table. > + { > + if (new_refblock_index >= new_refblock_size) { > + /* new_refblock is now complete */ > + ret = operation(bs, new_reftable, *new_reftable_index, > + new_reftable_size, new_refblock, > + new_refblock_empty, errp); > + if (ret < 0) { > + return ret; > + } > + > + (*new_reftable_index)++; > + new_refblock_index = 0; > + new_refblock_empty = true; > + } > + > + if (new_set_refcount) { > + new_set_refcount(new_refblock, new_refblock_index++, 0); Would it be worth guaranteeing that every new refblock is 0-initialized when allocated, so that you can skip setting a refcount to 0? This question depends on the answer about block allocation asked at [4] above. > + } else { > + new_refblock_index++; > + } > + } > + } > + } > + > + if (new_refblock_index > 0) { > + /* Complete the potentially existing partially filled final refblock > */ > + if (new_set_refcount) { > + for (; new_refblock_index < new_refblock_size; > + new_refblock_index++) > + { > + new_set_refcount(new_refblock, new_refblock_index, 0); Again, if you 0-initialize refblocks when allocated, you could skip this (another instance of [4] above). > + } > + } > + > + ret = operation(bs, new_reftable, *new_reftable_index, > + new_reftable_size, new_refblock, new_refblock_empty, > + errp); > + if (ret < 0) { > + return ret; > + } > + > + (*new_reftable_index)++; > + } site [1] mentioned above, as a good place to make a final status callback at 100%. But if you do it here, it means that we call the status callback twice with the same values (the 100% value of the first loop is the 0% value of the second loop) - not the end of the world, but may impact any testsuite that tracks progress reports. > + > + return 0; > +} > + > +int qcow2_change_refcount_order(BlockDriverState *bs, int refcount_order, > + BlockDriverAmendStatusCB *status_cb, > + void *cb_opaque, Error **errp) > +{ > + BDRVQcowState *s = bs->opaque; > + Qcow2GetRefcountFunc *new_get_refcount; > + Qcow2SetRefcountFunc *new_set_refcount; > + void *new_refblock = qemu_blockalign(bs->file, s->cluster_size); > + uint64_t *new_reftable = NULL, new_reftable_size = 0; > + uint64_t *old_reftable, old_reftable_size, old_reftable_offset; > + uint64_t new_reftable_index = 0; > + uint64_t i; > + int64_t new_reftable_offset; > + int new_refblock_size, new_refcount_bits = 1 << refcount_order; > + int old_refcount_order; > + int ret; > + > + assert(s->qcow_version >= 3); > + assert(refcount_order >= 0 && refcount_order <= 6); > + > + /* see qcow2_open() */ > + new_refblock_size = 1 << (s->cluster_bits - (refcount_order - 3)); Safe (cluster_bits is always at least 9, and at most 21 in our current implementation, so we are shifting anywhere from 6 to 24 positions). > + > + get_refcount_functions(refcount_order, > + &new_get_refcount, &new_set_refcount); > + > + > + /* First, allocate the structures so they are present in the refcount > + * structures */ > + ret = walk_over_reftable(bs, &new_reftable, &new_reftable_index, > + &new_reftable_size, NULL, new_refblock_size, > + new_refcount_bits, &alloc_refblock, NULL, > + status_cb, cb_opaque, 0, 2, errp); > + if (ret < 0) { > + goto done; > + } > + > + /* The new_reftable_size is now valid and will not be changed anymore, > + * so we can now allocate the reftable */ > + new_reftable_offset = qcow2_alloc_clusters(bs, new_reftable_size * > + sizeof(uint64_t)); And here is your bug, that I hinted at with the mention of [3] above. This allocation can potentially cause an overflow of the existing reftable to occupy one more cluster. Remember my thought experiment above, how an 8 megabyte image rolls from 1 to 2 clusters during the course of allocating refblocks for the new table? What if the original image wasn't completely full, but things are perfectly sized with enough free clusters, then all of the refblock allocations done during the first walk will still fit, and it is only this final allocation of the new reftable that will cause the rollover, at which point we've failed to account for the new refblock size. That is, I think I could craft an image that would trigger either an assertion failure or an out-of-bounds array access during the second walk. > + if (new_reftable_offset < 0) { > + error_setg_errno(errp, -new_reftable_offset, > + "Failed to allocate the new reftable"); > + ret = new_reftable_offset; > + goto done; If we fail here, do we leak allocations of the refblocks? I guess not; based on another forward reference to point [5]. > + } > + > + new_reftable_index = 0; > + > + /* Second, write the new refblocks */ > + ret = walk_over_reftable(bs, &new_reftable, &new_reftable_index, > + &new_reftable_size, new_refblock, > + new_refblock_size, new_refcount_bits, > + &flush_refblock, new_set_refcount, > + status_cb, cb_opaque, 1, 2, errp); > + if (ret < 0) { > + goto done; > + } If we fail here, it looks like we DO leak the clusters allocated for the new reftable (again, point [5]). > + > + > + /* Write the new reftable */ > + ret = qcow2_pre_write_overlap_check(bs, 0, new_reftable_offset, > + new_reftable_size * > sizeof(uint64_t)); > + if (ret < 0) { > + error_setg_errno(errp, -ret, "Overlap check failed"); > + goto done; > + } > + > + for (i = 0; i < new_reftable_size; i++) { > + cpu_to_be64s(&new_reftable[i]); > + } > + > + ret = bdrv_pwrite(bs->file, new_reftable_offset, new_reftable, > + new_reftable_size * sizeof(uint64_t)); > + > + for (i = 0; i < new_reftable_size; i++) { > + be64_to_cpus(&new_reftable[i]); > + } > + > + if (ret < 0) { > + error_setg_errno(errp, -ret, "Failed to write the new reftable"); > + goto done; > + } Looks like you correctly maintain the in-memory copy in preferred cpu byte order, while writing to disk in big-endian order. > + > + > + /* Empty the refcount cache */ > + ret = qcow2_cache_flush(bs, s->refcount_block_cache); > + if (ret < 0) { > + error_setg_errno(errp, -ret, "Failed to flush the refblock cache"); > + goto done; > + } > + > + /* Update the image header to point to the new reftable; this only > updates > + * the fields which are relevant to qcow2_update_header(); other fields > + * such as s->refcount_table or s->refcount_bits stay stale for now > + * (because we have to restore everything if qcow2_update_header() > fails) */ > + old_refcount_order = s->refcount_order; > + old_reftable_size = s->refcount_table_size; > + old_reftable_offset = s->refcount_table_offset; > + > + s->refcount_order = refcount_order; > + s->refcount_table_size = new_reftable_size; > + s->refcount_table_offset = new_reftable_offset; > + > + ret = qcow2_update_header(bs); > + if (ret < 0) { > + s->refcount_order = old_refcount_order; > + s->refcount_table_size = old_reftable_size; > + s->refcount_table_offset = old_reftable_offset; > + error_setg_errno(errp, -ret, "Failed to update the qcow2 header"); > + goto done; > + } Failures up to here still have issues leaking the new reftable allocation (point [5]). > + > + /* Now update the rest of the in-memory information */ > + old_reftable = s->refcount_table; > + s->refcount_table = new_reftable; > + > + /* For cleaning up all old refblocks */ > + new_reftable = old_reftable; > + new_reftable_size = old_reftable_size; > + > + s->refcount_bits = 1 << refcount_order; > + if (refcount_order < 6) { > + s->refcount_max = (UINT64_C(1) << s->refcount_bits) - 1; > + } else { > + s->refcount_max = INT64_MAX; > + } Is it worth factoring this computation into a common helper, since it appeared in an earlier patch as well? > + > + s->refcount_block_bits = s->cluster_bits - (refcount_order - 3); > + s->refcount_block_size = 1 << s->refcount_block_bits; > + > + s->get_refcount = new_get_refcount; > + s->set_refcount = new_set_refcount; > + > + /* And free the old reftable (the old refblocks are freed below the > "done" > + * label) */ > + qcow2_free_clusters(bs, old_reftable_offset, > + old_reftable_size * sizeof(uint64_t), > + QCOW2_DISCARD_NEVER); site [2] mentioned above, as a possible point where you might want to ensure the callback is called with equal progress and total values to ensure the caller knows the job is done. Except this site doesn't have quite as much information as site [1] about what total size all the other status callbacks were using. > + > +done: > + if (new_reftable) { > + /* On success, new_reftable actually points to the old reftable (and > + * new_reftable_size is the old reftable's size); but that is just > + * fine */ > + for (i = 0; i < new_reftable_size; i++) { > + uint64_t offset = new_reftable[i] & REFT_OFFSET_MASK; > + if (offset) { > + qcow2_free_clusters(bs, offset, s->cluster_size, > + QCOW2_DISCARD_NEVER); > + } > + } > + g_free(new_reftable); So here is point [5] - if we failed early, this tries to clean up all allocated refblocks associated with the new table. It does NOT clean up any refblocks allocated due to resizing the old table to be slightly larger, but that should be fine (not a leak, so much as an image that is now a couple clusters larger than the minimum required size). However, while you clean up the clusters associated with refblocks (layer 2), the cleanup of old clusters associated with the reftable (layer 1) happened before the done: label on success, but that means that on failure, you are NOT cleaning up the clusters associated with the new reftable. > + } > + > + qemu_vfree(new_refblock); > + return ret; > +} > diff --git a/block/qcow2.h b/block/qcow2.h > index fe12c54..5b96519 100644 > --- a/block/qcow2.h > +++ b/block/qcow2.h > @@ -526,6 +526,10 @@ int qcow2_check_metadata_overlap(BlockDriverState *bs, > int ign, int64_t offset, > int qcow2_pre_write_overlap_check(BlockDriverState *bs, int ign, int64_t > offset, > int64_t size); > > +int qcow2_change_refcount_order(BlockDriverState *bs, int refcount_order, > + BlockDriverAmendStatusCB *status_cb, > + void *cb_opaque, Error **errp); > + > /* qcow2-cluster.c functions */ > int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size, > bool exact_size); > Interesting patch. Hope my review helps you prepare a better v2. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature