Previous patches mentioned how the blkdebug filter driver demonstrates a bug present in our NBD server (for example, commit b0245d64); the bug is also present with the raw format driver when probing occurs. Basically, if we specify a particular alignment > 1, but defer the actual block status to the underlying file, and the underlying file has a smaller alignment, then the use of BDRV_BLOCK_RAW to defer to the underlying file can end up with status split at an alignment unacceptable to our layer. Many callers don't care, but NBD does - it is a violation of the NBD protocol to send status or read results split on an unaligned boundary (in 737d3f5244, we taught our 4.0 client to be tolerant of such violations because the problem was even more pronounced with qemu 3.1 as server; but we still need to fix our server for the sake of other stricter clients).
This patch lays the groundwork - it adjusts bdrv_block_status to ensure that any time one layer defers to another via BDRV_BLOCK_RAW, the deferral is either truncated down to an aligned boundary, or multiple sub-aligned requests are coalesced into a single representative answer (using an implicit hole beyond EOF as needed). Iotest 241 exposes the effect (when format probing occurred, we don't want block status to subdivide the initial sector, and thus not any other sector either). Similarly, iotest 221 is a good candidate to amend to specifically track the effects; a change to a hole at EOF is not visible if an upper layer does not want alignment smaller than 512. However, this patch alone is not a complete fix - it is still possible to have an active layer with large alignment constraints backed by another layer with smaller constraints; so the next patch will complete the task. In particular, the next patch will introduce some mutual recursion (bdrv_co_common_block_status_above will call this new function rather than directly into bdrv_co_block_status), so some conditions added here (such as a NULL pointer for map or handling a length-0 request) are not reachable until that point. There is one interesting corner case: prior to this patch, ALLOCATED was never returned without either DATA or ZERO also set. But now, if we have two subregions where the first reports status 0 (not allocated), and the second reports ZERO|ALLOCATED but not DATA (preallocated, read sees zero but underlying file has indeterminate contents), then we can end up with a result of just ALLOCATED. However, looking at callers of bdrv_is_allocated does not find any problem with this new return value. What's more, this situation doesn't really happen until the next patch adds support for getting aligned status from backing files (as the use of aligned status in this patch tends to be limited to just the protocol child of a format driver, yet protocol drivers tend to report fully allocated, and only format drivers have unallocated clusters that defer to a backing file in the first place). Signed-off-by: Eric Blake <ebl...@redhat.com> --- block/io.c | 142 +++++++++++++++++++++++++++++++++++-- tests/qemu-iotests/221 | 13 ++++ tests/qemu-iotests/221.out | 9 +++ tests/qemu-iotests/241.out | 3 +- 4 files changed, 161 insertions(+), 6 deletions(-) diff --git a/block/io.c b/block/io.c index ca2dca30070e..4bca775c96b4 100644 --- a/block/io.c +++ b/block/io.c @@ -2325,6 +2325,132 @@ int bdrv_flush_all(void) return result; } +static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs, + bool want_zero, + int64_t offset, int64_t bytes, + int64_t *pnum, int64_t *map, + BlockDriverState **file); + +/* + * Returns an aligned allocation status of the specified disk region. + * + * Wrapper around bdrv_co_block_status() which requires the initial + * @offset and @count to be aligned to @align (must be power of 2), + * and guarantees the resulting @pnum will also be aligned; this may + * require piecing together multiple sub-aligned queries into an + * appropriate coalesced answer, as follows: + * + * - BDRV_BLOCK_DATA is set if the flag is set for at least one subregion + * - BDRV_BLOCK_ZERO is set only if the flag is set for all subregions + * - BDRV_BLOCK_OFFSET_VALID is set only if all subregions are contiguous + * from the same file (@map and @file are then from the first subregion) + * - BDRV_BLOCK_ALLOCATED is set if the flag is set for at least one subregion + * - BDRV_BLOCK_EOF is set if the last subregion queried set it (any + * remaining bytes to reach alignment are treated as an implicit hole) + * - BDRV_BLOCK_RAW is never set + */ +static int coroutine_fn bdrv_co_block_status_aligned(BlockDriverState *bs, + uint32_t align, + bool want_zero, + int64_t offset, + int64_t bytes, + int64_t *pnum, + int64_t *map, + BlockDriverState **file) +{ + int ret; + + assert(is_power_of_2(align) && QEMU_IS_ALIGNED(offset | bytes, align)); + ret = bdrv_co_block_status(bs, want_zero, offset, bytes, pnum, map, file); + if (ret < 0) { + return ret; + } + /* 0-length return only possible for 0-length query or beyond EOF */ + if (!*pnum) { + assert(!bytes || ret & BDRV_BLOCK_EOF); + return ret; + } + assert(!(ret & BDRV_BLOCK_RAW)); + + /* + * If initial query ended at EOF, round up to align: the post-EOF + * tail is an implicit hole, but our rules say we can treat that + * like the initial subregion. + */ + if (ret & BDRV_BLOCK_EOF) { + *pnum = QEMU_ALIGN_UP(*pnum, align); + assert(*pnum <= bytes); + return ret; + } + + /* + * If result is unaligned but not at EOF, it's easier to return + * the aligned subset and then compute the coalesced version over + * just align bytes. + */ + if (*pnum >= align) { + *pnum = QEMU_ALIGN_DOWN(*pnum, align); + return ret; + } + + /* + * If we got here, we have to merge status for multiple + * subregions. We can't detect if offsets are contiguous unless + * map and file are non-NULL. + */ + if (!map || !file) { + ret &= ~BDRV_BLOCK_OFFSET_VALID; + } + while (*pnum < align) { + int ret2; + int64_t pnum2; + int64_t map2; + BlockDriverState *file2; + + ret2 = bdrv_co_block_status(bs, want_zero, offset + *pnum, + align - *pnum, &pnum2, &map2, &file2); + if (ret2 < 0) { + return ret2; + } + assert(!(ret2 & BDRV_BLOCK_RAW)); + /* + * A 0-length answer here is a bug - we should not be querying + * beyond EOF. Our rules allow any further bytes in implicit + * hole past EOF to have same treatment as the subregion just + * before EOF. + */ + assert(pnum2 && pnum2 <= align - *pnum); + if (ret2 & BDRV_BLOCK_EOF) { + ret |= BDRV_BLOCK_EOF; + pnum2 = align - *pnum; + } + + /* Now merge the status */ + if (ret2 & BDRV_BLOCK_DATA) { + ret |= BDRV_BLOCK_DATA; + } + if (!(ret2 & BDRV_BLOCK_ZERO)) { + ret &= ~BDRV_BLOCK_ZERO; + } + if ((ret & BDRV_BLOCK_OFFSET_VALID) && + (!(ret2 & BDRV_BLOCK_OFFSET_VALID) || + *map + *pnum != map2 || *file != file2)) { + ret &= ~BDRV_BLOCK_OFFSET_VALID; + if (map) { + *map = 0; + } + if (file) { + *file = NULL; + } + } + if (ret2 & BDRV_BLOCK_ALLOCATED) { + ret |= BDRV_BLOCK_ALLOCATED; + } + *pnum += pnum2; + } + return ret; +} + /* * Returns the allocation status of the specified sectors. * Drivers not implementing the functionality are assumed to not support @@ -2438,7 +2564,17 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs, */ assert(*pnum && QEMU_IS_ALIGNED(*pnum, align) && align > offset - aligned_offset); - if (ret & BDRV_BLOCK_RECURSE) { + if (ret & BDRV_BLOCK_RAW) { + assert(local_file); + ret = bdrv_co_block_status_aligned(local_file, align, want_zero, + local_map, *pnum, pnum, &local_map, + &local_file); + if (ret < 0) { + goto out; + } + assert(!(ret & BDRV_BLOCK_RAW)); + ret |= BDRV_BLOCK_RAW; + } else if (ret & BDRV_BLOCK_RECURSE) { assert(ret & BDRV_BLOCK_DATA); assert(ret & BDRV_BLOCK_OFFSET_VALID); assert(!(ret & BDRV_BLOCK_ZERO)); @@ -2453,9 +2589,7 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs, } if (ret & BDRV_BLOCK_RAW) { - assert(ret & BDRV_BLOCK_OFFSET_VALID && local_file); - ret = bdrv_co_block_status(local_file, want_zero, local_map, - *pnum, pnum, &local_map, &local_file); + ret &= ~BDRV_BLOCK_RAW; goto out; } diff --git a/tests/qemu-iotests/221 b/tests/qemu-iotests/221 index c463fd4b113e..6a15e0160b24 100755 --- a/tests/qemu-iotests/221 +++ b/tests/qemu-iotests/221 @@ -46,6 +46,12 @@ echo echo "=== Check mapping of unaligned raw image ===" echo +# Note that when we enable format probing by omitting -f, the raw +# layer forces 512-byte alignment and the bytes past EOF take on the +# same status as the rest of the sector; otherwise, we can see the +# implicit hole visible past EOF thanks to the block layer rounding +# sizes up. + _make_test_img 65537 # qemu-img create rounds size up # file-posix allocates the first block of any images when it is created; @@ -55,15 +61,22 @@ _make_test_img 65537 # qemu-img create rounds size up $QEMU_IO -c 'discard 0 65537' "$TEST_IMG" | _filter_qemu_io $QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map +$QEMU_IMG map -f $IMGFMT --output=json "$TEST_IMG" | _filter_qemu_img_map +echo truncate --size=65537 "$TEST_IMG" # so we resize it and check again $QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map +$QEMU_IMG map -f $IMGFMT --output=json "$TEST_IMG" | _filter_qemu_img_map +echo $QEMU_IO -c 'w 65536 1' "$TEST_IMG" | _filter_qemu_io # writing also rounds up $QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map +$QEMU_IMG map -f $IMGFMT --output=json "$TEST_IMG" | _filter_qemu_img_map +echo truncate --size=65537 "$TEST_IMG" # so we resize it and check again $QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map +$QEMU_IMG map -f $IMGFMT --output=json "$TEST_IMG" | _filter_qemu_img_map # success, all done echo '*** done' diff --git a/tests/qemu-iotests/221.out b/tests/qemu-iotests/221.out index 93846c7dabb6..d22b5e00d4f8 100644 --- a/tests/qemu-iotests/221.out +++ b/tests/qemu-iotests/221.out @@ -7,11 +7,20 @@ discard 65537/65537 bytes at offset 0 64.001 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) [{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] [{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] + +[{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] +[{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] + wrote 1/1 bytes at offset 65536 1 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) [{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": OFFSET}, +{ "start": 65536, "length": 512, "depth": 0, "zero": false, "data": true, "offset": OFFSET}] +[{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": OFFSET}, { "start": 65536, "length": 1, "depth": 0, "zero": false, "data": true, "offset": OFFSET}, { "start": 65537, "length": 511, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] + +[{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": OFFSET}, +{ "start": 65536, "length": 512, "depth": 0, "zero": false, "data": true, "offset": OFFSET}] [{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": OFFSET}, { "start": 65536, "length": 1, "depth": 0, "zero": false, "data": true, "offset": OFFSET}, { "start": 65537, "length": 511, "depth": 0, "zero": true, "data": false, "offset": OFFSET}] diff --git a/tests/qemu-iotests/241.out b/tests/qemu-iotests/241.out index 67aaeed34f50..56d3796cf3ac 100644 --- a/tests/qemu-iotests/241.out +++ b/tests/qemu-iotests/241.out @@ -22,8 +22,7 @@ WARNING: Image format was not specified for 'TEST_DIR/t.raw' and probing guessed size: 1024 min block: 1 -[{ "start": 0, "length": 1000, "depth": 0, "zero": false, "data": true, "offset": OFFSET}, -{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": true, "offset": OFFSET}] +[{ "start": 0, "length": 1024, "depth": 0, "zero": false, "data": true, "offset": OFFSET}] 1 KiB (0x400) bytes allocated at offset 0 bytes (0x0) === Encrypted qcow2 file backed by unaligned raw image === -- 2.30.1