Re: [Qemu-devel] [PATCH 5/7] block: Fix BDRV_BLOCK_RAW status to honor alignment

Kevin Wolf Wed, 03 Apr 2019 06:04:53 -0700

Am 03.04.2019 um 05:05 hat Eric Blake geschrieben:
> Previous patches mentioned how the blkdebug filter driver demonstrates
> a bug present in our NBD server; the bug is also present with the raw
> format driver when probing occurs. Basically, if we specify a
> particular alignment > 1, but defer the actual block status to the
> underlying file, and the underlying file has a smaller alignment, then
> the use of BDRV_BLOCK_RAW to defer to the underlying file can end up
> with status split at an alignment unacceptable to our layer.  Many
> callers don't care, but NBD does - it is a violation of the NBD
> protocol to send status or read results split on an unaligned boundary
> (we've taught our client to be tolerant of such violations because of
> qemu 3.1; but we still need to fix our server for the sake of other
> stricter clients).
> 
> This patch lays the groundwork - it adjusts bdrv_block_status to
> ensure that any time one layer defers to another via BDRV_BLOCK_RAW,
> the deferral is either truncated down to an aligned boundary, or
> multiple sub-aligned requests are coalesced into a single
> representative answer (using an implicit hole beyond EOF as
> needed). Iotest 241 exposes the effect (when format probing occurred,
> we don't want block status to subdivide the initial sector, and thus
> not any other sector either). Similarly, iotest 221 is a good
> candidate to amend to specifically track the effects; a change to a
> hole at EOF is not visible if an upper layer does not want alignment
> smaller than 512. However, this patch alone is not a complete fix - it
> is still possible to have an active layer with large alignment
> constraints backed by another layer with smaller constraints; so the
> next patch will complete the task.
> 
> Signed-off-by: Eric Blake <ebl...@redhat.com>
> ---
>  block/io.c                 | 108 +++++++++++++++++++++++++++++++++++--
>  tests/qemu-iotests/221     |  10 ++++
>  tests/qemu-iotests/221.out |   6 +++
>  tests/qemu-iotests/241.out |   3 +-
>  4 files changed, 122 insertions(+), 5 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index dfc153b8d82..936877d3745 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2021,6 +2021,97 @@ int coroutine_fn 
> bdrv_co_block_status_from_backing(BlockDriverState *bs,
>      return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID;
>  }
> 
> +static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
> +                                             bool want_zero,
> +                                             int64_t offset, int64_t bytes,
> +                                             int64_t *pnum, int64_t *map,
> +                                             BlockDriverState **file);
> +
> +/*
> + * Returns an aligned allocation status of the specified sectors.
> + * Wrapper around bdrv_co_block_status() which requires the initial
> + * offset and count to be aligned, and guarantees the result will also
> + * be aligned (even if it requires piecing together multiple
> + * sub-aligned queries into an appropriate coalesced answer).
> + */


I think this comment should specify the result of the operation when the
aligned region consists of multiple subregions with different status.
Probably something like this:

- BDRV_BLOCK_DATA is set if the flag is set for at least one subregion
- BDRV_BLOCK_ZERO is set if the flag is set for all subregions
- BDRV_BLOCK_OFFSET_VALID is set if the flag is set for all subregions,
  the provided offsets are contiguous and file is the same for all
  subregions.
- BDRV_BLOCK_ALLOCATED is never set here (the caller sets it)
- BDRV_BLOCK_EOF is set if the last subregion sets it; assert that it's
  not set for any other subregion
- BDRV_BLOCK_RAW is never set

- *map is only set if BDRV_BLOCK_OFFSET_VALID is set. It contains
  the offset of the first subregion then.

- *file is also only set if BDRV_BLOCK_OFFSET_VALID is set. It contains
  the *file of the subregions, which must be the same for all of them
  (otherwise, we wouldn't have set BDRV_BLOCK_OFFSET_VALID).

- *pnum: The sum of all subregions

> +static int coroutine_fn bdrv_co_block_status_aligned(BlockDriverState *bs,
> +                                                     uint32_t align,
> +                                                     bool want_zero,
> +                                                     int64_t offset,
> +                                                     int64_t bytes,
> +                                                     int64_t *pnum,
> +                                                     int64_t *map,
> +                                                     BlockDriverState **file)
> +{
> +    int ret;
> +
> +    assert(is_power_of_2(align) && QEMU_IS_ALIGNED(offset | bytes, align));
> +    ret = bdrv_co_block_status(bs, want_zero, offset, bytes, pnum, map, 
> file);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +    if (!*pnum) {
> +        assert(!bytes || ret & BDRV_BLOCK_EOF);
> +        return ret;
> +    }
> +    if (*pnum >= align) {
> +        if (!QEMU_IS_ALIGNED(*pnum, align)) {
> +            ret &= ~BDRV_BLOCK_EOF;
> +            *pnum = QEMU_ALIGN_DOWN(*pnum, align);
> +        }
> +        return ret;
> +    }

So we decide to just shorten the function if we can return at least some
*pnum > 0. This means that is most cases, this function will have to be
called twice, once for the aligned part, and again for the misaligned
rest.

We do save a little if the caller isn't actually interested in mapping
the whole image, but just in a specific range before the misaligned
part.

So it depends on the use case whether this is an optimisation or a
pessimisation.

> +    /* Merge in status for the rest of align */
> +    while (*pnum < align) {

Okay, in any case, I can see it's a simplification. :-)

> +        int ret2;
> +        int64_t pnum2;
> +        int64_t map2;
> +        BlockDriverState *file2;
> +
> +        ret2 = bdrv_co_block_status(bs, want_zero, offset + *pnum,
> +                                    align - *pnum, &pnum2,
> +                                    map ? &map2 : NULL, file ? &file2 : 
> NULL);
> +        if (ret2 < 0) {
> +            return ret2;
> +        }
> +        if (ret2 & BDRV_BLOCK_EOF) {
> +            ret |= BDRV_BLOCK_EOF;
> +            if (!pnum2) {
> +                pnum2 = align - *pnum;
> +                ret2 |= BDRV_BLOCK_ZERO;
> +            }
> +        } else {
> +            assert(pnum2);
> +        }
> +        if (ret2 & BDRV_BLOCK_DATA) {
> +            ret |= BDRV_BLOCK_DATA;
> +        }
> +        if (!(ret2 & BDRV_BLOCK_ZERO)) {
> +            ret &= ~BDRV_BLOCK_ZERO;
> +        }
> +        if ((ret & BDRV_BLOCK_OFFSET_VALID) &&
> +            (!(ret2 & BDRV_BLOCK_OFFSET_VALID) ||
> +             (map && *map + *pnum != map2) || (file && *file != file2))) {
> +            ret &= ~BDRV_BLOCK_OFFSET_VALID;
> +            if (map) {
> +                *map = 0;
> +            }
> +            if (file) {
> +                *file = NULL;
> +            }
> +        }
> +        if (ret2 & BDRV_BLOCK_ALLOCATED) {
> +            ret |= BDRV_BLOCK_ALLOCATED;
> +        }

Hm, so if we have a region that consists of two subregion, one is
unallocated and the other one is a zero cluster. The result will be
DATA == false, ZERO == false, ALLOCATED == true. This looks a bit odd.

Did you check this case and come to the conclusion that it's okay? If
so, I think a comment would be good.

> +        if (ret2 & BDRV_BLOCK_RAW) {
> +            assert(ret & BDRV_BLOCK_RAW);
> +            assert(ret & BDRV_BLOCK_OFFSET_VALID);
> +        }

Doesn't RAW mean that we need to recurse rather than returning the flag?
Or actually, bdrv_co_block_status() shouldn't ever return the flag, so
can't we assert that it's clear?

> +        *pnum += pnum2;
> +    }
> +    return ret;
> +}

Kevin

Re: [Qemu-devel] [PATCH 5/7] block: Fix BDRV_BLOCK_RAW status to honor alignment

Reply via email to