On Thu, Apr 17, 2025 at 01:39:09PM -0500, Eric Blake wrote:
> There are some optimizations that require knowing if an image starts
> out as reading all zeroes, such as making blockdev-mirror faster by
> skipping the copying of source zeroes to the destination.  The
> existing bdrv_co_is_zero_fast() is a good building block for answering
> this question, but it tends to give an answer of 0 for a file we just
> created via QMP 'blockdev-create' or similar (such as 'qemu-img create
> -f raw').  Why?  Because file-posix.c insists on allocating a tiny
> header to any file rather than leaving it 100% sparse, due to some
> filesystems that are unable to answer alignment probes on a hole.  But
> teaching file-posix.c to read the tiny header doesn't scale - the
> problem of a small header is also visible when libvirt sets up an NBD
> client to a just-created file on a migration destination host.
> 
> So, we need a wrapper function that handles a bit more complexity in a
> common manner for all block devices - when the BDS is mostly a hole,
> but has a small non-hole header, it is still worth the time to read
> that header and check if it reads as all zeroes before giving up and
> returning a pessimistic answer.
> 
> Signed-off-by: Eric Blake <ebl...@redhat.com>
> ---
>  include/block/block-io.h |  2 ++
>  block/io.c               | 58 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 60 insertions(+)
> 
> diff --git a/include/block/block-io.h b/include/block/block-io.h
> index b49e0537dd4..b99cc98d265 100644
> --- a/include/block/block-io.h
> +++ b/include/block/block-io.h
> @@ -161,6 +161,8 @@ bdrv_is_allocated_above(BlockDriverState *bs, 
> BlockDriverState *base,
> 
>  int coroutine_fn GRAPH_RDLOCK
>  bdrv_co_is_zero_fast(BlockDriverState *bs, int64_t offset, int64_t bytes);
> +int coroutine_fn GRAPH_RDLOCK
> +bdrv_co_is_all_zeroes(BlockDriverState *bs);
> 
>  int GRAPH_RDLOCK
>  bdrv_apply_auto_read_only(BlockDriverState *bs, const char *errmsg,
> diff --git a/block/io.c b/block/io.c
> index 6ef78070915..dc1341e4029 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2778,6 +2778,64 @@ int coroutine_fn bdrv_co_is_zero_fast(BlockDriverState 
> *bs, int64_t offset,
>      return 1;
>  }
> 
> +/*
> + * Check @bs (and its backing chain) to see if the entire image is known
> + * to read as zeroes.
> + * Return 1 if that is the case, 0 otherwise and -errno on error.
> + * This test is meant to be fast rather than accurate so returning 0
> + * does not guarantee non-zero data; however, it can report 1 in more

False negatives are possible, let's also document that false positives
are not possible:

  This test is mean to be fast rather than accurate so returning 0 does
  not guarantee non-zero data, but returning 1 does guarantee all zero
  data; ...

> + * cases than bdrv_co_is_zero_fast.
> + */
> +int coroutine_fn bdrv_co_is_all_zeroes(BlockDriverState *bs)
> +{
> +    int ret;
> +    int64_t pnum, bytes;
> +    char *buf;
> +    QEMUIOVector local_qiov;
> +    IO_CODE();
> +
> +    bytes = bdrv_co_getlength(bs);
> +    if (bytes < 0) {
> +        return bytes;
> +    }
> +
> +    /* First probe - see if the entire image reads as zero */
> +    ret = bdrv_co_common_block_status_above(bs, NULL, false, BDRV_BSTAT_ZERO,
> +                                            0, bytes, &pnum, NULL, NULL,
> +                                            NULL);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +    if (ret & BDRV_BLOCK_ZERO) {
> +        return bdrv_co_is_zero_fast(bs, pnum, bytes - pnum);
> +    }
> +
> +    /*
> +     * Because of the way 'blockdev-create' works, raw files tend to
> +     * be created with a non-sparse region at the front to make
> +     * alignment probing easier.  If the block starts with only a
> +     * small allocated region, it is still worth the effort to see if
> +     * the rest of the image is still sparse, coupled with manually
> +     * reading the first region to see if it reads zero after all.
> +     */
> +    if (pnum > qemu_real_host_page_size()) {

Probably not worth it for the corner case, but replacing
qemu_real_host_page_size() with 128 KiB would allow this to work on
images created on different CPU architectures (4 KiB vs 64 KiB page
sizes).

> +        return 0;
> +    }
> +    ret = bdrv_co_is_zero_fast(bs, pnum, bytes - pnum);
> +    if (ret <= 0) {
> +        return ret;
> +    }
> +    /* Only the head of the image is unknown, and it's small.  Read it.  */
> +    buf = qemu_blockalign(bs, pnum);
> +    qemu_iovec_init_buf(&local_qiov, buf, pnum);
> +    ret = bdrv_driver_preadv(bs, 0, pnum, &local_qiov, 0, 0);
> +    if (ret >= 0) {
> +        ret = buffer_is_zero(buf, pnum);
> +    }
> +    qemu_vfree(buf);
> +    return ret;
> +}
> +
>  int coroutine_fn bdrv_co_is_allocated(BlockDriverState *bs, int64_t offset,
>                                        int64_t bytes, int64_t *pnum)
>  {
> -- 
> 2.49.0
> 
> 

Attachment: signature.asc
Description: PGP signature

Reply via email to