Am 14.05.2014 13:41, schrieb Kevin Wolf: > Am 08.05.2014 um 18:22 hat Peter Lieven geschrieben: >> this patch tries to optimize zero write requests >> by automatically using bdrv_write_zeroes if it is >> supported by the format. >> >> This significantly speeds up file system initialization and >> should speed zero write test used to test backend storage >> performance. >> >> I ran the following 2 tests on my internal SSD with a >> 50G QCOW2 container and on an attached iSCSI storage. >> >> a) mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/vdX >> >> QCOW2 [off] [on] [unmap] >> ----- >> runtime: 14secs 1.1secs 1.1secs >> filesize: 937M 18M 18M >> >> iSCSI [off] [on] [unmap] >> ---- >> runtime: 9.3s 0.9s 0.9s >> >> b) dd if=/dev/zero of=/dev/vdX bs=1M oflag=direct >> >> QCOW2 [off] [on] [unmap] >> ----- >> runtime: 246secs 18secs 18secs >> filesize: 51G 192K 192K >> throughput: 203M/s 2.3G/s 2.3G/s >> >> iSCSI* [off] [on] [unmap] >> ---- >> runtime: 8mins 45secs 33secs >> throughput: 106M/s 1.2G/s 1.6G/s >> allocated: 100% 100% 0% >> >> * The storage was connected via an 1Gbit interface. >> It seems to internally handle writing zeroes >> via WRITESAME16 very fast. >> >> Signed-off-by: Peter Lieven <p...@kamp.de> >> --- >> v3->v4: - use QAPI generated enum and lookup table [Kevin] >> - added more details about the options in the comments >> of the qapi-schema [Eric] >> - changed the type of detect_zeroes from str to >> BlockdevDetectZeroesOptions. I left the name >> as is because it is consistent with e.g. >> BlockdevDiscardOptions or BlockdevAioOptions [Eric] >> - changed the parse function in blockdev_init to >> be generic usable for other enum parameters >> >> v2->v3: - moved parameter parsing to blockdev_init >> - added per device detect_zeroes status to >> hmp (info block -v) and qmp (query-block) [Eric] >> - added support to enable detect-zeroes also >> for hot added devices [Eric] >> - added missing entry to qemu_common_drive_opts >> - fixed description of qemu_iovec_is_zero [Fam] >> >> v1->v2: - added tests to commit message (Markus) >> RFCv2->v1: - fixed paramter parsing strncmp -> strcmp (Eric) >> - fixed typo (choosen->chosen) (Eric) >> - added second example to commit msg >> >> RFCv1->RFCv2: - add detect-zeroes=off|on|unmap knob to drive cmdline >> parameter >> - call zero detection only for format (bs->file != NULL) >> >> block.c | 11 ++++++++++ >> block/qapi.c | 1 + >> blockdev.c | 34 +++++++++++++++++++++++++++++ >> hmp.c | 5 +++++ >> include/block/block_int.h | 1 + >> include/qemu-common.h | 1 + >> qapi-schema.json | 52 >> ++++++++++++++++++++++++++++++++------------- >> qemu-options.hx | 6 ++++++ >> qmp-commands.hx | 3 +++ >> util/iov.c | 21 ++++++++++++++++++ >> 10 files changed, 120 insertions(+), 15 deletions(-) >> >> diff --git a/block.c b/block.c >> index b749d31..aea4c77 100644 >> --- a/block.c >> +++ b/block.c >> @@ -3244,6 +3244,17 @@ static int coroutine_fn >> bdrv_aligned_pwritev(BlockDriverState *bs, >> >> ret = notifier_with_return_list_notify(&bs->before_write_notifiers, >> req); >> >> + if (!ret && bs->detect_zeroes != BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF && >> + !(flags & BDRV_REQ_ZERO_WRITE) && bs->file && >> + drv->bdrv_co_write_zeroes && qemu_iovec_is_zero(qiov)) { > Pretty long condition. :-) > > Looks like most is obviously needed, but I wonder what the bs->file part > is good for? That looks rather arbitrary.
What I wanted to achieve is that this condition is only true if we handle the format (e.g. raw, qcow2, vmdk etc.). If e.g. qcow2 then sends a zero write this should always end on the disk and should not be optimizable. > >> + flags |= BDRV_REQ_ZERO_WRITE; >> + /* if the device was not opened with discard=on the below flag >> + * is immediately cleared again in bdrv_co_do_write_zeroes */ > Is it? I only see it being cleared in bdrv_co_write_zeroes(), but that's > not a function that seems to be called from here. You are right. Question, do we want to support detect_zeroes = unmap if discard = ignore? If yes, I have to update the docs. Otherwise I have to check for BDRV_O_DISCARD before setting BDRV_REQ_MAY_UNMAP. > >> + if (bs->detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP) { >> + flags |= BDRV_REQ_MAY_UNMAP; >> + } >> + } >> + >> if (ret < 0) { >> /* Do nothing, write notifier decided to fail this request */ >> } else if (flags & BDRV_REQ_ZERO_WRITE) { >> diff --git a/block/qapi.c b/block/qapi.c >> index af11445..75f44f1 100644 >> --- a/block/qapi.c >> +++ b/block/qapi.c >> @@ -50,6 +50,7 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState >> *bs) >> } >> >> info->backing_file_depth = bdrv_get_backing_file_depth(bs); >> + info->detect_zeroes = bs->detect_zeroes; >> >> if (bs->io_limits_enabled) { >> ThrottleConfig cfg; >> diff --git a/blockdev.c b/blockdev.c >> index 7810e9f..955bd49 100644 >> --- a/blockdev.c >> +++ b/blockdev.c >> @@ -288,6 +288,23 @@ static int parse_block_error_action(const char *buf, >> bool is_read, Error **errp) >> } >> } >> >> +static int parse_enum_option(const char *lookup[], const char *buf, int max, >> + int def, Error **errp) >> +{ >> + int i; >> + if (!buf) { >> + return def; >> + } >> + for (i = 0; i < max; i++) { >> + if (!strcmp(buf, lookup[i])) { >> + return i; >> + } >> + } >> + error_setg(errp, "invalid parameter value: %s", >> + buf); >> + return def; >> +} >> + >> static bool check_throttle_config(ThrottleConfig *cfg, Error **errp) >> { >> if (throttle_conflicting(cfg)) { > This hunk could have been a patch of its own (not a reason for a respin, > but if you need to respin to address one of the other comments, please > split the patch). Yes, will do. Erik also already suggested this. > >> @@ -324,6 +341,7 @@ static DriveInfo *blockdev_init(const char *file, QDict >> *bs_opts, >> QemuOpts *opts; >> const char *id; >> bool has_driver_specific_opts; >> + BlockdevDetectZeroesOptions detect_zeroes; >> BlockDriver *drv = NULL; >> >> /* Check common options by copying from bs_opts to opts, all other >> options >> @@ -452,6 +470,17 @@ static DriveInfo *blockdev_init(const char *file, QDict >> *bs_opts, >> } >> } >> >> + detect_zeroes = >> + parse_enum_option(BlockdevDetectZeroesOptions_lookup, >> + qemu_opt_get(opts, "detect-zeroes"), >> + BLOCKDEV_DETECT_ZEROES_OPTIONS_MAX, >> + BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF, >> + &error); >> + if (error) { >> + error_propagate(errp, error); >> + goto early_err; >> + } >> + >> /* init */ >> dinfo = g_malloc0(sizeof(*dinfo)); >> dinfo->id = g_strdup(qemu_opts_id(opts)); >> @@ -462,6 +491,7 @@ static DriveInfo *blockdev_init(const char *file, QDict >> *bs_opts, >> } >> dinfo->bdrv->open_flags = snapshot ? BDRV_O_SNAPSHOT : 0; >> dinfo->bdrv->read_only = ro; >> + dinfo->bdrv->detect_zeroes = detect_zeroes; >> dinfo->refcount = 1; >> if (serial != NULL) { >> dinfo->serial = g_strdup(serial); >> @@ -2455,6 +2485,10 @@ QemuOptsList qemu_common_drive_opts = { >> .name = "copy-on-read", >> .type = QEMU_OPT_BOOL, >> .help = "copy read data from backing file into image file", >> + },{ >> + .name = "detect-zeroes", >> + .type = QEMU_OPT_STRING, >> + .help = "try to optimize zero writes", >> }, >> { /* end of list */ } >> }, >> diff --git a/hmp.c b/hmp.c >> index ca869ba..a541e7d 100644 >> --- a/hmp.c >> +++ b/hmp.c >> @@ -336,6 +336,11 @@ void hmp_info_block(Monitor *mon, const QDict *qdict) >> info->value->inserted->backing_file_depth); >> } >> >> + if (verbose) { >> + monitor_printf(mon, " Detect zeroes: %s\n", >> + >> BlockdevDetectZeroesOptions_lookup[info->value->inserted->detect_zeroes]); >> + } > Perhaps we should always display the information if zero detection is > enabled. This would be similar to things like I/O throttling limits, > which are only printed if throttling is enabled. Ok > >> if (info->value->inserted->bps >> || info->value->inserted->bps_rd >> || info->value->inserted->bps_wr >> diff --git a/include/block/block_int.h b/include/block/block_int.h >> index 9ffcb69..b8cc926 100644 >> --- a/include/block/block_int.h >> +++ b/include/block/block_int.h >> @@ -364,6 +364,7 @@ struct BlockDriverState { >> BlockJob *job; >> >> QDict *options; >> + BlockdevDetectZeroesOptions detect_zeroes; >> }; >> >> int get_tmp_filename(char *filename, int size); >> diff --git a/include/qemu-common.h b/include/qemu-common.h >> index a998e8d..8e3d6eb 100644 >> --- a/include/qemu-common.h >> +++ b/include/qemu-common.h >> @@ -330,6 +330,7 @@ void qemu_iovec_concat(QEMUIOVector *dst, >> void qemu_iovec_concat_iov(QEMUIOVector *dst, >> struct iovec *src_iov, unsigned int src_cnt, >> size_t soffset, size_t sbytes); >> +bool qemu_iovec_is_zero(QEMUIOVector *qiov); >> void qemu_iovec_destroy(QEMUIOVector *qiov); >> void qemu_iovec_reset(QEMUIOVector *qiov); >> size_t qemu_iovec_to_buf(QEMUIOVector *qiov, size_t offset, >> diff --git a/qapi-schema.json b/qapi-schema.json >> index 0b00427..d1620b1 100644 >> --- a/qapi-schema.json >> +++ b/qapi-schema.json >> @@ -937,6 +937,8 @@ >> # @encryption_key_missing: true if the backing device is encrypted but an >> # valid encryption key is missing >> # >> +# @detect_zeroes: detect and optimize zero writes (Since 2.1) >> +# >> # @bps: total throughput limit in bytes per second is specified >> # >> # @bps_rd: read throughput limit in bytes per second is specified >> @@ -972,6 +974,7 @@ >> 'data': { 'file': 'str', '*node-name': 'str', 'ro': 'bool', 'drv': 'str', >> '*backing_file': 'str', 'backing_file_depth': 'int', >> 'encrypted': 'bool', 'encryption_key_missing': 'bool', >> + 'detect_zeroes': 'BlockdevDetectZeroesOptions', >> 'bps': 'int', 'bps_rd': 'int', 'bps_wr': 'int', >> 'iops': 'int', 'iops_rd': 'int', 'iops_wr': 'int', >> 'image': 'ImageInfo', >> @@ -4250,6 +4253,22 @@ >> 'data': [ 'ignore', 'unmap' ] } >> >> ## >> +# @BlockdevDetectZeroesOptions >> +# >> +# Describes the operation mode for the automatic conversion of plain >> +# zero writes by the OS to driver specific optimized zero write commands. >> +# >> +# @off: Disabled (default) >> +# @on: Enabled >> +# @unmap: Enabled and even try to unmap blocks if possible. This requires >> +# also that @BlockdevDiscardOptions is set to unmap for this >> device. >> +# >> +# Since: 2.1 >> +## >> +{ 'enum': 'BlockdevDetectZeroesOptions', >> + 'data': [ 'off', 'on', 'unmap' ] } >> + >> +## >> # @BlockdevAioOptions >> # >> # Selects the AIO backend to handle I/O requests >> @@ -4301,20 +4320,22 @@ >> # Options that are available for all block devices, independent of the block >> # driver. >> # >> -# @driver: block driver name >> -# @id: #optional id by which the new block device can be referred >> to. >> -# This is a required option on the top level of blockdev-add, >> and >> -# currently not allowed on any other level. >> -# @node-name: #optional the name of a block driver state node (Since 2.0) >> -# @discard: #optional discard-related options (default: ignore) >> -# @cache: #optional cache-related options >> -# @aio: #optional AIO backend (default: threads) >> -# @rerror: #optional how to handle read errors on the device >> -# (default: report) >> -# @werror: #optional how to handle write errors on the device >> -# (default: enospc) >> -# @read-only: #optional whether the block device should be read-only >> -# (default: false) >> +# @driver: block driver name >> +# @id: #optional id by which the new block device can be >> referred to. >> +# This is a required option on the top level of >> blockdev-add, and >> +# currently not allowed on any other level. >> +# @node-name: #optional the name of a block driver state node (Since >> 2.0) >> +# @discard: #optional discard-related options (default: ignore) >> +# @cache: #optional cache-related options >> +# @aio: #optional AIO backend (default: threads) >> +# @rerror: #optional how to handle read errors on the device >> +# (default: report) >> +# @werror: #optional how to handle write errors on the device >> +# (default: enospc) >> +# @read-only: #optional whether the block device should be read-only >> +# (default: false) >> +# @detect-zeroes: #optional detect and optimize zero writes (Since 2.1) >> +# (default: off) >> # >> # Since: 1.7 >> ## >> @@ -4327,7 +4348,8 @@ >> '*aio': 'BlockdevAioOptions', >> '*rerror': 'BlockdevOnError', >> '*werror': 'BlockdevOnError', >> - '*read-only': 'bool' } } >> + '*read-only': 'bool', >> + '*detect-zeroes': 'BlockdevDetectZeroesOptions' } } >> >> ## >> # @BlockdevOptionsFile >> diff --git a/qemu-options.hx b/qemu-options.hx >> index 781af14..5ee94ea 100644 >> --- a/qemu-options.hx >> +++ b/qemu-options.hx >> @@ -414,6 +414,7 @@ DEF("drive", HAS_ARG, QEMU_OPTION_drive, >> " [,serial=s][,addr=A][,rerror=ignore|stop|report]\n" >> " >> [,werror=ignore|stop|report|enospc][,id=name][,aio=threads|native]\n" >> " [,readonly=on|off][,copy-on-read=on|off]\n" >> + " [,detect-zeroes=on|off|unmap]\n" >> " [[,bps=b]|[[,bps_rd=r][,bps_wr=w]]]\n" >> " [[,iops=i]|[[,iops_rd=r][,iops_wr=w]]]\n" >> " [[,bps_max=bm]|[[,bps_rd_max=rm][,bps_wr_max=wm]]]\n" >> @@ -475,6 +476,11 @@ Open drive @option{file} as read-only. Guest write >> attempts will fail. >> @item copy-on-read=@var{copy-on-read} >> @var{copy-on-read} is "on" or "off" and enables whether to copy read backing >> file sectors into the image file. >> +@item detect-zeroes=@var{detect-zeroes} >> +@var{detect-zeroes} is "off", "on" or "unmap" and enables the automatic >> +conversion of plain zero writes by the OS to driver specific optimized >> +zero write commands. If "unmap" is chosen and @var{discard} is "on" >> +a zero write may even be converted to an UNMAP operation. >> @end table >> >> By default, the @option{cache=writeback} mode is used. It will report data >> diff --git a/qmp-commands.hx b/qmp-commands.hx >> index ed3ab92..a535955 100644 >> --- a/qmp-commands.hx >> +++ b/qmp-commands.hx >> @@ -2032,6 +2032,8 @@ Each json-object contain the following: >> - "iops_rd_max": read I/O operations max (json-int) >> - "iops_wr_max": write I/O operations max (json-int) >> - "iops_size": I/O size when limiting by iops (json-int) >> + - "detect_zeroes": detect and optimize zero writing (json-string) >> + - Possible values: "off", "on", "unmap" >> - "image": the detail of the image, it is a json-object containing >> the following: >> - "filename": image file name (json-string) >> @@ -2108,6 +2110,7 @@ Example: >> "iops_rd_max": 0, >> "iops_wr_max": 0, >> "iops_size": 0, >> + "detect_zeroes": "on", >> "image":{ >> "filename":"disks/test.qcow2", >> "format":"qcow2", >> diff --git a/util/iov.c b/util/iov.c >> index 6569b5a..f8c49a1 100644 >> --- a/util/iov.c >> +++ b/util/iov.c >> @@ -335,6 +335,27 @@ void qemu_iovec_concat(QEMUIOVector *dst, >> qemu_iovec_concat_iov(dst, src->iov, src->niov, soffset, sbytes); >> } >> >> +/* >> + * check if the contents of the iovecs are all zero >> + */ >> +bool qemu_iovec_is_zero(QEMUIOVector *qiov) >> +{ >> + int i; >> + for (i = 0; i < qiov->niov; i++) { >> + size_t offs = qiov->iov[i].iov_len & ~(4 * sizeof(long) - 1); > I think it gets a bit more readable like this: > > size_t offs = QEMU_ALIGN_DOWN(qiov->iov[i].iov_len, 4 * sizeof(long)); Yes, indeed. > >> + uint8_t *ptr = qiov->iov[i].iov_base; >> + if (offs && !buffer_is_zero(qiov->iov[i].iov_base, offs)) { >> + return false; >> + } >> + for (; offs < qiov->iov[i].iov_len; offs++) { >> + if (ptr[offs]) { >> + return false; >> + } >> + } >> + } >> + return true; >> +} >> + > This function could be a separate patch as well. Ok, will split as well. Thank you for your and Erik for his comments, Peter