On Mon, Jun 20, 2022 at 11:36:11AM +0800, Sam Li wrote: Hi Sam, Is this version 2 of "[RFC v1] Add support for zoned device"? Please keep the email subject line the same (except for "v2", "v3", etc) so that it's clear which patch series this new version replaces.
> Fix some mistakes before. It can report a range of zones now. This looks like the description of what changed compared to v1. Please put the changelog below "---" in the future. When patch emails are merged by git-am(1) it keeps the text above "---" and discards the text below "---". The changelog is usually no longer useful once the patches are merged, so it should be located below the "---" line. The text above the "---" is the commit description (an explanation of why this commit is necessary). In this case the commit description should explain that this patch adds .bdrv_co_zone_report() and .bdrv_co_zone_mgmt() to BlockDriver so that zoned block devices can be supported. > > Signed-off-by: Sam Li <faithilike...@gmail.com> > --- > block/block-backend.c | 22 ++++ > block/coroutines.h | 5 + > block/file-posix.c | 182 ++++++++++++++++++++++++++++++ > block/io.c | 23 ++++ > include/block/block-common.h | 43 ++++++- > include/block/block-io.h | 13 +++ > include/block/block_int-common.h | 20 ++++ > qemu-io-cmds.c | 118 +++++++++++++++++++ > tests/qemu-iotests/tests/zoned.sh | 52 +++++++++ > 9 files changed, 477 insertions(+), 1 deletion(-) > create mode 100644 tests/qemu-iotests/tests/zoned.sh > > diff --git a/block/block-backend.c b/block/block-backend.c > index e0e1aff4b1..20248e4a35 100644 > --- a/block/block-backend.c > +++ b/block/block-backend.c > @@ -104,6 +104,8 @@ typedef struct BlockBackendAIOCB { > int ret; > } BlockBackendAIOCB; > > + > + Please avoid whitespace changes in code that is otherwise untouched by your patch. Code changes can cause merge conflicts and they make it harder to use git-annotate(1), so only changes that are necessary should be included in a patch. > static const AIOCBInfo block_backend_aiocb_info = { > .get_aio_context = blk_aiocb_get_aio_context, > .aiocb_size = sizeof(BlockBackendAIOCB), > @@ -1810,6 +1812,25 @@ int blk_flush(BlockBackend *blk) > return ret; > } > Please add a documentation comment for blk_co_zone_report() that explains how to use the functions and the purpose of the arguments. For example, does offset have to be the first byte in a zone or can it be any byte offset? What are the alignment requirements of offset and len? Why is nr_zones a pointer? > +int blk_co_zone_report(BlockBackend *blk, int64_t offset, int64_t len, Functions that run in coroutine context must be labeled with coroutine_fn: int coroutine_fn blk_co_zone_report(...) This tells humans and tools that the function can only be called from a coroutine. There is a blog post about coroutines in QEMU here: https://blog.vmsplice.net/2014/01/coroutines-in-qemu-basics.html > + int64_t *nr_zones, > + struct BlockZoneDescriptor *zones) QEMU coding style uses typedefs when defining structs, so "struct BlockZoneDescriptor *zones" should be written as "BlockZoneDescriptor *zones". > +{ > + int ret; This function is called from the I/O code path, please mark it with: IO_CODE(); From include/block/block-io.h: * I/O API functions. These functions are thread-safe, and therefore * can run in any thread as long as the thread has called * aio_context_acquire/release(). * * These functions can only call functions from I/O and Common categories, * but can be invoked by GS, "I/O or GS" and I/O APIs. * * All functions in this category must use the macro * IO_CODE(); * to catch when they are accidentally called by the wrong API. > + ret = bdrv_co_zone_report(blk->root->bs, offset, len, nr_zones, zones); Please add blk_inc_in_flight(blk) and blk_dec_in_flight(blk) around this function call to ensure that zone report requests finish before I/O is drained (see bdrv_drained_begin()). This is necessary so that it's possible to wait for I/O requests, including zone report, to complete. Similar to blk_co_do_preadv() we need blk_wait_while_drained(blk), blk_check_byte_request(), and bdrv_inc_in_flight(bs) before calling bdrv_co_zone_report(). bdrv_dec_in_flight(bs) needs to be called after bdrv_co_zone_report() returns. > + return ret; > +} > + > +int blk_co_zone_mgmt(BlockBackend *blk, enum zone_op op, > + int64_t offset, int64_t len) > +{ > + int ret; > + ret = bdrv_co_zone_mgmt(blk->root->bs, op, offset, len); > + > + return ret; > +} The same applies to this function. > + > + > void blk_drain(BlockBackend *blk) > { > BlockDriverState *bs = blk_bs(blk); > @@ -2634,3 +2655,4 @@ int blk_make_empty(BlockBackend *blk, Error **errp) > > return bdrv_make_empty(blk->root, errp); > } > + Unrelated whitespace change. > diff --git a/block/coroutines.h b/block/coroutines.h > index 830ecaa733..247326255f 100644 > --- a/block/coroutines.h > +++ b/block/coroutines.h > @@ -80,6 +80,11 @@ int coroutine_fn > blk_co_do_pdiscard(BlockBackend *blk, int64_t offset, int64_t bytes); > > int coroutine_fn blk_co_do_flush(BlockBackend *blk); > +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset, > + int64_t len, int64_t *nr_zones, > + struct BlockZoneDescriptor *zones); > +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, enum zone_op op, > + int64_t offset, int64_t len); > > > /* > diff --git a/block/file-posix.c b/block/file-posix.c > index 48cd096624..71fd21f8ba 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -178,6 +178,137 @@ typedef struct BDRVRawReopenState { > bool check_cache_dropped; > } BDRVRawReopenState; > > +/* > + * parse_zone - Fill a zone descriptor > + */ > +static inline void parse_zone(struct BlockZoneDescriptor *zone, > + struct blk_zone *blkz) { > + zone->start = blkz->start << BDRV_SECTOR_BITS; > + zone->length = blkz->len << BDRV_SECTOR_BITS; > + zone->cap = blkz->capacity << BDRV_SECTOR_BITS; > + zone->wp = blkz->wp << BDRV_SECTOR_BITS; > + zone->type = blkz->type; > + zone->cond = blkz->type; Should this be "zone->cond = blkz->cond"? > +} > + > +/** > + * zone report - Get a zone block device's information in the > + * form of an array of zone descriptors. > + * > + * @param bs: passing zone block device file descriptor > + * @param zones: Space to hold zone information on reply > + * @param offset: the location in the zone block device > + * @return 0 on success, -1 on failure > + */ > +static int raw_co_zone_report(BlockDriverState *bs, int64_t offset, int64_t > len, coroutine_fn > + int64_t *nr_zones, > + struct BlockZoneDescriptor *zones) { > + BDRVRawState *s = bs->opaque; > + struct blk_zone_report *rep; > + struct BlockZoneDescriptor bzd; > + struct blk_zone *blkz; > + int64_t zone_size_mask, end, rep_size, nrz; > + int ret, n = 0, i = 0; > + > + printf("%s\n", "start to report zone"); This looks like debug output. QEMU has a tracing system that you can use. See docs/devel/tracing.rst. > + nrz = *nr_zones; > + rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct > blk_zone); > + rep = (struct blk_zone_report *)malloc(rep_size); Please use g_autofree and g_new(). QEMU avoids direct use of malloc(3)/free(3). > + if (!rep) { > + return -1; > + } > + > + zone_size_mask = zone_start_sector - 1; > + /* align up */ > + end = ((offset + len + zone_size_mask) & (~zone_size_mask)) > + >> BDRV_SECTOR_BITS; More readable: end = DIV_ROUND_UP(offset + len, BDRV_SECTOR_SIZE); > + > + blkz = (struct blk_zone *)(rep + 1); > + while (offset < end) { > + memset(rep, 0, rep_size); > + rep->sector = offset; > + rep->nr_zones = nrz; > + > + ret = ioctl(s->fd, BLKREPORTZONE, rep); Can this ioctl() block? It seems likely. If yes, then the call needs to be made from the thread pool to avoid blocking the current thread. See raw_thread_pool_submit(). > + if (ret != 0) { > + ret = -errno; > + error_report("%d: ioctl BLKREPORTZONE at %ld failed %d", > + s->fd, offset, errno); > + free(rep); > + return ret; > + } > + > + if (!rep->nr_zones) { > + break; > + } > + > + for (i = 0; i < rep->nr_zones; i++) { > + parse_zone(&bzd, &blkz[i]); > + if (zones) { > + memcpy(&zones[n], &bzd, sizeof(bzd)); n is never incremented so this always overwrites the first element. > + } > + } > + > + offset = blkz[i].start + blkz[i].len; > + } > + Does this function need to update *nr_zones = n before returning? How does the caller know how many zones were returned? > + return ret; > +} > + > +/** > + * zone management operations - Execute an operation on a zone > + */ > +static int raw_co_zone_mgmt(BlockDriverState *bs, enum zone_op op, > + int64_t offset, int64_t len) { > + BDRVRawState *s = bs->opaque; > + struct blk_zone_range range; > + const char *ioctl_name; > + uint64_t ioctl_op; ioctl()'s second argument is unsigned long request. Please use unsigned long to keep the types consistent. > + int64_t zone_size_mask, end; > + int ret; > + > + zone_size_mask = zone_start_sector - 1; > + /* align up */ > + end = ((offset + len + zone_size_mask) & (~zone_size_mask)) > + >> BDRV_SECTOR_BITS; > + offset = (offset & (~zone_size_mask)) >> BDRV_SECTOR_BITS; /* align down > */ > + > + switch (op) { > + case zone_open: > + ioctl_name = "BLKOPENZONE"; > + ioctl_op = BLKOPENZONE; > + break; > + case zone_close: > + ioctl_name = "BLKCLOSEZONE"; > + ioctl_op = BLKCLOSEZONE; > + break; > + case zone_finish: > + ioctl_name = "BLKFINISHZONE"; > + ioctl_op = BLKFINISHZONE; > + break; > + case zone_reset: > + ioctl_name = "BLKRESETZONE"; > + ioctl_op = BLKRESETZONE; > + break; > + default: > + error_report("Invalid zone operation 0x%x", op); > + errno = -EINVAL; > + return -1; > + } > + > + /* Execute the operation */ > + range.sector = offset; > + range.nr_sectors = end - offset; > + ret = ioctl(s->fd, ioctl_op, &range); > + if (ret != 0) { > + error_report("ioctl %s failed %d", > + ioctl_name, errno); > + return -1; > + } > + > + return 0; > +} > + > static int fd_open(BlockDriverState *bs) > { > BDRVRawState *s = bs->opaque; > @@ -3324,6 +3455,9 @@ BlockDriver bdrv_file = { > .bdrv_abort_perm_update = raw_abort_perm_update, > .create_opts = &raw_create_opts, > .mutable_opts = mutable_opts, > + > + .bdrv_co_zone_report = raw_co_zone_report, > + .bdrv_co_zone_mgmt = raw_co_zone_mgmt, > }; > > /***********************************************/ > @@ -3703,6 +3837,53 @@ static BlockDriver bdrv_host_device = { > #endif > }; > > +static BlockDriver bdrv_zoned_host_device = { > + .format_name = "zoned_host_device", > + .protocol_name = "zoned_host_device", > + .instance_size = sizeof(BDRVRawState), > + .bdrv_needs_filename = true, > + .bdrv_probe_device = hdev_probe_device, > + .bdrv_parse_filename = hdev_parse_filename, > + .bdrv_file_open = hdev_open, > + .bdrv_close = raw_close, > + .bdrv_reopen_prepare = raw_reopen_prepare, > + .bdrv_reopen_commit = raw_reopen_commit, > + .bdrv_reopen_abort = raw_reopen_abort, > + .bdrv_co_create_opts = bdrv_co_create_opts_simple, > + .create_opts = &bdrv_create_opts_simple, > + .mutable_opts = mutable_opts, > + .bdrv_co_invalidate_cache = raw_co_invalidate_cache, > + .bdrv_co_pwrite_zeroes = hdev_co_pwrite_zeroes, > + > + .bdrv_co_preadv = raw_co_preadv, > + .bdrv_co_pwritev = raw_co_pwritev, > + .bdrv_co_flush_to_disk = raw_co_flush_to_disk, > + .bdrv_co_pdiscard = hdev_co_pdiscard, > + .bdrv_co_copy_range_from = raw_co_copy_range_from, > + .bdrv_co_copy_range_to = raw_co_copy_range_to, > + .bdrv_refresh_limits = raw_refresh_limits, > + .bdrv_io_plug = raw_aio_plug, > + .bdrv_io_unplug = raw_aio_unplug, > + .bdrv_attach_aio_context = raw_aio_attach_aio_context, > + > + .bdrv_co_truncate = raw_co_truncate, > + .bdrv_getlength = raw_getlength, > + .bdrv_get_info = raw_get_info, > + .bdrv_get_allocated_file_size > + = raw_get_allocated_file_size, > + .bdrv_get_specific_stats = hdev_get_specific_stats, > + .bdrv_check_perm = raw_check_perm, > + .bdrv_set_perm = raw_set_perm, > + .bdrv_abort_perm_update = raw_abort_perm_update, > + .bdrv_probe_blocksizes = hdev_probe_blocksizes, > + .bdrv_probe_geometry = hdev_probe_geometry, > + .bdrv_co_ioctl = hdev_co_ioctl, > + > + /* zone management operations */ > + .bdrv_co_zone_report = raw_co_zone_report, > + .bdrv_co_zone_mgmt = raw_co_zone_mgmt, > +}; > + > #if defined(__linux__) || defined(__FreeBSD__) || defined(__FreeBSD_kernel__) > static void cdrom_parse_filename(const char *filename, QDict *options, > Error **errp) > @@ -3964,6 +4145,7 @@ static void bdrv_file_init(void) > #if defined(HAVE_HOST_BLOCK_DEVICE) > bdrv_register(&bdrv_host_device); > #ifdef __linux__ > + bdrv_register(&bdrv_zoned_host_device); > bdrv_register(&bdrv_host_cdrom); > #endif > #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__) > diff --git a/block/io.c b/block/io.c > index 789e6373d5..3e8bb83cc3 100644 > --- a/block/io.c > +++ b/block/io.c > @@ -3258,6 +3258,29 @@ out: > return co.ret; > } > > +int bdrv_co_zone_report(BlockDriverState *bs, > + int64_t offset, int64_t len, int64_t *nr_zones, > + struct BlockZoneDescriptor *zones) > +{ > + int ret = 0; > + if (!bs->drv->bdrv_co_zone_report(bs, offset, len, nr_zones, zones)) { > + return -ENOTSUP; > + } This returns -ENOTSUP any time ->bdrv_co_zone_report() returns 0. Should this be: if (!bs->drv->bdrv_co_zone_report) { return -ENOTSUP; } return bs->drv->bdrv_co_zone_report(bs, offset, len, nr_zones, zones); ? > + > + return ret; > +} > + > +int bdrv_co_zone_mgmt(BlockDriverState *bs, enum zone_op op, > + int64_t offset, int64_t len) > +{ > + int ret = 0; > + if (!bs->drv->bdrv_co_zone_mgmt(bs, op, offset, len)) { > + return -ENOTSUP; > + } > + > + return ret; > +} > + > void *qemu_blockalign(BlockDriverState *bs, size_t size) > { > IO_CODE(); > diff --git a/include/block/block-common.h b/include/block/block-common.h > index fdb7306e78..24c468d8ad 100644 > --- a/include/block/block-common.h > +++ b/include/block/block-common.h > @@ -23,7 +23,7 @@ > */ > #ifndef BLOCK_COMMON_H > #define BLOCK_COMMON_H > - > +#include <linux/blkzoned.h> Linux-specific code must be #ifdef __linux__ to avoid compilation errors on non-Linux hosts. > #include "block/aio.h" > #include "block/aio-wait.h" > #include "qemu/iov.h" > @@ -48,6 +48,47 @@ > typedef struct BlockDriver BlockDriver; > typedef struct BdrvChild BdrvChild; > typedef struct BdrvChildClass BdrvChildClass; > +enum zone_type { Please use "typedef enum { ... } BdrvZoneType" so that enums follow QEMU's coding style. > + BLK_ZT_CONV = BLK_ZONE_TYPE_CONVENTIONAL, > + BLK_ZT_SWR = BLK_ZONE_TYPE_SEQWRITE_REQ, > + BLK_ZT_SWP = BLK_ZONE_TYPE_SEQWRITE_PREF, > +}; > + > +enum zone_cond { > + BLK_ZS_NOT_WP = BLK_ZONE_COND_NOT_WP, > + BLK_ZS_EMPTY = BLK_ZONE_COND_EMPTY, > + BLK_ZS_IOPEN = BLK_ZONE_COND_IMP_OPEN, > + BLK_ZS_EOPEN = BLK_ZONE_COND_EXP_OPEN, > + BLK_ZS_CLOSED = BLK_ZONE_COND_CLOSED, > + BLK_ZS_RDONLY = BLK_ZONE_COND_READONLY, > + BLK_ZS_FULL = BLK_ZONE_COND_FULL, > + BLK_ZS_OFFLINE = BLK_ZONE_COND_OFFLINE, > +}; This 1:1 correspondence with Linux constants could make the code a little harder to port. Maybe QEMU's block layer should define its own numeric constants so the code doesn't rely on operating system-specific headers. block/file-posix.c #ifdef __linux__ code can be responsible for converting Linux-specific constants to QEMU constants (and the 1:1 mapping can be used there). > + > +enum zone_op { > + zone_open, > + zone_close, > + zone_finish, > + zone_reset, > +}; > + > +/* > + * Zone descriptor data structure. > + * Provide information on a zone with all position and size values in bytes. > + */ > +typedef struct BlockZoneDescriptor { > + uint64_t start; > + uint64_t length; > + uint64_t cap; > + uint64_t wp; > + enum zone_type type; > + enum zone_cond cond; > +} BlockZoneDescriptor; > + > +enum zone_model { > + BLK_Z_HM, > + BLK_Z_HA, > +}; > > typedef struct BlockDriverInfo { > /* in bytes, 0 if irrelevant */ > diff --git a/include/block/block-io.h b/include/block/block-io.h > index 62c84f0519..deb8902684 100644 > --- a/include/block/block-io.h > +++ b/include/block/block-io.h > @@ -80,6 +80,13 @@ int bdrv_co_ioctl(BlockDriverState *bs, int req, void > *buf); > /* Ensure contents are flushed to disk. */ > int coroutine_fn bdrv_co_flush(BlockDriverState *bs); > > +/* Report zone information of zone block device. */ > +int coroutine_fn bdrv_co_zone_report(BlockDriverState *bs, int64_t offset, > + int64_t len, int64_t *nr_zones, > + struct BlockZoneDescriptor *zones); > +int coroutine_fn bdrv_co_zone_mgmt(BlockDriverState *bs, enum zone_op op, > + int64_t offset, int64_t len); > + > int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes); > bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs); > int bdrv_block_status(BlockDriverState *bs, int64_t offset, > @@ -290,6 +297,12 @@ bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector > *qiov, int64_t pos); > int generated_co_wrapper > bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos); > > +int generated_co_wrapper blk_zone_report(BlockBackend *blk, int64_t offset, > + int64_t len, int64_t *nr_zones, > + struct BlockZoneDescriptor *zones); > +int generated_co_wrapper blk_zone_mgmt(BlockBackend *blk, enum zone_op op, > + int64_t offset, int64_t len); > + > /** > * bdrv_parent_drained_begin_single: > * > diff --git a/include/block/block_int-common.h > b/include/block/block_int-common.h > index 8947abab76..4d0180a7da 100644 > --- a/include/block/block_int-common.h > +++ b/include/block/block_int-common.h > @@ -63,6 +63,7 @@ > #define BLOCK_OPT_EXTL2 "extended_l2" > > #define BLOCK_PROBE_BUF_SIZE 512 > +#define zone_start_sector 512 > > enum BdrvTrackedRequestType { > BDRV_TRACKED_READ, > @@ -94,6 +95,19 @@ typedef struct BdrvTrackedRequest { > struct BdrvTrackedRequest *waiting_for; > } BdrvTrackedRequest; > > +/** > + * Zone device information data structure. > + * Provide information on a device. > + */ > +typedef struct zbd_dev { > + enum zone_model model; > + uint32_t block_size; How does this relate to QEMU's BlockLimits? > + uint32_t write_granularity; Same. Maybe this belongs in BlockLimits? > + uint32_t nr_zones; Should this really be limited to 32-bit? For example, take 256 MB zones, then the max nr_zones * 256 MB is much smaller than a uint64_t capacity value. It seems safer to make this 64-bit, but maybe Dmitry, Hannes, or Damien can tell us what to do here. > + struct BlockZoneDescriptor *zones; /* array of zones */ When is this used?
signature.asc
Description: PGP signature