Re: [RFC v4 2/9] qemu-io: add zoned block device operations.
; +g_autofree BlockZoneDescriptor *zones = NULL; >> +zones = g_new(BlockZoneDescriptor, nr_zones); >> +ret = blk_zone_report(blk, offset, &nr_zones, zones); >> +if (ret < 0) { >> +printf("zone report failed: %s\n", strerror(-ret)); >> +} else { >> +for (int i = 0; i < nr_zones; ++i) { >> +printf("start: 0x%" PRIx64 ", len 0x%" PRIx64 ", " >> + "cap"" 0x%" PRIx64 ",wptr 0x%" PRIx64 ", " >> + "zcond:%u, [type: %u]\n", >> + zones[i].start, zones[i].length, zones[i].cap, >> zones[i].wp, >> + zones[i].cond, zones[i].type); >> +} >> +} >> +return ret; >> +} >> + >> +static const cmdinfo_t zone_report_cmd = { >> +.name = "zone_report", >> +.altname = "zp", >> +.cfunc = zone_report_f, >> +.argmin = 2, >> +.argmax = 2, >> +.args = "offset number", >> +.oneline = "report zone information", >> +}; >> + >> +static int zone_open_f(BlockBackend *blk, int argc, char **argv) >> +{ >> +int ret; >> +int64_t offset, len; >> +++optind; >> +offset = cvtnum(argv[optind]); >> +++optind; >> +len = cvtnum(argv[optind]); >> +ret = blk_zone_mgmt(blk, zone_open, offset, len); > > Please name enum constants in all caps: BDRV_ZONE_OPEN or BDRV_ZO_OPEN. -- Damien Le Moal Western Digital Research
Re: [PATCH v7 2/4] qcow2: add configurations for zoned format extension
On 2/20/24 06:15, Markus Armbruster wrote: >>>>> Making this member @size mandatory means we must specify it when >>>>> BlockdevCreateOptionsQcow2 member @zone is present and @zone's member >>>>> @mode is "host-managed". Feels right to me. Am I missing anything? >>>> >>>> That's right. And the checks when creating such an img can help do >>>> that. It's not specified in the .json file directly. >>> >>> What would break if we did specify it in the QAPI schema directly? >> >> Nothing I think. We can keep the current schema and add a default zone >> size like 131072. > > I believe making the member mandatory makes a lot more sense. > > I guess we can keep @capacity and @max-append-bytes keep optional *if* > we can come up with sensible defaults. Yes, @capacity can be optional and default to the zone size. @max-append-bytes can also be optional and default to the regular read/write max size. However, defining a "sensible" default for the zone size is rather tricky. The reason is that zones are generally sized according to the device speed. Zones are 256MB on HDDs (which takes about 1s to fully write sequentially) while NVMe ZNS devices will more likely have a zone size of 1-2 GB (because the device is much faster). If we stick with such approach, a sensible zone size would depend on how fast the qcow2 image backing is. I.e. 256 MB would be more appropriate for a qcow2 image on a file stored on HDD while larger sizes may be better for SSD backed images. But that is also not accounting for the host page caching which can "show" the qcow2 image as being much faster than its backing storage really is. So I would be tempted to say that defaulting to 0 to force the user to specify a zone size would be safer. But if you really want a non-0 default, then maybe 256MB or so may be OK as that is the most commonly used value out there for zoned storage (there are literally millions of SMR drives running in production systems out there and NVMe ZNS devices are not widely used yet). -- Damien Le Moal Western Digital Research
Re: [RFC v2] Adding block layer APIs resembling Linux ZoneBlockDevice ioctls.
;> * I/O API functions. These functions are thread-safe, and therefore >>> * can run in any thread as long as the thread has called >>> * aio_context_acquire/release(). >>> * >>> * These functions can only call functions from I/O and Common categories, >>> * but can be invoked by GS, "I/O or GS" and I/O APIs. >>> * >>> * All functions in this category must use the macro >>> * IO_CODE(); >>> * to catch when they are accidentally called by the wrong API. >>> >>>> +ret = bdrv_co_zone_report(blk->root->bs, offset, len, nr_zones, >>>> zones); >>> >>> Please add blk_inc_in_flight(blk) and blk_dec_in_flight(blk) around this >>> function call to ensure that zone report requests finish before I/O is >>> drained (see bdrv_drained_begin()). This is necessary so that it's >>> possible to wait for I/O requests, including zone report, to complete. >>> >>> Similar to blk_co_do_preadv() we need blk_wait_while_drained(blk), >>> blk_check_byte_request(), and bdrv_inc_in_flight(bs) before calling >>> bdrv_co_zone_report(). bdrv_dec_in_flight(bs) needs to be called after >>> bdrv_co_zone_report() returns. >>> >> After adding similar structure to blk_co_do_preadv(), zone operation >> command will always fail at blk_wait_while_drained(blk) because >> blk->inflight <= 0. Would it be ok to just remove >> blk_wait_while_drained? > > Are you hitting the assertion in > block/block-backend.c:blk_wait_while_drained()? > > assert(blk->in_flight > 0); > > If yes, then there is a bug in the code. You need to make sure that > blk_inc_in_flight() is called before blk_wait_while_drained(). > >>>> +BLK_ZT_CONV = BLK_ZONE_TYPE_CONVENTIONAL, >>>> +BLK_ZT_SWR = BLK_ZONE_TYPE_SEQWRITE_REQ, >>>> +BLK_ZT_SWP = BLK_ZONE_TYPE_SEQWRITE_PREF, >>>> +}; >>>> + >>>> +enum zone_cond { >>>> +BLK_ZS_NOT_WP = BLK_ZONE_COND_NOT_WP, >>>> +BLK_ZS_EMPTY = BLK_ZONE_COND_EMPTY, >>>> +BLK_ZS_IOPEN = BLK_ZONE_COND_IMP_OPEN, >>>> +BLK_ZS_EOPEN = BLK_ZONE_COND_EXP_OPEN, >>>> +BLK_ZS_CLOSED = BLK_ZONE_COND_CLOSED, >>>> +BLK_ZS_RDONLY = BLK_ZONE_COND_READONLY, >>>> +BLK_ZS_FULL = BLK_ZONE_COND_FULL, >>>> +BLK_ZS_OFFLINE = BLK_ZONE_COND_OFFLINE, >>>> +}; >>> >>> This 1:1 correspondence with Linux constants could make the code a >>> little harder to port. >>> >>> Maybe QEMU's block layer should define its own numeric constants so the >>> code doesn't rely on operating system-specific headers. >>> block/file-posix.c #ifdef __linux__ code can be responsible for >>> converting Linux-specific constants to QEMU constants (and the 1:1 >>> mapping can be used there). >>> >> Can we define those constants in block-common.h? Because >> BlockZoneDescriptor requires zone_condition, zone_type defined and >> BlockZoneDesicriptor are used in header files and qemu-io >> sub-commands. If we use #ifdef __linux__ in block-common.h, it can be >> responsible for converting Linux constants instead. >> >> Thanks for reviewing! If there is any problem, please let me know. > > I suggest defining the constants in block-common.h. #ifdef __linux__ is > not necessary in block-common.h because the constants should just be an > enum with BLK_ZS_NOT_WP = 0 and so on (no need for Linux headers). > > In block/file-posix.c you can define a helper function inside #ifdef > __linux__ that does something like: > > BlkZoneCond zone_cond_from_linux(enum blk_zone_cond val) > { > switch (val) { > case BLK_ZONE_COND_NOT_WP: > return BLK_ZS_NOT_WP; > ... > } > > The code in block/file-posix.c should call this helper to convert from > Linux values to QEMU values. Given that the entire zone API is Linux specific (as far as I know), we do not need to have these helpers: the entire code for zones needs to be under a #ifdef __linux__. But the conversion from Linux struct blk_zone to qemu zone descriptor still needs to be done. And the perfect place to do this is the parse_zone() function. There, we can add: switch (blkz->cond) { case BLK_ZONE_COND_NOT_WP: zone->cond = BLK_ZS_NOT_WP; break; ... } And same for zone type. That will also allow checking the values returned by Linux. ZBC-2 defines more zone types and zone conditions than currently defined in /usr/include/linux/blkzoned.h. If these new zone types/conditions ever get supported by Linux, qemu can catch the values it does not support and reject the drive. > > This way the QEMU block layer does not use Linux constants and compiles > on non-Linux machines. > > Stefan -- Damien Le Moal Western Digital Research
Re: [RFC v3 1/5] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls.
r >>> size is 512 bytes. */ >>> +printf("start to report zone with offset: 0x%lx\n", offset); >>> + >>> +blkz = (struct blk_zone *)(rep + 1); >>> +while (n < nrz) { >>> +memset(rep, 0, rep_size); >>> +rep->sector = offset; >>> +rep->nr_zones = nrz; >> >> What prevents zones[] overflowing? nrz isn't being reduced in the loop, >> so maybe the rep->nr_zones return value will eventually exceed the >> number of elements still available in zones[n..]? > > I suppose the number of zones[] is restricted in the subsequent for > loop where zones[] copy one zone at a time as n increases. Even if > rep->zones exceeds the available room in zones[], the extra zone will > not be copied. > >> >>> +static int handle_aiocb_zone_mgmt(void *opaque) { >>> +RawPosixAIOData *aiocb = opaque; >>> +int fd = aiocb->aio_fildes; >>> +int64_t offset = aiocb->aio_offset; >>> +int64_t len = aiocb->aio_nbytes; >>> +zone_op op = aiocb->op; >>> + >>> +struct blk_zone_range range; >>> +const char *ioctl_name; >>> +unsigned long ioctl_op; >>> +int64_t zone_size; >>> +int64_t zone_size_mask; >>> +int ret; >>> + >>> +ret = ioctl(fd, BLKGETZONESZ, &zone_size); >> >> Can this value be stored in bs (maybe in BlockLimits) to avoid calling >> ioctl(BLKGETZONESZ) each time? > > Yes, zone_size is in the zbd_dev field. I'll update BlockLimits after > I think through the configurations. In addition, it's a temporary > approach. It is substituted by get_sysfs_long_val now. > >> >>> +if (ret) { >>> +return -1; >> >> The return value should be a negative errno. -1 is -EPERM but it's >> probably not that error code you wanted. This should be: >> >> return -errno; >> >>> +} >>> + >>> +zone_size_mask = zone_size - 1; >>> +if (offset & zone_size_mask) { >>> +error_report("offset is not the start of a zone"); >>> +return -1; >> >> return -EINVAL; >> >>> +} >>> + >>> +if (len & zone_size_mask) { >>> +error_report("len is not aligned to zones"); >>> +return -1; >> >> return -EINVAL; >> >>> +static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t >>> offset, >>> +int64_t len, int64_t *nr_zones, >>> +BlockZoneDescriptor *zones) { >>> +BDRVRawState *s = bs->opaque; >>> +RawPosixAIOData acb; >>> + >>> +acb = (RawPosixAIOData) { >>> +.bs = bs, >>> +.aio_fildes = s->fd, >>> +.aio_type = QEMU_AIO_IOCTL, >>> +.aio_offset = offset, >>> +.aio_nbytes = len, >>> +.zone_report= { >>> +.nr_zones = nr_zones, >>> +.zones = zones, >>> +}, >> >> Indentation is off here. Please use 4-space indentation. > Noted! > > Thanks for reviewing! > > Sam -- Damien Le Moal Western Digital Research
Re: [RFC v3 2/5] qemu-io: add zoned block device operations.
On 6/28/22 16:56, Stefan Hajnoczi wrote: > On Mon, Jun 27, 2022 at 08:19:14AM +0800, Sam Li wrote: >> diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c >> index 2f0d8ac25a..3f2592b9f5 100644 >> --- a/qemu-io-cmds.c >> +++ b/qemu-io-cmds.c >> @@ -1706,6 +1706,122 @@ static const cmdinfo_t flush_cmd = { >> .oneline= "flush all in-core file state to disk", >> }; >> >> +static int zone_report_f(BlockBackend *blk, int argc, char **argv) >> +{ >> +int ret; >> +int64_t offset, len, nr_zones; >> +int i = 0; >> + >> +++optind; >> +offset = cvtnum(argv[optind]); >> +++optind; >> +len = cvtnum(argv[optind]); >> +++optind; >> +nr_zones = cvtnum(argv[optind]); >> + >> +g_autofree BlockZoneDescriptor *zones = g_new(BlockZoneDescriptor, >> nr_zones); >> +ret = blk_zone_report(blk, offset, len, &nr_zones, zones); >> +while (i < nr_zones) { > > Does blk_zone_report() set nr_zones to 0 on failure or do we need to > check if (ret < 0) here? ret = 0 means "no zone reported" which happen only if nr_zones is 0 or the start offset is past the end of the disk capacity. ret < 0 would mean that a report zone operation was actually attempted and failed (EIO, ENOMEM etc). > >> +fprintf(stdout, "start: 0x%lx, len 0x%lx, cap 0x%lx, wptr 0x%lx, " > > The rest of the source file uses printf() instead of fprintf(stdout, > ...). That's usually preferred because it's shorter. > >> +"zcond:%u, [type: %u]\n", > > Please use PRIx64 instead of lx format specifiers for portability. On > 32-bit hosts lx is 32-bit, not 64-bit. You can grep QEMU's code for > examples of PRIx64. > >> +zones[i].start, zones[i].length, zones[i].cap, zones[i].wp, >> +zones[i].cond, zones[i].type); >> +++i; >> +} > > A for loop is more idiomatic: > > for (int i = 0; i < nr_zones; i++) { > ... > } > >> +return ret; >> +} >> + >> +static const cmdinfo_t zone_report_cmd = { >> +.name = "zone_report", >> +.altname = "f", >> +.cfunc = zone_report_f, >> +.argmin = 3, >> +.argmax = 3, >> +.args = "offset [offset..] len [len..] number [num..]", > > The arguments are "offset len number". This command does not accept > optional offset/len/num arguments. The arguments should be offset + len OR offset + number of zones. Having the 3 of them does not make sense to me. The interface would then be: (1) offset + len -> report all zones in the block range [offset .. offset + len - 1] (2) offset + number of zones -> report at most "number of zones" from the zone containing the block at "offset". (2) matches the semantic used at the device command level. So I prefer to approach (1). > >> +.oneline = "report a number of zones", > > Maybe "report zone information". > >> +}; >> + >> +static int zone_open_f(BlockBackend *blk, int argc, char **argv) >> +{ >> +int64_t offset, len; >> +++optind; >> +offset = cvtnum(argv[optind]); >> +++optind; >> +len = cvtnum(argv[optind]); >> +return blk_zone_mgmt(blk, zone_open, offset, len); > > Where is the error reported? When I look at read_f() I see: > > if (ret < 0) { > printf("read failed: %s\n", strerror(-ret)); > > I think something similar is needed because qemu-io.c does not print an > error message for us. The same is true for the other commands defined in > this patch. > >> +} >> + >> +static const cmdinfo_t zone_open_cmd = { >> +.name = "zone_open", >> +.altname = "f", >> +.cfunc = zone_open_f, >> +.argmin = 2, >> +.argmax = 2, >> +.args = "offset [offset..] len [len..]", > > There are no optional offset/len args. The same is true for the other > commands below. -- Damien Le Moal Western Digital Research
Re: [RFC v3 1/5] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls.
On 6/28/22 19:23, Sam Li wrote: > Damien Le Moal 于2022年6月28日周二 17:05写道: >> >> On 6/28/22 17:05, Sam Li wrote: >>> Stefan Hajnoczi 于2022年6月28日周二 14:52写道: >>>> >>>> On Mon, Jun 27, 2022 at 08:19:13AM +0800, Sam Li wrote: >>>>> diff --git a/block/block-backend.c b/block/block-backend.c >>>>> index e0e1aff4b1..786f964d02 100644 >>>>> --- a/block/block-backend.c >>>>> +++ b/block/block-backend.c >>>>> @@ -1810,6 +1810,62 @@ int blk_flush(BlockBackend *blk) >>>>> return ret; >>>>> } >>>>> >>>>> +/* >>>>> + * Return zone_report from BlockDriver. Offset can be any number within >>>>> + * the zone size. No alignment for offset and len. >>>> >>>> What is the purpose of len? Is it the maximum number of zones to return >>>> in nr_zones[]? >>> >>> len is actually not used in bdrv_co_zone_report. It is needed by >>> blk_check_byte_request. >> >> There is no IO buffer being passed. Only an array of zone descriptors and >> an in-out number of zones. len is definitely not needed. Not sure what >> blk_check_byte_request() is intended to check though. > > Can I just remove len argument and blk_check_byte_request() if it's not used? If it is unused, yes, you must remove it. > >> >>> >>>> How does the caller know how many zones were returned? >>> >>> nr_zones represents IN maximum and OUT actual. The caller will know by >>> nr_zones which is changed in bdrv_co_zone_report. I'll add it in the >>> comments. >>> >>>> >>>>> + */ >>>>> +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset, >>>>> + int64_t len, int64_t *nr_zones, >>>>> + BlockZoneDescriptor *zones) >>>>> +{ >>>>> +int ret; >>>>> +BlockDriverState *bs; >>>>> +IO_CODE(); >>>>> + >>>>> +blk_inc_in_flight(blk); /* increase before waiting */ >>>>> +blk_wait_while_drained(blk); >>>>> +bs = blk_bs(blk); >>>>> + >>>>> +ret = blk_check_byte_request(blk, offset, len); >>>>> +if (ret < 0) { >>>>> +return ret; >>>>> +} >>>>> + >>>>> +bdrv_inc_in_flight(bs); >>>> >>>> The bdrv_inc/dec_in_flight() call should be inside >>>> bdrv_co_zone_report(). See bdrv_co_ioctl() for an example. >>>> >>>>> +ret = bdrv_co_zone_report(blk->root->bs, offset, len, >>>>> + nr_zones, zones); >>>>> +bdrv_dec_in_flight(bs); >>>>> +blk_dec_in_flight(blk); >>>>> +return ret; >>>>> +} >>>>> + >>>>> +/* >>>>> + * Return zone_mgmt from BlockDriver. >>>> >>>> Maybe this should be: >>>> >>>> Send a zone management command. >>> >>> Yes, it's more accurate. >>> >>>> >>>>> @@ -216,6 +217,11 @@ typedef struct RawPosixAIOData { >>>>> PreallocMode prealloc; >>>>> Error **errp; >>>>> } truncate; >>>>> +struct { >>>>> +int64_t *nr_zones; >>>>> +BlockZoneDescriptor *zones; >>>>> +} zone_report; >>>>> +zone_op op; >>>> >>>> It's cleaner to put op inside a struct zone_mgmt so its purpose is >>>> self-explanatory: >>>> >>>> struct { >>>> zone_op op; >>>> } zone_mgmt; >>>> >>>>> +static int handle_aiocb_zone_report(void *opaque) { >>>>> +RawPosixAIOData *aiocb = opaque; >>>>> +int fd = aiocb->aio_fildes; >>>>> +int64_t *nr_zones = aiocb->zone_report.nr_zones; >>>>> +BlockZoneDescriptor *zones = aiocb->zone_report.zones; >>>>> +int64_t offset = aiocb->aio_offset; >>>>> +int64_t len = aiocb->aio_nbytes; >> >> Since you have the zone array and number of zones (size of that array) I >> really do not see why you need len. >> >>>>> + >>>>> +struct blk_zone *blkz
Re: [RFC v3 1/5] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls.
On 6/29/22 10:50, Sam Li wrote: >>>>>>> +rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct >>>>>>> blk_zone); >>>>>>> +g_autofree struct blk_zone_report *rep = g_new(struct >>>>>>> blk_zone_report, nrz); >>>>>> >>>>>> g_new() looks incorrect. There should be 1 struct blk_zone_report >>>>>> followed by nrz struct blk_zone structs. Please use g_malloc(rep_size) >>>>>> instead. >>>>> >>>>> Yes! However, it still has a memory leak error when using g_autofree >>>>> && g_malloc. >>>> >>>> That may be because you are changing the value of the rep pointer while >>>> parsing the report ? >>> >>> I am not sure it is the case. Can you show me some way to find the problem? >> >> Not sure. I never used this g_malloc()/g_autofree() before so not sure how >> it works. It may be that g_autofree() work only with g_new() ? >> Could you try separating the declaration and allocation ? e.g. >> >> Declare at the beginning of the function: >> g_autofree struct blk_zone_report *rep = NULL; >> >> And then when needed do: >> >> rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone); >> rep = g_malloc(rep_size); > > Actually, the memory leak occurs in that way. When using zone_mgmt, > memory leak still occurs. Asan gives the error information not much so > I haven't tracked down the problem yet. See this: https://blog.fishsoup.net/2015/11/05/attributecleanup-mixed-declarations-and-code-and-goto/ Maybe you can find some hints. -- Damien Le Moal Western Digital Research
Re: [PATCH v6 1/4] file-posix: add tracking of the zone write pointers
On 3/14/23 11:23, Dmitry Fomichev wrote: >> @@ -3339,10 +3473,27 @@ static int coroutine_fn >> raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op, >> len >> BDRV_SECTOR_BITS); >> ret = raw_thread_pool_submit(bs, handle_aiocb_zone_mgmt, &acb); >> if (ret != 0) { >> + update_zones_wp(s->fd, wps, offset, index); >> ret = -errno; >> error_report("ioctl %s failed %d", op_name, ret); >> + goto out; >> } >> >> + if (zo == BLKRESETZONE && len == capacity) { >> + for (int i = 0; i < bs->bl.nr_zones; ++i) { >> + if (!BDRV_ZT_IS_CONV(wps->wp[i])) { >> + wps->wp[i] = i * bs->bl.zone_size; > > This will reset write pointers of all read-only zones that may exist on the > device and make the data stored in those zones unreadable. R/O zones need to > be > skipped in this loop. And offline zones need to be skipped as well. -- Damien Le Moal Western Digital Research
Re: [PATCH v6 1/4] file-posix: add tracking of the zone write pointers
On 3/15/23 21:59, Sam Li wrote: > Damien Le Moal 于2023年3月14日周二 11:49写道: >> >> On 3/14/23 11:23, Dmitry Fomichev wrote: >>>> @@ -3339,10 +3473,27 @@ static int coroutine_fn >>>> raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op, >>>> len >> BDRV_SECTOR_BITS); >>>> ret = raw_thread_pool_submit(bs, handle_aiocb_zone_mgmt, &acb); >>>> if (ret != 0) { >>>> +update_zones_wp(s->fd, wps, offset, index); >>>> ret = -errno; >>>> error_report("ioctl %s failed %d", op_name, ret); >>>> +goto out; >>>> } >>>> >>>> +if (zo == BLKRESETZONE && len == capacity) { >>>> +for (int i = 0; i < bs->bl.nr_zones; ++i) { >>>> +if (!BDRV_ZT_IS_CONV(wps->wp[i])) { >>>> +wps->wp[i] = i * bs->bl.zone_size; >>> >>> This will reset write pointers of all read-only zones that may exist on the >>> device and make the data stored in those zones unreadable. R/O zones need >>> to be >>> skipped in this loop. >> >> And offline zones need to be skipped as well. > > I see. That can be done thanks to get_zones_wp() which can show the > state of the zone at specific position. I do not think so: a zone wp is invalid for read-only and offline zones. So you cannot rely on the wp value to detect these states. Even a valid wp value would not tell you if the zone is read only or offline anyway. You need to track these states with flags set when doing the first report zone on startup and when doing a report zone after an IO error. > > Sam -- Damien Le Moal Western Digital Research
Re: [PATCH v7 1/8] include: add zoned device structs
On 2022/08/15 23:25, Sam Li wrote: > Signed-off-by: Sam Li > Reviewed-by: Stefan Hajnoczi Looks good. Reviewed-by: Damien Le Moal > --- > include/block/block-common.h | 43 > 1 file changed, 43 insertions(+) > > diff --git a/include/block/block-common.h b/include/block/block-common.h > index fdb7306e78..36bd0e480e 100644 > --- a/include/block/block-common.h > +++ b/include/block/block-common.h > @@ -49,6 +49,49 @@ typedef struct BlockDriver BlockDriver; > typedef struct BdrvChild BdrvChild; > typedef struct BdrvChildClass BdrvChildClass; > > +typedef enum BlockZoneOp { > +BLK_ZO_OPEN, > +BLK_ZO_CLOSE, > +BLK_ZO_FINISH, > +BLK_ZO_RESET, > +} BlockZoneOp; > + > +typedef enum BlockZoneModel { > +BLK_Z_NONE = 0x0, /* Regular block device */ > +BLK_Z_HM = 0x1, /* Host-managed zoned block device */ > +BLK_Z_HA = 0x2, /* Host-aware zoned block device */ > +} BlockZoneModel; > + > +typedef enum BlockZoneCondition { > +BLK_ZS_NOT_WP = 0x0, > +BLK_ZS_EMPTY = 0x1, > +BLK_ZS_IOPEN = 0x2, > +BLK_ZS_EOPEN = 0x3, > +BLK_ZS_CLOSED = 0x4, > +BLK_ZS_RDONLY = 0xD, > +BLK_ZS_FULL = 0xE, > +BLK_ZS_OFFLINE = 0xF, > +} BlockZoneCondition; > + > +typedef enum BlockZoneType { > +BLK_ZT_CONV = 0x1, /* Conventional random writes supported */ > +BLK_ZT_SWR = 0x2, /* Sequential writes required */ > +BLK_ZT_SWP = 0x3, /* Sequential writes preferred */ > +} BlockZoneType; > + > +/* > + * Zone descriptor data structure. > + * Provides information on a zone with all position and size values in bytes. > + */ > +typedef struct BlockZoneDescriptor { > +uint64_t start; > +uint64_t length; > +uint64_t cap; > +uint64_t wp; > +BlockZoneType type; > +BlockZoneCondition cond; > +} BlockZoneDescriptor; > + > typedef struct BlockDriverInfo { > /* in bytes, 0 if irrelevant */ > int cluster_size; -- Damien Le Moal Western Digital Research
Re: [PATCH v7 2/8] file-posix: introduce get_sysfs_str_val for device zoned model
if (ret < 0) { > +zoned = BLK_Z_NONE; > +} > +bs->bl.zoned = zoned; > } > > static int check_for_dasd(int fd) > diff --git a/include/block/block_int-common.h > b/include/block/block_int-common.h > index 8947abab76..7f7863cc9e 100644 > --- a/include/block/block_int-common.h > +++ b/include/block/block_int-common.h > @@ -825,6 +825,9 @@ typedef struct BlockLimits { > > /* maximum number of iovec elements */ > int max_iov; > + > +/* device zone model */ > +BlockZoneModel zoned; > } BlockLimits; > > typedef struct BdrvOpBlocker BdrvOpBlocker; -- Damien Le Moal Western Digital Research
Re: [PATCH v7 3/8] file-posix: introduce get_sysfs_long_val for the long sysfs attribute
On 2022/08/16 10:53, Sam Li wrote: > Damien Le Moal 于2022年8月17日周三 01:35写道: >> >> On 2022/08/15 23:25, Sam Li wrote: >>> Use sysfs attribute files to get the long value of zoned device >>> information. >>> >>> Signed-off-by: Sam Li >>> Reviewed-by: Hannes Reinecke >>> Reviewed-by: Stefan Hajnoczi >>> --- >>> block/file-posix.c | 27 +++ >>> 1 file changed, 27 insertions(+) >>> >>> diff --git a/block/file-posix.c b/block/file-posix.c >>> index c07ac4c697..727389488c 100644 >>> --- a/block/file-posix.c >>> +++ b/block/file-posix.c >>> @@ -1258,6 +1258,33 @@ static int get_sysfs_zoned_model(struct stat *st, >>> BlockZoneModel *zoned) { >>> return 0; >>> } >>> >>> +/* >>> + * Get zoned device information (chunk_sectors, zoned_append_max_bytes, >>> + * max_open_zones, max_active_zones) through sysfs attribute files. >>> + */ >> >> The comment here needs to be more generic since this helper is used in patch >> 2 >> in hdev_get_max_segments(). So simply something like: >> >> /* >> * Get a sysfs attribute value as a long integer. >> */ >> >> And since this helper is used in patch 2, this patch needs to go before >> patch 2 >> (reverse patch 2 and 3 order). > > Can I merge patch2 and patch 3 into one patch? Because in patch 2 > hdev_get_max_segments -> get_sysfs_long_val(-> get_sysfs_str_val) > while in patch 3 get_sysfs_long_val-> get_sysfs_str_val, > hdev_get_max_segments is required for qemu setting up I guess so the > dependency is intertwined here. If we use separate patches, then the > last patch will modify the first patch's code, which I think is messy. Indeed. So merge the 2 patches to solve this. Rework the commit message too to mention the introduction of the get_sysfs_long_val() helper. > >> >>> +static long get_sysfs_long_val(struct stat *st, const char *attribute) { >>> +#ifdef CONFIG_LINUX >>> +g_autofree char *str = NULL; >>> +const char *end; >>> +long val; >>> +int ret; >>> + >>> +ret = get_sysfs_str_val(st, attribute, &str); >>> +if (ret < 0) { >>> +return ret; >>> + } >>> + >>> + /* The file is ended with '\n', pass 'end' to accept that. */ >>> +ret = qemu_strtol(str, &end, 10, &val); >>> +if (ret == 0 && end && *end == '\n') { >>> +ret = val; >>> +} >>> +return ret; >>> +#else >>> +return -ENOTSUP; >>> +#endif >>> +} >>> + >>> static int hdev_get_max_segments(int fd, struct stat *st) { >>> int ret; >>> if (S_ISCHR(st->st_mode)) { >> >> >> -- >> Damien Le Moal >> Western Digital Research -- Damien Le Moal Western Digital Research
Re: [PATCH v7 3/8] file-posix: introduce get_sysfs_long_val for the long sysfs attribute
On 2022/08/15 23:25, Sam Li wrote: > Use sysfs attribute files to get the long value of zoned device > information. > > Signed-off-by: Sam Li > Reviewed-by: Hannes Reinecke > Reviewed-by: Stefan Hajnoczi > --- > block/file-posix.c | 27 +++ > 1 file changed, 27 insertions(+) > > diff --git a/block/file-posix.c b/block/file-posix.c > index c07ac4c697..727389488c 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -1258,6 +1258,33 @@ static int get_sysfs_zoned_model(struct stat *st, > BlockZoneModel *zoned) { > return 0; > } > > +/* > + * Get zoned device information (chunk_sectors, zoned_append_max_bytes, > + * max_open_zones, max_active_zones) through sysfs attribute files. > + */ The comment here needs to be more generic since this helper is used in patch 2 in hdev_get_max_segments(). So simply something like: /* * Get a sysfs attribute value as a long integer. */ And since this helper is used in patch 2, this patch needs to go before patch 2 (reverse patch 2 and 3 order). > +static long get_sysfs_long_val(struct stat *st, const char *attribute) { > +#ifdef CONFIG_LINUX > +g_autofree char *str = NULL; > +const char *end; > +long val; > +int ret; > + > +ret = get_sysfs_str_val(st, attribute, &str); > +if (ret < 0) { > +return ret; > +} > + > +/* The file is ended with '\n', pass 'end' to accept that. */ > +ret = qemu_strtol(str, &end, 10, &val); > +if (ret == 0 && end && *end == '\n') { > +ret = val; > +} > +return ret; > +#else > +return -ENOTSUP; > +#endif > +} > + > static int hdev_get_max_segments(int fd, struct stat *st) { > int ret; > if (S_ISCHR(st->st_mode)) { -- Damien Le Moal Western Digital Research
Re: [PATCH v7 4/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
@@ static const cmdinfo_t flush_cmd = { > .oneline= "flush all in-core file state to disk", > }; > > +static int zone_report_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset; > +unsigned int nr_zones; > + > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +nr_zones = cvtnum(argv[optind]); > + > +g_autofree BlockZoneDescriptor *zones = NULL; > +zones = g_new(BlockZoneDescriptor, nr_zones); > +ret = blk_zone_report(blk, offset, &nr_zones, zones); > +if (ret < 0) { > +printf("zone report failed: %s\n", strerror(-ret)); > +} else { > +for (int i = 0; i < nr_zones; ++i) { > +printf("start: 0x%" PRIx64 ", len 0x%" PRIx64 ", " > + "cap"" 0x%" PRIx64 ",wptr 0x%" PRIx64 ", " > + "zcond:%u, [type: %u]\n", > + zones[i].start, zones[i].length, zones[i].cap, > zones[i].wp, > + zones[i].cond, zones[i].type); > +} > +} > +return ret; > +} > + > +static const cmdinfo_t zone_report_cmd = { > +.name = "zone_report", > +.altname = "zrp", > +.cfunc = zone_report_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset number", > +.oneline = "report zone information", > +}; > + > +static int zone_open_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len); > +if (ret < 0) { > +printf("zone open failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_open_cmd = { > +.name = "zone_open", > +.altname = "zo", > +.cfunc = zone_open_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "explicit open a range of zones in zone block device", > +}; > + > +static int zone_close_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len); > +if (ret < 0) { > +printf("zone close failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_close_cmd = { > +.name = "zone_close", > +.altname = "zc", > +.cfunc = zone_close_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "close a range of zones in zone block device", > +}; > + > +static int zone_finish_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len); > +if (ret < 0) { > +printf("zone finish failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_finish_cmd = { > +.name = "zone_finish", > +.altname = "zf", > +.cfunc = zone_finish_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "finish a range of zones in zone block device", > +}; > + > +static int zone_reset_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, BLK_ZO_RESET, offset, len); > +if (ret < 0) { > +printf("zone reset failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_reset_cmd = { > +.name = "zone_reset", > +.altname = "zrs", > +.cfunc = zone_reset_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "reset a zone write pointer in zone block device", > +}; > + > static int truncate_f(BlockBackend *blk, int argc, char **argv); > static const cmdinfo_t truncate_cmd = { > .name = "truncate", > @@ -2504,6 +2642,11 @@ static void __attribute((constructor)) > init_qemuio_commands(void) > qemuio_add_command(&aio_write_cmd); > qemuio_add_command(&aio_flush_cmd); > qemuio_add_command(&flush_cmd); > +qemuio_add_command(&zone_report_cmd); > +qemuio_add_command(&zone_open_cmd); > +qemuio_add_command(&zone_close_cmd); > +qemuio_add_command(&zone_finish_cmd); > +qemuio_add_command(&zone_reset_cmd); > qemuio_add_command(&truncate_cmd); > qemuio_add_command(&length_cmd); > qemuio_add_command(&info_cmd); -- Damien Le Moal Western Digital Research
Re: [PATCH v7 4/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
int64_t offset, unsigned int *nr_zones, >>> +BlockZoneDescriptor *zones); >>> +int coroutine_fn (*bdrv_co_zone_mgmt)(BlockDriverState *bs, >>> BlockZoneOp op, >>> +int64_t offset, int64_t len); >>> + >>> /* removable device specific */ >>> bool (*bdrv_is_inserted)(BlockDriverState *bs); >>> void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag); >>> @@ -828,6 +833,21 @@ typedef struct BlockLimits { >>> >>> /* device zone model */ >>> BlockZoneModel zoned; >>> + >>> +/* zone size expressed in 512-byte sectors */ >>> +uint32_t zone_sectors; >>> + >>> +/* total number of zones */ >>> +unsigned int nr_zones; >>> + >>> +/* maximum size in bytes of a zone append write operation */ >>> +int64_t zone_append_max_bytes; >>> + >>> +/* maximum number of open zones */ >>> +int64_t max_open_zones; >>> + >>> +/* maximum number of active zones */ >>> +int64_t max_active_zones; >>> } BlockLimits; >>> >>> typedef struct BdrvOpBlocker BdrvOpBlocker; >>> diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h >>> index 21fc10c4c9..3d26929cdd 100644 >>> --- a/include/block/raw-aio.h >>> +++ b/include/block/raw-aio.h >>> @@ -29,6 +29,8 @@ >>> #define QEMU_AIO_WRITE_ZEROES 0x0020 >>> #define QEMU_AIO_COPY_RANGE 0x0040 >>> #define QEMU_AIO_TRUNCATE 0x0080 >>> +#define QEMU_AIO_ZONE_REPORT 0x0100 >>> +#define QEMU_AIO_ZONE_MGMT0x0200 >>> #define QEMU_AIO_TYPE_MASK \ >>> (QEMU_AIO_READ | \ >>> QEMU_AIO_WRITE | \ >>> @@ -37,7 +39,9 @@ >>> QEMU_AIO_DISCARD | \ >>> QEMU_AIO_WRITE_ZEROES | \ >>> QEMU_AIO_COPY_RANGE | \ >>> - QEMU_AIO_TRUNCATE) >>> + QEMU_AIO_TRUNCATE | \ >>> + QEMU_AIO_ZONE_REPORT | \ >>> + QEMU_AIO_ZONE_MGMT) >>> >>> /* AIO flags */ >>> #define QEMU_AIO_MISALIGNED 0x1000 >>> diff --git a/include/sysemu/block-backend-io.h >>> b/include/sysemu/block-backend-io.h >>> index 50f5aa2e07..6e7df1d93b 100644 >>> --- a/include/sysemu/block-backend-io.h >>> +++ b/include/sysemu/block-backend-io.h >>> @@ -156,6 +156,12 @@ int generated_co_wrapper >>> blk_pwrite_zeroes(BlockBackend *blk, int64_t offset, >>> int coroutine_fn blk_co_pwrite_zeroes(BlockBackend *blk, int64_t offset, >>>int64_t bytes, BdrvRequestFlags >>> flags); >>> >>> +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset, >>> +unsigned int *nr_zones, >>> +BlockZoneDescriptor *zones); >>> +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op, >>> + int64_t offset, int64_t len); >>> + >>> int generated_co_wrapper blk_pdiscard(BlockBackend *blk, int64_t offset, >>>int64_t bytes); >>> int coroutine_fn blk_co_pdiscard(BlockBackend *blk, int64_t offset, >>> diff --git a/meson.build b/meson.build >>> index 294e9a8f32..c3219b0e87 100644 >>> --- a/meson.build >>> +++ b/meson.build >>> @@ -1883,6 +1883,7 @@ config_host_data.set('CONFIG_REPLICATION', >>> get_option('live_block_migration').al >>> # has_header >>> config_host_data.set('CONFIG_EPOLL', cc.has_header('sys/epoll.h')) >>> config_host_data.set('CONFIG_LINUX_MAGIC_H', >>> cc.has_header('linux/magic.h')) >>> +config_host_data.set('CONFIG_BLKZONED', cc.has_header('linux/blkzoned.h')) >>> config_host_data.set('CONFIG_VALGRIND_H', >>> cc.has_header('valgrind/valgrind.h')) >>> config_host_data.set('HAVE_BTRFS_H', cc.has_header('linux/btrfs.h')) >>> config_host_data.set('HAVE_DRM_H', cc.has_header('libdrm/drm.h')) >>> diff --git a/qapi/block-core.json b/qapi/block-core.json >>> index 2173e7734a..c6bbb7a037 100644 >>> --- a/qapi/block-core.json >>> +++ b/qapi/block-core.json >>> @@ -2942,6 +2942,7 @@ >>> # @compress: Since 5.0 >>> # @copy-before-write: Since 6.2 >>> # @snapshot-access: Since 7.0 >>> +# @zoned_host_device: Since 7.2 >>> # >>> # Since: 2.9 >>> ## >>> @@ -2955,7 +2956,8 @@ >>> 'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', >>> 'parallels', >>> 'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd', >>> { 'name': 'replication', 'if': 'CONFIG_REPLICATION' }, >>> -'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] } >>> +'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', >>> +{ 'name': 'zoned_host_device', 'if': 'CONFIG_BLKZONED' } ] } >>> >>> ## >>> # @BlockdevOptionsFile: >>> @@ -4329,7 +4331,9 @@ >>>'vhdx': 'BlockdevOptionsGenericFormat', >>>'vmdk': 'BlockdevOptionsGenericCOWFormat', >>>'vpc':'BlockdevOptionsGenericFormat', >>> - 'vvfat': 'BlockdevOptionsVVFAT' >>> + 'vvfat': 'BlockdevOptionsVVFAT', >>> + 'zoned_host_device': { 'type': 'BlockdevOptionsFile', >>> + 'if': 'CONFIG_BLKZONED' } >>>} } >>> >>> ## >>> diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c >>> index 952dc940f1..687c3a624c 100644 >>> --- a/qemu-io-cmds.c >>> +++ b/qemu-io-cmds.c >>> @@ -1712,6 +1712,144 @@ static const cmdinfo_t flush_cmd = { >>> .oneline= "flush all in-core file state to disk", >>> }; >>> >>> +static int zone_report_f(BlockBackend *blk, int argc, char **argv) >>> +{ >>> +int ret; >>> +int64_t offset; >>> +unsigned int nr_zones; >>> + >>> +++optind; >>> +offset = cvtnum(argv[optind]); >>> +++optind; >>> +nr_zones = cvtnum(argv[optind]); >>> + >>> +g_autofree BlockZoneDescriptor *zones = NULL; >>> +zones = g_new(BlockZoneDescriptor, nr_zones); >>> +ret = blk_zone_report(blk, offset, &nr_zones, zones); >>> +if (ret < 0) { >>> +printf("zone report failed: %s\n", strerror(-ret)); >>> +} else { >>> +for (int i = 0; i < nr_zones; ++i) { >>> +printf("start: 0x%" PRIx64 ", len 0x%" PRIx64 ", " >>> + "cap"" 0x%" PRIx64 ",wptr 0x%" PRIx64 ", " >>> + "zcond:%u, [type: %u]\n", >>> + zones[i].start, zones[i].length, zones[i].cap, >>> zones[i].wp, >>> + zones[i].cond, zones[i].type); >>> +} >>> +} >>> +return ret; >>> +} >>> + >>> +static const cmdinfo_t zone_report_cmd = { >>> +.name = "zone_report", >>> +.altname = "zrp", >>> +.cfunc = zone_report_f, >>> +.argmin = 2, >>> +.argmax = 2, >>> +.args = "offset number", >>> +.oneline = "report zone information", >>> +}; >>> + >>> +static int zone_open_f(BlockBackend *blk, int argc, char **argv) >>> +{ >>> +int ret; >>> +int64_t offset, len; >>> +++optind; >>> +offset = cvtnum(argv[optind]); >>> +++optind; >>> +len = cvtnum(argv[optind]); >>> +ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len); >>> +if (ret < 0) { >>> +printf("zone open failed: %s\n", strerror(-ret)); >>> +} >>> +return ret; >>> +} >>> + >>> +static const cmdinfo_t zone_open_cmd = { >>> +.name = "zone_open", >>> +.altname = "zo", >>> +.cfunc = zone_open_f, >>> +.argmin = 2, >>> +.argmax = 2, >>> +.args = "offset len", >>> +.oneline = "explicit open a range of zones in zone block device", >>> +}; >>> + >>> +static int zone_close_f(BlockBackend *blk, int argc, char **argv) >>> +{ >>> +int ret; >>> +int64_t offset, len; >>> +++optind; >>> +offset = cvtnum(argv[optind]); >>> +++optind; >>> +len = cvtnum(argv[optind]); >>> +ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len); >>> +if (ret < 0) { >>> +printf("zone close failed: %s\n", strerror(-ret)); >>> +} >>> +return ret; >>> +} >>> + >>> +static const cmdinfo_t zone_close_cmd = { >>> +.name = "zone_close", >>> +.altname = "zc", >>> +.cfunc = zone_close_f, >>> +.argmin = 2, >>> +.argmax = 2, >>> +.args = "offset len", >>> +.oneline = "close a range of zones in zone block device", >>> +}; >>> + >>> +static int zone_finish_f(BlockBackend *blk, int argc, char **argv) >>> +{ >>> +int ret; >>> +int64_t offset, len; >>> +++optind; >>> +offset = cvtnum(argv[optind]); >>> +++optind; >>> +len = cvtnum(argv[optind]); >>> +ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len); >>> +if (ret < 0) { >>> +printf("zone finish failed: %s\n", strerror(-ret)); >>> +} >>> +return ret; >>> +} >>> + >>> +static const cmdinfo_t zone_finish_cmd = { >>> +.name = "zone_finish", >>> +.altname = "zf", >>> +.cfunc = zone_finish_f, >>> +.argmin = 2, >>> +.argmax = 2, >>> +.args = "offset len", >>> +.oneline = "finish a range of zones in zone block device", >>> +}; >>> + >>> +static int zone_reset_f(BlockBackend *blk, int argc, char **argv) >>> +{ >>> +int ret; >>> +int64_t offset, len; >>> +++optind; >>> +offset = cvtnum(argv[optind]); >>> +++optind; >>> +len = cvtnum(argv[optind]); >>> +ret = blk_zone_mgmt(blk, BLK_ZO_RESET, offset, len); >>> +if (ret < 0) { >>> +printf("zone reset failed: %s\n", strerror(-ret)); >>> +} >>> +return ret; >>> +} >>> + >>> +static const cmdinfo_t zone_reset_cmd = { >>> +.name = "zone_reset", >>> +.altname = "zrs", >>> +.cfunc = zone_reset_f, >>> +.argmin = 2, >>> +.argmax = 2, >>> +.args = "offset len", >>> +.oneline = "reset a zone write pointer in zone block device", >>> +}; >>> + >>> static int truncate_f(BlockBackend *blk, int argc, char **argv); >>> static const cmdinfo_t truncate_cmd = { >>> .name = "truncate", >>> @@ -2504,6 +2642,11 @@ static void __attribute((constructor)) >>> init_qemuio_commands(void) >>> qemuio_add_command(&aio_write_cmd); >>> qemuio_add_command(&aio_flush_cmd); >>> qemuio_add_command(&flush_cmd); >>> +qemuio_add_command(&zone_report_cmd); >>> +qemuio_add_command(&zone_open_cmd); >>> +qemuio_add_command(&zone_close_cmd); >>> +qemuio_add_command(&zone_finish_cmd); >>> +qemuio_add_command(&zone_reset_cmd); >>> qemuio_add_command(&truncate_cmd); >>> qemuio_add_command(&length_cmd); >>> qemuio_add_command(&info_cmd); >>> -- >>> 2.37.1 >>> -- Damien Le Moal Western Digital Research
Re: [PATCH v7 4/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
On 2022/08/24 16:46, Damien Le Moal wrote: > On 2022/08/22 21:12, Sam Li wrote: >> Stefan Hajnoczi 于2022年8月23日周二 08:49写道: >>> >>> On Tue, Aug 16, 2022 at 02:25:18PM +0800, Sam Li wrote: [...]>>>> +blkz = (struct blk_zone *)(rep + 1); >>>> +while (n < nrz) { >>>> +memset(rep, 0, rep_size); >>>> +rep->sector = sector; >>>> +rep->nr_zones = nrz - n; >>>> + >>>> +ret = ioctl(fd, BLKREPORTZONE, rep); >>> >>> Does this ioctl() need "do { ... } while (ret == -1 && errno == EINTR)"? >> >> No? We discussed this before. I guess even EINTR should be propagated >> back to the guest. Maybe Damien can talk more about why. > > In the kernel, completion of zone management IO requests are waited for using > wait_for_completion_io() which uses TASK_UNINTERRUPTIBLE. So a signal will not > abort anything. So I do not think that the do { } while() loop is necessary. Note: I do not think the loop to be necessary, but it will not hurt :) -- Damien Le Moal Western Digital Research
Re: [PATCH v8 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
t; +.bdrv_co_zone_report = raw_co_zone_report, >>>>>> +.bdrv_co_zone_mgmt = raw_co_zone_mgmt, >>>>>> +}; >>>>> >>>>> Differences to bdrv_host_device: >>>>> >>>>> * .bdrv_parse_filename is not set >>>>> >>>>> * .bdrv_co_ioctl is not set >>>>> >>>>> * .bdrv_co_zone_report and .bdrv_co_zone_mgmt are set >>>> >>>> As Stefan mentioned, zoned_host_device is a new driver that doesn't >>>> work with string filenames. .bdrv_parse_filename() helps legacy >>>> drivers strip the optional protocol prefix off the filename and no use >>>> here. Therefore it can be dropped. >>> >>> Makes sense. >>> >>>> .bdrv_co_ioctl is set actually. >>> >>> You're right; I diffed the two and misread the result. >>> >>>> Zoned_host_device is basically host_device + zone operations. It >>>> serves for a simple purpose: if the host device is zoned, register >>>> zoned_host_device driver; else, register host_device. >>> >>> Why would I ever want to use host_device instead of zoned_host_device? >>> >>> To answer this question, we need to understand how their behavior >>> differs. >>> >>> We can ignore the legacy protocol prefix / string filename part. >>> >>> All that's left seems to be "if the host device is zoned, then using the >>> zoned_host_device driver gets you the zoned features, whereas using the >>> host_device driver doesn't". What am I missing? >> >> I think that's basically what users need to know about. > > Now answer my previous question, please: why would I ever want to use > host_device instead of zoned_host_device? > > Or in other words, why would I ever want to present a zoned host device > to a guest as non-zoned device? > >>>>> Notably common is .bdrv_file_open = hdev_open. What happens when you >>>>> try to create a zoned_host_device where the @filename argument is not in >>>>> fact a zoned device? >>>> >>>> If the device is a regular block device, QEMU will still open the >>>> device. For instance, I use a loopback device to test zone_report in >>>> qemu-io. It returns ENOTTY which indicates Inappropriate ioctl for the >>>> device. Meanwhile, if using a regular block device when emulation a >>>> zoned device on a guest os, the best case is that the guest can boot >>>> but has no emulated block device. In some cases, QEMU just terminates >>>> because the block device has not met the alignment requirements. >>> >>> I'm not sure I understand all of this. I'm also not sure I have to :) >> >> Maybe I didn't explain it very well. Which part would you like to know >> more about? > > Let's try more specific questions. Say I configure a zoned_host_device > backed by a host device that isn't zoned. > > 1. Is this configuration accepted? If we assume we have the special zoned_host_device driver, with the associated command line zoned_host_device option explicitly calling for it, then no, I do not think this should be allowed at all and an error should be returned on startup. That would be consistent with the fact that the options zoned_host_device and host_device are different to make sure we can check that the user knows what he/she/them is doing. If we have only host_device as a setup option and driver, the driver methods can be trivially adjusted to do the right thing based on the device type (i.e. zoned vs regular/not zoned). However, that would prevent an interesting future extension of this work to implement a full zone emulation on top of a regular (not zoned) host block device. With this in mind, we currently have the following: 1) host_device option -> accept only regular non-zoned host block devices 2) zoned_host_device option -> accept only zoned host block devices And in the future, we can have: 1) host_device option -> accept only regular non-zoned host block devices 2) zoned_host_device option -> accept any host block device type a) Use native zone kernel API for zoned host block devices b) Use full zone emulation for regular host block devices But sure, internally, we could have a single driver structure with methods adjusted to do the correct thing based on the device type and option specified. Having a 1:1 mapping between the driver name and driver structure does clarify things I think (even though there are indeed a lot of methods that are identical). > > 2. Would a guest work as long as it doesn't touch this device? > > 3. Would a guest using this device work as long as it uses no zoned >features? > > 4. What happens when a guest tries to use zoned features? > > [...] > -- Damien Le Moal Western Digital Research
Re: [PATCH 2/2] virtio-blk: add zoned storage emulation for zoned devices
LK_Z_HM: >>> +blkcfg.zoned.model = VIRTIO_BLK_Z_HM; >>> +virtio_stl_p(vdev, &blkcfg.zoned.zone_sectors, > >>> + state->bl.zone_sectors); >>> +virtio_stl_p(vdev, &blkcfg.zoned.max_active_zones, >>> + state->bl.max_active_zones); >>> +virtio_stl_p(vdev, &blkcfg.zoned.max_open_zones, >>> + state->bl.max_open_zones); >>> +virtio_stl_p(vdev, &blkcfg.zoned.write_granularity, blk_size); >>> +virtio_stl_p(vdev, &blkcfg.zoned.max_append_sectors, >>> + state->bl.max_append_sectors); >>> +break; >>> +case BLK_Z_HA: >>> +blkcfg.zoned.model = VIRTIO_BLK_Z_HA; >> >> The block limits aren't relevant to host-aware zoned devices? > > Yes, the HA devices are seen as non-zoned device and the driver just > ignore all other fields in zoned. This is incorrect. HA devices may be used as regular block devices (not zoned) by the guest, but that is the guest decision to make. qemu handling of the device should still have the proper zone information for the drive and be ready to accept any zone operation for it. > > +\item if the driver that can not support zoned devices reads > +VIRTIO_BLK_Z_HA from the \field{model} field of \field{zoned}, the driver > +MAY handle the device as a non-zoned device. In this case, the > +driver SHOULD ignore all other fields in \field{zoned}. > +\end{itemize} The driver here refers to the guest virtio driver. In the spec, qemu zone block emulation/handling is called "the device". Confusing :) > >> >>> +break; >>> +default: >>> +blkcfg.zoned.model = VIRTIO_BLK_Z_NONE; >>> +virtio_error(vdev, "Invalid zoned model %x \n", >>> (int)state->bl.zoned); >>> +break; >>> +} >>> +} >>> +#endif >>> memcpy(config, &blkcfg, s->config_size); >>> } >>> >>> @@ -995,6 +1311,7 @@ static uint64_t virtio_blk_get_features(VirtIODevice >>> *vdev, uint64_t features, >>> Error **errp) >>> { >>> VirtIOBlock *s = VIRTIO_BLK(vdev); >>> +BlockDriverState *state = blk_bs(s->blk); >>> >>> /* Firstly sync all virtio-blk possible supported features */ >>> features |= s->host_features; >>> @@ -1003,6 +1320,12 @@ static uint64_t virtio_blk_get_features(VirtIODevice >>> *vdev, uint64_t features, >>> virtio_add_feature(&features, VIRTIO_BLK_F_GEOMETRY); >>> virtio_add_feature(&features, VIRTIO_BLK_F_TOPOLOGY); >>> virtio_add_feature(&features, VIRTIO_BLK_F_BLK_SIZE); >>> +if (state->bl.zoned != BLK_Z_NONE) { >>> +virtio_add_feature(&s->host_features, VIRTIO_BLK_F_ZONED); >>> +if (state->bl.zoned == BLK_Z_HM) { >>> +virtio_clear_feature(&features, VIRTIO_BLK_F_DISCARD); >> >> Why is features modified here but s->host_features is modified in the >> line above? > > Because F_DISCARD should not be offered by the driver when the device > offers F_ZONED with the HM model. > >> >>> +} >>> +} >> >> This detects VIRTIO_BLK_F_ZONED based on the BlockDriverState... > (This part can be removed.) > >> >>> if (virtio_has_feature(features, VIRTIO_F_VERSION_1)) { >>> if (virtio_has_feature(s->host_features, VIRTIO_BLK_F_SCSI)) { >>> error_setg(errp, "Please set scsi=off for virtio-blk devices >>> in order to use virtio 1.0"); >>> @@ -1286,6 +1609,9 @@ static Property virtio_blk_properties[] = { >>> #ifdef __linux__ >>> DEFINE_PROP_BIT64("scsi", VirtIOBlock, host_features, >>>VIRTIO_BLK_F_SCSI, false), >>> +#endif >>> +#ifdef CONFIG_BLKZONED >>> +DEFINE_PROP_BIT64("zoned", VirtIOBlock, host_features, >>> VIRTIO_BLK_F_ZONED, true), >>> #endif >> >> ...but this allows users to enable/disable the flag from the >> command-line. >> >> I think DEFINE_PROP_BIT64() can be removed, but please check that the >> config space size is still correct. It may be necessary to move the >> bs->bl.zoned check into virtio_blk_device_realize() and set >> the VIRTIO_BLK_F_ZONED s->host_features bit there instead. That will >> allow the s->config_size calculation to work correctly. > > Ok, it works. Thanks! > > Additionally, the DEFINE_PROP_BIT here is to declare that the supports > zones. Then the virtio-blk driver will not need to look at that > feature. So the former part detecting the F_ZONED feature based on > BlockDriverState can be removed. If removing this macro, then the > virio-blk driver must set the F_ZONED feature by itself. -- Damien Le Moal Western Digital Research
Re: [PATCH v10 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
BlockCompletionFunc *cb, void *opaque); >> +BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op, >> + int64_t offset, int64_t len, >> + BlockCompletionFunc *cb, void *opaque); >> BlockAIOCB *blk_aio_pdiscard(BlockBackend *blk, int64_t offset, int64_t >> bytes, >> BlockCompletionFunc *cb, void *opaque); >> void blk_aio_cancel_async(BlockAIOCB *acb); >> @@ -156,6 +162,17 @@ int generated_co_wrapper blk_pwrite_zeroes(BlockBackend >> *blk, int64_t offset, >> int coroutine_fn blk_co_pwrite_zeroes(BlockBackend *blk, int64_t offset, >>int64_t bytes, BdrvRequestFlags >> flags); >> >> +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset, >> +unsigned int *nr_zones, >> +BlockZoneDescriptor *zones); >> +int generated_co_wrapper blk_zone_report(BlockBackend *blk, int64_t offset, >> + unsigned int *nr_zones, >> + BlockZoneDescriptor *zones); >> +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op, >> + int64_t offset, int64_t len); >> +int generated_co_wrapper blk_zone_mgmt(BlockBackend *blk, BlockZoneOp op, >> + int64_t offset, int64_t len); >> + >> int generated_co_wrapper blk_pdiscard(BlockBackend *blk, int64_t offset, >>int64_t bytes); >> int coroutine_fn blk_co_pdiscard(BlockBackend *blk, int64_t offset, >> diff --git a/meson.build b/meson.build >> index 63cfb844cf..9a797388ad 100644 >> --- a/meson.build >> +++ b/meson.build >> @@ -1882,6 +1882,7 @@ config_host_data.set('CONFIG_REPLICATION', >> get_option('replication').allowed()) >> # has_header >> config_host_data.set('CONFIG_EPOLL', cc.has_header('sys/epoll.h')) >> config_host_data.set('CONFIG_LINUX_MAGIC_H', cc.has_header('linux/magic.h')) >> +config_host_data.set('CONFIG_BLKZONED', cc.has_header('linux/blkzoned.h')) >> config_host_data.set('CONFIG_VALGRIND_H', >> cc.has_header('valgrind/valgrind.h')) >> config_host_data.set('HAVE_BTRFS_H', cc.has_header('linux/btrfs.h')) >> config_host_data.set('HAVE_DRM_H', cc.has_header('libdrm/drm.h')) >> @@ -1975,6 +1976,9 @@ config_host_data.set('HAVE_SIGEV_NOTIFY_THREAD_ID', >> config_host_data.set('HAVE_STRUCT_STAT_ST_ATIM', >> cc.has_member('struct stat', 'st_atim', >> prefix: '#include ')) >> +config_host_data.set('HAVE_BLK_ZONE_REP_CAPACITY', >> + cc.has_member('struct blk_zone', 'capacity', >> + prefix: '#include ')) >> >> # has_type >> config_host_data.set('CONFIG_IOVEC', >> diff --git a/qapi/block-core.json b/qapi/block-core.json >> index f21fa235f2..ee87c1df8a 100644 >> --- a/qapi/block-core.json >> +++ b/qapi/block-core.json >> @@ -2942,6 +2942,7 @@ >> # @compress: Since 5.0 >> # @copy-before-write: Since 6.2 >> # @snapshot-access: Since 7.0 >> +# @zoned_host_device: Since 7.2 >> # >> # Since: 2.9 >> ## >> @@ -2955,7 +2956,8 @@ >> 'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', >> 'parallels', >> 'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd', >> { 'name': 'replication', 'if': 'CONFIG_REPLICATION' }, >> -'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] } >> +'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', >> +{ 'name': 'zoned_host_device', 'if': 'CONFIG_BLKZONED' } ] } >> >> ## >> # @BlockdevOptionsFile: >> @@ -4329,7 +4331,9 @@ >>'vhdx': 'BlockdevOptionsGenericFormat', >>'vmdk': 'BlockdevOptionsGenericCOWFormat', >>'vpc':'BlockdevOptionsGenericFormat', >> - 'vvfat': 'BlockdevOptionsVVFAT' >> + 'vvfat': 'BlockdevOptionsVVFAT', >> + 'zoned_host_device': { 'type': 'BlockdevOptionsFile', >> + 'if': 'CONFIG_BLKZONED' } >>} } >> >> ## >> diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c >> index 952dc940f1..e56c8d1c30 100644 >> --- a/qemu-io-cmds.c >> +++ b/qemu-io-cmds.c >> @@ -1712,6 +1712,149 @@ static const cmdinfo_t flush_cmd = { >> .oneline= "flush all in-core file state to disk", >> }; >> >> +static inline int64_t tosector(int64_t bytes) { >> +return bytes >> BDRV_SECTOR_BITS; >> +} >> + >> +static int zone_report_f(BlockBackend *blk, int argc, char **argv) >> +{ >> +int ret; >> +int64_t offset; >> +unsigned int nr_zones; >> + >> +++optind; >> +offset = cvtnum(argv[optind]); >> +++optind; >> +nr_zones = cvtnum(argv[optind]); >> + >> +g_autofree BlockZoneDescriptor *zones = NULL; >> +zones = g_new(BlockZoneDescriptor, nr_zones); >> +ret = blk_zone_report(blk, offset, &nr_zones, zones); >> +if (ret < 0) { >> +printf("zone report failed: %s\n", strerror(-ret)); >> +} else { >> +for (int i = 0; i < nr_zones; ++i) { >> +printf("start: 0x%" PRIx64 ", len 0x%" PRIx64 ", " >> + "cap"" 0x%" PRIx64 ", wptr 0x%" PRIx64 ", " >> + "zcond:%u, [type: %u]\n", >> +tosector(zones[i].start), tosector(zones[i].length), >> +tosector(zones[i].cap), tosector(zones[i].wp), >> +zones[i].cond, zones[i].type); >> +} >> +} >> +return ret; >> +} >> + >> +static const cmdinfo_t zone_report_cmd = { >> +.name = "zone_report", >> +.altname = "zrp", >> +.cfunc = zone_report_f, >> +.argmin = 2, >> +.argmax = 2, >> +.args = "offset number", >> +.oneline = "report zone information", >> +}; >> + >> +static int zone_open_f(BlockBackend *blk, int argc, char **argv) >> +{ >> +int ret; >> +int64_t offset, len; >> +++optind; >> +offset = cvtnum(argv[optind]); >> +++optind; >> +len = cvtnum(argv[optind]); >> +ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len); >> +if (ret < 0) { >> +printf("zone open failed: %s\n", strerror(-ret)); >> +} >> +return ret; >> +} >> + >> +static const cmdinfo_t zone_open_cmd = { >> +.name = "zone_open", >> +.altname = "zo", >> +.cfunc = zone_open_f, >> +.argmin = 2, >> +.argmax = 2, >> +.args = "offset len", >> +.oneline = "explicit open a range of zones in zone block device", >> +}; >> + >> +static int zone_close_f(BlockBackend *blk, int argc, char **argv) >> +{ >> +int ret; >> +int64_t offset, len; >> +++optind; >> +offset = cvtnum(argv[optind]); >> +++optind; >> +len = cvtnum(argv[optind]); >> +ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len); >> +if (ret < 0) { >> +printf("zone close failed: %s\n", strerror(-ret)); >> +} >> +return ret; >> +} >> + >> +static const cmdinfo_t zone_close_cmd = { >> +.name = "zone_close", >> +.altname = "zc", >> +.cfunc = zone_close_f, >> +.argmin = 2, >> +.argmax = 2, >> +.args = "offset len", >> +.oneline = "close a range of zones in zone block device", >> +}; >> + >> +static int zone_finish_f(BlockBackend *blk, int argc, char **argv) >> +{ >> +int ret; >> +int64_t offset, len; >> +++optind; >> +offset = cvtnum(argv[optind]); >> +++optind; >> +len = cvtnum(argv[optind]); >> +ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len); >> +if (ret < 0) { >> +printf("zone finish failed: %s\n", strerror(-ret)); >> +} >> +return ret; >> +} >> + >> +static const cmdinfo_t zone_finish_cmd = { >> +.name = "zone_finish", >> +.altname = "zf", >> +.cfunc = zone_finish_f, >> +.argmin = 2, >> +.argmax = 2, >> +.args = "offset len", >> +.oneline = "finish a range of zones in zone block device", >> +}; >> + >> +static int zone_reset_f(BlockBackend *blk, int argc, char **argv) >> +{ >> +int ret; >> +int64_t offset, len; >> +++optind; >> +offset = cvtnum(argv[optind]); >> +++optind; >> +len = cvtnum(argv[optind]); >> +ret = blk_zone_mgmt(blk, BLK_ZO_RESET, offset, len); >> +if (ret < 0) { >> +printf("zone reset failed: %s\n", strerror(-ret)); >> +} >> +return ret; >> +} >> + >> +static const cmdinfo_t zone_reset_cmd = { >> +.name = "zone_reset", >> +.altname = "zrs", >> +.cfunc = zone_reset_f, >> +.argmin = 2, >> +.argmax = 2, >> +.args = "offset len", >> +.oneline = "reset a zone write pointer in zone block device", >> +}; >> + >> static int truncate_f(BlockBackend *blk, int argc, char **argv); >> static const cmdinfo_t truncate_cmd = { >> .name = "truncate", >> @@ -2504,6 +2647,11 @@ static void __attribute((constructor)) >> init_qemuio_commands(void) >> qemuio_add_command(&aio_write_cmd); >> qemuio_add_command(&aio_flush_cmd); >> qemuio_add_command(&flush_cmd); >> +qemuio_add_command(&zone_report_cmd); >> +qemuio_add_command(&zone_open_cmd); >> +qemuio_add_command(&zone_close_cmd); >> +qemuio_add_command(&zone_finish_cmd); >> +qemuio_add_command(&zone_reset_cmd); >> qemuio_add_command(&truncate_cmd); >> qemuio_add_command(&length_cmd); >> qemuio_add_command(&info_cmd); >> -- >> 2.37.3 >> -- Damien Le Moal Western Digital Research
Re: [PATCH v2 1/2] file-posix: add the tracking of the zones wp
> zone_op_name = "BLKRESETZONE"; > zone_op = BLKRESETZONE; > break; > +case BLK_ZO_RESET_ALL: > +zone_op_name = "BLKRESETZONE"; > +zone_op = BLKRESETZONE; > +is_all = true; > +break; This change seems unrelated to the wp tracking. Different patch ? > default: > g_assert_not_reached(); > } > diff --git a/include/block/block-common.h b/include/block/block-common.h > index 8efb6b0c43..43bfc484eb 100644 > --- a/include/block/block-common.h > +++ b/include/block/block-common.h > @@ -58,6 +58,7 @@ typedef enum BlockZoneOp { > BLK_ZO_CLOSE, > BLK_ZO_FINISH, > BLK_ZO_RESET, > +BLK_ZO_RESET_ALL, same here. Adding reset all support should be a different patch. > } BlockZoneOp; > > typedef enum BlockZoneModel { > @@ -96,6 +97,14 @@ typedef struct BlockZoneDescriptor { > BlockZoneCondition cond; > } BlockZoneDescriptor; > > +/* > + * Track write pointers of a zone in bytes. > + */ > +typedef struct BlockZoneWps { > +QemuMutex lock; > +uint64_t wp[]; > +} BlockZoneWps; > + > typedef struct BlockDriverInfo { > /* in bytes, 0 if irrelevant */ > int cluster_size; > @@ -209,6 +218,13 @@ typedef enum { > #define BDRV_SECTOR_BITS 9 > #define BDRV_SECTOR_SIZE (1ULL << BDRV_SECTOR_BITS) > > +/* > + * Get the first most significant bit of WP. If it is zero, then > + * the zone type is SWR. > + */ > +#define BDRV_ZT_IS_SWR(WP)((WP & 0x8000) == 0) ? (true) : \ > + (false) Simplify: #define BDRV_ZT_IS_SWR(wp) (!((wp) & (1ULL << 63)) But since this must be used for both SWR and SWP zones, I would reverse this into: #define BDRV_ZONE_IS_CONV(wp) ((wp) & (1ULL << 63)) Which is a lot simpler. > + > #define BDRV_REQUEST_MAX_SECTORS MIN_CONST(SIZE_MAX >> BDRV_SECTOR_BITS, \ > INT_MAX >> BDRV_SECTOR_BITS) > #define BDRV_REQUEST_MAX_BYTES (BDRV_REQUEST_MAX_SECTORS << BDRV_SECTOR_BITS) > diff --git a/include/block/block_int-common.h > b/include/block/block_int-common.h > index 37dddc603c..59c2d1316d 100644 > --- a/include/block/block_int-common.h > +++ b/include/block/block_int-common.h > @@ -857,6 +857,11 @@ typedef struct BlockLimits { > > /* device capacity expressed in bytes */ > int64_t capacity; > + > +/* array of write pointers' location of each zone in the zoned device. */ > +BlockZoneWps *wps; > + > +int64_t write_granularity; > } BlockLimits; > > typedef struct BdrvOpBlocker BdrvOpBlocker; > diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h > index 3d26929cdd..f13cc1887b 100644 > --- a/include/block/raw-aio.h > +++ b/include/block/raw-aio.h > @@ -31,6 +31,7 @@ > #define QEMU_AIO_TRUNCATE 0x0080 > #define QEMU_AIO_ZONE_REPORT 0x0100 > #define QEMU_AIO_ZONE_MGMT0x0200 > +#define QEMU_AIO_ZONE_APPEND 0x0400 > #define QEMU_AIO_TYPE_MASK \ > (QEMU_AIO_READ | \ > QEMU_AIO_WRITE | \ > @@ -41,7 +42,8 @@ > QEMU_AIO_COPY_RANGE | \ > QEMU_AIO_TRUNCATE | \ > QEMU_AIO_ZONE_REPORT | \ > - QEMU_AIO_ZONE_MGMT) > + QEMU_AIO_ZONE_MGMT | \ > + QEMU_AIO_ZONE_APPEND) This should be introduced in patch 2. This patch should be only about zone wp tracking with regular writes and zone management ops. The second patch can implement zone append emulation thanks to this patch. So separate. > > /* AIO flags */ > #define QEMU_AIO_MISALIGNED 0x1000 -- Damien Le Moal Western Digital Research
Re: [PATCH v2 2/2] block: introduce zone append write for zoned devices
> +} > + > +out: > +qemu_iovec_destroy(&qiov); > +qemu_io_free(buf); > +return ret; > +} > + > +static const cmdinfo_t zone_append_cmd = { > +.name = "zone_append", > +.altname = "zap", > +.cfunc = zone_append_f, > +.argmin = 3, > +.argmax = 3, > +.args = "offset len [len..]", > +.oneline = "append write a number of bytes at a specified offset", > +}; > + > static int truncate_f(BlockBackend *blk, int argc, char **argv); > static const cmdinfo_t truncate_cmd = { > .name = "truncate", > @@ -2652,6 +2713,7 @@ static void __attribute((constructor)) > init_qemuio_commands(void) > qemuio_add_command(&zone_close_cmd); > qemuio_add_command(&zone_finish_cmd); > qemuio_add_command(&zone_reset_cmd); > +qemuio_add_command(&zone_append_cmd); > qemuio_add_command(&truncate_cmd); > qemuio_add_command(&length_cmd); > qemuio_add_command(&info_cmd); > diff --git a/tests/qemu-iotests/tests/zoned.out > b/tests/qemu-iotests/tests/zoned.out > index 0c8f96deb9..b3b139b4ec 100644 > --- a/tests/qemu-iotests/tests/zoned.out > +++ b/tests/qemu-iotests/tests/zoned.out > @@ -50,4 +50,11 @@ start: 0x8, len 0x8, cap 0x8, wptr 0x10, > zcond:14, [type: 2] > (5) resetting the second zone > After resetting a zone: > start: 0x8, len 0x8, cap 0x8, wptr 0x8, zcond:1, [type: 2] > + > + > +(6) append write > +After appending the first zone: > +start: 0x0, len 0x8, cap 0x8, wptr 0x18, zcond:2, [type: 2] > +After appending the second zone: > +start: 0x8, len 0x8, cap 0x8, wptr 0x80018, zcond:2, [type: 2] > *** done > diff --git a/tests/qemu-iotests/tests/zoned.sh > b/tests/qemu-iotests/tests/zoned.sh > index fced0194c5..888711eef2 100755 > --- a/tests/qemu-iotests/tests/zoned.sh > +++ b/tests/qemu-iotests/tests/zoned.sh > @@ -79,6 +79,15 @@ echo "(5) resetting the second zone" > sudo $QEMU_IO $IMG -c "zrs 268435456 268435456" > echo "After resetting a zone:" > sudo $QEMU_IO $IMG -c "zrp 268435456 1" > +echo > +echo > +echo "(6) append write" # physical block size of the device is 4096 > +sudo $QEMU_IO $IMG -c "zap 0 0x1000 0x2000" > +echo "After appending the first zone:" > +sudo $QEMU_IO $IMG -c "zrp 0 1" > +sudo $QEMU_IO $IMG -c "zap 268435456 0x1000 0x2000" > +echo "After appending the second zone:" > +sudo $QEMU_IO $IMG -c "zrp 268435456 1" > > # success, all done > echo "*** done" -- Damien Le Moal Western Digital Research
Re: [PATCH v2 1/2] file-posix: add the tracking of the zones wp
On 10/5/22 10:44, Damien Le Moal wrote: > On 9/29/22 18:31, Sam Li wrote: >> Since Linux doesn't have a user API to issue zone append operations to >> zoned devices from user space, the file-posix driver is modified to add >> zone append emulation using regular writes. To do this, the file-posix >> driver tracks the wp location of all zones of the device. It uses an >> array of uint64_t. The most significant bit of each wp location indicates >> if the zone type is sequential write required. >> >> The zones wp can be changed due to the following operations issued: >> - zone reset: change the wp to the start offset of that zone >> - zone finish: change to the end location of that zone >> - write to a zone >> - zone append >> >> Signed-off-by: Sam Li >> --- >> block/file-posix.c | 138 ++- >> include/block/block-common.h | 16 >> include/block/block_int-common.h | 5 ++ >> include/block/raw-aio.h | 4 +- >> 4 files changed, 159 insertions(+), 4 deletions(-) >> >> diff --git a/block/file-posix.c b/block/file-posix.c >> index 73656d87f2..33e81ac112 100755 >> --- a/block/file-posix.c >> +++ b/block/file-posix.c >> @@ -206,6 +206,8 @@ typedef struct RawPosixAIOData { >> struct { >> struct iovec *iov; >> int niov; >> +int64_t *append_sector; >> +BlockZoneWps *wps; >> } io; >> struct { >> uint64_t cmd; >> @@ -1332,6 +1334,59 @@ static int hdev_get_max_segments(int fd, struct stat >> *st) { >> #endif >> } >> >> +#if defined(CONFIG_BLKZONED) >> +static int report_zone_wp(int64_t offset, int fd, BlockZoneWps *wps, >> + unsigned int nrz) { > > Maybe rename this to get_zones_wp() ? > >> +struct blk_zone *blkz; >> +int64_t rep_size; >> +int64_t sector = offset >> BDRV_SECTOR_BITS; >> +int ret, n = 0, i = 0; >> + >> +rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct >> blk_zone); >> +g_autofree struct blk_zone_report *rep = NULL; > > To be cleaner, move this declaration above with the others ? > >> +rep = g_malloc(rep_size); >> + >> +blkz = (struct blk_zone *)(rep + 1); >> +while (n < nrz) { >> +memset(rep, 0, rep_size); >> +rep->sector = sector; >> +rep->nr_zones = nrz - n; >> + >> +do { >> +ret = ioctl(fd, BLKREPORTZONE, rep); >> +} while (ret != 0 && errno == EINTR); >> +if (ret != 0) { >> +error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d", >> +fd, offset, errno); >> +return -errno; >> +} >> + >> +if (!rep->nr_zones) { >> +break; >> +} >> + >> +for (i = 0; i < rep->nr_zones; i++, n++) { >> +wps->wp[i] = blkz[i].wp << BDRV_SECTOR_BITS; >> +sector = blkz[i].start + blkz[i].len; >> + >> +/* >> + * In the wp tracking, it only cares if the zone type is >> sequential >> + * writes required so that the wp can advance to the right >> location. > > Or sequential write preferred (host aware case) > >> + * Instead of the type of zone_type which is an 8-bit unsigned >> + * integer, use the first most significant bits of the wp >> location >> + * to indicate the zone type: 0 for SWR zones and 1 for the >> + * others. >> + */ >> +if (!(blkz[i].type & BLK_ZONE_TYPE_SEQWRITE_REQ)) { > > This should be: > > if (blkz[i].type != BLK_ZONE_TYPE_CONVENTIONAL) { > > Note that the type field is not a bit-field. So you must compare values > instead of doing bit operations. > >> +wps->wp[i] += (uint64_t)1 << 63; > > You can simplify this: > > wps->wp[i] |= 1ULL << 63; > > Overall, I would rewrite this like this: > > for (i = 0; i < rep->nr_zones; i++, n++) { > /* > * The wp tracking cares only about sequential write required > * and sequential write preferred zones so that the wp can > * advance to the right location. > * Use the most significant bit of the wp location > * to indicate the zone type: 0 for SWR zone
Re: [PATCH v2 1/2] file-posix: add the tracking of the zones wp
On 10/5/22 10:44, Damien Le Moal wrote: > On 9/29/22 18:31, Sam Li wrote: >> Since Linux doesn't have a user API to issue zone append operations to >> zoned devices from user space, the file-posix driver is modified to add >> zone append emulation using regular writes. To do this, the file-posix >> driver tracks the wp location of all zones of the device. It uses an >> array of uint64_t. The most significant bit of each wp location indicates >> if the zone type is sequential write required. >> >> The zones wp can be changed due to the following operations issued: >> - zone reset: change the wp to the start offset of that zone >> - zone finish: change to the end location of that zone >> - write to a zone >> - zone append >> >> Signed-off-by: Sam Li Replying to myself to add some comments that I forgot. >> --- >> block/file-posix.c | 138 ++- >> include/block/block-common.h | 16 >> include/block/block_int-common.h | 5 ++ >> include/block/raw-aio.h | 4 +- >> 4 files changed, 159 insertions(+), 4 deletions(-) >> >> diff --git a/block/file-posix.c b/block/file-posix.c >> index 73656d87f2..33e81ac112 100755 >> --- a/block/file-posix.c >> +++ b/block/file-posix.c >> @@ -206,6 +206,8 @@ typedef struct RawPosixAIOData { >> struct { >> struct iovec *iov; >> int niov; >> +int64_t *append_sector; >> +BlockZoneWps *wps; >> } io; >> struct { >> uint64_t cmd; >> @@ -1332,6 +1334,59 @@ static int hdev_get_max_segments(int fd, struct stat >> *st) { >> #endif >> } >> >> +#if defined(CONFIG_BLKZONED) >> +static int report_zone_wp(int64_t offset, int fd, BlockZoneWps *wps, >> + unsigned int nrz) { > > Maybe rename this to get_zones_wp() ? > >> +struct blk_zone *blkz; >> +int64_t rep_size; >> +int64_t sector = offset >> BDRV_SECTOR_BITS; >> +int ret, n = 0, i = 0; >> + >> +rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct >> blk_zone); >> +g_autofree struct blk_zone_report *rep = NULL; > > To be cleaner, move this declaration above with the others ? > >> +rep = g_malloc(rep_size); >> + >> +blkz = (struct blk_zone *)(rep + 1); >> +while (n < nrz) { >> +memset(rep, 0, rep_size); >> +rep->sector = sector; >> +rep->nr_zones = nrz - n; >> + >> +do { >> +ret = ioctl(fd, BLKREPORTZONE, rep); >> +} while (ret != 0 && errno == EINTR); >> +if (ret != 0) { >> +error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d", >> +fd, offset, errno); >> +return -errno; >> +} >> + >> +if (!rep->nr_zones) { >> +break; >> +} >> + >> +for (i = 0; i < rep->nr_zones; i++, n++) { >> +wps->wp[i] = blkz[i].wp << BDRV_SECTOR_BITS; >> +sector = blkz[i].start + blkz[i].len; >> + >> +/* >> + * In the wp tracking, it only cares if the zone type is >> sequential >> + * writes required so that the wp can advance to the right >> location. > > Or sequential write preferred (host aware case) > >> + * Instead of the type of zone_type which is an 8-bit unsigned >> + * integer, use the first most significant bits of the wp >> location >> + * to indicate the zone type: 0 for SWR zones and 1 for the >> + * others. >> + */ >> +if (!(blkz[i].type & BLK_ZONE_TYPE_SEQWRITE_REQ)) { > > This should be: > > if (blkz[i].type != BLK_ZONE_TYPE_CONVENTIONAL) { > > Note that the type field is not a bit-field. So you must compare values > instead of doing bit operations. > >> +wps->wp[i] += (uint64_t)1 << 63; > > You can simplify this: > > wps->wp[i] |= 1ULL << 63; > > Overall, I would rewrite this like this: > > for (i = 0; i < rep->nr_zones; i++, n++) { > /* > * The wp tracking cares only about sequential write required > * and sequential write preferred zones so that the wp can > * advance to the right location. > * Use the most significant bit of the wp location >
Re: [PATCH v2 1/2] file-posix: add the tracking of the zones wp
On 10/5/22 17:30, Sam Li wrote: > Damien Le Moal 于2022年10月5日周三 15:34写道: >> >> On 10/5/22 10:44, Damien Le Moal wrote: >>> On 9/29/22 18:31, Sam Li wrote: >>>> Since Linux doesn't have a user API to issue zone append operations to >>>> zoned devices from user space, the file-posix driver is modified to add >>>> zone append emulation using regular writes. To do this, the file-posix >>>> driver tracks the wp location of all zones of the device. It uses an >>>> array of uint64_t. The most significant bit of each wp location indicates >>>> if the zone type is sequential write required. >>>> >>>> The zones wp can be changed due to the following operations issued: >>>> - zone reset: change the wp to the start offset of that zone >>>> - zone finish: change to the end location of that zone >>>> - write to a zone >>>> - zone append >>>> >>>> Signed-off-by: Sam Li >> >> Replying to myself to add some comments that I forgot. >> >>>> --- >>>> block/file-posix.c | 138 ++- >>>> include/block/block-common.h | 16 >>>> include/block/block_int-common.h | 5 ++ >>>> include/block/raw-aio.h | 4 +- >>>> 4 files changed, 159 insertions(+), 4 deletions(-) >>>> >>>> diff --git a/block/file-posix.c b/block/file-posix.c >>>> index 73656d87f2..33e81ac112 100755 >>>> --- a/block/file-posix.c >>>> +++ b/block/file-posix.c >>>> @@ -206,6 +206,8 @@ typedef struct RawPosixAIOData { >>>> struct { >>>> struct iovec *iov; >>>> int niov; >>>> +int64_t *append_sector; >>>> +BlockZoneWps *wps; >>>> } io; >>>> struct { >>>> uint64_t cmd; >>>> @@ -1332,6 +1334,59 @@ static int hdev_get_max_segments(int fd, struct >>>> stat *st) { >>>> #endif >>>> } >>>> >>>> +#if defined(CONFIG_BLKZONED) >>>> +static int report_zone_wp(int64_t offset, int fd, BlockZoneWps *wps, >>>> + unsigned int nrz) { >>> >>> Maybe rename this to get_zones_wp() ? >>> >>>> +struct blk_zone *blkz; >>>> +int64_t rep_size; >>>> +int64_t sector = offset >> BDRV_SECTOR_BITS; >>>> +int ret, n = 0, i = 0; >>>> + >>>> +rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct >>>> blk_zone); >>>> +g_autofree struct blk_zone_report *rep = NULL; >>> >>> To be cleaner, move this declaration above with the others ? >>> >>>> +rep = g_malloc(rep_size); >>>> + >>>> +blkz = (struct blk_zone *)(rep + 1); >>>> +while (n < nrz) { >>>> +memset(rep, 0, rep_size); >>>> +rep->sector = sector; >>>> +rep->nr_zones = nrz - n; >>>> + >>>> +do { >>>> +ret = ioctl(fd, BLKREPORTZONE, rep); >>>> +} while (ret != 0 && errno == EINTR); >>>> +if (ret != 0) { >>>> +error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed >>>> %d", >>>> +fd, offset, errno); >>>> +return -errno; >>>> +} >>>> + >>>> +if (!rep->nr_zones) { >>>> +break; >>>> +} >>>> + >>>> +for (i = 0; i < rep->nr_zones; i++, n++) { >>>> +wps->wp[i] = blkz[i].wp << BDRV_SECTOR_BITS; >>>> +sector = blkz[i].start + blkz[i].len; >>>> + >>>> +/* >>>> + * In the wp tracking, it only cares if the zone type is >>>> sequential >>>> + * writes required so that the wp can advance to the right >>>> location. >>> >>> Or sequential write preferred (host aware case) >>> >>>> + * Instead of the type of zone_type which is an 8-bit unsigned >>>> + * integer, use the first most significant bits of the wp >>>> location >>>> + * to indicate the zone type: 0 for SWR z
Re: [RFC v1] Adding block layer APIs resembling Linux ZoneBlockDevice ioctls.
int64_t pos); > > +int coroutine_fn (*bdrv_co_zone_report)(BlockDriverState *bs, > +struct BlockZoneDescriptor *zones, uint32_t *nr_zones, > +int64_t offset, int64_t len, uint8_t partial); > +int coroutine_fn (*bdrv_co_zone_mgmt)(BlockDriverState *bs, enum zone_op > op, > +int64_t offset, int64_t len); > + > /* removable device specific */ > bool (*bdrv_is_inserted)(BlockDriverState *bs); > void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag); > diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c > index 2f0d8ac25a..ca02b4047e 100644 > --- a/qemu-io-cmds.c > +++ b/qemu-io-cmds.c > @@ -1706,6 +1706,67 @@ static const cmdinfo_t flush_cmd = { > .oneline= "flush all in-core file state to disk", > }; > > +static int zone_report_f(BlockBackend *blk, int argc, char **argv) > +{ > +return blk_zone_report(blk); > +} > + > +static const cmdinfo_t zone_report_cmd = { > +.name = "zone_report", > +.altname = "f", > +.cfunc = zone_report_f, > +.oneline = "report zone information in zone block device", > +}; > + > +static int zone_open_f(BlockBackend *blk, int argc, char **argv) > +{ > +return blk_zone_mgmt(blk, zone_open); > +} > + > +static const cmdinfo_t zone_open_cmd = { > +.name = "zone_open", > +.altname = "f", > +.cfunc = zone_open_f, > +.oneline = "explicit open a range of zones in zone block device", > +}; > + > +static int zone_close_f(BlockBackend *blk, int argc, char **argv) > +{ > +return blk_zone_mgmt(blk, zone_close); > +} > + > +static const cmdinfo_t zone_close_cmd = { > +.name = "zone_close", > +.altname = "f", > +.cfunc = zone_close_f, > +.oneline = "close a range of zones in zone block device", > +}; > + > +static int zone_finish_f(BlockBackend *blk, int argc, char **argv) > +{ > +return blk_zone_mgmt(blk, zone_finish); > +} > + > +static const cmdinfo_t zone_finish_cmd = { > +.name = "zone_finish", > +.altname = "f", > +.cfunc = zone_finish_f, > +.oneline = "finish a range of zones in zone block device", > +}; > + > +static int zone_reset_f(BlockBackend *blk, int argc, char **argv) > +{ > +return blk_zone_mgmt(blk, zone_reset); > +} > + > +static const cmdinfo_t zone_reset_cmd = { > +.name = "zone_reset", > +.altname = "f", > +.cfunc = zone_reset_f, > +.oneline = "reset a zone write pointer in zone block device", > +}; > + > + > static int truncate_f(BlockBackend *blk, int argc, char **argv); > static const cmdinfo_t truncate_cmd = { > .name = "truncate", > @@ -2498,6 +2559,11 @@ static void __attribute((constructor)) > init_qemuio_commands(void) > qemuio_add_command(&aio_write_cmd); > qemuio_add_command(&aio_flush_cmd); > qemuio_add_command(&flush_cmd); > +qemuio_add_command(&zone_report_cmd); > +qemuio_add_command(&zone_open_cmd); > +qemuio_add_command(&zone_close_cmd); > +qemuio_add_command(&zone_finish_cmd); > +qemuio_add_command(&zone_reset_cmd); > qemuio_add_command(&truncate_cmd); > qemuio_add_command(&length_cmd); > qemuio_add_command(&info_cmd); > diff --git a/tests/qemu-iotests/tests/zoned.sh > b/tests/qemu-iotests/tests/zoned.sh > new file mode 100644 > index 00..a1596e8340 > --- /dev/null > +++ b/tests/qemu-iotests/tests/zoned.sh > @@ -0,0 +1,47 @@ > +#!/usr/bin/env bash > +# > +# Test zone management operations. > +# > + > + > + > +seq=`basename $0` > +echo "QA output created by $seq" > + > +status=1 # failure is the default! > + > +_cleanup() > +{ > + _cleanup_test_img > +} > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +# get standard environment, filters and checks > +. ./common.rc > +. ./common.filter > +. ./common.pattern > + > +# much of this could be generic for any format supporting compression. > +_supported_fmt qcow qcow2 > +_supported_proto file > +_supported_os Linux > + > +TEST_OFFSETS="0 1000 4000" > +TEST_LENS="1000" > +TEST_OPS="zone_report zone_open zone_close zone_finish zone_reset > zone_append" > + > + > +echo "Testing a null_blk device" > +echo > + > +for offset in $TEST_OFFSETS; do > +echo "At offset $offset:" > +io_test "write -b" $offset $CLUSTER_SIZE 8 > +io_test "read -b" $offset $CLUSTER_SIZE 8 > +_check_test_img > +done > + > +# success, all done > +echo "*** done" > +rm -f $seq.full > +status=0 -- Damien Le Moal Western Digital Research
Re: [RFC v1] Adding block layer APIs resembling Linux ZoneBlockDevice ioctls.
gt; +++ b/qemu-io-cmds.c >>> @@ -1706,6 +1706,67 @@ static const cmdinfo_t flush_cmd = { >>> .oneline= "flush all in-core file state to disk", >>> }; >>> >>> +static int zone_report_f(BlockBackend *blk, int argc, char **argv) >>> +{ >>> +return blk_zone_report(blk); >>> +} >>> + >>> +static const cmdinfo_t zone_report_cmd = { >>> + .name = "zone_report", >>> + .altname = "f", >>> +.cfunc = zone_report_f, >>> +.oneline = "report zone information in zone block device", >>> +}; >>> + >>> +static int zone_open_f(BlockBackend *blk, int argc, char **argv) >>> +{ >>> +return blk_zone_mgmt(blk, zone_open); >>> +} >>> + >>> +static const cmdinfo_t zone_open_cmd = { >>> +.name = "zone_open", >>> +.altname = "f", >>> +.cfunc = zone_open_f, >>> +.oneline = "explicit open a range of zones in zone block device", >>> +}; >>> + >>> +static int zone_close_f(BlockBackend *blk, int argc, char **argv) >>> +{ >>> +return blk_zone_mgmt(blk, zone_close); >>> +} >>> + >>> +static const cmdinfo_t zone_close_cmd = { >>> +.name = "zone_close", >>> +.altname = "f", >>> +.cfunc = zone_close_f, >>> +.oneline = "close a range of zones in zone block device", >>> +}; >>> + >>> +static int zone_finish_f(BlockBackend *blk, int argc, char **argv) >>> +{ >>> +return blk_zone_mgmt(blk, zone_finish); >>> +} >>> + >>> +static const cmdinfo_t zone_finish_cmd = { >>> +.name = "zone_finish", >>> +.altname = "f", >>> +.cfunc = zone_finish_f, >>> +.oneline = "finish a range of zones in zone block device", >>> +}; >>> + >>> +static int zone_reset_f(BlockBackend *blk, int argc, char **argv) >>> +{ >>> +return blk_zone_mgmt(blk, zone_reset); >>> +} >>> + >>> +static const cmdinfo_t zone_reset_cmd = { >>> +.name = "zone_reset", >>> +.altname = "f", >>> +.cfunc = zone_reset_f, >>> +.oneline = "reset a zone write pointer in zone block device", >>> +}; >>> + >>> + >>> static int truncate_f(BlockBackend *blk, int argc, char **argv); >>> static const cmdinfo_t truncate_cmd = { >>> .name = "truncate", >>> @@ -2498,6 +2559,11 @@ static void __attribute((constructor)) >>> init_qemuio_commands(void) >>> qemuio_add_command(&aio_write_cmd); >>> qemuio_add_command(&aio_flush_cmd); >>> qemuio_add_command(&flush_cmd); >>> +qemuio_add_command(&zone_report_cmd); >>> +qemuio_add_command(&zone_open_cmd); >>> +qemuio_add_command(&zone_close_cmd); >>> +qemuio_add_command(&zone_finish_cmd); >>> +qemuio_add_command(&zone_reset_cmd); >>> qemuio_add_command(&truncate_cmd); >>> qemuio_add_command(&length_cmd); >>> qemuio_add_command(&info_cmd); >>> diff --git a/tests/qemu-iotests/tests/zoned.sh >>> b/tests/qemu-iotests/tests/zoned.sh >>> new file mode 100644 >>> index 00..a1596e8340 >>> --- /dev/null >>> +++ b/tests/qemu-iotests/tests/zoned.sh >>> @@ -0,0 +1,47 @@ >>> +#!/usr/bin/env bash >>> +# >>> +# Test zone management operations. >>> +# >>> + >>> + >>> + >>> +seq=`basename $0` >>> +echo "QA output created by $seq" >>> + >>> +status=1 # failure is the default! >>> + >>> +_cleanup() >>> +{ >>> + _cleanup_test_img >>> +} >>> +trap "_cleanup; exit \$status" 0 1 2 3 15 >>> + >>> +# get standard environment, filters and checks >>> +. ./common.rc >>> +. ./common.filter >>> +. ./common.pattern >>> + >>> +# much of this could be generic for any format supporting compression. >>> +_supported_fmt qcow qcow2 >>> +_supported_proto file >>> +_supported_os Linux >>> + >>> +TEST_OFFSETS="0 1000 4000" >>> +TEST_LENS="1000" >>> +TEST_OPS="zone_report zone_open zone_close zone_finish zone_reset >>> zone_append" >>> + >>> + >>> +echo "Testing a null_blk device" >>> +echo >>> + >>> +for offset in $TEST_OFFSETS; do >>> +echo "At offset $offset:" >>> +io_test "write -b" $offset $CLUSTER_SIZE 8 >>> +io_test "read -b" $offset $CLUSTER_SIZE 8 >>> +_check_test_img >>> +done >>> + >>> +# success, all done >>> +echo "*** done" >>> +rm -f $seq.full >>> +status=0 >> >> >> -- >> Damien Le Moal >> Western Digital Research > -- Damien Le Moal Western Digital Research
Re: [RFC v2] Adding block layer APIs resembling Linux ZoneBlockDevice ioctls.
linux kernel API. >> +}; >> + >> +enum zone_cond { >> +BLK_ZS_NOT_WP = BLK_ZONE_COND_NOT_WP, >> +BLK_ZS_EMPTY = BLK_ZONE_COND_EMPTY, >> +BLK_ZS_IOPEN = BLK_ZONE_COND_IMP_OPEN, >> +BLK_ZS_EOPEN = BLK_ZONE_COND_EXP_OPEN, >> +BLK_ZS_CLOSED = BLK_ZONE_COND_CLOSED, >> +BLK_ZS_RDONLY = BLK_ZONE_COND_READONLY, >> +BLK_ZS_FULL = BLK_ZONE_COND_FULL, >> +BLK_ZS_OFFLINE = BLK_ZONE_COND_OFFLINE, >> +}; > > This 1:1 correspondence with Linux constants could make the code a > little harder to port. > > Maybe QEMU's block layer should define its own numeric constants so the > code doesn't rely on operating system-specific headers. > block/file-posix.c #ifdef __linux__ code can be responsible for > converting Linux-specific constants to QEMU constants (and the 1:1 > mapping can be used there). Yes ! > >> + >> +enum zone_op { >> +zone_open, >> +zone_close, >> +zone_finish, >> +zone_reset, >> +}; >> + >> +/* >> + * Zone descriptor data structure. >> + * Provide information on a zone with all position and size values in bytes. >> + */ >> +typedef struct BlockZoneDescriptor { >> +uint64_t start; >> +uint64_t length; >> +uint64_t cap; >> +uint64_t wp; >> +enum zone_type type; >> +enum zone_cond cond; >> +} BlockZoneDescriptor; >> + >> +enum zone_model { >> +BLK_Z_HM, >> +BLK_Z_HA, >> +}; >> >> typedef struct BlockDriverInfo { >> /* in bytes, 0 if irrelevant */ >> diff --git a/include/block/block-io.h b/include/block/block-io.h >> index 62c84f0519..deb8902684 100644 >> --- a/include/block/block-io.h >> +++ b/include/block/block-io.h >> @@ -80,6 +80,13 @@ int bdrv_co_ioctl(BlockDriverState *bs, int req, void >> *buf); >> /* Ensure contents are flushed to disk. */ >> int coroutine_fn bdrv_co_flush(BlockDriverState *bs); >> >> +/* Report zone information of zone block device. */ >> +int coroutine_fn bdrv_co_zone_report(BlockDriverState *bs, int64_t offset, >> + int64_t len, int64_t *nr_zones, >> + struct BlockZoneDescriptor *zones); >> +int coroutine_fn bdrv_co_zone_mgmt(BlockDriverState *bs, enum zone_op op, >> +int64_t offset, int64_t len); >> + >> int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes); >> bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs); >> int bdrv_block_status(BlockDriverState *bs, int64_t offset, >> @@ -290,6 +297,12 @@ bdrv_readv_vmstate(BlockDriverState *bs, QEMUIOVector >> *qiov, int64_t pos); >> int generated_co_wrapper >> bdrv_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos); >> >> +int generated_co_wrapper blk_zone_report(BlockBackend *blk, int64_t offset, >> + int64_t len, int64_t *nr_zones, >> + struct BlockZoneDescriptor *zones); >> +int generated_co_wrapper blk_zone_mgmt(BlockBackend *blk, enum zone_op op, >> +int64_t offset, int64_t len); >> + >> /** >> * bdrv_parent_drained_begin_single: >> * >> diff --git a/include/block/block_int-common.h >> b/include/block/block_int-common.h >> index 8947abab76..4d0180a7da 100644 >> --- a/include/block/block_int-common.h >> +++ b/include/block/block_int-common.h >> @@ -63,6 +63,7 @@ >> #define BLOCK_OPT_EXTL2 "extended_l2" >> >> #define BLOCK_PROBE_BUF_SIZE512 >> +#define zone_start_sector 512 >> >> enum BdrvTrackedRequestType { >> BDRV_TRACKED_READ, >> @@ -94,6 +95,19 @@ typedef struct BdrvTrackedRequest { >> struct BdrvTrackedRequest *waiting_for; >> } BdrvTrackedRequest; >> >> +/** >> + * Zone device information data structure. >> + * Provide information on a device. >> + */ >> +typedef struct zbd_dev { >> +enum zone_model model; >> +uint32_t block_size; > > How does this relate to QEMU's BlockLimits? > >> +uint32_t write_granularity; > > Same. Maybe this belongs in BlockLimits? > >> +uint32_t nr_zones; > > Should this really be limited to 32-bit? For example, take 256 MB zones, > then the max nr_zones * 256 MB is much smaller than a uint64_t capacity > value. It seems safer to make this 64-bit, but maybe Dmitry, Hannes, or > Damien can tell us what to do here. u32 is fine. We are nowhere near 4G zones :) The max out there today is 20TB SMR drive with 128MB zones. About 150,000 zones. Nowhere near 4G limit. Linux kernel also uses unsigned int for number of zones everywhere. > >> +struct BlockZoneDescriptor *zones; /* array of zones */ > > When is this used? This will be needed when zone append implementation comes. But not there yet :) A simpler form of the zone descriptor will likely be better anyway to save memory. -- Damien Le Moal Western Digital Research
Re: [RFC v4 4/9] file-posix: introduce get_sysfs_str_val for device zoned model.
On 7/12/22 15:17, Hannes Reinecke wrote: > On 7/12/22 04:13, Sam Li wrote: >> Signed-off-by: Sam Li >> --- >> block/file-posix.c | 60 >> include/block/block-common.h | 4 +-- >> 2 files changed, 62 insertions(+), 2 deletions(-) >> >> diff --git a/block/file-posix.c b/block/file-posix.c >> index 3161d39ea4..42708012ff 100644 >> --- a/block/file-posix.c >> +++ b/block/file-posix.c >> @@ -1279,6 +1279,65 @@ out: >> #endif >> } >> >> +/* >> + * Convert the zoned attribute file in sysfs to internal value. >> + */ >> +static zone_model get_sysfs_str_val(int fd, struct stat *st) { >> +#ifdef CONFIG_LINUX >> +char buf[32]; >> +char *sysfspath = NULL; >> +int ret, offset; >> +int sysfd = -1; >> + >> +if (S_ISCHR(st->st_mode)) { >> +if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) { >> +return ret; >> +} >> +return -ENOTSUP; >> +} >> + >> +if (!S_ISBLK(st->st_mode)) { >> +return -ENOTSUP; >> +} >> + >> +sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/zoned", >> +major(st->st_rdev), minor(st->st_rdev)); >> +sysfd = open(sysfspath, O_RDONLY); >> +if (sysfd == -1) { >> +ret = -errno; >> +goto out; >> +} >> +offset = 0; >> +do { >> +ret = read(sysfd, buf + offset, sizeof(buf) - 1 + offset); >> +if (ret > 0) { >> +offset += ret; >> +} >> +} while (ret == -1); >> +/* The file is ended with '\n' */ >> +if (buf[ret - 1] == '\n') { >> +buf[ret - 1] = '\0'; >> +} >> + >> +if (strcmp(buf, "host-managed") == 0) { >> +return BLK_Z_HM; >> +} else if (strcmp(buf, "host-aware") == 0) { >> +return BLK_Z_HA; >> +} else { >> +return -ENOTSUP; >> +} >> + >> +out: >> +if (sysfd != -1) { >> +close(sysfd); >> +} >> +g_free(sysfspath); >> +return ret; >> +#else >> +return -ENOTSUP; >> +#endif >> +} >> + >> static int hdev_get_max_segments(int fd, struct stat *st) { >> return get_sysfs_long_val(fd, st, "max_segments"); >> } >> @@ -1885,6 +1944,7 @@ static int handle_aiocb_zone_mgmt(void *opaque) { >> int64_t len = aiocb->aio_nbytes; >> zone_op op = aiocb->zone_mgmt.op; >> >> +zone_model mod; >> struct blk_zone_range range; >> const char *ioctl_name; >> unsigned long ioctl_op; >> diff --git a/include/block/block-common.h b/include/block/block-common.h >> index 78cddeeda5..35e00afe8e 100644 >> --- a/include/block/block-common.h >> +++ b/include/block/block-common.h >> @@ -56,8 +56,8 @@ typedef enum zone_op { >> } zone_op; >> >> typedef enum zone_model { >> -BLK_Z_HM, >> -BLK_Z_HA, >> +BLK_Z_HM = 0x1, >> +BLK_Z_HA = 0x2, >> } zone_model; >> >> typedef enum BlkZoneCondition { > > This hunk is unrelated, please move it into a different patch. This actually belong to patch 1 where this enum is introduced. No need for a different patch. > > Cheers, > > Hannes -- Damien Le Moal Western Digital Research
Re: [RFC v4 1/9] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls.
gmt, > +}; > + > #if defined(__linux__) || defined(__FreeBSD__) || defined(__FreeBSD_kernel__) > static void cdrom_parse_filename(const char *filename, QDict *options, > Error **errp) > @@ -3964,6 +4199,7 @@ static void bdrv_file_init(void) > #if defined(HAVE_HOST_BLOCK_DEVICE) > bdrv_register(&bdrv_host_device); > #ifdef __linux__ > +bdrv_register(&bdrv_zoned_host_device); > bdrv_register(&bdrv_host_cdrom); > #endif > #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__) > diff --git a/include/block/block-common.h b/include/block/block-common.h > index fdb7306e78..78cddeeda5 100644 > --- a/include/block/block-common.h > +++ b/include/block/block-common.h > @@ -23,7 +23,6 @@ > */ > #ifndef BLOCK_COMMON_H > #define BLOCK_COMMON_H > - > #include "block/aio.h" > #include "block/aio-wait.h" > #include "qemu/iov.h" > @@ -49,6 +48,48 @@ typedef struct BlockDriver BlockDriver; > typedef struct BdrvChild BdrvChild; > typedef struct BdrvChildClass BdrvChildClass; > > +typedef enum zone_op { > +zone_open, > +zone_close, > +zone_finish, > +zone_reset, > +} zone_op; > + > +typedef enum zone_model { > +BLK_Z_HM, > +BLK_Z_HA, > +} zone_model; > + > +typedef enum BlkZoneCondition { > +BLK_ZS_NOT_WP = 0x0, > +BLK_ZS_EMPTY = 0x1, > +BLK_ZS_IOPEN = 0x2, > +BLK_ZS_EOPEN = 0x3, > +BLK_ZS_CLOSED = 0x4, > +BLK_ZS_RDONLY = 0xD, > +BLK_ZS_FULL = 0xE, > +BLK_ZS_OFFLINE = 0xF, > +} BlkZoneCondition; > + > +typedef enum BlkZoneType { > +BLK_ZT_CONV = 0x1, > +BLK_ZT_SWR = 0x2, > +BLK_ZT_SWP = 0x3, > +} BlkZoneType; > + > +/* > + * Zone descriptor data structure. > + * Provide information on a zone with all position and size values in bytes. > + */ > +typedef struct BlockZoneDescriptor { > + uint64_t start; > +uint64_t length; > +uint64_t cap; > +uint64_t wp; > +BlkZoneType type; > +BlkZoneCondition cond; > +} BlockZoneDescriptor; > + > typedef struct BlockDriverInfo { > /* in bytes, 0 if irrelevant */ > int cluster_size; > diff --git a/include/block/block_int-common.h > b/include/block/block_int-common.h > index 8947abab76..6037871089 100644 > --- a/include/block/block_int-common.h > +++ b/include/block/block_int-common.h > @@ -94,6 +94,20 @@ typedef struct BdrvTrackedRequest { > struct BdrvTrackedRequest *waiting_for; > } BdrvTrackedRequest; > > +/** > + * Zone device information data structure. > + * Provide information on a device. > + */ > +typedef struct zbd_dev { > +uint32_t zone_size; > +zone_model model; > +uint32_t block_size; > +uint32_t write_granularity; > +uint32_t nr_zones; > +struct BlockZoneDescriptor *zones; /* array of zones */ > +uint32_t max_nr_open_zones; /* maximum number of explicitly open zones */ > +uint32_t max_nr_active_zones; > +} zbd_dev; > > struct BlockDriver { > /* > @@ -691,6 +705,12 @@ struct BlockDriver { >QEMUIOVector *qiov, >int64_t pos); > > +int coroutine_fn (*bdrv_co_zone_report)(BlockDriverState *bs, > +int64_t offset, int64_t *nr_zones, > +BlockZoneDescriptor *zones); > +int coroutine_fn (*bdrv_co_zone_mgmt)(BlockDriverState *bs, enum zone_op > op, > +int64_t offset, int64_t len); > + > /* removable device specific */ > bool (*bdrv_is_inserted)(BlockDriverState *bs); > void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag); -- Damien Le Moal Western Digital Research
Re: [RFC v4 3/9] file-posix: introduce get_sysfs_long_val for a block queue of sysfs attribute
On 7/12/22 11:13, Sam Li wrote: > Use sysfs attribute files to get the zoned device information in case > that ioctl() commands of zone management interface won't work. > > Signed-off-by: Sam Li > --- > block/file-posix.c | 38 +++--- > 1 file changed, 27 insertions(+), 11 deletions(-) > > diff --git a/block/file-posix.c b/block/file-posix.c > index e7523ae2ed..3161d39ea4 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -1218,15 +1218,19 @@ static int hdev_get_max_hw_transfer(int fd, struct > stat *st) > #endif > } > > -static int hdev_get_max_segments(int fd, struct stat *st) > -{ > +/* > + * Get zoned device information (chunk_sectors, zoned_append_max_bytes, > + * max_open_zones, max_active_zones) through sysfs attribute files. > + */ > +static long get_sysfs_long_val(int fd, struct stat *st, > + const char *attribute) { > #ifdef CONFIG_LINUX > char buf[32]; > const char *end; > char *sysfspath = NULL; > int ret; > int sysfd = -1; > -long max_segments; > +long val; > > if (S_ISCHR(st->st_mode)) { > if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) { > @@ -1239,8 +1243,9 @@ static int hdev_get_max_segments(int fd, struct stat > *st) > return -ENOTSUP; > } > > -sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments", > -major(st->st_rdev), minor(st->st_rdev)); > +sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s", > +major(st->st_rdev), minor(st->st_rdev), > +attribute); > sysfd = open(sysfspath, O_RDONLY); > if (sysfd == -1) { > ret = -errno; > @@ -1258,9 +1263,9 @@ static int hdev_get_max_segments(int fd, struct stat > *st) > } > buf[ret] = 0; > /* The file is ended with '\n', pass 'end' to accept that. */ > -ret = qemu_strtol(buf, &end, 10, &max_segments); > +ret = qemu_strtol(buf, &end, 10, &val); > if (ret == 0 && end && *end == '\n') { > -ret = max_segments; > +ret = val; > } > > out: > @@ -1274,6 +1279,10 @@ out: > #endif > } > > +static int hdev_get_max_segments(int fd, struct stat *st) { > +return get_sysfs_long_val(fd, st, "max_segments"); > +} > + > static void raw_refresh_limits(BlockDriverState *bs, Error **errp) > { > BDRVRawState *s = bs->opaque; > @@ -1883,10 +1892,17 @@ static int handle_aiocb_zone_mgmt(void *opaque) { > int64_t zone_size_mask; > int ret; > > -g_autofree struct stat *file = NULL; > -file = g_new(struct stat, 1); > -stat(s->filename, file); > -zone_size = get_sysfs_long_val(fd, file, "chunk_sectors"); > +struct stat file; > +if (fstat(fd, &file) < 0) { > +return -errno; > +} > +mod = get_sysfs_str_val(fd, &file); > +if (mod != BLK_Z_HM) { > +ret = -ENOTSUP; > +return ret; > +} > + > +zone_size = get_sysfs_long_val(fd, &file, "chunk_sectors"); > zone_size_mask = zone_size - 1; > if (offset & zone_size_mask) { > error_report("offset is not the start of a zone"); This hunk that modifies raw_refresh_limits() should go into another patch. Likely, you want that as part of patch 1, so this patch to introduce the get_sysfs_long_val() helper function should come before patch 1. -- Damien Le Moal Western Digital Research
Re: [RFC v4 2/9] qemu-io: add zoned block device operations.
zone_report(blk, offset, &nr_zones, zones); > +if (ret < 0) { > +printf("zone report failed: %s\n", strerror(-ret)); > +} else { > +for (int i = 0; i < nr_zones; ++i) { > +printf("start: 0x%" PRIx64 ", len 0x%" PRIx64 ", " > + "cap"" 0x%" PRIx64 ",wptr 0x%" PRIx64 ", " > + "zcond:%u, [type: %u]\n", > + zones[i].start, zones[i].length, zones[i].cap, > zones[i].wp, > + zones[i].cond, zones[i].type); > +} > +} > +return ret; > +} > + > +static const cmdinfo_t zone_report_cmd = { > +.name = "zone_report", > +.altname = "zp", > +.cfunc = zone_report_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset number", > +.oneline = "report zone information", > +}; > + > +static int zone_open_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, zone_open, offset, len); > +if (ret < 0) { > +printf("zone open failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_open_cmd = { > +.name = "zone_open", > +.altname = "zo", > +.cfunc = zone_open_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "explicit open a range of zones in zone block device", > +}; > + > +static int zone_close_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, zone_close, offset, len); > +if (ret < 0) { > +printf("zone close failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_close_cmd = { > +.name = "zone_close", > +.altname = "zc", > +.cfunc = zone_close_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "close a range of zones in zone block device", > +}; > + > +static int zone_finish_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, zone_finish, offset, len); > +if (ret < 0) { > +printf("zone finish failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_finish_cmd = { > +.name = "zone_finish", > +.altname = "zf", > +.cfunc = zone_finish_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "finish a range of zones in zone block device", > +}; > + > +static int zone_reset_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, zone_reset, offset, len); > +if (ret < 0) { > +printf("zone reset failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_reset_cmd = { > +.name = "zone_reset", > +.altname = "zrs", > +.cfunc = zone_reset_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "reset a zone write pointer in zone block device", > +}; > + > + > static int truncate_f(BlockBackend *blk, int argc, char **argv); > static const cmdinfo_t truncate_cmd = { > .name = "truncate", > @@ -2498,6 +2636,11 @@ static void __attribute((constructor)) > init_qemuio_commands(void) > qemuio_add_command(&aio_write_cmd); > qemuio_add_command(&aio_flush_cmd); > qemuio_add_command(&flush_cmd); > +qemuio_add_command(&zone_report_cmd); > +qemuio_add_command(&zone_open_cmd); > +qemuio_add_command(&zone_close_cmd); > +qemuio_add_command(&zone_finish_cmd); > +qemuio_add_command(&zone_reset_cmd); > qemuio_add_command(&truncate_cmd); > qemuio_add_command(&length_cmd); > qemuio_add_command(&info_cmd); -- Damien Le Moal Western Digital Research
Re: [RFC v4 9/9] qapi: add support for zoned host device
On 7/12/22 11:13, Sam Li wrote: > --- > block/file-posix.c | 8 +++- > qapi/block-core.json | 7 +-- > 2 files changed, 12 insertions(+), 3 deletions(-) > > diff --git a/block/file-posix.c b/block/file-posix.c > index e9ad1d8e1e..4e0aa02acf 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -3737,6 +3737,12 @@ static void hdev_parse_filename(const char *filename, > QDict *options, > bdrv_parse_filename_strip_prefix(filename, "host_device:", options); > } > > +static void zoned_host_device_parse_filename(const char *filename, QDict > *options, > +Error **errp) > +{ > +bdrv_parse_filename_strip_prefix(filename, "zoned_host_device:", > options); > +} > + > static bool hdev_is_sg(BlockDriverState *bs) > { > > @@ -3975,7 +3981,7 @@ static BlockDriver bdrv_zoned_host_device = { > .is_zoned = true, > .bdrv_needs_filename = true, > .bdrv_probe_device = hdev_probe_device, > -.bdrv_parse_filename = hdev_parse_filename, > +.bdrv_parse_filename = zoned_host_device_parse_filename, > .bdrv_file_open = hdev_open, > .bdrv_close = raw_close, > .bdrv_reopen_prepare = raw_reopen_prepare, > diff --git a/qapi/block-core.json b/qapi/block-core.json > index 2173e7734a..ab05c2ef99 100644 > --- a/qapi/block-core.json > +++ b/qapi/block-core.json > @@ -2955,7 +2955,8 @@ > 'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels', > 'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd', > { 'name': 'replication', 'if': 'CONFIG_REPLICATION' }, > -'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] } > +'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', > +{ 'name': 'zoned_host_device', 'if': 'HAVE_HOST_BLOCK_DEVICE' } > ] } This needs to be something like: { 'name': 'zoned_host_device', 'if': 'CONFIG_BLKZONED' } ] } And we need to make sure CONFIG_BLKZONED is defined if and only if we also have HAVE_HOST_BLOCK_DEVICE. > > ## > # @BlockdevOptionsFile: > @@ -4329,7 +4330,9 @@ >'vhdx': 'BlockdevOptionsGenericFormat', >'vmdk': 'BlockdevOptionsGenericCOWFormat', >'vpc':'BlockdevOptionsGenericFormat', > - 'vvfat': 'BlockdevOptionsVVFAT' > + 'vvfat': 'BlockdevOptionsVVFAT', > + 'zoned_host_device': { 'type': 'BlockdevOptionsFile', > + 'if': 'HAVE_HOST_BLOCK_DEVICE' } Same here I think. >} } > > ## -- Damien Le Moal Western Digital Research
Re: [RFC v4 4/9] file-posix: introduce get_sysfs_str_val for device zoned model.
On 7/12/22 11:13, Sam Li wrote: > Signed-off-by: Sam Li > --- > block/file-posix.c | 60 > include/block/block-common.h | 4 +-- > 2 files changed, 62 insertions(+), 2 deletions(-) > > diff --git a/block/file-posix.c b/block/file-posix.c > index 3161d39ea4..42708012ff 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -1279,6 +1279,65 @@ out: > #endif > } > > +/* > + * Convert the zoned attribute file in sysfs to internal value. > + */ > +static zone_model get_sysfs_str_val(int fd, struct stat *st) { This is not a generic definition of a helper function: as-is, this function can only get the zoned attribute string. It would be better to do the same as what you did for get_sysfs_long_val() and pass the attribute name you want to look at and add another argument to return the string, e.g.: static void get_sysfs_str_val(int fd, struct stat *st, const char *attribute, char **val) And if the attribute does not exist, or this is not linux, always return "none" in str. > +#ifdef CONFIG_LINUX > +char buf[32]; > +char *sysfspath = NULL; > +int ret, offset; > +int sysfd = -1; > + > +if (S_ISCHR(st->st_mode)) { > +if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) { > +return ret; > +} > +return -ENOTSUP; > +} > + > +if (!S_ISBLK(st->st_mode)) { > +return -ENOTSUP; > +} > + > +sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/zoned", > +major(st->st_rdev), minor(st->st_rdev)); > +sysfd = open(sysfspath, O_RDONLY); > +if (sysfd == -1) { > +ret = -errno; > +goto out; > +} > +offset = 0; > +do { > +ret = read(sysfd, buf + offset, sizeof(buf) - 1 + offset); > +if (ret > 0) { > +offset += ret; > +} > +} while (ret == -1); > +/* The file is ended with '\n' */ > +if (buf[ret - 1] == '\n') { > +buf[ret - 1] = '\0'; > +} > + > +if (strcmp(buf, "host-managed") == 0) { > +return BLK_Z_HM; > +} else if (strcmp(buf, "host-aware") == 0) { > +return BLK_Z_HA; > +} else { > +return -ENOTSUP; > +} Then all this code actually looking at the string value can go into a different patch, or in patch 1 (same comment as for patch 3). > + > +out: > +if (sysfd != -1) { > +close(sysfd); > +} > +g_free(sysfspath); > +return ret; > +#else > +return -ENOTSUP; > +#endif > +} > + > static int hdev_get_max_segments(int fd, struct stat *st) { > return get_sysfs_long_val(fd, st, "max_segments"); > } > @@ -1885,6 +1944,7 @@ static int handle_aiocb_zone_mgmt(void *opaque) { > int64_t len = aiocb->aio_nbytes; > zone_op op = aiocb->zone_mgmt.op; > > +zone_model mod; > struct blk_zone_range range; > const char *ioctl_name; > unsigned long ioctl_op; > diff --git a/include/block/block-common.h b/include/block/block-common.h > index 78cddeeda5..35e00afe8e 100644 > --- a/include/block/block-common.h > +++ b/include/block/block-common.h > @@ -56,8 +56,8 @@ typedef enum zone_op { > } zone_op; > > typedef enum zone_model { > -BLK_Z_HM, > -BLK_Z_HA, > +BLK_Z_HM = 0x1, > +BLK_Z_HA = 0x2, > } zone_model; > > typedef enum BlkZoneCondition { -- Damien Le Moal Western Digital Research
Re: [RFC v4 1/9] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls.
pedef enum zone_op { >> +zone_open, >> +zone_close, >> +zone_finish, >> +zone_reset, >> +} zone_op; > > QEMU coding style: > > typedef enum { > BLK_ZO_OPEN, > BLK_ZO_CLOSE, > BLK_ZO_FINISH, > BLK_ZO_RESET, > } BlockZoneOp; > > Please also reformat the enums below. > >> + >> +typedef enum zone_model { >> +BLK_Z_HM, >> +BLK_Z_HA, >> +} zone_model; >> + >> +typedef enum BlkZoneCondition { >> +BLK_ZS_NOT_WP = 0x0, >> +BLK_ZS_EMPTY = 0x1, >> +BLK_ZS_IOPEN = 0x2, >> +BLK_ZS_EOPEN = 0x3, >> +BLK_ZS_CLOSED = 0x4, >> +BLK_ZS_RDONLY = 0xD, >> +BLK_ZS_FULL = 0xE, >> +BLK_ZS_OFFLINE = 0xF, >> +} BlkZoneCondition; >> + >> +typedef enum BlkZoneType { >> +BLK_ZT_CONV = 0x1, >> +BLK_ZT_SWR = 0x2, >> +BLK_ZT_SWP = 0x3, >> +} BlkZoneType; >> + >> +/* >> + * Zone descriptor data structure. >> + * Provide information on a zone with all position and size values in bytes. >> + */ >> +typedef struct BlockZoneDescriptor { >> +uint64_t start; >> +uint64_t length; >> +uint64_t cap; >> +uint64_t wp; >> +BlkZoneType type; >> +BlkZoneCondition cond; >> +} BlockZoneDescriptor; >> + >> typedef struct BlockDriverInfo { >> /* in bytes, 0 if irrelevant */ >> int cluster_size; >> diff --git a/include/block/block_int-common.h >> b/include/block/block_int-common.h >> index 8947abab76..6037871089 100644 >> --- a/include/block/block_int-common.h >> +++ b/include/block/block_int-common.h >> @@ -94,6 +94,20 @@ typedef struct BdrvTrackedRequest { >> struct BdrvTrackedRequest *waiting_for; >> } BdrvTrackedRequest; >> >> +/** >> + * Zone device information data structure. >> + * Provide information on a device. >> + */ >> +typedef struct zbd_dev { >> +uint32_t zone_size; >> +zone_model model; >> +uint32_t block_size; >> +uint32_t write_granularity; >> +uint32_t nr_zones; >> +struct BlockZoneDescriptor *zones; /* array of zones */ >> +uint32_t max_nr_open_zones; /* maximum number of explicitly open zones >> */ >> +uint32_t max_nr_active_zones; >> +} zbd_dev; > > This struct isn't use by this patch. Please move this change into the > patch that uses struct zbd_dev. > > QEMU coding style naming would be BlockZoneDev instead of zbd_dev. > > I think this struct contains the block limits fields that could be added > to QEMU's BlockLimits? A new struct may not be necessary. > >> >> struct BlockDriver { >> /* >> @@ -691,6 +705,12 @@ struct BlockDriver { >>QEMUIOVector *qiov, >>int64_t pos); >> >> +int coroutine_fn (*bdrv_co_zone_report)(BlockDriverState *bs, >> +int64_t offset, int64_t *nr_zones, >> +BlockZoneDescriptor *zones); >> +int coroutine_fn (*bdrv_co_zone_mgmt)(BlockDriverState *bs, enum >> zone_op op, >> +int64_t offset, int64_t len); >> + >> /* removable device specific */ >> bool (*bdrv_is_inserted)(BlockDriverState *bs); >> void (*bdrv_eject)(BlockDriverState *bs, bool eject_flag); >> -- >> 2.36.1 >> -- Damien Le Moal Western Digital Research
Re: Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls.
On 2022/05/30 14:09, Sam Li wrote: > Hi everyone, Hi Sam, > I'm Sam Li, working on the Outreachy project which is to add zoned > device support to QEMU's virtio-blk emulation. > > For the first goal, adding QEMU block layer APIs resembling Linux ZBD > ioctls, I think the naive approach would be to introduce a new stable > struct zbd_zone descriptor for the library function interface. More > specifically, what I'd like to add to the BlockDriver struct are: > 1. zbd_info as zone block device information: includes numbers of > zones, size of logical blocks, and physical blocks. Virtio block devices only advertise a "block size" (blk_size field of struct virtio_blk_config). So I do not think that you need to distinguish between logical and physical blocks. However, for zoned devices, we need to add a "write granularity" field which indicates the minimum write size and alignment. This is to be able to handle 512e SMR disk drives as these have a 512 B logical block size and 4096 B physical block size. And SMR only allows writing in units of physical block size, regardless of the LBA size. For NVMe ZNS devices, there is no logical/physical block size difference, so the write granularity will always be equal to the block size. > 2. zbd_zone_type and zbd_zone_state As a first step, I would recommend to only have the zone type. That will allow you to not issue a zone ioctl that you know will fail, e.g. if the user tries to reset a conventional zone, we know this will fail, so no point in executing the BLKRESETZONE ioctl for that zone. With the zone type cached, you can easily catch such cases. But even that is actually optional as a first step. You can rely on the host device failing any invalid operation and return the errors back to the guest. Once you have an API working and the ability to execute all zone operations from a guest, you can then work on adding the more difficult bits: supporting the zone append operation will require tracking the write pointer position and state of the device sequential zones. For that, the zone information will need the zone capacity and write pointer position of all zones. The zone state may not be necessary as you can infer the empty and full states from the zone capacity and zone write pointer position. States such as explicit/implicit open, closed, read-only and offline do not need to be tracked. If an operation cannot be executed, the device will fail the io on the host side and you can simply propagate that error to the guest. See the Linux kernel sd_zbc driver and its emulation of zone append operations for inspiration: drivers/scsi/sd_zbc.c. Looking at that code (e.g. sd_zbc_prepare_zone_append()), you will see that the only thing being tracked is the write pointer position of zones (relative to the zone start sector). The state is inferred from that value, indicating if the zone can be written (it is not full) or not (the zone is full). > 3. zbd_dev_model: host-managed zbd, host-aware zbd Yes. The current virtio specs draft adding zoned block device support adds struct virtio_blk_zoned_characteristics. Most, if not all, of the fields in that structure can be kept as part fot the device zone information. > With those basic structs, we can start to implement new functions as > bdrv*() APIs for BLOCK*ZONE ioctls. BLK*ZONE :) > > I'll start to finish this task based on the above description. If > there is any problem or something I may miss in the design, please let > me know. Supporting all operations will be difficult to do in one go. But as explained above, if you initially exclude zone append support, you will not need to dynamically track zone state and wp. This will simplify the code to give you a solid working base to build upon the remaining support. > > Best regards, > Sam > -- Damien Le Moal Western Digital Research
Re: Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls.
On 2022/05/30 20:22, Stefan Hajnoczi wrote: > On Mon, 30 May 2022 at 06:38, Damien Le Moal wrote: >> On 2022/05/30 14:09, Sam Li wrote: >> Once you have an API working and the ability to execute all zone operations >> from >> a guest, you can then work on adding the more difficult bits: supporting the >> zone append operation will require tracking the write pointer position and >> state >> of the device sequential zones. For that, the zone information will need the >> zone capacity and write pointer position of all zones. The zone state may >> not be >> necessary as you can infer the empty and full states from the zone capacity >> and >> zone write pointer position. States such as explicit/implicit open, closed, >> read-only and offline do not need to be tracked. If an operation cannot be >> executed, the device will fail the io on the host side and you can simply >> propagate that error to the guest. >> >> See the Linux kernel sd_zbc driver and its emulation of zone append >> operations >> for inspiration: drivers/scsi/sd_zbc.c. Looking at that code (e.g. >> sd_zbc_prepare_zone_append()), you will see that the only thing being >> tracked is >> the write pointer position of zones (relative to the zone start sector). The >> state is inferred from that value, indicating if the zone can be written (it >> is >> not full) or not (the zone is full). > > It sounds like QEMU BlockDrivers need to maintain state? I haven't > thought this through but the guest probably has some state and the > device below QEMU also has state. Can QEMU just invoke the BLK*ZONE > ioctls on behalf of the guest without maintaining it's own state? FOr all operations except zone append, yes, it can and that the way it should be done. However, for a linux host, since there is no user interface for the zone append operation, QEMU will need to emulate that command using regular writes. And for that to work, zones state (at least the zone write pointer position) need to be tracked. Linux mandates zone append working for zoned devices... This kind of emulation is thus implemented in the scsi disk driver since ZBC & ZAC do not have zone append. It is not too hard to do, but care must be applied to zone wp tracking and locking (mutual exclusion for zone wp test + change depending on the operation). This tracking can be fully initialized on startup and does not need any persistent "metadata". > > Stefan > -- Damien Le Moal Western Digital Research
Re: Outreachy project task: Adding QEMU block layer APIs resembling Linux ZBD ioctls.
e the exact same signature, modulo the function name. So we should simplify here to reduce the number of functions. Something like: int zbd_zone_mgmt(int fd, enum zone_op op, off_t offset, off_t len); where op can be BDRV_ZONE_RESET, BDRV_ZONE_OPEN, etc can aggregate all 4 functions into one. In any case, if you look at how block drivers are defined (an example is the one for raw files in qemu/block/file-posix.c) using the BlockDriver data type (defined as a struct in qemu/include/block/block_int-common.h), you will see that the driver callback functions do not use a file descriptor (fd) but a pointer to some data structure (most of the time a BlockDriverState pointer). Another thing: you will need one more callback to get the device information related to zones: maximum number of open and active zones at least (the number of zones can be discovered with a report zones call). Cheers. -- Damien Le Moal Western Digital Research
Re: [PATCH v9 2/7] file-posix: introduce helper functions for sysfs attributes
On 2022/09/10 14:27, Sam Li wrote: > Use get_sysfs_str_val() to get the string value of device > zoned model. Then get_sysfs_zoned_model() can convert it to > BlockZoneModel type of QEMU. > > Use get_sysfs_long_val() to get the long value of zoned device > information. > > Signed-off-by: Sam Li > Reviewed-by: Hannes Reinecke > Reviewed-by: Stefan Hajnoczi Looks good to me. Reviewed-by: Damien Le Moal > --- > block/file-posix.c | 121 ++- > include/block/block_int-common.h | 3 + > 2 files changed, 88 insertions(+), 36 deletions(-) > > diff --git a/block/file-posix.c b/block/file-posix.c > index 48cd096624..0a8b4b426e 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -1210,66 +1210,109 @@ static int hdev_get_max_hw_transfer(int fd, struct > stat *st) > #endif > } > > -static int hdev_get_max_segments(int fd, struct stat *st) > -{ > +/* > + * Get a sysfs attribute value as character string. > + */ > +static int get_sysfs_str_val(struct stat *st, const char *attribute, > + char **val) { > #ifdef CONFIG_LINUX > -char buf[32]; > -const char *end; > -char *sysfspath = NULL; > +g_autofree char *sysfspath = NULL; > int ret; > -int sysfd = -1; > -long max_segments; > +size_t len; > > -if (S_ISCHR(st->st_mode)) { > -if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) { > -return ret; > -} > +if (!S_ISBLK(st->st_mode)) { > return -ENOTSUP; > } > > -if (!S_ISBLK(st->st_mode)) { > -return -ENOTSUP; > +sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s", > +major(st->st_rdev), minor(st->st_rdev), > +attribute); > +ret = g_file_get_contents(sysfspath, val, &len, NULL); > +if (ret == -1) { > +return -ENOENT; > } > > -sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments", > -major(st->st_rdev), minor(st->st_rdev)); > -sysfd = open(sysfspath, O_RDONLY); > -if (sysfd == -1) { > -ret = -errno; > -goto out; > +/* The file is ended with '\n' */ > +char *p; > +p = *val; > +if (*(p + len - 1) == '\n') { > +*(p + len - 1) = '\0'; > } > -do { > -ret = read(sysfd, buf, sizeof(buf) - 1); > -} while (ret == -1 && errno == EINTR); > +return ret; > +#else > +return -ENOTSUP; > +#endif > +} > + > +static int get_sysfs_zoned_model(struct stat *st, BlockZoneModel *zoned) { > +g_autofree char *val = NULL; > +int ret; > + > +ret = get_sysfs_str_val(st, "zoned", &val); > if (ret < 0) { > -ret = -errno; > -goto out; > -} else if (ret == 0) { > -ret = -EIO; > -goto out; > +return ret; > } > -buf[ret] = 0; > -/* The file is ended with '\n', pass 'end' to accept that. */ > -ret = qemu_strtol(buf, &end, 10, &max_segments); > -if (ret == 0 && end && *end == '\n') { > -ret = max_segments; > + > +if (strcmp(val, "host-managed") == 0) { > +*zoned = BLK_Z_HM; > +} else if (strcmp(val, "host-aware") == 0) { > +*zoned = BLK_Z_HA; > +} else if (strcmp(val, "none") == 0) { > +*zoned = BLK_Z_NONE; > +} else { > +return -ENOTSUP; > } > +return 0; > +} > > -out: > -if (sysfd != -1) { > -close(sysfd); > +/* > + * Get a sysfs attribute value as a long integer. > + */ > +static long get_sysfs_long_val(struct stat *st, const char *attribute) { > +#ifdef CONFIG_LINUX > +g_autofree char *str = NULL; > +const char *end; > +long val; > +int ret; > + > +ret = get_sysfs_str_val(st, attribute, &str); > +if (ret < 0) { > +return ret; > +} > + > +/* The file is ended with '\n', pass 'end' to accept that. */ > +ret = qemu_strtol(str, &end, 10, &val); > +if (ret == 0 && end && *end == '\0') { > +ret = val; > } > -g_free(sysfspath); > return ret; > #else > return -ENOTSUP; > #endif > } > > +static int hdev_get_max_segments(int fd, struct stat *st) { > +#ifdef CONFIG_LINUX > +int ret; > + > +if (S_ISCHR(st->st_mode)) { > +
Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
eader('linux/blkzoned.h')) > config_host_data.set('CONFIG_VALGRIND_H', > cc.has_header('valgrind/valgrind.h')) > config_host_data.set('HAVE_BTRFS_H', cc.has_header('linux/btrfs.h')) > config_host_data.set('HAVE_DRM_H', cc.has_header('libdrm/drm.h')) > diff --git a/qapi/block-core.json b/qapi/block-core.json > index 2173e7734a..c6bbb7a037 100644 > --- a/qapi/block-core.json > +++ b/qapi/block-core.json > @@ -2942,6 +2942,7 @@ > # @compress: Since 5.0 > # @copy-before-write: Since 6.2 > # @snapshot-access: Since 7.0 > +# @zoned_host_device: Since 7.2 > # > # Since: 2.9 > ## > @@ -2955,7 +2956,8 @@ > 'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels', > 'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd', > { 'name': 'replication', 'if': 'CONFIG_REPLICATION' }, > -'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] } > +'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', > +{ 'name': 'zoned_host_device', 'if': 'CONFIG_BLKZONED' } ] } > > ## > # @BlockdevOptionsFile: > @@ -4329,7 +4331,9 @@ >'vhdx': 'BlockdevOptionsGenericFormat', >'vmdk': 'BlockdevOptionsGenericCOWFormat', >'vpc':'BlockdevOptionsGenericFormat', > - 'vvfat': 'BlockdevOptionsVVFAT' > + 'vvfat': 'BlockdevOptionsVVFAT', > + 'zoned_host_device': { 'type': 'BlockdevOptionsFile', > + 'if': 'CONFIG_BLKZONED' } >} } > > ## > diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c > index 952dc940f1..446a059603 100644 > --- a/qemu-io-cmds.c > +++ b/qemu-io-cmds.c > @@ -1712,6 +1712,144 @@ static const cmdinfo_t flush_cmd = { > .oneline= "flush all in-core file state to disk", > }; > > +static int zone_report_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset; > +unsigned int nr_zones; > + > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +nr_zones = cvtnum(argv[optind]); > + > +g_autofree BlockZoneDescriptor *zones = NULL; > +zones = g_new(BlockZoneDescriptor, nr_zones); > +ret = blk_zone_report(blk, offset, &nr_zones, zones); > +if (ret < 0) { > +printf("zone report failed: %s\n", strerror(-ret)); > +} else { > +for (int i = 0; i < nr_zones; ++i) { > +printf("start: 0x%" PRIx64 ", len 0x%" PRIx64 ", " > + "cap"" 0x%" PRIx64 ", wptr 0x%" PRIx64 ", " > + "zcond:%u, [type: %u]\n", > + zones[i].start, zones[i].length, zones[i].cap, > zones[i].wp, > + zones[i].cond, zones[i].type); > +} > +} > +return ret; > +} > + > +static const cmdinfo_t zone_report_cmd = { > +.name = "zone_report", > +.altname = "zrp", > +.cfunc = zone_report_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset number", > +.oneline = "report zone information", > +}; > + > +static int zone_open_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len); > +if (ret < 0) { > +printf("zone open failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_open_cmd = { > +.name = "zone_open", > +.altname = "zo", > +.cfunc = zone_open_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "explicit open a range of zones in zone block device", > +}; > + > +static int zone_close_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len); > +if (ret < 0) { > +printf("zone close failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_close_cmd = { > +.name = "zone_close", > +.altname = "zc", > +.cfunc = zone_close_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "close a range of zones in zone block device", > +}; > + > +static int zone_finish_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len); > +if (ret < 0) { > +printf("zone finish failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_finish_cmd = { > +.name = "zone_finish", > +.altname = "zf", > +.cfunc = zone_finish_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "finish a range of zones in zone block device", > +}; > + > +static int zone_reset_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, BLK_ZO_RESET, offset, len); > +if (ret < 0) { > +printf("zone reset failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_reset_cmd = { > +.name = "zone_reset", > +.altname = "zrs", > +.cfunc = zone_reset_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "reset a zone write pointer in zone block device", > +}; > + > static int truncate_f(BlockBackend *blk, int argc, char **argv); > static const cmdinfo_t truncate_cmd = { > .name = "truncate", > @@ -2504,6 +2642,11 @@ static void __attribute((constructor)) > init_qemuio_commands(void) > qemuio_add_command(&aio_write_cmd); > qemuio_add_command(&aio_flush_cmd); > qemuio_add_command(&flush_cmd); > +qemuio_add_command(&zone_report_cmd); > +qemuio_add_command(&zone_open_cmd); > +qemuio_add_command(&zone_close_cmd); > +qemuio_add_command(&zone_finish_cmd); > +qemuio_add_command(&zone_reset_cmd); > qemuio_add_command(&truncate_cmd); > qemuio_add_command(&length_cmd); > qemuio_add_command(&info_cmd); -- Damien Le Moal Western Digital Research
Re: [PATCH v9 4/7] raw-format: add zone operations to pass through requests
On 2022/09/10 14:27, Sam Li wrote: > raw-format driver usually sits on top of file-posix driver. It needs to > pass through requests of zone commands. > > Signed-off-by: Sam Li > Reviewed-by: Stefan Hajnoczi Reviewed-by: Damien Le Moal > --- > block/raw-format.c | 13 + > 1 file changed, 13 insertions(+) > > diff --git a/block/raw-format.c b/block/raw-format.c > index 69fd650eaf..6b20bd22ef 100644 > --- a/block/raw-format.c > +++ b/block/raw-format.c > @@ -314,6 +314,17 @@ static int coroutine_fn raw_co_pdiscard(BlockDriverState > *bs, > return bdrv_co_pdiscard(bs->file, offset, bytes); > } > > +static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t > offset, > + unsigned int *nr_zones, > + BlockZoneDescriptor *zones) { > +return bdrv_co_zone_report(bs->file->bs, offset, nr_zones, zones); > +} > + > +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp > op, > + int64_t offset, int64_t len) { > +return bdrv_co_zone_mgmt(bs->file->bs, op, offset, len); > +} > + > static int64_t raw_getlength(BlockDriverState *bs) > { > int64_t len; > @@ -614,6 +625,8 @@ BlockDriver bdrv_raw = { > .bdrv_co_pwritev = &raw_co_pwritev, > .bdrv_co_pwrite_zeroes = &raw_co_pwrite_zeroes, > .bdrv_co_pdiscard = &raw_co_pdiscard, > +.bdrv_co_zone_report = &raw_co_zone_report, > +.bdrv_co_zone_mgmt = &raw_co_zone_mgmt, > .bdrv_co_block_status = &raw_co_block_status, > .bdrv_co_copy_range_from = &raw_co_copy_range_from, > .bdrv_co_copy_range_to = &raw_co_copy_range_to, -- Damien Le Moal Western Digital Research
Re: [PATCH v9 7/7] docs/zoned-storage: add zoned device documentation
On 2022/09/10 14:27, Sam Li wrote: > Add the documentation about the zoned device support to virtio-blk > emulation. > > Signed-off-by: Sam Li > Reviewed-by: Stefan Hajnoczi > --- > docs/devel/zoned-storage.rst | 41 ++ > docs/system/qemu-block-drivers.rst.inc | 6 > 2 files changed, 47 insertions(+) > create mode 100644 docs/devel/zoned-storage.rst > > diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst > new file mode 100644 > index 00..ead2d149cc > --- /dev/null > +++ b/docs/devel/zoned-storage.rst > @@ -0,0 +1,41 @@ > += > +zoned-storage > += > + > +Zoned Block Devices (ZBDs) devide the LBA space into block regions called > zones > +that are larger than the LBA size. It can only allow sequential writes, which s/It/They > +reduces write amplification in SSDs, leading to higher throughput and > increased > +capacity. More details about ZBDs can be found at: I would rephrase this like this, to be less assertive about the potential benefits (as they depend on the vendor implementation): ..., which can reduce write amplification in SSDs, and potentially lead to higher throughput and increased device capacity. > + > +https://zonedstorage.io/docs/introduction/zoned-storage > + > +1. Block layer APIs for zoned storage > +- > +QEMU block layer has three zoned storage model: > +- BLK_Z_HM: This model only allows sequential writes access. It supports a > set > +of ZBD-specific I/O request that used by the host to manage device zones. > +- BLK_Z_HA: It deals with both sequential writes and random writes access. > +- BLK_Z_NONE: Regular block devices and drive-managed ZBDs are treated as > +non-zoned devices. > + > +The block device information resides inside BlockDriverState. QEMU uses > +BlockLimits struct(BlockDriverState::bl) that is continuously accessed by the > +block layer while processing I/O requests. A BlockBackend has a root pointer > to > +a BlockDriverState graph(for example, raw format on top of file-posix). The > +zoned storage information can be propagated from the leaf BlockDriverState > all > +the way up to the BlockBackend. If the zoned storage model in file-posix is > +set to BLK_Z_HM, then block drivers will declare support for zoned host > device. > + > +The block layer APIs support commands needed for zoned storage devices, > +including report zones, four zone operations, and zone append. > + > +2. Emulating zoned storage controllers > +-- > +When the BlockBackend's BlockLimits model reports a zoned storage device, > users > +like the virtio-blk emulation or the qemu-io-cmds.c utility can use block > layer > +APIs for zoned storage emulation or testing. > + > +For example, the command line for zone report testing a null_blk device of > +qemu-io-cmds.c is: > +$ path/to/qemu-io --image-opts driver=zoned_host_device,filename=/dev/nullb0 > -c > +"zrp offset nr_zones" > diff --git a/docs/system/qemu-block-drivers.rst.inc > b/docs/system/qemu-block-drivers.rst.inc > index dfe5d2293d..0b97227fd9 100644 > --- a/docs/system/qemu-block-drivers.rst.inc > +++ b/docs/system/qemu-block-drivers.rst.inc > @@ -430,6 +430,12 @@ Hard disks >you may corrupt your host data (use the ``-snapshot`` command >line option or modify the device permissions accordingly). > > +Zoned block devices > + Zoned block devices can be passed through to the guest if the emulated > storage > + controller supports zoned storage. Use ``--blockdev zoned_host_device, > + node-name=drive0,filename=/dev/nullb0`` to pass through ``/dev/nullb0`` > + as ``drive0``. > + > Windows > ^^^ > -- Damien Le Moal Western Digital Research
Re: [PATCH v9 5/7] config: add check to block layer
On 2022/09/10 14:27, Sam Li wrote: > Putting zoned/non-zoned BlockDrivers on top of each other is not > allowed. > > Signed-off-by: Sam Li > Reviewed-by: Stefan Hajnoczi > --- > block.c | 14 ++ > block/file-posix.c | 14 ++ > block/raw-format.c | 1 + > include/block/block_int-common.h | 5 + > 4 files changed, 34 insertions(+) > > diff --git a/block.c b/block.c > index bc85f46eed..dad2ed3959 100644 > --- a/block.c > +++ b/block.c > @@ -7947,6 +7947,20 @@ void bdrv_add_child(BlockDriverState *parent_bs, > BlockDriverState *child_bs, > return; > } > > +/* > + * Non-zoned block drivers do not follow zoned storage constraints > + * (i.e. sequential writes to zones). Refuse mixing zoned and non-zoned > + * drivers in a graph. > + */ > +if (!parent_bs->drv->supports_zoned_children && > +child_bs->bl.zoned == BLK_Z_HM) { Shouldn't this be "child_bs->bl.zoned != BLK_Z_NONE" ? > +error_setg(errp, "Cannot add a %s child to a %s parent", > + child_bs->bl.zoned == BLK_Z_HM ? "zoned" : "non-zoned", > + parent_bs->drv->supports_zoned_children ? > + "support zoned children" : "not support zoned children"); > +return; > +} > + > if (!QLIST_EMPTY(&child_bs->parents)) { > error_setg(errp, "The node %s already has a parent", > child_bs->node_name); > diff --git a/block/file-posix.c b/block/file-posix.c > index 4edfa25d04..354de22860 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -779,6 +779,20 @@ static int raw_open_common(BlockDriverState *bs, QDict > *options, > goto fail; > } > } > +#ifdef CONFIG_BLKZONED > +/* > + * The kernel page chache does not reliably work for writes to SWR zones > + * of zoned block device because it can not guarantee the order of > writes. > + */ > +if (strcmp(bs->drv->format_name, "zoned_host_device") == 0) { > +if (!(s->open_flags & O_DIRECT)) { > +error_setg(errp, "driver=zoned_host_device was specified, but it > " > + "requires cache.direct=on, which was not > specified."); > +ret = -EINVAL; This line is not needed. Simply "return -EINVAL;". > +return ret; /* No host kernel page cache */ > +} > +} > +#endif > > if (S_ISBLK(st.st_mode)) { > #ifdef BLKDISCARDZEROES > diff --git a/block/raw-format.c b/block/raw-format.c > index 6b20bd22ef..9441536819 100644 > --- a/block/raw-format.c > +++ b/block/raw-format.c > @@ -614,6 +614,7 @@ static void raw_child_perm(BlockDriverState *bs, > BdrvChild *c, > BlockDriver bdrv_raw = { > .format_name = "raw", > .instance_size= sizeof(BDRVRawState), > +.supports_zoned_children = true, > .bdrv_probe = &raw_probe, > .bdrv_reopen_prepare = &raw_reopen_prepare, > .bdrv_reopen_commit = &raw_reopen_commit, > diff --git a/include/block/block_int-common.h > b/include/block/block_int-common.h > index 078ddd7e67..043aa161a0 100644 > --- a/include/block/block_int-common.h > +++ b/include/block/block_int-common.h > @@ -127,6 +127,11 @@ struct BlockDriver { > */ > bool is_format; > > +/* > + * Set to true if the BlockDriver supports zoned children. > + */ > +bool supports_zoned_children; > + > /* > * Drivers not implementing bdrv_parse_filename nor bdrv_open should have > * this field set to true, except ones that are defined only by their -- Damien Le Moal Western Digital Research
Re: [PATCH] block: introduce zone append write for zoned devices
MU_AIO_ZONE_REPORT 0x0100 > #define QEMU_AIO_ZONE_MGMT0x0200 > +#define QEMU_AIO_ZONE_APPEND 0x0400 > #define QEMU_AIO_TYPE_MASK \ > (QEMU_AIO_READ | \ > QEMU_AIO_WRITE | \ > @@ -41,7 +42,8 @@ > QEMU_AIO_COPY_RANGE | \ > QEMU_AIO_TRUNCATE | \ > QEMU_AIO_ZONE_REPORT | \ > - QEMU_AIO_ZONE_MGMT) > + QEMU_AIO_ZONE_MGMT | \ > + QEMU_AIO_ZONE_APPEND) > > /* AIO flags */ > #define QEMU_AIO_MISALIGNED 0x1000 > diff --git a/include/sysemu/block-backend-io.h > b/include/sysemu/block-backend-io.h > index 6835525582..33e35ae5d7 100644 > --- a/include/sysemu/block-backend-io.h > +++ b/include/sysemu/block-backend-io.h > @@ -51,6 +51,9 @@ BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t > offset, > BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op, >int64_t offset, int64_t len, >BlockCompletionFunc *cb, void *opaque); > +BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t *offset, > +QEMUIOVector *qiov, BdrvRequestFlags flags, > +BlockCompletionFunc *cb, void *opaque); > BlockAIOCB *blk_aio_pdiscard(BlockBackend *blk, int64_t offset, int64_t > bytes, > BlockCompletionFunc *cb, void *opaque); > void blk_aio_cancel_async(BlockAIOCB *acb); > @@ -172,6 +175,12 @@ int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, > BlockZoneOp op, >int64_t offset, int64_t len); > int generated_co_wrapper blk_zone_mgmt(BlockBackend *blk, BlockZoneOp op, > int64_t offset, int64_t len); > +int coroutine_fn blk_co_zone_append(BlockBackend *blk, int64_t *offset, > +QEMUIOVector *qiov, > +BdrvRequestFlags flags); > +int generated_co_wrapper blk_zone_append(BlockBackend *blk, int64_t *offset, > + QEMUIOVector *qiov, > + BdrvRequestFlags flags); > > int generated_co_wrapper blk_pdiscard(BlockBackend *blk, int64_t offset, >int64_t bytes); > diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c > index 446a059603..8868c88290 100644 > --- a/qemu-io-cmds.c > +++ b/qemu-io-cmds.c > @@ -1850,6 +1850,67 @@ static const cmdinfo_t zone_reset_cmd = { > .oneline = "reset a zone write pointer in zone block device", > }; > > +static int do_aio_zone_append(BlockBackend *blk, QEMUIOVector *qiov, > + int64_t *offset, int flags, int *total) > +{ > +int async_ret = NOT_DONE; > + > +blk_aio_zone_append(blk, offset, qiov, flags, aio_rw_done, &async_ret); > +while (async_ret == NOT_DONE) { > +main_loop_wait(false); > +} > + > +*total = qiov->size; > +return async_ret < 0 ? async_ret : 1; > +} > +static int zone_append_f(BlockBackend *blk, int argc, char **argv) { > +int ret; > +//struct timespec t1, t2; > +int flags = 0; > +int total = 0; > +int64_t offset; > +char *buf; > +int nr_iov; > +int pattern = 0xcd; > +QEMUIOVector qiov; > + > +if (optind > argc - 2) { > +return -EINVAL; > +} > +optind++; > +offset = cvtnum(argv[optind]); > +if (offset < 0) { > +print_cvtnum_err(offset, argv[optind]); > +return offset; > +} > +optind++; > +nr_iov = argc - optind; > +buf = create_iovec(blk, &qiov, &argv[optind], nr_iov, pattern); > +if (buf == NULL) { > +return -EINVAL; > +} > +ret = do_aio_zone_append(blk, &qiov, &offset, flags, &total); > +if (ret < 0) { > +printf("zone append failed: %s\n", strerror(-ret)); > +goto out; > +} > + > +out: > +qemu_iovec_destroy(&qiov); > +qemu_io_free(buf); > +return ret; > +} > + > +static const cmdinfo_t zone_append_cmd = { > +.name = "zone_append", > +.altname = "zap", > +.cfunc = zone_append_f, > +.argmin = 3, > +.argmax = 3, > +.args = "offset len [len..]", > +.oneline = "append write a number of bytes at a specified offset", > +}; > + > static int truncate_f(BlockBackend *blk, int argc, char **argv); > static const cmdinfo_t truncate_cmd = { > .name = "truncate", > @@ -2647,6 +2708,7 @@ static void __attribute((constructor)) > init_qemuio_commands(void) > qemuio_add_command(&zone_close_cmd); > qemuio_add_command(&zone_finish_cmd); > qemuio_add_command(&zone_reset_cmd); > +qemuio_add_command(&zone_append_cmd); > qemuio_add_command(&truncate_cmd); > qemuio_add_command(&length_cmd); > qemuio_add_command(&info_cmd); > diff --git a/tests/qemu-iotests/tests/zoned.out > b/tests/qemu-iotests/tests/zoned.out > index 0c8f96deb9..b3b139b4ec 100644 > --- a/tests/qemu-iotests/tests/zoned.out > +++ b/tests/qemu-iotests/tests/zoned.out > @@ -50,4 +50,11 @@ start: 0x8, len 0x8, cap 0x8, wptr 0x10, > zcond:14, [type: 2] > (5) resetting the second zone > After resetting a zone: > start: 0x8, len 0x8, cap 0x8, wptr 0x8, zcond:1, [type: 2] > + > + > +(6) append write > +After appending the first zone: > +start: 0x0, len 0x8, cap 0x8, wptr 0x18, zcond:2, [type: 2] > +After appending the second zone: > +start: 0x8, len 0x8, cap 0x8, wptr 0x80018, zcond:2, [type: 2] > *** done > diff --git a/tests/qemu-iotests/tests/zoned.sh > b/tests/qemu-iotests/tests/zoned.sh > index 871f47efed..b4dc89aaac 100755 > --- a/tests/qemu-iotests/tests/zoned.sh > +++ b/tests/qemu-iotests/tests/zoned.sh > @@ -79,6 +79,15 @@ echo "(5) resetting the second zone" > sudo $QEMU_IO $IMG -c "zrs 268435456 268435456" > echo "After resetting a zone:" > sudo $QEMU_IO $IMG -c "zrp 268435456 1" > +echo > +echo > +echo "(6) append write" # assuming block size of device is 4096 > +sudo $QEMU_IO $IMG -c "zap 0 0x1000 0x2000" > +echo "After appending the first zone:" > +sudo $QEMU_IO $IMG -c "zrp 0 1" > +sudo $QEMU_IO $IMG -c "zap 268435456 0x1000 0x2000" > +echo "After appending the second zone:" > +sudo $QEMU_IO $IMG -c "zrp 268435456 1" > # success, all done > echo "*** done" > rm -f $seq.full -- Damien Le Moal Western Digital Research
Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
On 2022/09/11 15:33, Sam Li wrote: > Damien Le Moal 于2022年9月11日周日 13:31写道: [...] >>> +/* >>> + * zone management operations - Execute an operation on a zone >>> + */ >>> +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp >>> op, >>> +int64_t offset, int64_t len) { >>> +#if defined(CONFIG_BLKZONED) >>> +BDRVRawState *s = bs->opaque; >>> +RawPosixAIOData acb; >>> +int64_t zone_sector, zone_sector_mask; >>> +const char *zone_op_name; >>> +unsigned long zone_op; >>> +bool is_all = false; >>> + >>> +zone_sector = bs->bl.zone_sectors; >>> +zone_sector_mask = zone_sector - 1; >>> +if (offset & zone_sector_mask) { >>> +error_report("sector offset %" PRId64 " is not aligned to zone >>> size " >>> + "%" PRId64 "", offset, zone_sector); >>> +return -EINVAL; >>> +} >>> + >>> +if (len & zone_sector_mask) { >> >> Linux allows SMR drives to have a smaller last zone. So this needs to be >> accounted for here. Otherwise, a zone operation that includes the last >> smaller >> zone would always fail. Something like this would work: >> >> if (((offset + len) < capacity && >> len & zone_sector_mask) || >> offset + len > capacity) { >> > > I see. I think the offset can be removed, like: > if (((len < capacity && len & zone_sector_mask) || len > capacity) { > Then if we use the previous zone's len for the last smaller zone, it > will be greater than its capacity. Nope, you cannot remove the offset since the zone operation may be for that last zone only, that is, offset == last zone start and len == last zone smaller size. In that case, len is alwats smaller than capacity. > > I will also include "opening the last zone" as a test case later. Note that you can create such smaller last zone on the host with null_blk by specifying a device capacity that is *not* a multiple of the zone size. > >>> +error_report("number of sectors %" PRId64 " is not aligned to zone >>> size" >>> + " %" PRId64 "", len, zone_sector); >>> +return -EINVAL; >>> +} >>> + >>> +switch (op) { >>> +case BLK_ZO_OPEN: >>> +zone_op_name = "BLKOPENZONE"; >>> +zone_op = BLKOPENZONE; >>> +break; >>> +case BLK_ZO_CLOSE: >>> +zone_op_name = "BLKCLOSEZONE"; >>> +zone_op = BLKCLOSEZONE; >>> +break; >>> +case BLK_ZO_FINISH: >>> +zone_op_name = "BLKFINISHZONE"; >>> +zone_op = BLKFINISHZONE; >>> +break; >>> +case BLK_ZO_RESET: >>> +zone_op_name = "BLKRESETZONE"; >>> + zone_op = BLKRESETZONE; >>> +break; >>> +default: >>> +g_assert_not_reached(); >>> +} >>> + >>> +acb = (RawPosixAIOData) { >>> +.bs = bs, >>> +.aio_fildes = s->fd, >>> +.aio_type = QEMU_AIO_ZONE_MGMT, >>> +.aio_offset = offset, >>> +.aio_nbytes = len, >>> +.zone_mgmt = { >>> +.zone_op = zone_op, >>> +.zone_op_name = zone_op_name, >>> +.all = is_all, >>> +}, >>> +}; >>> + >>> +return raw_thread_pool_submit(bs, handle_aiocb_zone_mgmt, &acb); >>> +#else >>> +return -ENOTSUP; >>> +#endif >>> +} -- Damien Le Moal Western Digital Research
Re: [PATCH] block: introduce zone append write for zoned devices
On 2022/09/11 15:06, Damien Le Moal wrote: > On 2022/09/10 15:38, Sam Li wrote: >> A zone append command is a write operation that specifies the first >> logical block of a zone as the write position. When writing to a zoned >> block device using zone append, the byte offset of the write is pointing >> to the write pointer of that zone. Upon completion the device will >> respond with the position the data has been placed in the zone. > > s/placed/written > > You need to explain more about what this patch does: > > Since Linux does not provide a user API to issue zone append operations to > zoned > devices from user space, the file-posix driver is modified to add zone append > emulation using regular write operations. To do this, the file-posix driver > tracks the wp location of all zones of the device Blah. Thinking more about this, I think you should split this patch in 2: 1) first patch adding the tracking of the zones wp. 2) second patch adding zone append emulation That will make the review far easier. > >> >> Signed-off-by: Sam Li >> --- >> block/block-backend.c | 65 +++ >> block/file-posix.c | 169 - >> block/io.c | 21 >> block/raw-format.c | 7 ++ >> include/block/block-common.h | 2 + >> include/block/block-io.h | 3 + >> include/block/block_int-common.h | 9 ++ >> include/block/raw-aio.h| 4 +- >> include/sysemu/block-backend-io.h | 9 ++ >> qemu-io-cmds.c | 62 +++ >> tests/qemu-iotests/tests/zoned.out | 7 ++ >> tests/qemu-iotests/tests/zoned.sh | 9 ++ >> 12 files changed, 360 insertions(+), 7 deletions(-) >> >> diff --git a/block/block-backend.c b/block/block-backend.c >> index ebe8d7bdf3..b77a1cb24b 100644 >> --- a/block/block-backend.c >> +++ b/block/block-backend.c >> @@ -1439,6 +1439,9 @@ typedef struct BlkRwCo { >> struct { >> BlockZoneOp op; >> } zone_mgmt; >> +struct { >> +int64_t *append_sector; >> +} zone_append; >> }; >> } BlkRwCo; >> >> @@ -1869,6 +1872,47 @@ BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, >> BlockZoneOp op, >> return &acb->common; >> } >> >> +static void blk_aio_zone_append_entry(void *opaque) { >> +BlkAioEmAIOCB *acb = opaque; >> +BlkRwCo *rwco = &acb->rwco; >> + >> +rwco->ret = blk_co_zone_append(rwco->blk, >> rwco->zone_append.append_sector, >> + rwco->iobuf, rwco->flags); >> +blk_aio_complete(acb); >> +} >> + >> +BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t *offset, >> +QEMUIOVector *qiov, BdrvRequestFlags flags, >> +BlockCompletionFunc *cb, void *opaque) { >> +BlkAioEmAIOCB *acb; >> +Coroutine *co; >> +IO_CODE(); >> + >> +blk_inc_in_flight(blk); >> +acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque); >> +acb->rwco = (BlkRwCo) { >> +.blk= blk, >> +.ret= NOT_DONE, >> +.flags = flags, >> +.iobuf = qiov, >> +.zone_append = { >> +.append_sector = offset, >> +}, >> +}; >> +acb->has_returned = false; >> + >> +co = qemu_coroutine_create(blk_aio_zone_append_entry, acb); >> +bdrv_coroutine_enter(blk_bs(blk), co); >> + >> +acb->has_returned = true; >> +if (acb->rwco.ret != NOT_DONE) { >> +replay_bh_schedule_oneshot_event(blk_get_aio_context(blk), >> + blk_aio_complete_bh, acb); >> +} >> + >> +return &acb->common; >> +} >> + >> /* >> * Send a zone_report command. >> * offset is a byte offset from the start of the device. No alignment >> @@ -1920,6 +1964,27 @@ int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, >> BlockZoneOp op, >> return ret; >> } >> >> +/* >> + * Send a zone_append command. >> + */ >> +int coroutine_fn blk_co_zone_append(BlockBackend *blk, int64_t *offset, >> +QEMUIOVector *qiov, BdrvRequestFlags flags) >> +{ >> +int ret; >> +IO_CODE(); >> + >> +blk_inc_in_flight(blk); >> +blk_wait_while_drained(blk); >&g
Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
', cc.has_header('linux/blkzoned.h')) > config_host_data.set('CONFIG_VALGRIND_H', > cc.has_header('valgrind/valgrind.h')) > config_host_data.set('HAVE_BTRFS_H', cc.has_header('linux/btrfs.h')) > config_host_data.set('HAVE_DRM_H', cc.has_header('libdrm/drm.h')) > diff --git a/qapi/block-core.json b/qapi/block-core.json > index 2173e7734a..c6bbb7a037 100644 > --- a/qapi/block-core.json > +++ b/qapi/block-core.json > @@ -2942,6 +2942,7 @@ > # @compress: Since 5.0 > # @copy-before-write: Since 6.2 > # @snapshot-access: Since 7.0 > +# @zoned_host_device: Since 7.2 > # > # Since: 2.9 > ## > @@ -2955,7 +2956,8 @@ > 'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels', > 'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd', > { 'name': 'replication', 'if': 'CONFIG_REPLICATION' }, > -'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] } > +'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', > +{ 'name': 'zoned_host_device', 'if': 'CONFIG_BLKZONED' } ] } > > ## > # @BlockdevOptionsFile: > @@ -4329,7 +4331,9 @@ >'vhdx': 'BlockdevOptionsGenericFormat', >'vmdk': 'BlockdevOptionsGenericCOWFormat', >'vpc':'BlockdevOptionsGenericFormat', > - 'vvfat': 'BlockdevOptionsVVFAT' > + 'vvfat': 'BlockdevOptionsVVFAT', > + 'zoned_host_device': { 'type': 'BlockdevOptionsFile', > + 'if': 'CONFIG_BLKZONED' } >} } > > ## > diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c > index 952dc940f1..446a059603 100644 > --- a/qemu-io-cmds.c > +++ b/qemu-io-cmds.c > @@ -1712,6 +1712,144 @@ static const cmdinfo_t flush_cmd = { > .oneline= "flush all in-core file state to disk", > }; > > +static int zone_report_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset; > +unsigned int nr_zones; > + > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +nr_zones = cvtnum(argv[optind]); > + > +g_autofree BlockZoneDescriptor *zones = NULL; > +zones = g_new(BlockZoneDescriptor, nr_zones); > +ret = blk_zone_report(blk, offset, &nr_zones, zones); > +if (ret < 0) { > +printf("zone report failed: %s\n", strerror(-ret)); > +} else { > +for (int i = 0; i < nr_zones; ++i) { > +printf("start: 0x%" PRIx64 ", len 0x%" PRIx64 ", " > + "cap"" 0x%" PRIx64 ", wptr 0x%" PRIx64 ", " > + "zcond:%u, [type: %u]\n", > + zones[i].start, zones[i].length, zones[i].cap, > zones[i].wp, > + zones[i].cond, zones[i].type); > + } > +} > +return ret; > +} > + > +static const cmdinfo_t zone_report_cmd = { > +.name = "zone_report", > +.altname = "zrp", > +.cfunc = zone_report_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset number", > +.oneline = "report zone information", > +}; > + > +static int zone_open_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len); > +if (ret < 0) { > +printf("zone open failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_open_cmd = { > +.name = "zone_open", > +.altname = "zo", > +.cfunc = zone_open_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "explicit open a range of zones in zone block device", > +}; > + > +static int zone_close_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len); > +if (ret < 0) { > +printf("zone close failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_close_cmd = { > +.name = "zone_close", > +.altname = "zc", > +.cfunc = zone_close_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "close a range of zones in zone block device", > +}; > + > +static int zone_finish_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len); > +if (ret < 0) { > +printf("zone finish failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_finish_cmd = { > +.name = "zone_finish", > +.altname = "zf", > +.cfunc = zone_finish_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "finish a range of zones in zone block device", > +}; > + > +static int zone_reset_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, BLK_ZO_RESET, offset, len); > +if (ret < 0) { > +printf("zone reset failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_reset_cmd = { > +.name = "zone_reset", > +.altname = "zrs", > +.cfunc = zone_reset_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "reset a zone write pointer in zone block device", > +}; > + > static int truncate_f(BlockBackend *blk, int argc, char **argv); > static const cmdinfo_t truncate_cmd = { > .name = "truncate", > @@ -2504,6 +2642,11 @@ static void __attribute((constructor)) > init_qemuio_commands(void) > qemuio_add_command(&aio_write_cmd); > qemuio_add_command(&aio_flush_cmd); > qemuio_add_command(&flush_cmd); > +qemuio_add_command(&zone_report_cmd); > +qemuio_add_command(&zone_open_cmd); > +qemuio_add_command(&zone_close_cmd); > +qemuio_add_command(&zone_finish_cmd); > +qemuio_add_command(&zone_reset_cmd); > qemuio_add_command(&truncate_cmd); > qemuio_add_command(&length_cmd); > qemuio_add_command(&info_cmd); -- Damien Le Moal Western Digital Research
Re: [PATCH v9 5/7] config: add check to block layer
On 2022/09/11 15:54, Sam Li wrote: > Damien Le Moal 于2022年9月11日周日 13:34写道: >> >> On 2022/09/10 14:27, Sam Li wrote: >>> Putting zoned/non-zoned BlockDrivers on top of each other is not >>> allowed. >>> >>> Signed-off-by: Sam Li >>> Reviewed-by: Stefan Hajnoczi >>> --- >>> block.c | 14 ++ >>> block/file-posix.c | 14 ++ >>> block/raw-format.c | 1 + >>> include/block/block_int-common.h | 5 + >>> 4 files changed, 34 insertions(+) >>> >>> diff --git a/block.c b/block.c >>> index bc85f46eed..dad2ed3959 100644 >>> --- a/block.c >>> +++ b/block.c >>> @@ -7947,6 +7947,20 @@ void bdrv_add_child(BlockDriverState *parent_bs, >>> BlockDriverState *child_bs, >>> return; >>> } >>> >>> +/* >>> + * Non-zoned block drivers do not follow zoned storage constraints >>> + * (i.e. sequential writes to zones). Refuse mixing zoned and non-zoned >>> + * drivers in a graph. >>> + */ >>> +if (!parent_bs->drv->supports_zoned_children && >>> +child_bs->bl.zoned == BLK_Z_HM) { >> >> Shouldn't this be "child_bs->bl.zoned != BLK_Z_NONE" ? > > The host-aware model allows zoned storage constraints(sequentially > write) and random write. Is mixing HA and non-zoned drivers allowed? > What's the difference? Yes, HA devices can be used as regular devices too. If you are allowing this here, then add a comment explaining it. It may also be good to add a message like "Using host-aware device as a regular device" here for the HA case. > >> >>> +error_setg(errp, "Cannot add a %s child to a %s parent", >>> + child_bs->bl.zoned == BLK_Z_HM ? "zoned" : "non-zoned", >>> + parent_bs->drv->supports_zoned_children ? >>> + "support zoned children" : "not support zoned >>> children"); >>> +return; >>> +} >>> + >>> if (!QLIST_EMPTY(&child_bs->parents)) { >>> error_setg(errp, "The node %s already has a parent", >>> child_bs->node_name); >>> diff --git a/block/file-posix.c b/block/file-posix.c >>> index 4edfa25d04..354de22860 100644 >>> --- a/block/file-posix.c >>> +++ b/block/file-posix.c >>> @@ -779,6 +779,20 @@ static int raw_open_common(BlockDriverState *bs, QDict >>> *options, >>> goto fail; >>> } >>> } >>> +#ifdef CONFIG_BLKZONED >>> +/* >>> + * The kernel page chache does not reliably work for writes to SWR >>> zones >>> + * of zoned block device because it can not guarantee the order of >>> writes. >>> + */ >>> +if (strcmp(bs->drv->format_name, "zoned_host_device") == 0) { >>> +if (!(s->open_flags & O_DIRECT)) { >>> +error_setg(errp, "driver=zoned_host_device was specified, but >>> it " >>> + "requires cache.direct=on, which was not >>> specified."); >>> +ret = -EINVAL; >> >> This line is not needed. Simply "return -EINVAL;". >> >>> +return ret; /* No host kernel page cache */ >>> +} >>> +} >>> +#endif >>> >>> if (S_ISBLK(st.st_mode)) { >>> #ifdef BLKDISCARDZEROES >>> diff --git a/block/raw-format.c b/block/raw-format.c >>> index 6b20bd22ef..9441536819 100644 >>> --- a/block/raw-format.c >>> +++ b/block/raw-format.c >>> @@ -614,6 +614,7 @@ static void raw_child_perm(BlockDriverState *bs, >>> BdrvChild *c, >>> BlockDriver bdrv_raw = { >>> .format_name = "raw", >>> .instance_size= sizeof(BDRVRawState), >>> +.supports_zoned_children = true, >>> .bdrv_probe = &raw_probe, >>> .bdrv_reopen_prepare = &raw_reopen_prepare, >>> .bdrv_reopen_commit = &raw_reopen_commit, >>> diff --git a/include/block/block_int-common.h >>> b/include/block/block_int-common.h >>> index 078ddd7e67..043aa161a0 100644 >>> --- a/include/block/block_int-common.h >>> +++ b/include/block/block_int-common.h >>> @@ -127,6 +127,11 @@ struct BlockDriver { >>> */ >>> bool is_format; >>> >>> +/* >>> + * Set to true if the BlockDriver supports zoned children. >>> + */ >>> +bool supports_zoned_children; >>> + >>> /* >>> * Drivers not implementing bdrv_parse_filename nor bdrv_open should >>> have >>> * this field set to true, except ones that are defined only by their >> >> -- >> Damien Le Moal >> Western Digital Research >> -- Damien Le Moal Western Digital Research
Re: [PATCH] block: introduce zone append write for zoned devices
is not aligned to >>> block " >>> +"size %" PRId64 "", i, iov_len, lbsz); >>> + return -EINVAL; >>> + } >> >> This alignment check should be against the device write granularity, not the >> logical block size. The write granularity will always be equal to the device >> physical block size, which may or may not be equal to the device logical >> block >> size. E.g. a 512e SMR disk has a 512B logical block size but a 4096B physical >> block size. And the ZBC & ZAC specifications mandate that all write be >> aligned >> to the physical block size. > > I see. I'll change it to physical block size. I would use a filed called "write_granularity" since the virtio specs will introduce that anyway. This zone granularity is going to be indeed equal to the physical block size of the host device for now. [...] >>> /* removable device specific */ >>> bool (*bdrv_is_inserted)(BlockDriverState *bs); >>> @@ -854,6 +857,12 @@ typedef struct BlockLimits { >>> >>> /* maximum number of active zones */ >>> int64_t max_active_zones; >>> + >>> +/* array of zones in the zoned block device. Only tracks write >>> pointer's >>> + * location of each zone as a helper for zone_append API */ >>> +BlockZoneDescriptor *zones; >> >> This is a lot of memory for only tracking the wp... Why not reduce this to an >> array of int64 values, for the wp only ? As you may need the zone type too >> (conventional vs sequential), you can use the most significant bit (a zone wp >> value will never use all 64 bits !). >> >> Or device another zone structure with zone type, zone wp and mutex only, so >> smaller than the regular zone report structure. > > I was just trying to reuse do_zone_report. It is better to only track > the wp only. Then a new struct and smaller zone_report should be > introduced for it. Yes, this will use less memory, which is always good. -- Damien Le Moal Western Digital Research
Re: [PATCH v9 1/7] include: add zoned device structs
On 9/19/22 09:50, Sam Li wrote: > Stefan Hajnoczi 于2022年9月18日周日 04:17写道: >> >> On Thu, Sep 15, 2022 at 06:06:38PM +0800, Sam Li wrote: >>> Eric Blake 于2022年9月15日周四 16:05写道: >>>> >>>> On Sat, Sep 10, 2022 at 01:27:53PM +0800, Sam Li wrote: >>>>> Signed-off-by: Sam Li >>>>> Reviewed-by: Stefan Hajnoczi >>>>> Reviewed-by: Damien Le Moal >>>>> --- >>>>> include/block/block-common.h | 43 >>>>> 1 file changed, 43 insertions(+) >>>>> >>>>> diff --git a/include/block/block-common.h b/include/block/block-common.h >>>>> index fdb7306e78..36bd0e480e 100644 >>>>> --- a/include/block/block-common.h >>>>> +++ b/include/block/block-common.h >>>>> @@ -49,6 +49,49 @@ typedef struct BlockDriver BlockDriver; >>>>> typedef struct BdrvChild BdrvChild; >>>>> typedef struct BdrvChildClass BdrvChildClass; >>>>> >>>>> +typedef enum BlockZoneOp { >>>>> +BLK_ZO_OPEN, >>>>> +BLK_ZO_CLOSE, >>>>> +BLK_ZO_FINISH, >>>>> +BLK_ZO_RESET, >>>>> +} BlockZoneOp; >>>>> + >>>>> +typedef enum BlockZoneModel { >>>>> +BLK_Z_NONE = 0x0, /* Regular block device */ >>>>> +BLK_Z_HM = 0x1, /* Host-managed zoned block device */ >>>>> +BLK_Z_HA = 0x2, /* Host-aware zoned block device */ >>>>> +} BlockZoneModel; >>>>> + >>>>> +typedef enum BlockZoneCondition { >>>>> +BLK_ZS_NOT_WP = 0x0, >>>>> +BLK_ZS_EMPTY = 0x1, >>>>> +BLK_ZS_IOPEN = 0x2, >>>>> +BLK_ZS_EOPEN = 0x3, >>>>> +BLK_ZS_CLOSED = 0x4, >>>>> +BLK_ZS_RDONLY = 0xD, >>>>> +BLK_ZS_FULL = 0xE, >>>>> +BLK_ZS_OFFLINE = 0xF, >>>>> +} BlockZoneCondition; >>>>> + >>>>> +typedef enum BlockZoneType { >>>>> +BLK_ZT_CONV = 0x1, /* Conventional random writes supported */ >>>>> +BLK_ZT_SWR = 0x2, /* Sequential writes required */ >>>>> +BLK_ZT_SWP = 0x3, /* Sequential writes preferred */ >>>>> +} BlockZoneType; >>>>> + >>>>> +/* >>>>> + * Zone descriptor data structure. >>>>> + * Provides information on a zone with all position and size values in >>>>> bytes. >>>> >>>> I'm glad that you chose bytes here for use in qemu. But since the >>>> kernel struct blk_zone uses sectors instead of bytes, is it worth >>>> adding a sentence that we intentionally use bytes here, different from >>>> Linux, to make it easier for reviewers to realize that scaling when >>>> translating between qemu and kernel is necessary? >>> >>> Sorry about the unit mistake. The zone information is in sectors which >>> is the same as kernel struct blk_zone. I think adding a sentence to >>> inform the sector unit makes it clear what the zone descriptor is. >> >> I'd make the units bytes for consistency with the rest of the QEMU block >> layer. For example, the MapEntry structure that "qemu-img map" reports >> has names with similar fields and they are in bytes: >> >> struct MapEntry { >> int64_t start; >> int64_t length; >> > > I think the zone descriptor uses sector units because ioctl() will > report zones in sector units. Making blk_zone.offset = > zone_descriptor.offset is more convenient than using byte units where > it needs make conversions twice(sector -> byte -> sector in zone > descriptors and offset argument in bdrv_co_zone_report). The MapEntry > uses byte units because lseek() in bdrv_co_block_status suggests the > file offset is set to bytes and I think it may be why the rest of the > block layer uses bytes(not sure). > > I do not object to using bytes here but it would require some > compromises. If I was wrong about anything, please let me know. The conversion can be done using 9-bits left and right shifts, which are cheap to do. I think it is important to be consistent with qemu block API, so using for the API bytes is preferred. That will avoid confusions. > > > Sam -- Damien Le Moal Western Digital Research
Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
On 9/20/22 17:51, Klaus Jensen wrote: On Sep 10 13:27, Sam Li wrote: Add a new zoned_host_device BlockDriver. The zoned_host_device option accepts only zoned host block devices. By adding zone management operations in this new BlockDriver, users can use the new block layer APIs including Report Zone and four zone management operations (open, close, finish, reset). Qemu-io uses the new APIs to perform zoned storage commands of the device: zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs), zone_finish(zf). For example, to test zone_report, use following command: $ ./build/qemu-io --image-opts -n driver=zoned_host_device, filename=/dev/nullb0 -c "zrp offset nr_zones" Signed-off-by: Sam Li Reviewed-by: Hannes Reinecke --- block/block-backend.c | 145 ++ block/file-posix.c| 323 +- block/io.c| 41 include/block/block-io.h | 7 + include/block/block_int-common.h | 21 ++ include/block/raw-aio.h | 6 +- include/sysemu/block-backend-io.h | 17 ++ meson.build | 1 + qapi/block-core.json | 8 +- qemu-io-cmds.c| 143 + 10 files changed, 708 insertions(+), 4 deletions(-) +/* + * zone management operations - Execute an operation on a zone + */ +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op, +int64_t offset, int64_t len) { +#if defined(CONFIG_BLKZONED) +BDRVRawState *s = bs->opaque; +RawPosixAIOData acb; +int64_t zone_sector, zone_sector_mask; +const char *zone_op_name; +unsigned long zone_op; +bool is_all = false; + +zone_sector = bs->bl.zone_sectors; +zone_sector_mask = zone_sector - 1; +if (offset & zone_sector_mask) { +error_report("sector offset %" PRId64 " is not aligned to zone size " + "%" PRId64 "", offset, zone_sector); +return -EINVAL; +} + +if (len & zone_sector_mask) { +error_report("number of sectors %" PRId64 " is not aligned to zone size" + " %" PRId64 "", len, zone_sector); +return -EINVAL; +} These checks impose a power-of-two constraint on the zone size. Can they be changed to divisions to lift that constraint? I don't see anything in this patch set that relies on power of two zone sizes. Given that Linux will only expose zoned devices that have a zone size that is a power of 2 number of LBAs, this will work as is and avoid divisions in the IO path. But given that zone management operations are not performance critical, generalizing the code should be fine. However, once we start adding the code for full zone emulation on top of a regular file or qcow image, sector-to-zone conversions requiring divisions will hurt. So I really would prefer the code be left as-is for now. -- Damien Le Moal Western Digital Research
Re: [PATCH v10 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
On 10/7/22 00:17, Stefan Hajnoczi wrote: > On Tue, Oct 04, 2022 at 08:23:15AM +0900, Damien Le Moal wrote: >> On 2022/10/04 2:47, Stefan Hajnoczi wrote: >>> On Thu, Sep 29, 2022 at 04:36:27PM +0800, Sam Li wrote: >>>> Add a new zoned_host_device BlockDriver. The zoned_host_device option >>>> accepts only zoned host block devices. By adding zone management >>>> operations in this new BlockDriver, users can use the new block >>>> layer APIs including Report Zone and four zone management operations >>>> (open, close, finish, reset). >>>> >>>> Qemu-io uses the new APIs to perform zoned storage commands of the device: >>>> zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs), >>>> zone_finish(zf). >>>> >>>> For example, to test zone_report, use following command: >>>> $ ./build/qemu-io --image-opts -n driver=zoned_host_device, >>>> filename=/dev/nullb0 >>>> -c "zrp offset nr_zones" >>>> >>>> Signed-off-by: Sam Li >>>> Reviewed-by: Hannes Reinecke >>>> --- >>>> block/block-backend.c | 146 + >>>> block/file-posix.c| 340 +- >>>> block/io.c| 41 >>>> include/block/block-common.h | 4 + >>>> include/block/block-io.h | 7 + >>>> include/block/block_int-common.h | 24 +++ >>>> include/block/raw-aio.h | 6 +- >>>> include/sysemu/block-backend-io.h | 17 ++ >>>> meson.build | 4 + >>>> qapi/block-core.json | 8 +- >>>> qemu-io-cmds.c| 148 + >>>> 11 files changed, 741 insertions(+), 4 deletions(-) >>>> >>>> diff --git a/block/block-backend.c b/block/block-backend.c >>>> index d4a5df2ac2..f7f7acd6f4 100644 >>>> --- a/block/block-backend.c >>>> +++ b/block/block-backend.c >>>> @@ -1431,6 +1431,15 @@ typedef struct BlkRwCo { >>>> void *iobuf; >>>> int ret; >>>> BdrvRequestFlags flags; >>>> +union { >>>> +struct { >>>> +unsigned int *nr_zones; >>>> +BlockZoneDescriptor *zones; >>>> +} zone_report; >>>> +struct { >>>> +BlockZoneOp op; >>>> +} zone_mgmt; >>>> +}; >>>> } BlkRwCo; >>>> >>>> int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags) >>>> @@ -1775,6 +1784,143 @@ int coroutine_fn blk_co_flush(BlockBackend *blk) >>>> return ret; >>>> } >>>> >>>> +static void blk_aio_zone_report_entry(void *opaque) { >>> >>> >>> The coroutine_fn annotation is missing: >>> >>> static void coroutine_fn blk_aio_zone_report_entry(void *opaque) { >>> >>>> +BlkAioEmAIOCB *acb = opaque; >>>> +BlkRwCo *rwco = &acb->rwco; >>>> + >>>> +rwco->ret = blk_co_zone_report(rwco->blk, rwco->offset, >>>> + rwco->zone_report.nr_zones, >>>> + rwco->zone_report.zones); >>>> +blk_aio_complete(acb); >>>> +} >>>> + >>>> +BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset, >>>> +unsigned int *nr_zones, >>>> +BlockZoneDescriptor *zones, >>>> +BlockCompletionFunc *cb, void *opaque) >>>> +{ >>>> +BlkAioEmAIOCB *acb; >>>> +Coroutine *co; >>>> +IO_CODE(); >>>> + >>>> +blk_inc_in_flight(blk); >>>> +acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque); >>>> +acb->rwco = (BlkRwCo) { >>>> +.blk= blk, >>>> +.offset = offset, >>>> +.ret= NOT_DONE, >>>> +.zone_report = { >>>> +.zones = zones, >>>> +.nr_zones = nr_zones, >>>> +}, >>>> +}; >>>> +acb->has_returned = false; >>>> + >>>> +co = qemu_coroutine_create(blk_aio_zone_report_entry, acb); >>>> +bdrv_coroutine_e
Re: [PATCH v11 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
escriptor, nr_zones); > +ret = blk_zone_report(blk, offset, &nr_zones, zones); > +if (ret < 0) { > +printf("zone report failed: %s\n", strerror(-ret)); > +} else { > +for (int i = 0; i < nr_zones; ++i) { > +printf("start: 0x%" PRIx64 ", len 0x%" PRIx64 ", " > + "cap"" 0x%" PRIx64 ", wptr 0x%" PRIx64 ", " > + "zcond:%u, [type: %u]\n", > +tosector(zones[i].start), tosector(zones[i].length), > +tosector(zones[i].cap), tosector(zones[i].wp), > +zones[i].cond, zones[i].type); > +} > +} > +return ret; > +} > + > +static const cmdinfo_t zone_report_cmd = { > +.name = "zone_report", > +.altname = "zrp", > +.cfunc = zone_report_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset number", > +.oneline = "report zone information", > +}; > + > +static int zone_open_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len); > +if (ret < 0) { > +printf("zone open failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_open_cmd = { > +.name = "zone_open", > +.altname = "zo", > +.cfunc = zone_open_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "explicit open a range of zones in zone block device", > +}; > + > +static int zone_close_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len); > +if (ret < 0) { > +printf("zone close failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_close_cmd = { > +.name = "zone_close", > +.altname = "zc", > +.cfunc = zone_close_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "close a range of zones in zone block device", > +}; > + > +static int zone_finish_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len); > +if (ret < 0) { > +printf("zone finish failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_finish_cmd = { > +.name = "zone_finish", > +.altname = "zf", > +.cfunc = zone_finish_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "finish a range of zones in zone block device", > +}; > + > +static int zone_reset_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, BLK_ZO_RESET, offset, len); > +if (ret < 0) { > +printf("zone reset failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_reset_cmd = { > +.name = "zone_reset", > +.altname = "zrs", > +.cfunc = zone_reset_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "reset a zone write pointer in zone block device", > +}; > + > static int truncate_f(BlockBackend *blk, int argc, char **argv); > static const cmdinfo_t truncate_cmd = { > .name = "truncate", > @@ -2504,6 +2647,11 @@ static void __attribute((constructor)) > init_qemuio_commands(void) > qemuio_add_command(&aio_write_cmd); > qemuio_add_command(&aio_flush_cmd); > qemuio_add_command(&flush_cmd); > +qemuio_add_command(&zone_report_cmd); > +qemuio_add_command(&zone_open_cmd); > +qemuio_add_command(&zone_close_cmd); > +qemuio_add_command(&zone_finish_cmd); > +qemuio_add_command(&zone_reset_cmd); > qemuio_add_command(&truncate_cmd); > qemuio_add_command(&length_cmd); > qemuio_add_command(&info_cmd); -- Damien Le Moal Western Digital Research
Re: [PATCH v11 7/7] docs/zoned-storage: add zoned device documentation
On 10/10/22 11:21, Sam Li wrote: > Add the documentation about the zoned device support to virtio-blk > emulation. > > Signed-off-by: Sam Li > Reviewed-by: Stefan Hajnoczi > --- > docs/devel/zoned-storage.rst | 40 ++ > docs/system/qemu-block-drivers.rst.inc | 6 > 2 files changed, 46 insertions(+) > create mode 100644 docs/devel/zoned-storage.rst > > diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst > new file mode 100644 > index 00..deaa4ce99b > --- /dev/null > +++ b/docs/devel/zoned-storage.rst > @@ -0,0 +1,40 @@ > += > +zoned-storage > += > + > +Zoned Block Devices (ZBDs) devide the LBA space into block regions called > zones > +that are larger than the LBA size. They can only allow sequential writes, > which > +can reduce write amplification in SSDs, and potentially lead to higher > +throughput and increased capacity. More details about ZBDs can be found at: > + > +https://zonedstorage.io/docs/introduction/zoned-storage > + > +1. Block layer APIs for zoned storage > +- > +QEMU block layer has three zoned storage model: > +- BLK_Z_HM: This model only allows sequential writes access. It supports a > set - BLK_Z_HM: The host-managed zoned model... > +of ZBD-specific I/O request that used by the host to manage device zones. ...of ZBD-specific commands that can be used by a host to manage the zones of a device. > +- BLK_Z_HA: It deals with both sequential writes and random writes access. - BLK_Z_HA: The host-aware zoned model allows random write operations in zones, making it backward compatible with regular block devices. > +- BLK_Z_NONE: Regular block devices and drive-managed ZBDs are treated as > +non-zoned devices. > + > +The block device information resides inside BlockDriverState. QEMU uses > +BlockLimits struct(BlockDriverState::bl) that is continuously accessed by the > +block layer while processing I/O requests. A BlockBackend has a root pointer > to > +a BlockDriverState graph(for example, raw format on top of file-posix). The > +zoned storage information can be propagated from the leaf BlockDriverState > all > +the way up to the BlockBackend. If the zoned storage model in file-posix is > +set to BLK_Z_HM, then block drivers will declare support for zoned host > device. > + > +The block layer APIs support commands needed for zoned storage devices, > +including report zones, four zone operations, and zone append. > + > +2. Emulating zoned storage controllers > +-- > +When the BlockBackend's BlockLimits model reports a zoned storage device, > users > +like the virtio-blk emulation or the qemu-io-cmds.c utility can use block > layer > +APIs for zoned storage emulation or testing. > + > +For example, to test zone_report on a null_blk device using qemu-io is: > +$ path/to/qemu-io --image-opts -n > driver=zoned_host_device,filename=/dev/nullb0 > +-c "zrp offset nr_zones" > diff --git a/docs/system/qemu-block-drivers.rst.inc > b/docs/system/qemu-block-drivers.rst.inc > index dfe5d2293d..0b97227fd9 100644 > --- a/docs/system/qemu-block-drivers.rst.inc > +++ b/docs/system/qemu-block-drivers.rst.inc > @@ -430,6 +430,12 @@ Hard disks >you may corrupt your host data (use the ``-snapshot`` command >line option or modify the device permissions accordingly). > > +Zoned block devices > + Zoned block devices can be passed through to the guest if the emulated > storage > + controller supports zoned storage. Use ``--blockdev zoned_host_device, > + node-name=drive0,filename=/dev/nullb0`` to pass through ``/dev/nullb0`` > + as ``drive0``. > + > Windows > ^^^ > With the above nits fixed, feel free to add: Reviewed-by: Damien Le Moal -- Damien Le Moal Western Digital Research
Re: [PATCH v3 1/3] file-posix:add the tracking of the zones write pointers
ctors = nr_sectors; > @@ -2035,11 +2154,41 @@ static int handle_aiocb_zone_mgmt(void *opaque) { > ret = ioctl(fd, aiocb->zone_mgmt.zone_op, &range); > } while (ret != 0 && errno == EINTR); > if (ret != 0) { > +update_zones_wp(0, aiocb->aio_fildes, aiocb->bs->bl.wps, > +aiocb->bs->bl.nr_zones); You need only to update the range of zones that was specified for the management option, not all zones. So you must specify the zone management aio offset and size/zone_size here. > ret = -errno; > error_report("ioctl %s failed %d", aiocb->zone_mgmt.zone_op_name, > ret); > return ret; > } > + > +qemu_mutex_lock(&wps->lock); > +if (!BDRV_ZT_IS_CONV(wps->wp[index])) { > + /* > + * The zoned device allows the last zone smaller that the zone size. > + */ comment indentation is off. > +if (aiocb->aio_nbytes < bs->bl.zone_size) { > +wend_offset = aiocb->aio_offset + aiocb->aio_nbytes; > +} else { > +wend_offset = aiocb->aio_offset + bs->bl.zone_size; > +} > + > +if (aiocb->aio_offset != wps->wp[index] && > +aiocb->zone_mgmt.zone_op == BLKRESETZONE) { I do not understand the condition here. Why do you have "aiocb->aio_offset != wps->wp[index]" ? > +if (aiocb->zone_mgmt.all) { This is the only place where you use this all boolean field. For simplicity, I would drop this field completely and test that aiocb->aio_offset == 0 && aiocb->aio_nbytes == bs->bl.capacity to detect a reset all zones operation. > +for (int i = 0; i < bs->bl.nr_zones; ++i) { > +wps->wp[i] = i * bs->bl.zone_size; You are not handling conventional zones here. For conventional zones, you should not change the value. Otherwise, BDRV_ZT_IS_CONV() will always return false after this. > +} > +} else { > +wps->wp[index] = aiocb->aio_offset; > +} > +} else if (aiocb->aio_offset != wps->wp[index] && > +aiocb->zone_mgmt.zone_op == BLKFINISHZONE) { Same comment here. Why do you have "aiocb->aio_offset != wps->wp[index]" ? > +wps->wp[index] = wend_offset; > +} > +} > +qemu_mutex_unlock(&wps->lock); > + > return ret; > } > #endif > @@ -2480,6 +2629,12 @@ static void raw_close(BlockDriverState *bs) > BDRVRawState *s = bs->opaque; > > if (s->fd >= 0) { > +#if defined(CONFIG_BLKZONED) > +if (bs->bl.wps) { > +qemu_mutex_destroy(&bs->bl.wps->lock); > +g_free(bs->bl.wps); > +} > +#endif > qemu_close(s->fd); > s->fd = -1; > } > @@ -3278,6 +3433,7 @@ static int coroutine_fn > raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op, > int64_t zone_size, zone_size_mask; > const char *zone_op_name; > unsigned long zone_op; > +bool is_all = false; > > zone_size = bs->bl.zone_size; > zone_size_mask = zone_size - 1; > @@ -3314,6 +3470,7 @@ static int coroutine_fn > raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op, > case BLK_ZO_RESET_ALL: > zone_op_name = "BLKRESETZONE"; > zone_op = BLKRESETZONE; > +is_all = true; > break; > default: > g_assert_not_reached(); > @@ -3328,6 +3485,7 @@ static int coroutine_fn > raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op, > .zone_mgmt = { > .zone_op = zone_op, > .zone_op_name = zone_op_name, > +.all = is_all, > }, > }; > > diff --git a/include/block/block-common.h b/include/block/block-common.h > index 882de6825e..b8b2dba64a 100644 > --- a/include/block/block-common.h > +++ b/include/block/block-common.h > @@ -93,6 +93,14 @@ typedef struct BlockZoneDescriptor { > BlockZoneCondition cond; > } BlockZoneDescriptor; > > +/* > + * Track write pointers of a zone in bytes. > + */ > +typedef struct BlockZoneWps { > +QemuMutex lock; > +uint64_t wp[]; > +} BlockZoneWps; > + > typedef struct BlockDriverInfo { > /* in bytes, 0 if irrelevant */ > int cluster_size; > @@ -206,6 +214,12 @@ typedef enum { > #define BDRV_SECTOR_BITS 9 > #define BDRV_SECTOR_SIZE (1ULL << BDRV_SECTOR_BITS) > > +/* > + * Get the first most significant bit of wp. If it is zero, then > + * the zone type is SWR. > + */ > +#define BDRV_ZT_IS_CONV(wp)(wp & (1ULL << 63)) > + > #define BDRV_REQUEST_MAX_SECTORS MIN_CONST(SIZE_MAX >> BDRV_SECTOR_BITS, \ > INT_MAX >> BDRV_SECTOR_BITS) > #define BDRV_REQUEST_MAX_BYTES (BDRV_REQUEST_MAX_SECTORS << BDRV_SECTOR_BITS) > diff --git a/include/block/block_int-common.h > b/include/block/block_int-common.h > index 37dddc603c..59c2d1316d 100644 > --- a/include/block/block_int-common.h > +++ b/include/block/block_int-common.h > @@ -857,6 +857,11 @@ typedef struct BlockLimits { > > /* device capacity expressed in bytes */ > int64_t capacity; > + > +/* array of write pointers' location of each zone in the zoned device. */ > +BlockZoneWps *wps; > + > +int64_t write_granularity; > } BlockLimits; > > typedef struct BdrvOpBlocker BdrvOpBlocker; -- Damien Le Moal Western Digital Research
Re: [PATCH v11 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
On 10/13/22 14:33, Sam Li wrote: > Damien Le Moal 于2022年10月13日周四 12:41写道: >> >> On 10/10/22 11:21, Sam Li wrote: >>> Add a new zoned_host_device BlockDriver. The zoned_host_device option >>> accepts only zoned host block devices. By adding zone management >>> operations in this new BlockDriver, users can use the new block >>> layer APIs including Report Zone and four zone management operations >>> (open, close, finish, reset, reset_all). >>> >>> Qemu-io uses the new APIs to perform zoned storage commands of the device: >>> zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs), >>> zone_finish(zf). >>> >>> For example, to test zone_report, use following command: >>> $ ./build/qemu-io --image-opts -n driver=zoned_host_device, >>> filename=/dev/nullb0 >>> -c "zrp offset nr_zones" >>> >>> Signed-off-by: Sam Li >>> Reviewed-by: Hannes Reinecke >>> --- >>> block/block-backend.c | 146 + >>> block/file-posix.c| 329 ++ >>> block/io.c| 41 >>> include/block/block-common.h | 1 + >>> include/block/block-io.h | 7 + >>> include/block/block_int-common.h | 24 +++ >>> include/block/raw-aio.h | 6 +- >>> include/sysemu/block-backend-io.h | 17 ++ >>> meson.build | 4 + >>> qapi/block-core.json | 8 +- >>> qemu-io-cmds.c| 148 ++ >>> 11 files changed, 728 insertions(+), 3 deletions(-) >>> >>> diff --git a/block/block-backend.c b/block/block-backend.c >>> index d4a5df2ac2..ddc569e3ac 100644 >>> --- a/block/block-backend.c >>> +++ b/block/block-backend.c >>> @@ -1431,6 +1431,15 @@ typedef struct BlkRwCo { >>> void *iobuf; >>> int ret; >>> BdrvRequestFlags flags; >>> +union { >>> +struct { >>> +unsigned int *nr_zones; >>> +BlockZoneDescriptor *zones; >>> +} zone_report; >>> +struct { >>> +BlockZoneOp op; >>> +} zone_mgmt; >>> +}; >>> } BlkRwCo; >>> >>> int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags) >>> @@ -1775,6 +1784,143 @@ int coroutine_fn blk_co_flush(BlockBackend *blk) >>> return ret; >>> } >>> >>> +static void coroutine_fn blk_aio_zone_report_entry(void *opaque) { >>> +BlkAioEmAIOCB *acb = opaque; >>> +BlkRwCo *rwco = &acb->rwco; >>> + >>> +rwco->ret = blk_co_zone_report(rwco->blk, rwco->offset, >>> + rwco->zone_report.nr_zones, >>> + rwco->zone_report.zones); >>> +blk_aio_complete(acb); >>> +} >>> + >>> +BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset, >>> +unsigned int *nr_zones, >>> +BlockZoneDescriptor *zones, >>> +BlockCompletionFunc *cb, void *opaque) >>> +{ >>> +BlkAioEmAIOCB *acb; >>> +Coroutine *co; >>> +IO_CODE(); >>> + >>> +blk_inc_in_flight(blk); >>> +acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque); >>> +acb->rwco = (BlkRwCo) { >>> +.blk= blk, >>> +.offset = offset, >>> +.ret= NOT_DONE, >>> +.zone_report = { >>> +.zones = zones, >>> +.nr_zones = nr_zones, >>> +}, >>> +}; >>> +acb->has_returned = false; >>> + >>> +co = qemu_coroutine_create(blk_aio_zone_report_entry, acb); >>> +bdrv_coroutine_enter(blk_bs(blk), co); >>> + >>> +acb->has_returned = true; >>> +if (acb->rwco.ret != NOT_DONE) { >>> +replay_bh_schedule_oneshot_event(blk_get_aio_context(blk), >>> + blk_aio_complete_bh, acb); >>> +} >>> + >>> +return &acb->common; >>> +} >>> + >>> +static void coroutine_fn blk_aio_zone_mgmt_entry(void *opaque) { >>> +BlkAioEmAIOCB *acb = opaque; >>> +BlkRwCo *rwco = &acb->rwco; >>> + >>> +rwco-&g
Re: [PATCH v3 2/3] block: introduce zone append write for zoned devices
> --- a/include/block/raw-aio.h > +++ b/include/block/raw-aio.h > @@ -31,6 +31,7 @@ > #define QEMU_AIO_TRUNCATE 0x0080 > #define QEMU_AIO_ZONE_REPORT 0x0100 > #define QEMU_AIO_ZONE_MGMT0x0200 > +#define QEMU_AIO_ZONE_APPEND 0x0400 > #define QEMU_AIO_TYPE_MASK \ > (QEMU_AIO_READ | \ > QEMU_AIO_WRITE | \ > @@ -41,7 +42,8 @@ > QEMU_AIO_COPY_RANGE | \ > QEMU_AIO_TRUNCATE | \ > QEMU_AIO_ZONE_REPORT | \ > - QEMU_AIO_ZONE_MGMT) > + QEMU_AIO_ZONE_MGMT | \ > + QEMU_AIO_ZONE_APPEND) > > /* AIO flags */ > #define QEMU_AIO_MISALIGNED 0x1000 > diff --git a/include/sysemu/block-backend-io.h > b/include/sysemu/block-backend-io.h > index 6835525582..33e35ae5d7 100644 > --- a/include/sysemu/block-backend-io.h > +++ b/include/sysemu/block-backend-io.h > @@ -51,6 +51,9 @@ BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t > offset, > BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op, >int64_t offset, int64_t len, >BlockCompletionFunc *cb, void *opaque); > +BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t *offset, > +QEMUIOVector *qiov, BdrvRequestFlags flags, > +BlockCompletionFunc *cb, void *opaque); > BlockAIOCB *blk_aio_pdiscard(BlockBackend *blk, int64_t offset, int64_t > bytes, > BlockCompletionFunc *cb, void *opaque); > void blk_aio_cancel_async(BlockAIOCB *acb); > @@ -172,6 +175,12 @@ int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, > BlockZoneOp op, >int64_t offset, int64_t len); > int generated_co_wrapper blk_zone_mgmt(BlockBackend *blk, BlockZoneOp op, > int64_t offset, int64_t len); > +int coroutine_fn blk_co_zone_append(BlockBackend *blk, int64_t *offset, > +QEMUIOVector *qiov, > +BdrvRequestFlags flags); > +int generated_co_wrapper blk_zone_append(BlockBackend *blk, int64_t *offset, > + QEMUIOVector *qiov, > + BdrvRequestFlags flags); > > int generated_co_wrapper blk_pdiscard(BlockBackend *blk, int64_t offset, >int64_t bytes); -- Damien Le Moal Western Digital Research
Re: [PATCH v3 1/3] file-posix:add the tracking of the zones write pointers
On 2022/10/13 16:08, Sam Li wrote: > Damien Le Moal 于2022年10月13日周四 13:13写道: >> >> On 10/10/22 11:33, Sam Li wrote: >>> Since Linux doesn't have a user API to issue zone append operations to >>> zoned devices from user space, the file-posix driver is modified to add >>> zone append emulation using regular writes. To do this, the file-posix >>> driver tracks the wp location of all zones of the device. It uses an >>> array of uint64_t. The most significant bit of each wp location indicates >>> if the zone type is conventional zones. >>> >>> The zones wp can be changed due to the following operations issued: >>> - zone reset: change the wp to the start offset of that zone >>> - zone finish: change to the end location of that zone >>> - write to a zone >>> - zone append >>> >>> Signed-off-by: Sam Li >>> --- >>> block/file-posix.c | 158 +++ >>> include/block/block-common.h | 14 +++ >>> include/block/block_int-common.h | 5 + >>> 3 files changed, 177 insertions(+) >>> >>> diff --git a/block/file-posix.c b/block/file-posix.c >>> index a9d347292e..17c0b58158 100755 >>> --- a/block/file-posix.c >>> +++ b/block/file-posix.c >>> @@ -206,6 +206,7 @@ typedef struct RawPosixAIOData { >>> struct { >>> struct iovec *iov; >>> int niov; >>> +int64_t *append_sector; >> >> This should be added as part of patch 2. You do not need this to track >> the wp of zones in this patch. >> >>> } io; >>> struct { >>> uint64_t cmd; >>> @@ -226,6 +227,7 @@ typedef struct RawPosixAIOData { >>> struct { >>> unsigned long zone_op; >>> const char *zone_op_name; >>> +bool all; >>> } zone_mgmt; >>> }; >>> } RawPosixAIOData; >>> @@ -1331,6 +1333,67 @@ static int hdev_get_max_segments(int fd, struct stat >>> *st) { >>> #endif >>> } >>> >>> +#if defined(CONFIG_BLKZONED) >>> +static int get_zones_wp(int64_t offset, int fd, BlockZoneWps *wps, >> >> Nit: It would seem more natural to have the fd argument first... >> >>> +unsigned int nrz) { >>> +struct blk_zone *blkz; >>> +int64_t rep_size; >>> +int64_t sector = offset >> BDRV_SECTOR_BITS; >>> +int ret, n = 0, i = 0; >>> +rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct >>> blk_zone); >>> +g_autofree struct blk_zone_report *rep = NULL; >>> + >>> +rep = g_malloc(rep_size); >>> +blkz = (struct blk_zone *)(rep + 1); >>> +while (n < nrz) { >>> +memset(rep, 0, rep_size); >>> +rep->sector = sector; >>> +rep->nr_zones = nrz - n; >>> + >>> +do { >>> +ret = ioctl(fd, BLKREPORTZONE, rep); >>> +} while (ret != 0 && errno == EINTR); >>> +if (ret != 0) { >>> +error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed >>> %d", >>> +fd, offset, errno); >>> +return -errno; >>> +} >>> + >>> +if (!rep->nr_zones) { >>> +break; >>> +} >>> + >>> +for (i = 0; i < rep->nr_zones; i++, n++) { >>> +/* >>> + * The wp tracking cares only about sequential writes required >>> and >>> + * sequential write preferred zones so that the wp can advance >>> to >>> + * the right location. >>> + * Use the most significant bit of the wp location to indicate >>> the >>> + * zone type: 0 for SWR/SWP zones and 1 for conventional zones. >>> + */ >>> +if (!(blkz[i].type != BLK_ZONE_TYPE_CONVENTIONAL)) { >> >> Double negation... This can simply be: >> >> if (blkz[i].type == BLK_ZONE_TYPE_CONVENTIONAL) { >> >>> +wps->wp[i] += 1ULL << 63; >> >> No need for the += here. This can be "=". >> >>> +} else { >>> +wps->wp[i] = blkz[i].wp << BDRV_SECTOR_BITS; >>> +} >>> +
Re: [PATCH v3 2/3] block: introduce zone append write for zoned devices
On 2022/10/13 16:27, Sam Li wrote: > Damien Le Moal 于2022年10月13日周四 13:55写道: >> >> On 10/10/22 11:33, Sam Li wrote: >>> A zone append command is a write operation that specifies the first >>> logical block of a zone as the write position. When writing to a zoned >>> block device using zone append, the byte offset of writes is pointing >>> to the write pointer of that zone. Upon completion the device will >>> respond with the position the data has been written in the zone. >>> >>> Signed-off-by: Sam Li >>> --- >>> block/block-backend.c | 64 +++ >>> block/file-posix.c| 64 --- >>> block/io.c| 21 ++ >>> block/raw-format.c| 7 >>> include/block/block-io.h | 3 ++ >>> include/block/block_int-common.h | 3 ++ >>> include/block/raw-aio.h | 4 +- >>> include/sysemu/block-backend-io.h | 9 + >>> 8 files changed, 168 insertions(+), 7 deletions(-) >>> >>> diff --git a/block/block-backend.c b/block/block-backend.c >>> index ddc569e3ac..bfdb719bc8 100644 >>> --- a/block/block-backend.c >>> +++ b/block/block-backend.c >>> @@ -1439,6 +1439,9 @@ typedef struct BlkRwCo { >>> struct { >>> BlockZoneOp op; >>> } zone_mgmt; >>> +struct { >>> +int64_t *append_sector; >> >> I would call this "sector", since it will always be referenced as >> "->zone_append.sector", you get the "append" for free :) >> >> That said, shouldn't this be a byte value, so called "offset" ? Not >> entirely sure... > > Yes, it can be changed to "offset"(byte) following QEMU's convention. > Just need to add conversions to virtio_blk_zone_append/*_complete, > which is easily done. > >> >>> +} zone_append; >>> }; >>> } BlkRwCo; >>> >>> @@ -1869,6 +1872,46 @@ BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, >>> BlockZoneOp op, >>> return &acb->common; >>> } >>> >>> +static void coroutine_fn blk_aio_zone_append_entry(void *opaque) { >>> +BlkAioEmAIOCB *acb = opaque; >>> +BlkRwCo *rwco = &acb->rwco; >>> + >>> +rwco->ret = blk_co_zone_append(rwco->blk, >>> rwco->zone_append.append_sector, >>> + rwco->iobuf, rwco->flags); >>> +blk_aio_complete(acb); >>> +} >>> + >>> +BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t *offset, >>> +QEMUIOVector *qiov, BdrvRequestFlags flags, >>> +BlockCompletionFunc *cb, void *opaque) { >>> +BlkAioEmAIOCB *acb; >>> +Coroutine *co; >>> +IO_CODE(); >>> + >>> +blk_inc_in_flight(blk); >>> +acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque); >>> +acb->rwco = (BlkRwCo) { >>> +.blk= blk, >>> +.ret= NOT_DONE, >>> +.flags = flags, >>> +.iobuf = qiov, >>> +.zone_append = { >>> +.append_sector = offset, >> >> See above comment. So since this is a byte value, this needs to be >> called "offset", no ? > > Yes, same answers above. > >> >>> +}, >>> +}; >>> +acb->has_returned = false; >>> + >>> +co = qemu_coroutine_create(blk_aio_zone_append_entry, acb); >>> +bdrv_coroutine_enter(blk_bs(blk), co); >>> +acb->has_returned = true; >>> +if (acb->rwco.ret != NOT_DONE) { >>> +replay_bh_schedule_oneshot_event(blk_get_aio_context(blk), >>> + blk_aio_complete_bh, acb); >>> +} >>> + >>> +return &acb->common; >>> +} >>> + >>> /* >>> * Send a zone_report command. >>> * offset is a byte offset from the start of the device. No alignment >>> @@ -1921,6 +1964,27 @@ int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, >>> BlockZoneOp op, >>> return ret; >>> } >>> >>> +/* >>> + * Send a zone_append command. >>> + */ >>> +int coroutine_fn blk_co_zone_app
Re: [RFC v3 1/2] include: update virtio_blk headers from Linux 5.19-rc2+
* Sequential Write Preferred zone */ >> +#define VIRTIO_BLK_ZT_SWP 3 >> + >> +/* >> + * Zone states that are available for zones of all types. >> + */ >> + >> +/* Not a write pointer (conventional zones only) */ >> +#define VIRTIO_BLK_ZS_NOT_WP 0 >> +/* Empty */ >> +#define VIRTIO_BLK_ZS_EMPTY 1 >> +/* Implicitly Open */ >> +#define VIRTIO_BLK_ZS_IOPEN 2 >> +/* Explicitly Open */ >> +#define VIRTIO_BLK_ZS_EOPEN 3 >> +/* Closed */ >> +#define VIRTIO_BLK_ZS_CLOSED 4 >> +/* Read-Only */ >> +#define VIRTIO_BLK_ZS_RDONLY 13 >> +/* Full */ >> +#define VIRTIO_BLK_ZS_FULL 14 >> +/* Offline */ >> +#define VIRTIO_BLK_ZS_OFFLINE 15 >> + >> /* Unmap this range (only valid for write zeroes command) */ >> #define VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP 0x0001 >> >> @@ -198,4 +300,11 @@ struct virtio_scsi_inhdr { >> #define VIRTIO_BLK_S_OK0 >> #define VIRTIO_BLK_S_IOERR 1 >> #define VIRTIO_BLK_S_UNSUPP2 >> + >> +/* Error codes that are specific to zoned block devices */ >> +#define VIRTIO_BLK_S_ZONE_INVALID_CMD 3 >> +#define VIRTIO_BLK_S_ZONE_UNALIGNED_WP 4 >> +#define VIRTIO_BLK_S_ZONE_OPEN_RESOURCE 5 >> +#define VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE 6 >> + >> #endif /* _LINUX_VIRTIO_BLK_H */ > -- Damien Le Moal Western Digital Research
Re: [PATCH v12 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
ockBackend *blk, int64_t offset, > diff --git a/meson.build b/meson.build > index 44c1f92697..0aa99b64a0 100644 > --- a/meson.build > +++ b/meson.build > @@ -1928,6 +1928,7 @@ config_host_data.set('CONFIG_REPLICATION', > get_option('replication').allowed()) > # has_header > config_host_data.set('CONFIG_EPOLL', cc.has_header('sys/epoll.h')) > config_host_data.set('CONFIG_LINUX_MAGIC_H', cc.has_header('linux/magic.h')) > +config_host_data.set('CONFIG_BLKZONED', cc.has_header('linux/blkzoned.h')) > config_host_data.set('CONFIG_VALGRIND_H', > cc.has_header('valgrind/valgrind.h')) > config_host_data.set('HAVE_BTRFS_H', cc.has_header('linux/btrfs.h')) > config_host_data.set('HAVE_DRM_H', cc.has_header('libdrm/drm.h')) > @@ -2021,6 +2022,9 @@ config_host_data.set('HAVE_SIGEV_NOTIFY_THREAD_ID', > config_host_data.set('HAVE_STRUCT_STAT_ST_ATIM', > cc.has_member('struct stat', 'st_atim', > prefix: '#include ')) > +config_host_data.set('HAVE_BLK_ZONE_REP_CAPACITY', > + cc.has_member('struct blk_zone', 'capacity', > + prefix: '#include ')) > > # has_type > config_host_data.set('CONFIG_IOVEC', > diff --git a/qapi/block-core.json b/qapi/block-core.json > index 882b266532..05a3b44731 100644 > --- a/qapi/block-core.json > +++ b/qapi/block-core.json > @@ -2942,6 +2942,7 @@ > # @compress: Since 5.0 > # @copy-before-write: Since 6.2 > # @snapshot-access: Since 7.0 > +# @zoned_host_device: Since 7.2 > # > # Since: 2.9 > ## > @@ -2955,7 +2956,8 @@ > 'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels', > 'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd', > { 'name': 'replication', 'if': 'CONFIG_REPLICATION' }, > -'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] } > +'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat', > +{ 'name': 'zoned_host_device', 'if': 'CONFIG_BLKZONED' } ] } > > ## > # @BlockdevOptionsFile: > @@ -4329,7 +4331,9 @@ >'vhdx': 'BlockdevOptionsGenericFormat', >'vmdk': 'BlockdevOptionsGenericCOWFormat', >'vpc':'BlockdevOptionsGenericFormat', > - 'vvfat': 'BlockdevOptionsVVFAT' > + 'vvfat': 'BlockdevOptionsVVFAT', > + 'zoned_host_device': { 'type': 'BlockdevOptionsFile', > + 'if': 'CONFIG_BLKZONED' } >} } > > ## > diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c > index 952dc940f1..c1b28ea108 100644 > --- a/qemu-io-cmds.c > +++ b/qemu-io-cmds.c > @@ -1712,6 +1712,150 @@ static const cmdinfo_t flush_cmd = { > .oneline= "flush all in-core file state to disk", > }; > > +static inline int64_t tosector(int64_t bytes) > +{ > +return bytes >> BDRV_SECTOR_BITS; > +} > + > +static int zone_report_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset; > +unsigned int nr_zones; > + > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +nr_zones = cvtnum(argv[optind]); > + > +g_autofree BlockZoneDescriptor *zones = NULL; > +zones = g_new(BlockZoneDescriptor, nr_zones); > +ret = blk_zone_report(blk, offset, &nr_zones, zones); > +if (ret < 0) { > +printf("zone report failed: %s\n", strerror(-ret)); > +} else { > +for (int i = 0; i < nr_zones; ++i) { > +printf("start: 0x%" PRIx64 ", len 0x%" PRIx64 ", " > + "cap"" 0x%" PRIx64 ", wptr 0x%" PRIx64 ", " > + "zcond:%u, [type: %u]\n", > +tosector(zones[i].start), tosector(zones[i].length), > +tosector(zones[i].cap), tosector(zones[i].wp), > +zones[i].cond, zones[i].type); > +} > +} > +return ret; > +} > + > +static const cmdinfo_t zone_report_cmd = { > +.name = "zone_report", > +.altname = "zrp", > +.cfunc = zone_report_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset number", > +.oneline = "report zone information", > +}; > + > +static int zone_open_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len); > +if (ret < 0) { > +printf("zone open failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_open_cmd = { > +.name = "zone_open", > +.altname = "zo", > +.cfunc = zone_open_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "explicit open a range of zones in zone block device", > +}; > + > +static int zone_close_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len); > +if (ret < 0) { > +printf("zone close failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_close_cmd = { > +.name = "zone_close", > +.altname = "zc", > +.cfunc = zone_close_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "close a range of zones in zone block device", > +}; > + > +static int zone_finish_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, BLK_ZO_FINISH, offset, len); > +if (ret < 0) { > +printf("zone finish failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_finish_cmd = { > +.name = "zone_finish", > +.altname = "zf", > +.cfunc = zone_finish_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "finish a range of zones in zone block device", > +}; > + > +static int zone_reset_f(BlockBackend *blk, int argc, char **argv) > +{ > +int ret; > +int64_t offset, len; > +++optind; > +offset = cvtnum(argv[optind]); > +++optind; > +len = cvtnum(argv[optind]); > +ret = blk_zone_mgmt(blk, BLK_ZO_RESET, offset, len); > +if (ret < 0) { > +printf("zone reset failed: %s\n", strerror(-ret)); > +} > +return ret; > +} > + > +static const cmdinfo_t zone_reset_cmd = { > +.name = "zone_reset", > +.altname = "zrs", > +.cfunc = zone_reset_f, > +.argmin = 2, > +.argmax = 2, > +.args = "offset len", > +.oneline = "reset a zone write pointer in zone block device", > +}; > + > static int truncate_f(BlockBackend *blk, int argc, char **argv); > static const cmdinfo_t truncate_cmd = { > .name = "truncate", > @@ -2504,6 +2648,11 @@ static void __attribute((constructor)) > init_qemuio_commands(void) > qemuio_add_command(&aio_write_cmd); > qemuio_add_command(&aio_flush_cmd); > qemuio_add_command(&flush_cmd); > +qemuio_add_command(&zone_report_cmd); > +qemuio_add_command(&zone_open_cmd); > +qemuio_add_command(&zone_close_cmd); > +qemuio_add_command(&zone_finish_cmd); > +qemuio_add_command(&zone_reset_cmd); > qemuio_add_command(&truncate_cmd); > qemuio_add_command(&length_cmd); > qemuio_add_command(&info_cmd); -- Damien Le Moal Western Digital Research
Re: [PATCH v4 1/3] file-posix: add the tracking of the zones write pointers
wps->wp[i] = i * bs->bl.zone_size; > +} > +} > +} else { > +wps->wp[index] = aiocb->aio_offset; > +} > +} else if (aiocb->zone_mgmt.zone_op == BLKFINISHZONE) { > +wps->wp[index] = wend_offset; > +} > +} > +qemu_mutex_unlock(&wps->lock); > + > return 0; > } > #endif > @@ -2483,6 +2621,12 @@ static void raw_close(BlockDriverState *bs) > BDRVRawState *s = bs->opaque; > > if (s->fd >= 0) { > +#if defined(CONFIG_BLKZONED) > +if (bs->bl.wps) { > +qemu_mutex_destroy(&bs->bl.wps->lock); > +g_free(bs->bl.wps); > +} > +#endif > qemu_close(s->fd); > s->fd = -1; > } > diff --git a/include/block/block-common.h b/include/block/block-common.h > index 36bd0e480e..52372f8252 100644 > --- a/include/block/block-common.h > +++ b/include/block/block-common.h > @@ -92,6 +92,14 @@ typedef struct BlockZoneDescriptor { > BlockZoneCondition cond; > } BlockZoneDescriptor; > > +/* > + * Track write pointers of a zone in bytes. > + */ > +typedef struct BlockZoneWps { > +QemuMutex lock; > +uint64_t wp[]; > +} BlockZoneWps; > + > typedef struct BlockDriverInfo { > /* in bytes, 0 if irrelevant */ > int cluster_size; > @@ -205,6 +213,12 @@ typedef enum { > #define BDRV_SECTOR_BITS 9 > #define BDRV_SECTOR_SIZE (1ULL << BDRV_SECTOR_BITS) > > +/* > + * Get the first most significant bit of wp. If it is zero, then > + * the zone type is SWR. > + */ > +#define BDRV_ZT_IS_CONV(wp)(wp & (1ULL << 63)) > + > #define BDRV_REQUEST_MAX_SECTORS MIN_CONST(SIZE_MAX >> BDRV_SECTOR_BITS, \ > INT_MAX >> BDRV_SECTOR_BITS) > #define BDRV_REQUEST_MAX_BYTES (BDRV_REQUEST_MAX_SECTORS << BDRV_SECTOR_BITS) > diff --git a/include/block/block_int-common.h > b/include/block/block_int-common.h > index 37dddc603c..e3136146aa 100644 > --- a/include/block/block_int-common.h > +++ b/include/block/block_int-common.h > @@ -857,6 +857,9 @@ typedef struct BlockLimits { > > /* device capacity expressed in bytes */ > int64_t capacity; > + > +/* array of write pointers' location of each zone in the zoned device. */ > +BlockZoneWps *wps; > } BlockLimits; > > typedef struct BdrvOpBlocker BdrvOpBlocker; -- Damien Le Moal Western Digital Research
Re: [PATCH v4 2/3] block: introduce zone append write for zoned devices
ck/block-io.h > +++ b/include/block/block-io.h > @@ -94,6 +94,9 @@ int coroutine_fn bdrv_co_zone_report(BlockDriverState *bs, > int64_t offset, > BlockZoneDescriptor *zones); > int coroutine_fn bdrv_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op, > int64_t offset, int64_t len); > +int coroutine_fn bdrv_co_zone_append(BlockDriverState *bs, int64_t *offset, > + QEMUIOVector *qiov, > + BdrvRequestFlags flags); > > int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes); > bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs); > diff --git a/include/block/block_int-common.h > b/include/block/block_int-common.h > index e3136146aa..a7e7db5646 100644 > --- a/include/block/block_int-common.h > +++ b/include/block/block_int-common.h > @@ -701,6 +701,9 @@ struct BlockDriver { > BlockZoneDescriptor *zones); > int coroutine_fn (*bdrv_co_zone_mgmt)(BlockDriverState *bs, BlockZoneOp > op, > int64_t offset, int64_t len); > +int coroutine_fn (*bdrv_co_zone_append)(BlockDriverState *bs, > +int64_t *offset, QEMUIOVector *qiov, > +BdrvRequestFlags flags); > > /* removable device specific */ > bool (*bdrv_is_inserted)(BlockDriverState *bs); > @@ -860,6 +863,8 @@ typedef struct BlockLimits { > > /* array of write pointers' location of each zone in the zoned device. */ > BlockZoneWps *wps; > + > +int64_t write_granularity; > } BlockLimits; > > typedef struct BdrvOpBlocker BdrvOpBlocker; > diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h > index 877b2240b3..53033a5ca7 100644 > --- a/include/block/raw-aio.h > +++ b/include/block/raw-aio.h > @@ -31,6 +31,7 @@ > #define QEMU_AIO_TRUNCATE 0x0080 > #define QEMU_AIO_ZONE_REPORT 0x0100 > #define QEMU_AIO_ZONE_MGMT0x0200 > +#define QEMU_AIO_ZONE_APPEND 0x0400 > #define QEMU_AIO_TYPE_MASK \ > (QEMU_AIO_READ | \ > QEMU_AIO_WRITE | \ > @@ -41,7 +42,8 @@ > QEMU_AIO_COPY_RANGE | \ > QEMU_AIO_TRUNCATE | \ > QEMU_AIO_ZONE_REPORT | \ > - QEMU_AIO_ZONE_MGMT) > + QEMU_AIO_ZONE_MGMT | \ > + QEMU_AIO_ZONE_APPEND) > > /* AIO flags */ > #define QEMU_AIO_MISALIGNED 0x1000 > diff --git a/include/sysemu/block-backend-io.h > b/include/sysemu/block-backend-io.h > index 1b5fc7db6b..ff9f777f52 100644 > --- a/include/sysemu/block-backend-io.h > +++ b/include/sysemu/block-backend-io.h > @@ -52,6 +52,9 @@ BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t > offset, > BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op, >int64_t offset, int64_t len, >BlockCompletionFunc *cb, void *opaque); > +BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t *offset, > +QEMUIOVector *qiov, BdrvRequestFlags flags, > +BlockCompletionFunc *cb, void *opaque); > BlockAIOCB *blk_aio_pdiscard(BlockBackend *blk, int64_t offset, int64_t > bytes, > BlockCompletionFunc *cb, void *opaque); > void blk_aio_cancel_async(BlockAIOCB *acb); > @@ -173,6 +176,12 @@ int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, > BlockZoneOp op, >int64_t offset, int64_t len); > int generated_co_wrapper blk_zone_mgmt(BlockBackend *blk, BlockZoneOp op, > int64_t offset, int64_t len); > +int coroutine_fn blk_co_zone_append(BlockBackend *blk, int64_t *offset, > +QEMUIOVector *qiov, > +BdrvRequestFlags flags); > +int generated_co_wrapper blk_zone_append(BlockBackend *blk, int64_t *offset, > + QEMUIOVector *qiov, > + BdrvRequestFlags flags); > > int generated_co_wrapper blk_pdiscard(BlockBackend *blk, int64_t offset, >int64_t bytes); -- Damien Le Moal Western Digital Research
Re: [PATCH] Use io_uring_register_ring_fd() to skip fd operations
On 2022/04/18 16:52, Sam Li wrote: > Linux recently added a new io_uring(7) optimization API that QEMU > doesn't take advantage of yet. The liburing library that QEMU uses > has added a corresponding new API calling io_uring_register_ring_fd(). > When this API is called after creating the ring, the io_uring_submit() > library function passes a flag to the io_uring_enter(2) syscall > allowing it to skip the ring file descriptor fdget()/fdput() > operations. This saves some CPU cycles. > > Signed-off-by: Sam Li > --- > block/io_uring.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/block/io_uring.c b/block/io_uring.c > index 782afdb433..2942967126 100644 > --- a/block/io_uring.c > +++ b/block/io_uring.c > @@ -435,6 +435,9 @@ LuringState *luring_init(Error **errp) > } > > ioq_init(&s->io_q); > +if (io_uring_register_ring_fd(&s->ring) < 0) { > +error_setg_errno(errp, errno, "failed to register linux > io_uring ring file descriptor"); This line seems broken. Probably your mailer wrapped the line. If you can, please use "git send-email" to avoid such problems. The line is very long anyway. So it probably is better to split it after errno: error_setg_errno(errp, errno, "failed to register linux io_uring ring file descriptor"); Also, if I understand this correctly, you ignore the error here and this will naturally fallback to the non-optimized iouring operation which will do the fdget()/fdput(), right ? If yes, it would be good to add a comment before the error_setg_errno() call to make it clear that the error being ignored is not an oversight. > +} Personally, I like a blank line before return. > return s; > And while at it, I would remove this useless blank line. Completely optional :) > } -- Damien Le Moal Western Digital Research
Re: [PATCH v2] block/file-posix: fix wps checking in raw_co_prw
On 6/8/23 03:57, Sam Li wrote: > If the write operation fails and the wps is NULL, then accessing it will > lead to data corruption. > > Solving the issue by adding a nullptr checking in get_zones_wp() where > the wps is used. > > This issue is found by Peter Maydell using the Coverity Tool (CID > 1512459). > > Signed-off-by: Sam Li > --- > block/file-posix.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/block/file-posix.c b/block/file-posix.c > index ac1ed54811..4a6c71c7f5 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -2523,7 +2523,7 @@ out: > } > } > } else { > -if (type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) { > +if (type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND) && wps) { > update_zones_wp(bs, s->fd, 0, 1); Nit: this could be: } else if (wps && type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) { However, both if & else side do something only if the above condition is true and we only need to that for a zoned drive. So the entire code block could really be simplified to be a lot more readable. Something like this (totally untested, not even compiled): #if defined(CONFIG_BLKZONED) if (bs->bl.zone_size && (type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND))) { BlockZoneWps *wps = bs->wps; uint64_t *wp; if (!wps) { return ret; } if (ret) { /* write error: update the wp from the underlying device */ update_zones_wp(bs, s->fd, 0, 1); goto unlock; } wp = &wps->wp[offset / bs->bl.zone_size]; if (BDRV_ZT_IS_CONV(*wp)) { /* Conventional zones do not have a write pointer */ goto unlock; } /* Return the written position for zone append */ if (type & QEMU_AIO_ZONE_APPEND) { *s->offset = *wp; trace_zbd_zone_append_complete(bs, *s->offset >> BDRV_SECTOR_BITS); } /* Advance the wp if needed */ if (offset + bytes > *wp) { *wp = offset + bytes; } unlock: qemu_co_mutex_unlock(&wps->colock); } #endif And making this entire block a helper function (e.g. advance_zone_wp()) would further clean the code. But that should be done in another patch. Care to send one ? -- Damien Le Moal Western Digital Research
Re: [PATCH v6 3/4] qcow2: add zoned emulation capability
break; > +case BLK_ZO_RESET: > +ret = qcow2_reset_zone(bs, index, len); > +break; > +default: > +error_report("Unsupported zone op: 0x%x", op); > +ret = -ENOTSUP; > +break; > +} > +return ret; > + > +unlock: > +qemu_co_mutex_unlock(&wps->colock); > +return ret; > +} > + > +static int coroutine_fn > +qcow2_co_zone_append(BlockDriverState *bs, int64_t *offset, QEMUIOVector > *qiov, > + BdrvRequestFlags flags) > +{ > +assert(flags == 0); > +int64_t capacity = bs->total_sectors << BDRV_SECTOR_BITS; > +uint32_t index; > +int ret; > +int64_t zone_size_mask = bs->bl.zone_size - 1; > +int64_t iov_len = 0; > +int64_t len = 0; > + > +if (*offset >= capacity) { > +error_report("*offset %" PRId64 " is equal to or greater than the" > + "device capacity %" PRId64 "", *offset, capacity); > +return -EINVAL; > +} > + > +/* offset + len should not pass the end of that zone starting from > offset */ > +if (*offset & zone_size_mask) { > +error_report("sector offset %" PRId64 " is not aligned to zone size " > + "%" PRId32 "", *offset / 512, bs->bl.zone_size / 512); > +return -EINVAL; > +} > + > +int64_t wg = bs->bl.write_granularity; > +int64_t wg_mask = wg - 1; > +for (int i = 0; i < qiov->niov; i++) { > +iov_len = qiov->iov[i].iov_len; > +if (iov_len & wg_mask) { > +error_report("len of IOVector[%d] %" PRId64 " is not aligned to " > + "block size %" PRId64 "", i, iov_len, wg); > +return -EINVAL; > +} > +} > +len = qiov->size; > +index = *offset / bs->bl.zone_size; > + > +if ((len >> BDRV_SECTOR_BITS) > bs->bl.max_append_sectors) { > +return -ENOTSUP; > +} > + > +qemu_co_mutex_lock(&bs->wps->colock); > +uint64_t wp_i = bs->wps->wp[index]; > +ret = qcow2_co_pwritev_part(bs, wp_i, len, qiov, 0, 0); > +if (ret == 0) { > +*offset = wp_i; > +} else { > +error_report("qcow2: zap failed"); > +} > + > +qemu_co_mutex_unlock(&bs->wps->colock); > +return ret; > +} > + > static int coroutine_fn GRAPH_RDLOCK > qcow2_co_copy_range_from(BlockDriverState *bs, > BdrvChild *src, int64_t src_offset, > @@ -6383,6 +7116,10 @@ BlockDriver bdrv_qcow2 = { > .bdrv_co_pwritev_compressed_part= qcow2_co_pwritev_compressed_part, > .bdrv_make_empty= qcow2_make_empty, > > +.bdrv_co_zone_report= qcow2_co_zone_report, > +.bdrv_co_zone_mgmt= qcow2_co_zone_mgmt, > +.bdrv_co_zone_append= qcow2_co_zone_append, > + > .bdrv_snapshot_create = qcow2_snapshot_create, > .bdrv_snapshot_goto = qcow2_snapshot_goto, > .bdrv_snapshot_delete = qcow2_snapshot_delete, > diff --git a/block/trace-events b/block/trace-events > index 8e789e1f12..e35222e079 100644 > --- a/block/trace-events > +++ b/block/trace-events > @@ -82,6 +82,8 @@ qcow2_writev_data(void *co, uint64_t offset) "co %p offset > 0x%" PRIx64 > qcow2_pwrite_zeroes_start_req(void *co, int64_t offset, int64_t bytes) "co > %p offset 0x%" PRIx64 " bytes %" PRId64 > qcow2_pwrite_zeroes(void *co, int64_t offset, int64_t bytes) "co %p offset > 0x%" PRIx64 " bytes %" PRId64 > qcow2_skip_cow(void *co, uint64_t offset, int nb_clusters) "co %p offset > 0x%" PRIx64 " nb_clusters %d" > +qcow2_wp_tracking(int index, uint64_t wp) "wps[%d]: 0x%" PRIx64 > +qcow2_imp_open_zones(uint8_t op, int nrz) "nr_imp_open_zones after op 0x%x: > %d" > > # qcow2-cluster.c > qcow2_alloc_clusters_offset(void *co, uint64_t offset, int bytes) "co %p > offset 0x%" PRIx64 " bytes %d" > diff --git a/include/qemu/queue.h b/include/qemu/queue.h > index e029e7bf66..3f0a48740e 100644 > --- a/include/qemu/queue.h > +++ b/include/qemu/queue.h > @@ -179,6 +179,7 @@ struct { > \ > #define QLIST_EMPTY(head)((head)->lh_first == NULL) > #define QLIST_FIRST(head)((head)->lh_first) > #define QLIST_NEXT(elm, field) ((elm)->field.le_next) > +#define QLIST_LAST(head, field) (*(head)->lh_first->field.le_prev) > > > /* -- Damien Le Moal Western Digital Research
Re: [PATCH v5 1/4] docs/qcow2: add the zoned format feature
On 10/30/23 23:04, Eric Blake wrote: >> + >> + .. option:: zone_size >> + >> +The size of a zone in bytes. The device is divided into zones of this >> +size with the exception of the last zone, which may be smaller. >> + >> + .. option:: zone_capacity >> + >> +The initial capacity value, in bytes, for all zones. The capacity must >> +be less than or equal to zone size. If the last zone is smaller, then >> +its capacity is capped. >> + >> +The zone capacity is per zone and may be different between zones in real >> +devices. For simplicity, QCow2 sets all zones to the same capacity. > > Just making sure I understand: One possible setup would be to describe > a block device with zones of size 1024M but with capacity 1000M (that > is, the zone reserves 24M capacity for other purposes)? > > Otherwise, I'm having a hard time seeing when you would ever set a > capacity different from size. > > Are there requirements that one (or both) of these values must be > powers of 2? Or is the requirement merely that they must be a > multiple of 512 bytes (because sub-sector operations are not > permitted)? Is there any implicit requirement based on qcow2 > implementation that a zone size/capacity must be a multiple of cluster > size (other than possibly for the last zone)? Linux requires the zone size to be a power of 2 number of LBAs. As a value so defined may not align to a ZNS drive internal superblock size (e.g. align to erase blocks), the zone capacity can be smaller than the zone size. The zone capacity represents the number of LBAs that are usable within a zone. The LBAs between zone capacity and zone size are unusable: reads will return 0s and writes will fail for these LBAs. A drive capacity is reported as the sum of all zone sizes, so it may be larger than the actual usable capacity, which is the sum of all zone capacities. Qcow2 follows this model despite the fact that we do not have the constraint of aligning to some hardware erase block size. This is mainly to allow emulating a real drive configuration. If a real drive emulation is not the goal of the use-case at hand, most users will simply want to have zone size == zone capacity for their zoned qcow2 drives. > >> + >> + .. option:: zone_nr_conv >> + >> +The number of conventional zones of the zoned device. >> + >> + .. option:: max_open_zones >> + >> +The maximal allowed open zones. >> + >> + .. option:: max_active_zones >> + >> +The limit of the zones with implicit open, explicit open or closed >> state. >> + >> + .. option:: max_append_sectors >> + >> +The maximal number of 512-byte sectors in a zone append request. > > Why is this value in sectors instead of bytes? I understand that > drivers may be written with sectors in mind, but any time we mix units > in the public interface, it gets awkward. I'd lean towards having > bytes here, with a requirement that it be a multiple of 512. Agreed. Let's use bytes to avoid confusion. -- Damien Le Moal Western Digital Research
Re: [PATCH] block/file-posix: fix update_zones_wp() caller
On 8/25/23 02:39, Sam Li wrote: > When the zoned requests that may change wp fail, it needs to > update only wps of the zones within the range of the requests > for not disrupting the other in-flight requests. The wp is updated > successfully after the request completes. > > Fixed the callers with right offset and nr_zones. > > Signed-off-by: Sam Li > --- > block/file-posix.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/block/file-posix.c b/block/file-posix.c > index b16e9c21a1..22559d6c2d 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -2522,7 +2522,8 @@ out: > } > } else { > if (type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) { > -update_zones_wp(bs, s->fd, 0, 1); > +update_zones_wp(bs, s->fd, offset, > +ROUND_UP(bytes, bs->bl.zone_size)); Write and zone append operations are not allowed to cross zone boundaries. So I the number of zones should always be 1. The above changes a number of zones to a number of bytes, which seems wrong. The correct fix is I think: update_zones_wp(bs, s->fd, offset, 1); > } > } > > @@ -3472,7 +3473,7 @@ static int coroutine_fn > raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op, > len >> BDRV_SECTOR_BITS); > ret = raw_thread_pool_submit(handle_aiocb_zone_mgmt, &acb); > if (ret != 0) { > -update_zones_wp(bs, s->fd, offset, i); > +update_zones_wp(bs, s->fd, offset, nrz); Same here. Why would you need to update all zones wp ? This will affect zones that do not have a write error and potentially change there correct in-memory wp to a wrong value. I think this also should be: update_zones_wp(bs, s->fd, offset, 1); > error_report("ioctl %s failed %d", op_name, ret); > return ret; > } -- Damien Le Moal Western Digital Research
Re: [PATCH] block/file-posix: fix update_zones_wp() caller
On 8/25/23 12:05, Sam Li wrote: > Damien Le Moal 于2023年8月25日周五 07:49写道: >> >> On 8/25/23 02:39, Sam Li wrote: >>> When the zoned requests that may change wp fail, it needs to >>> update only wps of the zones within the range of the requests >>> for not disrupting the other in-flight requests. The wp is updated >>> successfully after the request completes. >>> >>> Fixed the callers with right offset and nr_zones. >>> >>> Signed-off-by: Sam Li >>> --- >>> block/file-posix.c | 5 +++-- >>> 1 file changed, 3 insertions(+), 2 deletions(-) >>> >>> diff --git a/block/file-posix.c b/block/file-posix.c >>> index b16e9c21a1..22559d6c2d 100644 >>> --- a/block/file-posix.c >>> +++ b/block/file-posix.c >>> @@ -2522,7 +2522,8 @@ out: >>> } >>> } else { >>> if (type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) { >>> -update_zones_wp(bs, s->fd, 0, 1); >>> +update_zones_wp(bs, s->fd, offset, >>> +ROUND_UP(bytes, bs->bl.zone_size)); >> >> Write and zone append operations are not allowed to cross zone boundaries. >> So I >> the number of zones should always be 1. The above changes a number of zones >> to a >> number of bytes, which seems wrong. The correct fix is I think: >> >> update_zones_wp(bs, s->fd, offset, 1); >> > > I see. I forgot this constraint. > >>> } >>> } >>> >>> @@ -3472,7 +3473,7 @@ static int coroutine_fn >>> raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op, >>> len >> BDRV_SECTOR_BITS); >>> ret = raw_thread_pool_submit(handle_aiocb_zone_mgmt, &acb); >>> if (ret != 0) { >>> -update_zones_wp(bs, s->fd, offset, i); >>> +update_zones_wp(bs, s->fd, offset, nrz); >> >> Same here. Why would you need to update all zones wp ? This will affect zones >> that do not have a write error and potentially change there correct >> in-memory wp >> to a wrong value. I think this also should be: >> >>update_zones_wp(bs, s->fd, offset, 1); >> > > Is update_zones_wp for cancelling the writes on invalid zones or > updating corrupted write pointers caused by caller (write, append or > zone_mgmt)? > > My thought is based on the latter. Zone_mgmt can manage multiple zones > with a single request. When the request fails, it's hard to tell which > zone is corrupted. The relation between the req (zone_mgmt) and > update_zones_wp is: if req succeeds, no updates; if req fails, > consider the req never happens and do again. You should update the wp of the zones that were touched by the operation that failed. No other zone should have its wp updated as that could cause corruptions of the wp if there are on-going writes on these other zones. so the call should be "update_zones_wp(bs, s->fd, offset, n);" with n being the number of zones that the operation targeted. > > If the former is right, then it assumes only the first zone may > contain an error. I am not sure it's right. > >>> error_report("ioctl %s failed %d", op_name, ret); >>> return ret; >>> } >> >> -- >> Damien Le Moal >> Western Digital Research >> -- Damien Le Moal Western Digital Research
Re: [PATCH v2 2/4] qcow2: add configurations for zoned format extension
On 8/28/23 18:22, Sam Li wrote: > Stefan Hajnoczi 于2023年8月21日周一 21:31写道: >> >> On Mon, Aug 14, 2023 at 04:58:00PM +0800, Sam Li wrote: >>> diff --git a/block/qcow2.h b/block/qcow2.h >>> index f789ce3ae0..3694c8d217 100644 >>> --- a/block/qcow2.h >>> +++ b/block/qcow2.h >>> @@ -236,6 +236,20 @@ typedef struct Qcow2CryptoHeaderExtension { >>> uint64_t length; >>> } QEMU_PACKED Qcow2CryptoHeaderExtension; >>> >>> +typedef struct Qcow2ZonedHeaderExtension { >>> +/* Zoned device attributes */ >>> +uint8_t zoned_profile; >>> +uint8_t zoned; >>> +uint16_t reserved16; >>> +uint32_t zone_size; >>> +uint32_t zone_capacity; >> >> Should zone capacity be stored individually for each zone (alongside the >> write pointer and other per zone metadata) instead of as a global value >> for all zones? My understanding is that NVMe ZNS does not have a global >> value and each zone could have a different zone capacity value. > > Though zone capacity is per-zone attribute, it remains same for all > zones in most cases. Referring to the NVMe ZNS spec, zone capacity > changes associate to RESET_ZONE op when the variable zone capacity bit > is '1'. It hasn't specifically tell what it is changed to. Current ZNS > emulation doesn't change zone capacity as well. > > If the Variable Zone Capacity bit is cleared to ‘0’ in the Zone > Operation Characteristics field in the Zoned > Namespace Command Set specific Identify Namespace data structure, then > this field does not change without a change to the format of the zoned > namespace. > > If the Variable Zone Capacity bit is set to ‘1’ in the Zone Operation > Characteristics field in the Zoned > Namespace Command Set specific Identify Namespace data structure, then > the zone capacity may > change upon successful completion of a Zone Management Send command > specifying the Zone Send > Action of Reset Zone. Regardless of the variable zone capacity feature, zone capacity is per zone and may be different between zones. That is why it is reported per zone in zone report. The IO path code should not assume that the zone capacity is the same for all zones. For this particular case though, given that this is QCow2 emulation, limiting ourselves to the same zone capacity for all zones is I think fine. But that should be clearly stated somewhere may be... > >> >>> +uint32_t nr_zones; >> >> Is this field necessary since it can be derived from other image >> options: nr_zones = DIV_ROUND_UP(total_length, zone_capacity)? > > It can be dropped. I added this for reducing duplication. Thanks! -- Damien Le Moal Western Digital Research
Re: [PATCH v2 2/4] qcow2: add configurations for zoned format extension
On 8/28/23 19:18, Sam Li wrote: > Damien Le Moal 于2023年8月28日周一 18:13写道: >> >> On 8/28/23 18:22, Sam Li wrote: >>> Stefan Hajnoczi 于2023年8月21日周一 21:31写道: >>>> >>>> On Mon, Aug 14, 2023 at 04:58:00PM +0800, Sam Li wrote: >>>>> diff --git a/block/qcow2.h b/block/qcow2.h >>>>> index f789ce3ae0..3694c8d217 100644 >>>>> --- a/block/qcow2.h >>>>> +++ b/block/qcow2.h >>>>> @@ -236,6 +236,20 @@ typedef struct Qcow2CryptoHeaderExtension { >>>>> uint64_t length; >>>>> } QEMU_PACKED Qcow2CryptoHeaderExtension; >>>>> >>>>> +typedef struct Qcow2ZonedHeaderExtension { >>>>> +/* Zoned device attributes */ >>>>> +uint8_t zoned_profile; >>>>> +uint8_t zoned; >>>>> +uint16_t reserved16; >>>>> +uint32_t zone_size; >>>>> +uint32_t zone_capacity; >>>> >>>> Should zone capacity be stored individually for each zone (alongside the >>>> write pointer and other per zone metadata) instead of as a global value >>>> for all zones? My understanding is that NVMe ZNS does not have a global >>>> value and each zone could have a different zone capacity value. >>> >>> Though zone capacity is per-zone attribute, it remains same for all >>> zones in most cases. Referring to the NVMe ZNS spec, zone capacity >>> changes associate to RESET_ZONE op when the variable zone capacity bit >>> is '1'. It hasn't specifically tell what it is changed to. Current ZNS >>> emulation doesn't change zone capacity as well. >>> >>> If the Variable Zone Capacity bit is cleared to ‘0’ in the Zone >>> Operation Characteristics field in the Zoned >>> Namespace Command Set specific Identify Namespace data structure, then >>> this field does not change without a change to the format of the zoned >>> namespace. >>> >>> If the Variable Zone Capacity bit is set to ‘1’ in the Zone Operation >>> Characteristics field in the Zoned >>> Namespace Command Set specific Identify Namespace data structure, then >>> the zone capacity may >>> change upon successful completion of a Zone Management Send command >>> specifying the Zone Send >>> Action of Reset Zone. >> >> Regardless of the variable zone capacity feature, zone capacity is per zone >> and >> may be different between zones. That is why it is reported per zone in zone >> report. The IO path code should not assume that the zone capacity is the same >> for all zones. > > How is zone capacity changed, by devices or commands? Can you give > some example please? If the device does not support variable zone capacity, the zone capacity is fixed at device manufacturing time and never changes. It is reported per zone and you have to make things work with whatever value you see. The user cannot change device zone capacity. For you qcow2 zoned image, the equivalent is to fix the zone capacity when the image is created and not allowing to change it. And for simplicity, the same zone capacity value can be used for all zones, so having the zone capacity value in the header is OK. > >> >> For this particular case though, given that this is QCow2 emulation, limiting >> ourselves to the same zone capacity for all zones is I think fine. But that >> should be clearly stated somewhere may be... > > I see. The qcow2 documentaion can add that. > >> >>> >>>> >>>>> +uint32_t nr_zones; >>>> >>>> Is this field necessary since it can be derived from other image >>>> options: nr_zones = DIV_ROUND_UP(total_length, zone_capacity)? >>> >>> It can be dropped. I added this for reducing duplication. Thanks! >> >> -- >> Damien Le Moal >> Western Digital Research >> -- Damien Le Moal Western Digital Research
Re: [PATCH v2 3/4] qcow2: add zoned emulation capability
On 8/28/23 20:55, Sam Li wrote: >>> +/* close one implicitly open zones to make it available */ >>> +for (int i = s->zoned_header.zone_nr_conv; >>> +i < bs->bl.nr_zones; ++i) { >>> +uint64_t *wp = &s->wps->wp[i]; >>> +if (qcow2_get_zs(*wp) == BLK_ZS_IOPEN) { >>> +ret = qcow2_write_wp_at(bs, wp, i, BLK_ZS_CLOSED); >> >> I'm wondering if it's correct to store the zone state persistently in >> the qcow2 file. If the guest or QEMU crashes, then zones will be left in >> states like EOPEN. Since the guest software will have forgotten about >> explicitly opened zones, the guest would need to recover zone states. >> I'm not sure if existing software is designed to do that. >> >> Damien: Should the zone state be persistent? Yes and no. Yes you need to preserve/maintain zone states but not as is. With a real drive, if you power cycle the device, you get the following states changes: Before | After power cycle +--- EMPTY | EMPTY FULL | FULL IMP. OPEN | CLOSED EXP. OPEN | CLOSED CLOSED | CLOSED READ=ONLY | READ-ONLY OFFLINE| OFFLINE So any open (implicit or explicit) zone will show up as closed after power cycle. That is, the number of "active" zones does not change. For the qcow2 emulation, as long as you do not also emulate read-only and offline zones, you actually do not need to save the zone state in the zone metadata. On startup, you can infer the state from the zone write pointer: zone wp == zone start -> EMPTY zone wp >= zone capacity -> FULL zone wp > zone start -> CLOSED And make sure that all closed zones are counted as the initial number of active zones. The initial number of open zones will always be 0. So it is easy :) -- Damien Le Moal Western Digital Research
Re: [PATCH v2 3/4] qcow2: add zoned emulation capability
On 8/29/23 15:27, Sam Li wrote: > Damien Le Moal 于2023年8月29日周二 14:06写道: >> >> On 8/28/23 20:55, Sam Li wrote: >>>>> +/* close one implicitly open zones to make it available */ >>>>> +for (int i = s->zoned_header.zone_nr_conv; >>>>> +i < bs->bl.nr_zones; ++i) { >>>>> +uint64_t *wp = &s->wps->wp[i]; >>>>> +if (qcow2_get_zs(*wp) == BLK_ZS_IOPEN) { >>>>> +ret = qcow2_write_wp_at(bs, wp, i, BLK_ZS_CLOSED); >>>> >>>> I'm wondering if it's correct to store the zone state persistently in >>>> the qcow2 file. If the guest or QEMU crashes, then zones will be left in >>>> states like EOPEN. Since the guest software will have forgotten about >>>> explicitly opened zones, the guest would need to recover zone states. >>>> I'm not sure if existing software is designed to do that. >>>> >>>> Damien: Should the zone state be persistent? >> >> Yes and no. Yes you need to preserve/maintain zone states but not as is. >> With a real drive, if you power cycle the device, you get the following >> states >> changes: >> >> Before | After power cycle >> +--- >> EMPTY | EMPTY >> FULL | FULL >> IMP. OPEN | CLOSED >> EXP. OPEN | CLOSED >> CLOSED | CLOSED >> READ=ONLY | READ-ONLY >> OFFLINE| OFFLINE >> >> So any open (implicit or explicit) zone will show up as closed after power >> cycle. That is, the number of "active" zones does not change. >> For the qcow2 emulation, as long as you do not also emulate read-only and >> offline zones, you actually do not need to save the zone state in the zone >> metadata. On startup, you can infer the state from the zone write pointer: >> >> zone wp == zone start -> EMPTY >> zone wp >= zone capacity -> FULL >> zone wp > zone start -> CLOSED >> >> And make sure that all closed zones are counted as the initial number of >> active >> zones. The initial number of open zones will always be 0. >> >> So it is easy :) > > Thanks for the explanations! > > Read-only and offline are device internal events. Does qcow2 emulation > need to emulate that? > > Current NVMe ZNS emulation in QEMU has a nvme_offline_zone() function. > Does it suggest keeping the offline state persistent? > https://github.com/qemu/qemu/blob/master/hw/nvme/ctrl.c#L3740 The offline state is useful for testing only. If a zone goes offline, it generally means that the device is dying... At least for now, I do not think it is needed for qcow2. That can always be added later. > > Sam -- Damien Le Moal Western Digital Research
Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
On 2020/09/29 6:25, Keith Busch wrote: > On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote: >> On Sep 28 02:33, Dmitry Fomichev wrote: >>> You are making it sound like the entire WDC series relies on this approach. >>> Actually, the persistency is introduced in the second to last patch in the >>> series and it only adds a couple of lines of code in the i/o path to mark >>> zones dirty. This is possible because of using mmap() and I find the way >>> it is done to be quite elegant, not ugly :) >>> >> >> No, I understand that your implementation works fine without >> persistance, but persistance is key. That is why my series adds it in >> the first patch. Without persistence it is just a toy. And the QEMU >> device is not just an "NVMe-version" of null_blk. > > I really think we should be a bit more cautious of commiting to an > on-disk format for the persistent state. Both this and Klaus' persistent > state feels a bit ad-hoc, and with all the other knobs provided, it > looks too easy to have out-of-sync states, or just not being able to > boot at all if a qemu versions have different on-disk formats. > > Is anyone really considering zone emulation for production level stuff > anyway? I can't imagine a real scenario where you'd want put yourself > through that: you are just giving yourself all the downsides of a zoned > block device and none of the benefits. AFAIK, this is provided as a > development vehicle, closer to a "toy". > > I think we should consider trimming this down to a more minimal set that > we *do* agree on and commit for inclusion ASAP. We can iterate all the > bells & whistles and flush out the meta data's data marshalling scheme > for persistence later. +1 on this. Removing the persistence also removes the debate on endianess. With that out of the way, it should be straightforward to get agreement on a series that can be merged quickly to get developers started with testing ZNS software with QEMU. That is the most important goal here. 5.9 is around the corner, we need something for people to get started with ZNS quickly. -- Damien Le Moal Western Digital Research
Re: [PATCH v4 00/14] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set
On 2020/09/29 19:46, Klaus Jensen wrote: > On Sep 28 22:54, Damien Le Moal wrote: >> On 2020/09/29 6:25, Keith Busch wrote: >>> On Mon, Sep 28, 2020 at 08:36:48AM +0200, Klaus Jensen wrote: >>>> On Sep 28 02:33, Dmitry Fomichev wrote: >>>>> You are making it sound like the entire WDC series relies on this >>>>> approach. >>>>> Actually, the persistency is introduced in the second to last patch in the >>>>> series and it only adds a couple of lines of code in the i/o path to mark >>>>> zones dirty. This is possible because of using mmap() and I find the way >>>>> it is done to be quite elegant, not ugly :) >>>>> >>>> >>>> No, I understand that your implementation works fine without >>>> persistance, but persistance is key. That is why my series adds it in >>>> the first patch. Without persistence it is just a toy. And the QEMU >>>> device is not just an "NVMe-version" of null_blk. >>> >>> I really think we should be a bit more cautious of commiting to an >>> on-disk format for the persistent state. Both this and Klaus' persistent >>> state feels a bit ad-hoc, and with all the other knobs provided, it >>> looks too easy to have out-of-sync states, or just not being able to >>> boot at all if a qemu versions have different on-disk formats. >>> >>> Is anyone really considering zone emulation for production level stuff >>> anyway? I can't imagine a real scenario where you'd want put yourself >>> through that: you are just giving yourself all the downsides of a zoned >>> block device and none of the benefits. AFAIK, this is provided as a >>> development vehicle, closer to a "toy". >>> >>> I think we should consider trimming this down to a more minimal set that >>> we *do* agree on and commit for inclusion ASAP. We can iterate all the >>> bells & whistles and flush out the meta data's data marshalling scheme >>> for persistence later. >> >> +1 on this. Removing the persistence also removes the debate on endianess. >> With >> that out of the way, it should be straightforward to get agreement on a >> series >> that can be merged quickly to get developers started with testing ZNS >> software >> with QEMU. That is the most important goal here. 5.9 is around the corner, we >> need something for people to get started with ZNS quickly. >> > > Wait. What. No. Stop! > > It is unmistakably clear that you are invalidating my arguments about > portability and endianness issues by suggesting that we just remove > persistent state and deal with it later, but persistence is the killer > feature that sets the QEMU emulated device apart from other emulation > options. It is not about using emulation in production (because yeah, > why would you?), but persistence is what makes it possible to develop > and test "zoned FTLs" or something that requires recovery at power up. > This is what allows testing of how your host software deals with opened > zones being transitioned to FULL on power up and the persistent tracking > of LBA allocation (in my series) can be used to properly test error > recovery if you lost state in the app. I am not invalidating anything. I am in violent agreement with you about the usefulness of persistence. My point was that I agree with Keith: let's first get the base emulation in and improve on top of it. And the base emulation does not need to include persistence and endianess of the saved zone meta for now. The result of this would still be super useful to have in stable. Then let's add persistence and others bells and whistles on top (see below). > Please, work with me on this instead of just removing such an essential > feature. Since persistence seems to be the only thing we are really > discussing, we should have plenty of time until the soft-freeze to come > up with a proper solution on that. > > I agree that my version had a format that was pretty ad-hoc and that > won't fly - it needs magic and version capabilities like in Dmitry's > series, which incidentially looks a lot like what we did in the > OpenChannel implementation, so I agree with the strategy. > > ZNS-wise, the only thing my implementation stores is the zone > descriptors (in spec-native little-endian format) and the zone > descriptor extensions. So there are no endian issues with those. The > allocation tracking bitmap is always stored in little endian, but > converted to big-endian if running on a big-endian host. > > Let me just conjure something up. >
Re: [PATCH v2] block/file-posix: optimize append write
On 10/18/24 23:37, Kevin Wolf wrote: > Am 04.10.2024 um 12:41 hat Sam Li geschrieben: >> When the file-posix driver emulates append write, it holds the lock >> whenever accessing wp, which limits the IO queue depth to one. >> >> The write IO flow can be optimized to allow concurrent writes. The lock >> is held in two cases: >> 1. Assumed that the write IO succeeds, update the wp before issuing the >> write. >> 2. If the write IO fails, report that zone and use the reported value >> as the current wp. > > What happens with the concurrent writes that started later and may not > have completed yet? Can we really just reset to the reported value > before all other requests have completed, too? Yes, because if one write fails, we know that the following writes will fail too as they will not be aligned to the write pointer. These subsequent failed writes will again trigger the report zones and update, but that is fine. All of them have failed and the report will give the same wp again. This is a typical pattern with zoned block device: if one write fails in a zone, the user has to expect failures for all other writes issued to the same zone, do a report zone to get the wp and restart writing from there. -- Damien Le Moal Western Digital Research
Re: [PATCH v2] block/file-posix: optimize append write
On 10/21/24 20:08, Kevin Wolf wrote: > Am 20.10.2024 um 03:03 hat Damien Le Moal geschrieben: >> On 10/18/24 23:37, Kevin Wolf wrote: >>> Am 04.10.2024 um 12:41 hat Sam Li geschrieben: >>>> When the file-posix driver emulates append write, it holds the lock >>>> whenever accessing wp, which limits the IO queue depth to one. >>>> >>>> The write IO flow can be optimized to allow concurrent writes. The lock >>>> is held in two cases: >>>> 1. Assumed that the write IO succeeds, update the wp before issuing the >>>> write. >>>> 2. If the write IO fails, report that zone and use the reported value >>>> as the current wp. >>> >>> What happens with the concurrent writes that started later and may not >>> have completed yet? Can we really just reset to the reported value >>> before all other requests have completed, too? >> >> Yes, because if one write fails, we know that the following writes >> will fail too as they will not be aligned to the write pointer. These >> subsequent failed writes will again trigger the report zones and >> update, but that is fine. All of them have failed and the report will >> give the same wp again. >> >> This is a typical pattern with zoned block device: if one write fails >> in a zone, the user has to expect failures for all other writes issued >> to the same zone, do a report zone to get the wp and restart writing >> from there. > > Ok, that makes sense. Can we be sure that requests are handled in the > order they were submitted, though? That is, if the failed request is > resubmitted, could the already pending next one still succeed if it's > overtaken by the resubmitted request? Not sure if this would even cause > a probem, but is it a case we have to consider? A zoned device will always handle writes in the order they were submitted (per zone) and that is true for emulated devices as well as real ones. The completions may not be seen in order though, but that is fine. So what you are saying above is not a problem. The resubmitted failed write will go after the ones already submitted (and about to be failed) and may succeed if it is aligned to the wp, or fail. Whichever will happen only after all the already submitted writes have failed. -- Damien Le Moal Western Digital Research
Re: [PATCH v2] block/file-posix: optimize append write
On 10/22/24 03:13, Stefan Hajnoczi wrote: > On Mon, Oct 21, 2024 at 09:32:50PM +0900, Damien Le Moal wrote: >> On 10/21/24 20:08, Kevin Wolf wrote: >>> Am 20.10.2024 um 03:03 hat Damien Le Moal geschrieben: >>>> On 10/18/24 23:37, Kevin Wolf wrote: >>>>> Am 04.10.2024 um 12:41 hat Sam Li geschrieben: >>>>>> When the file-posix driver emulates append write, it holds the lock >>>>>> whenever accessing wp, which limits the IO queue depth to one. >>>>>> >>>>>> The write IO flow can be optimized to allow concurrent writes. The lock >>>>>> is held in two cases: >>>>>> 1. Assumed that the write IO succeeds, update the wp before issuing the >>>>>> write. >>>>>> 2. If the write IO fails, report that zone and use the reported value >>>>>> as the current wp. >>>>> >>>>> What happens with the concurrent writes that started later and may not >>>>> have completed yet? Can we really just reset to the reported value >>>>> before all other requests have completed, too? >>>> >>>> Yes, because if one write fails, we know that the following writes >>>> will fail too as they will not be aligned to the write pointer. These >>>> subsequent failed writes will again trigger the report zones and >>>> update, but that is fine. All of them have failed and the report will >>>> give the same wp again. >>>> >>>> This is a typical pattern with zoned block device: if one write fails >>>> in a zone, the user has to expect failures for all other writes issued >>>> to the same zone, do a report zone to get the wp and restart writing >>>> from there. >>> >>> Ok, that makes sense. Can we be sure that requests are handled in the >>> order they were submitted, though? That is, if the failed request is >>> resubmitted, could the already pending next one still succeed if it's >>> overtaken by the resubmitted request? Not sure if this would even cause >>> a probem, but is it a case we have to consider? >> >> A zoned device will always handle writes in the order they were submitted >> (per >> zone) and that is true for emulated devices as well as real ones. > > Is there serialization code in the kernel so that zoned devices behind > multi-path keep requests ordered? Yes: the kernel only issues at most one write per zone at any time, to preserve ordering. So there should be no issues at all. > Normally I don't assume any ordering between concurrent requests to a > block device, so I'm surprised that it's safe to submit multiple writes. Correct, the normal case does not provide any guarantees. But writes to zoned block devices are a special case. More on this here: https://zonedstorage.io/docs/linux/sched -- Damien Le Moal Western Digital Research
Re: [PATCH v7 3/4] qcow2: add zoned emulation capability
On 2024/09/23 15:40, Sam Li wrote: > Hi Damien, > > Damien Le Moal 于2024年9月23日周一 15:22写道: >> >> On 2024/09/23 13:06, Sam Li wrote: >> >> [...] >> >>>>> @@ -2837,6 +3180,19 @@ qcow2_co_pwritev_part(BlockDriverState *bs, >>>>> int64_t offset, int64_t bytes, >>>>> qiov_offset += cur_bytes; >>>>> trace_qcow2_writev_done_part(qemu_coroutine_self(), cur_bytes); >>>>> } >>>>> + >>>>> +if (bs->bl.zoned == BLK_Z_HM) { >>>>> +index = start_offset / zone_size; >>>>> +wp = &bs->wps->wp[index]; >>>>> +if (!QCOW2_ZT_IS_CONV(*wp)) { >>>>> +/* Advance the write pointer when the write completes */ >>>> >>>> Updating the write pointer after I/O does not prevent other write >>>> requests from beginning at the same offset as this request. Multiple >>>> write request coroutines can run concurrently and only the first one >>>> should succeed. The others should fail if they are using the same >>>> offset. >>>> >>>> The comment above says "Real drives change states before it can write to >>>> the zone" and I think it's appropriate to update the write pointer >>>> before performing the write too. The qcow2 zone emulation code is >>>> different from the file-posix.c passthrough code. We are responsible for >>>> maintaining zoned metadata state and cannot wait for the result of the >>>> I/O to tell us what happened. >> >> Yes, correct. The wp MUST be updated when issuing the IO, with the assumption >> that the write IO will succeed (errors are rare !). >> >>> The problem of updating the write pointer before IO completion is the >>> failure case. It can't be predicted in advance if an IO fails or not. >>> When write I/O fails, the wp should not be updated. >> >> Correct, if an IO fails, the wp should not be updated. However, that is not >> difficult to deal with: >> 1) under the zone lock, advance the wp position when issuing the write IO >> 2) When the write IO completes with success, nothing else needs to be done. >> 3) When *any* write IO completes with error you need to: >> - Lock the zone >> - Do a report zone for the target zone of the failed write to get >> the current >> wp location >> - Update bs->wps->wp[index] using that current wp location >> - Unlock the zone >> >> With that, one may get a few errors if multiple async writes are being >> issued, >> but that behavior is consistent with the same happening with a real drive. >> So no >> issue. And since the report zones gets you the current wp location, the user >> can >> restart writing from that location once it has dealt with all the previous >> write >> failures. > > I see. To allow the concurrent writes, the lock will only be used on > the failure path while processing append writes. > >> >>> The alternative way is to hold the wps lock as is also required for wp >>> accessing. Therefore only one of multiple concurrent write requests >>> will succeed. >> >> That is a very simple solution that avoids the above error recovery, but that >> would be very bad for performance (especially for a pure sequential write >> workload as we would limit IOs to quue depth 1). So if we can avoid this >> simple >> approach, that would be a lot better. > > Yeah, I'll drop this approach. Although, it reminds me of how > file-posix driver emulates zone_append. It holds the lock whenever > accessing wps. Does that limit IOs to QD 1 too? If so, it can be > improved. > -- one zone_append starts >>> wp_lock() >>>> IO processing >>>>> wp_update >>>>>> wp_unlock() > -- ends Yes, this limits the IO queue depth to 1, so not great. file-posix can use the exact same error recovery mechanism as I explained. The write IO flow would thus become: On submission: 1) wp_lock() 2) Check write alignement with wp (or change zone append into regular write) 3) Issue write as an asynchronous IO 4) wp_update 5) wp_unlock() And on completion, all you need is: 1) If no error, return success 2) wp_lock() 3) Do a report zone and use the reported wp value as the current wp 4) wp_unlock() 5) return IO error With this simple scheme, when there are no IO errors, things are fast and there is no "big lock" limiting the number of writes that can be issued. Only write error recovery ends up locking everything during the report zones, but that is fine. Errors are rare :) -- Damien Le Moal Western Digital Research
Re: [PATCH v7 3/4] qcow2: add zoned emulation capability
On 2024/09/23 13:06, Sam Li wrote: [...] >>> @@ -2837,6 +3180,19 @@ qcow2_co_pwritev_part(BlockDriverState *bs, int64_t >>> offset, int64_t bytes, >>> qiov_offset += cur_bytes; >>> trace_qcow2_writev_done_part(qemu_coroutine_self(), cur_bytes); >>> } >>> + >>> +if (bs->bl.zoned == BLK_Z_HM) { >>> +index = start_offset / zone_size; >>> +wp = &bs->wps->wp[index]; >>> +if (!QCOW2_ZT_IS_CONV(*wp)) { >>> +/* Advance the write pointer when the write completes */ >> >> Updating the write pointer after I/O does not prevent other write >> requests from beginning at the same offset as this request. Multiple >> write request coroutines can run concurrently and only the first one >> should succeed. The others should fail if they are using the same >> offset. >> >> The comment above says "Real drives change states before it can write to >> the zone" and I think it's appropriate to update the write pointer >> before performing the write too. The qcow2 zone emulation code is >> different from the file-posix.c passthrough code. We are responsible for >> maintaining zoned metadata state and cannot wait for the result of the >> I/O to tell us what happened. Yes, correct. The wp MUST be updated when issuing the IO, with the assumption that the write IO will succeed (errors are rare !). > The problem of updating the write pointer before IO completion is the > failure case. It can't be predicted in advance if an IO fails or not. > When write I/O fails, the wp should not be updated. Correct, if an IO fails, the wp should not be updated. However, that is not difficult to deal with: 1) under the zone lock, advance the wp position when issuing the write IO 2) When the write IO completes with success, nothing else needs to be done. 3) When *any* write IO completes with error you need to: - Lock the zone - Do a report zone for the target zone of the failed write to get the current wp location - Update bs->wps->wp[index] using that current wp location - Unlock the zone With that, one may get a few errors if multiple async writes are being issued, but that behavior is consistent with the same happening with a real drive. So no issue. And since the report zones gets you the current wp location, the user can restart writing from that location once it has dealt with all the previous write failures. > The alternative way is to hold the wps lock as is also required for wp > accessing. Therefore only one of multiple concurrent write requests > will succeed. That is a very simple solution that avoids the above error recovery, but that would be very bad for performance (especially for a pure sequential write workload as we would limit IOs to quue depth 1). So if we can avoid this simple approach, that would be a lot better. -- Damien Le Moal Western Digital Research
Re: [PATCH] block/file-posix: optimize append write
On 9/30/24 01:03, Sam Li wrote: > When the file-posix driver emulates append write, it holds the lock > whenever accessing wp, which limits the IO queue depth to one. > > The write IO flow can be optimized to allow concurrent writes. The lock > is held in two cases: > 1. Assumed that the write IO succeeds, update the wp before issuing the > write. > 2. If the write IO fails, report that zone and use the reported value > as the current wp. > > Signed-off-by: Sam Li > --- > block/file-posix.c | 33 - > 1 file changed, 16 insertions(+), 17 deletions(-) > > diff --git a/block/file-posix.c b/block/file-posix.c > index ff928b5e85..64a57fadb1 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -2489,11 +2489,19 @@ static int coroutine_fn raw_co_prw(BlockDriverState > *bs, int64_t *offset_ptr, > #if defined(CONFIG_BLKZONED) > if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) && > bs->bl.zoned != BLK_Z_NONE) { > -qemu_co_mutex_lock(&bs->wps->colock); > +BlockZoneWps *wps = bs->wps; > +int index = offset / bs->bl.zone_size; A blank line after this declaration (to separate declarations from code) woule make this code more readable... > +qemu_co_mutex_lock(&wps->colock); > +uint64_t *wp = &wps->wp[index]; > if (type & QEMU_AIO_ZONE_APPEND) { > -int index = offset / bs->bl.zone_size; > -offset = bs->wps->wp[index]; > +offset = *wp; > +*offset_ptr = offset; > +} > +/* Advance the wp if needed */ > +if (offset + bytes > *wp) { Why the if ? offset MUST be equal to wp for writes, and for zone append we do not need to check the offset at all. So advancing the wp should be unconditional. BUT ! where are the checks for "zone is full" and "offset == wp" for write operations ? These must be checked while holding the zone lock. > +*wp = offset + bytes; > } > +qemu_co_mutex_unlock(&bs->wps->colock); > } > #endif > > @@ -2540,28 +2548,19 @@ out: > #if defined(CONFIG_BLKZONED) > if ((type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) && > bs->bl.zoned != BLK_Z_NONE) { > -BlockZoneWps *wps = bs->wps; > if (ret == 0) { > -uint64_t *wp = &wps->wp[offset / bs->bl.zone_size]; > -if (!BDRV_ZT_IS_CONV(*wp)) { > -if (type & QEMU_AIO_ZONE_APPEND) { > -*offset_ptr = *wp; > -trace_zbd_zone_append_complete(bs, *offset_ptr > ->> BDRV_SECTOR_BITS); > -} > -/* Advance the wp if needed */ > -if (offset + bytes > *wp) { > -*wp = offset + bytes; > -} > +if (type & QEMU_AIO_ZONE_APPEND) { > +trace_zbd_zone_append_complete(bs, *offset_ptr > +>> BDRV_SECTOR_BITS); > } > } else { > +qemu_co_mutex_lock(&bs->wps->colock); > /* > * write and append write are not allowed to cross zone > boundaries > */ > update_zones_wp(bs, s->fd, offset, 1); > +qemu_co_mutex_unlock(&bs->wps->colock); > } > - > -qemu_co_mutex_unlock(&wps->colock); > } > #endif > return ret; -- Damien Le Moal Western Digital Research