Stefan Hajnoczi <stefa...@gmail.com> 于2022年6月2日周四 16:05写道: > > On Thu, 2 Jun 2022 at 06:43, Sam Li <faithilike...@gmail.com> wrote: > > > > Hi Stefan, > > > > Stefan Hajnoczi <stefa...@gmail.com> 于2022年6月1日周三 19:43写道: > > > > > > On Wed, 1 Jun 2022 at 06:47, Damien Le Moal > > > <damien.lem...@opensource.wdc.com> wrote: > > > > > > > > On 6/1/22 11:57, Sam Li wrote: > > > > > Hi Stefan, > > > > > > > > > > Stefan Hajnoczi <stefa...@gmail.com> 于2022年5月30日周一 19:19写道: > > > > > > > > > > > > > > >> > > > > >> On Mon, 30 May 2022 at 06:09, Sam Li <faithilike...@gmail.com> wrote: > > > > >>> > > > > >>> Hi everyone, > > > > >>> I'm Sam Li, working on the Outreachy project which is to add zoned > > > > >>> device support to QEMU's virtio-blk emulation. > > > > >>> > > > > >>> For the first goal, adding QEMU block layer APIs resembling Linux > > > > >>> ZBD > > > > >>> ioctls, I think the naive approach would be to introduce a new > > > > >>> stable > > > > >>> struct zbd_zone descriptor for the library function interface. More > > > > >>> specifically, what I'd like to add to the BlockDriver struct are: > > > > >>> 1. zbd_info as zone block device information: includes numbers of > > > > >>> zones, size of logical blocks, and physical blocks. > > > > >>> 2. zbd_zone_type and zbd_zone_state > > > > >>> 3. zbd_dev_model: host-managed zbd, host-aware zbd > > > > >>> With those basic structs, we can start to implement new functions as > > > > >>> bdrv*() APIs for BLOCK*ZONE ioctls. > > > > >>> > > > > >>> I'll start to finish this task based on the above description. If > > > > >>> there is any problem or something I may miss in the design, please > > > > >>> let > > > > >>> me know. > > > > >> > > > > >> Hi Sam, > > > > >> Can you propose function prototypes for the new BlockDriver callbacks > > > > >> needed for zoned devices? > > > > > > > > > > I have made some modifications based on Damien's device in design part > > > > > 1 and added the function prototypes in design part 2. If there is any > > > > > problem or part I missed, please let me know. > > > > > > > > > > Design of Block Layer APIs in BlockDriver: > > > > > 1. introduce a new stable struct zbd_zone descriptor for the library > > > > > function interface. > > > > > a. zbd_info as zone block device information: includes numbers of > > > > > zones, size of blocks, write granularity in byte(minimal write size > > > > > and alignment > > > > > - write granularity: 512e SMRs: writes in units of physical block > > > > > size, 4096 bytes; NVMe ZNS write granularity is equal to the block > > > > > size. > > > > > - zone descriptor: start, length, capacity, write pointer, zone > > > > > type > > > > > b. zbd_zone_type > > > > > - zone type: conventional, sequential write required, sequential > > > > > write preferred > > > > > c. zbd_dev_model: host-managed zbd, host-aware zbd > > > > > > > > This explanation is a little hard to understand. It seems to be mixing > > > > up > > > > device level information and per-zone information. I think it would be a > > > > lot simpler to write a struct definition to directly illustrate what you > > > > are planning. > > > > > > > > It is something like this ? > > > > > > > > struct zbd_zone { > > > > enum zone_type type; > > > > enum zone_cond cond; > > > > uint64_t start; > > > > uint32_t length; > > > > uint32_t cap; > > > > uint64_t wp; > > > > }; > > > > > > > > strcut zbd_dev { > > > > enum zone_model model; > > > > uint32_t block_size; > > > > uint32_t write_granularity; > > > > uint32_t nr_zones > > > > struct zbd_zone *zones; /* array of zones */ > > > > }; > > > > > > > > If yes, then my comments are as follows. > > > > > > > > For the device struct: It may be good to have also the maximum number of > > > > open zones and the maximum number of active zones. > > > > > > > > For the zone struct: You may need to add a read-write lock per zone to > > > > be > > > > able to write lock zones to ensure a sequential write pattern (virtio > > > > devices can be multi-queue and so writes may be coming in from different > > > > contexts) and to correctly emulate zone append operations with an atomic > > > > update of the wp field. > > > > > > > > These need to be integrated into the generic block driver interface in > > > > include/block/block_int-common.h or include/block/block-common.h. > > > > > > QEMU's block layer has a few ways of exposing information about block > > > devices: > > > > > > int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi); > > > ImageInfoSpecific *(*bdrv_get_specific_info)(BlockDriverState *bs, > > > Error **errp); > > > > > > These fetch information from the BlockDriver and are good when a small > > > amount of data is reported occassionally and consumed by the caller. > > > > > > For data that is continuously accessed or that could be large, it may > > > be necessary for the data to reside inside BlockDriverState so that it > > > can be accessed in place (without copying): > > > > > > void (*bdrv_refresh_limits)(BlockDriverState *bs, Error **errp); > > > > > > QEMU uses this for the BlockLimits struct (BlockDriverState::bl) that > > > is continuously accessed by the block layer while processing I/O > > > requests. The "refresh" function updates the data in case the > > > underlying storage device has changed somehow. If no update function > > > is necessary then data can simply be populated during .bdrv_open() and > > > no new BlockDriver callback needs to be added. > > > > > > So in the simplest case BlockDriverState can be extended with a struct > > > zbd_dev field that is populated during .bdrv_open(). If the > > > BlockDriver doesn't support zones then the zbd_dev.nr_zones field is 0 > > > or the model field indicates that this is not a zoned storage device. > > > > > > However, a BlockBackend (not BlockDriverState!) API will be needed to > > > expose this data to users like the hw/block/virtio-blk.c emulation > > > code or the qemu-io-cmds.c utility that can be used for testing. A > > > BlockBackend has a root pointer to a BlockDriverState graph (for > > > example, qcow2 on top of file-posix). It will be necessary to > > > propagate zoned storage information from the leaf BlockDriverState all > > > the way up to the BlockBackend. In simple cases the BB root points > > > directly to the file-posix BDS that has Linux ZBD support but the > > > design needs to account for additional BDS graph nodes. > > > > I think a simple way to think BlockBackend APIs is to use following > > callbacks: blk_aio_zone_mgmt() -> blk_aio_prwv() + > > blk_aio_zone_mgmt_entry() -> blk_co_do_zone_mgmt() -> blk_zone_mgmt(). > > The last function call will call bdrv_co_zone_mgmt() in > > block/file-posix.c. If I understand the additional case correctly, the > > BlockBackend API can expose the zone information to the virtio-blk > > emulation now. > > Yes! > > block/raw-format.c also needs to implement .bdrv_co_zone_mgmt() by > calling bdrv_co_zone_mgmt(bs->file, ...). This is because the > raw-format.c driver usually sits on top of file-posix.c and has to > pass through requests. > > There are filter block drivers like block/throttle.c, > block/blkdebug.c, etc (git grep is_filter block/) that will also need > to be modified to pass through requests in the same way. >
Are the filter block drivers also on top of file-posix.c but below block-backend.c? I read that the filter block drivers, and formats are designed to be manageable pieces so as to make block device configuration easier and clearer. > Based on what I've read in Dmitry's virtio-blk spec proposal, the > BlockBackend API could look something like: > > typedef struct { ...model, zone_sectors, max_open_zones, etc... } > BlockZoneInfo; > void blk_get_zone_info(BlockBackend *blk, BlockZoneInfo *info); > > virtio-blk.c calls this to fill out configuration space fields and > determine whether the BlockBackend is a zoned device. > > Then there are 3 commands that happen in the I/O code path: > > typedef struct { ... } BlockZoneDescriptor; > BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset, > BlockZoneDescriptor *zones, size_t max_zones, BlockCompletionFunc *cb, > void *opaque); > > typedef enum { ... } BlockZoneMgmtCmd; > BlockAIOCB *blk_aio_zone_mgmt_send(BlockBackend *blk, int64_t offset, > BlockZoneMgmtCmd cmd, bool all, BlockCompletionFunc *cb, void > *opaque); > > typedef void BlockZoneAppendCompletionFunc(void *opaque, int ret, > int64_t new_wp); > BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t offset, > QEMUIOVector *qiov, BlockZoneAppendCompletionFunc *cb, void *opaque); > > > Besides, comparing blk_aio_flush() with blk_flush() in block-backend.c > > which lead to include/block/block-io.h, we may need consider the case > > when calling block layer API from non-coroutine context. Meanwhile, > > using bdrv_co_writev()/bdrv_co_readv() instead of read-write lock per > > zone may be a option too. > > Yes, device emulation code usually uses the aio versions of the > BlockBackend I/O functions (read, write, flush). The QEMU block layer > runs the aio I/O request inside a coroutine and usually also exposes > coroutine versions of the same functions. For example, block jobs > (e.g. storage mirroring, backup, and migration background tasks) > usually call the coroutine versions of the BlockBackend APIs instead > of the aio ones. > > qemu-io-cmds.c will want synchronous versions of the aio commands > (blk_zone_report(), blk_zone_mgmt_send(), blk_zone_append()) that > block until the command completes. This is because the qemu-io utility > typically executes one command at a time and it's written mostly in > blocking style rather than async callbacks or coroutines. > docs/devel/block-coroutine-wrapper.rst describes how to generate > synchronous versions of coroutine functions. > > Do you want to start implementing blk_get_zone_info()? This will > require blk_*(), bdrv_*(), and BlockDriver (block/file-posix.c) > functions. I want to implement the smallest part that can be tested first and then move on to the next part. And I want to test zone report operation first. Does the qemu io-test require the following part to work: bdrv_co_zone_report in file-pisix.c, blk_get_zone_info() in block-backend.c, blk_aio_zone_report() in io code path and modify some test in test/qemu-iotests? If it does, then yes. Have a nice day! Sam > > Stefan