Hi Stefan, Stefan Hajnoczi <stefa...@gmail.com> 于2022年6月1日周三 19:43写道: > > On Wed, 1 Jun 2022 at 06:47, Damien Le Moal > <damien.lem...@opensource.wdc.com> wrote: > > > > On 6/1/22 11:57, Sam Li wrote: > > > Hi Stefan, > > > > > > Stefan Hajnoczi <stefa...@gmail.com> 于2022年5月30日周一 19:19写道: > > > > > > > > >> > > >> On Mon, 30 May 2022 at 06:09, Sam Li <faithilike...@gmail.com> wrote: > > >>> > > >>> Hi everyone, > > >>> I'm Sam Li, working on the Outreachy project which is to add zoned > > >>> device support to QEMU's virtio-blk emulation. > > >>> > > >>> For the first goal, adding QEMU block layer APIs resembling Linux ZBD > > >>> ioctls, I think the naive approach would be to introduce a new stable > > >>> struct zbd_zone descriptor for the library function interface. More > > >>> specifically, what I'd like to add to the BlockDriver struct are: > > >>> 1. zbd_info as zone block device information: includes numbers of > > >>> zones, size of logical blocks, and physical blocks. > > >>> 2. zbd_zone_type and zbd_zone_state > > >>> 3. zbd_dev_model: host-managed zbd, host-aware zbd > > >>> With those basic structs, we can start to implement new functions as > > >>> bdrv*() APIs for BLOCK*ZONE ioctls. > > >>> > > >>> I'll start to finish this task based on the above description. If > > >>> there is any problem or something I may miss in the design, please let > > >>> me know. > > >> > > >> Hi Sam, > > >> Can you propose function prototypes for the new BlockDriver callbacks > > >> needed for zoned devices? > > > > > > I have made some modifications based on Damien's device in design part > > > 1 and added the function prototypes in design part 2. If there is any > > > problem or part I missed, please let me know. > > > > > > Design of Block Layer APIs in BlockDriver: > > > 1. introduce a new stable struct zbd_zone descriptor for the library > > > function interface. > > > a. zbd_info as zone block device information: includes numbers of > > > zones, size of blocks, write granularity in byte(minimal write size > > > and alignment > > > - write granularity: 512e SMRs: writes in units of physical block > > > size, 4096 bytes; NVMe ZNS write granularity is equal to the block > > > size. > > > - zone descriptor: start, length, capacity, write pointer, zone type > > > b. zbd_zone_type > > > - zone type: conventional, sequential write required, sequential > > > write preferred > > > c. zbd_dev_model: host-managed zbd, host-aware zbd > > > > This explanation is a little hard to understand. It seems to be mixing up > > device level information and per-zone information. I think it would be a > > lot simpler to write a struct definition to directly illustrate what you > > are planning. > > > > It is something like this ? > > > > struct zbd_zone { > > enum zone_type type; > > enum zone_cond cond; > > uint64_t start; > > uint32_t length; > > uint32_t cap; > > uint64_t wp; > > }; > > > > strcut zbd_dev { > > enum zone_model model; > > uint32_t block_size; > > uint32_t write_granularity; > > uint32_t nr_zones > > struct zbd_zone *zones; /* array of zones */ > > }; > > > > If yes, then my comments are as follows. > > > > For the device struct: It may be good to have also the maximum number of > > open zones and the maximum number of active zones. > > > > For the zone struct: You may need to add a read-write lock per zone to be > > able to write lock zones to ensure a sequential write pattern (virtio > > devices can be multi-queue and so writes may be coming in from different > > contexts) and to correctly emulate zone append operations with an atomic > > update of the wp field. > > > > These need to be integrated into the generic block driver interface in > > include/block/block_int-common.h or include/block/block-common.h. > > QEMU's block layer has a few ways of exposing information about block devices: > > int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi); > ImageInfoSpecific *(*bdrv_get_specific_info)(BlockDriverState *bs, > Error **errp); > > These fetch information from the BlockDriver and are good when a small > amount of data is reported occassionally and consumed by the caller. > > For data that is continuously accessed or that could be large, it may > be necessary for the data to reside inside BlockDriverState so that it > can be accessed in place (without copying): > > void (*bdrv_refresh_limits)(BlockDriverState *bs, Error **errp); > > QEMU uses this for the BlockLimits struct (BlockDriverState::bl) that > is continuously accessed by the block layer while processing I/O > requests. The "refresh" function updates the data in case the > underlying storage device has changed somehow. If no update function > is necessary then data can simply be populated during .bdrv_open() and > no new BlockDriver callback needs to be added. > > So in the simplest case BlockDriverState can be extended with a struct > zbd_dev field that is populated during .bdrv_open(). If the > BlockDriver doesn't support zones then the zbd_dev.nr_zones field is 0 > or the model field indicates that this is not a zoned storage device. > > However, a BlockBackend (not BlockDriverState!) API will be needed to > expose this data to users like the hw/block/virtio-blk.c emulation > code or the qemu-io-cmds.c utility that can be used for testing. A > BlockBackend has a root pointer to a BlockDriverState graph (for > example, qcow2 on top of file-posix). It will be necessary to > propagate zoned storage information from the leaf BlockDriverState all > the way up to the BlockBackend. In simple cases the BB root points > directly to the file-posix BDS that has Linux ZBD support but the > design needs to account for additional BDS graph nodes.
I think a simple way to think BlockBackend APIs is to use following callbacks: blk_aio_zone_mgmt() -> blk_aio_prwv() + blk_aio_zone_mgmt_entry() -> blk_co_do_zone_mgmt() -> blk_zone_mgmt(). The last function call will call bdrv_co_zone_mgmt() in block/file-posix.c. If I understand the additional case correctly, the BlockBackend API can expose the zone information to the virtio-blk emulation now. Besides, comparing blk_aio_flush() with blk_flush() in block-backend.c which lead to include/block/block-io.h, we may need consider the case when calling block layer API from non-coroutine context. Meanwhile, using bdrv_co_writev()/bdrv_co_readv() instead of read-write lock per zone may be a option too. If there is any problem, please let me know. Best regards, Sam > > In order to make progress on this interface I suggest looking at the > virtio-blk spec extension for zoned storage and thinking what the > BlockBackend API should look like that hw/block/virtio-blk.c will use. > Then it may be easier to decide how to report zone information from > BlockDriverState. > > > Stefan