Re: [qemu-web PATCH] Add a blog post about zoned storage emulation

2022-11-17 Thread Sam Li
Stefan Hajnoczi  于2022年11月18日周五 03:12写道:
>
> Hi Sam,
> Please send a git repo URL so Thomas can fetch the commit without
> email/file size limitations.

I'll push it to the zbd branch after fixing the bellowing.
https://github.com/sgzerolc/qemu-web/zbd

>
> > diff --git a/_posts/2022-11-17-zoned-emulation.md 
> > b/_posts/2022-11-17-zoned-emulation.md
> > new file mode 100644
> > index 000..69ce4d7
> > --- /dev/null
> > +++ b/_posts/2022-11-17-zoned-emulation.md
> > @@ -0,0 +1,45 @@
> > +---
> > +layout: post
> > +title:  "Introduction to Zoned Storage Emulation"
> > +date:   2022-11-17
> > +author: Sam Li
> > +categories: [storage, gsoc, outreachy, internships]
> > +---
> > +
> > +## Zoned block devices
> > +
> > +Aimed for at-scale data infrastructures,
>
> I don't know what at-scale data infrastructure is. Is it something
> readers can relate to? Otherwise there's a risk that readers will
> decide this doesn't apply to them and stop reading.

Yes, I'll remove it.

>
> > zoned block devices (ZBDs) divide the LBA space into block regions called 
> > zones that are larger than the LBA size.
>
> LBA is not defined and also not used again after this sentence.
> Readers will be familiar with disks but may not know what an LBA is.
> Since the concept isn't used again I suggest dropping it:
>
>   zoned block devices (ZBDs) are divided into regions called zones
> that can only be written sequentially.
>
> > By only allowing sequential writes, it can reduce write amplification in 
> > SSDs,
>
> This sounds more natural:
>
>   By only allowing sequential writes, SSD write amplification can be reduced
>
> It might also be nice to provide a little bit of extra context:
>
>   ... reduced by eliminating the need for a  href="https://en.wikipedia.org/wiki/Flash_translation_layer";>Flash
> Translation Layer
>
> > and potentially lead to higher throughput and increased capacity. Providing 
> > new storage software stack,
>
> s/Providing new/Providing a new/
>
> > zoned storage concept is standardized as ZBC(SCSI standard), ZAC(ATA 
> > standard), ZNS(NVMe).
>
> Small tweaks:
>
>   zoned storage concepts are standardized in ZBC (SCSI standard), ZAC
> (ATA standard), ZNS (NVMe).
>
> There is a space before opening parentheses: hello (world) instead of
> hello(world). Please check the rest of the article for more instances
> of this.
>
> It would be nice to include links but I didn't find good pages for
> ZBC/ZAC/ZNS aside from the full standards that they are part of.
>
> This intro section would be a good place to link to https://zonedstorage.io/!

Good idea! Zoned storage site also has a brief introduction to those standards.
https://zonedstorage.io/docs/introduction/smr#governing-standards
https://zonedstorage.io/docs/introduction/zns

>
> > Meanwhile, the virtio protocol for block devices(virtio-blk) should also be 
> > aware of ZBDs instead of taking them as regular block devices. It should be 
> > able to pass such devices through to the guest. An overview of necessary 
> > work is as follows:
> > +
> > +1. Virtio protocol: [extend virtio-blk protocol with main zoned storage 
> > concept](https://lwn.net/Articles/914377/), Dmitry Fomichev
> > +2. Linux: [implement the virtio specification 
> > extensions](https://www.spinics.net/lists/linux-block/msg91944.html), 
> > Dmitry Fomichev
> > +3. QEMU: add zoned emulation support to virtio-blk, Sam Li, [Outreachy 
> > 2022 
> > project](https://wiki.qemu.org/Internships/ProjectIdeas/VirtIOBlkZonedBlockDevices)
>
> You could split the QEMU work into 2 points if you like:
> 3. QEMU: add zoned storage APIs to the block layer, Sam Li
> 4. QEMU: implement zoned storage support in virtio-blk emulation, Sam Li
>
> > +
> > +
> > +
> > +## Zoned emulation
> > +
> > +Currently, QEMU can support zoned devices by virtio-scsi or PCI device 
> > passthrough. It needs to specify the device type it is talking to. While 
> > storage controller emulation uses block layer APIs instead of directly 
> > accessing disk images. Extending virtio-blk emulation avoids code 
> > duplication and simplify the support by hiding the device types under a 
> > unified zoned storage interface, simplifying VM deployment for different 
> > type of zoned devices.
>
> Another advantages that come to mind:
> 1. virtio-blk can be implemented in hardware. If those devices wish to
> follow the zoned storage model then the virtio-blk specification needs
> to natively support zoned stora

Re: [qemu-web PATCH] Add a blog post about zoned storage emulation

2022-11-17 Thread Sam Li
Sam Li  于2022年11月18日周五 08:33写道:
>
> Stefan Hajnoczi  于2022年11月18日周五 03:12写道:
> >
> > Hi Sam,
> > Please send a git repo URL so Thomas can fetch the commit without
> > email/file size limitations.
>
> I'll push it to the zbd branch after fixing the bellowing.
> https://github.com/sgzerolc/qemu-web/zbd

Sorry, I've pushed the latest commit and the link to it should be:
https://github.com/sgzerolc/qemu-web/tree/zbd

Thanks,
Sam



Re: [qemu-web PATCH] Add a blog post about zoned storage emulation

2022-11-23 Thread Sam Li
Thomas Huth  于2022年11月23日周三 20:48写道:
>
> On 17/11/2022 20.12, Stefan Hajnoczi wrote:
> > Hi Sam,
> > Please send a git repo URL so Thomas can fetch the commit without
> > email/file size limitations.
>
> The size obviously comes from the PNG image ... since this seems to be a
> photo, I think JPG would be a better file type, so please convert it to JPG
> with an appropriate compression level. I assume this will help to shrink it
> to a reasonable size.
>
> >> +
>
> Another question : Where does the picture come from? Does it have a license
> that allows it to be used on websites like the QEMU blog?

It comes from slide P12 in this sharing and it doesn't have such a
license. I'll remove this image instead.
https://kvmforum2022.sched.com/event/15jL3/whats-in-virtio-12-and-what-isnt-there-michael-s-tsirkin-red-hat?

Here is a link to the fixed version:
https://github.com/sgzerolc/qemu-web/tree/zbd


Thanks,
Sam



[qemu-web PATCH v2] Add a blog post about zoned storage emulation

2022-11-27 Thread Sam Li
Signed-off-by: Sam Li 
---
 _posts/2022-11-17-zoned-emulation.md | 69 
 1 file changed, 69 insertions(+)
 create mode 100644 _posts/2022-11-17-zoned-emulation.md

diff --git a/_posts/2022-11-17-zoned-emulation.md 
b/_posts/2022-11-17-zoned-emulation.md
new file mode 100644
index 000..1e16e99
--- /dev/null
+++ b/_posts/2022-11-17-zoned-emulation.md
@@ -0,0 +1,69 @@
+---
+layout: post
+title:  "Introduction to Zoned Storage Emulation"
+date:   2022-11-17
+author: Sam Li
+categories: [storage, gsoc, outreachy, internships]
+---
+
+This summer I worked on adding Zoned Block Device (ZBD) support to virtio-blk 
as part of the [Outreachy](https://www.outreachy.org/) internship program. QEMU 
hasn't directly supported ZBDs before so this article explains how they work 
and why QEMU needed to be extended.
+
+## Zoned block devices
+
+Zoned block devices (ZBDs) are divided into regions called zones that can only 
be written sequentially. By only allowing sequential writes, SSD write 
amplification can be reduced by eliminating the need for a [Flash Translation 
Layer](https://en.wikipedia.org/wiki/Flash_translation_layer), and potentially 
lead to higher throughput and increased capacity. Providing a new storage 
software stack, zoned storage concepts are standardized as [ZBC (SCSI 
standard), ZAC (ATA 
standard)](https://zonedstorage.io/docs/introduction/smr#governing-standards), 
and [ZNS (NVMe)](https://zonedstorage.io/docs/introduction/zns). Meanwhile, the 
virtio protocol for block devices(virtio-blk) should also be aware of ZBDs 
instead of taking them as regular block devices. It should be able to pass such 
devices through to the guest. An overview of necessary work is as follows:
+
+1. Virtio protocol: [extend virtio-blk protocol with main zoned storage 
concept](https://lwn.net/Articles/914377/), Dmitry Fomichev
+2. Linux: [implement the virtio specification 
extensions](https://www.spinics.net/lists/linux-block/msg91944.html), Dmitry 
Fomichev
+3. QEMU: [add zoned storage APIs to the block 
layer](https://lists.gnu.org/archive/html/qemu-devel/2022-10/msg05195.html), 
Sam Li
+4. QEMU: implement zoned storage support in virtio-blk emulation, Sam Li
+
+Once the QEMU and Linux patches have been merged it will be possible to expose 
a virtio-blk ZBD to the guest like this:
+
+```sh
+-blockdev 
node-name=drive0,driver=zoned_host_device,filename=/path/to/zbd,cache.direct=on 
\
+-device virtio-blk-pci,drive=drive0 \
+```
+
+And then we can perform zoned block commands on that device in the guest os.
+
+```sh
+# blkzone report /dev/vda
+  start: 0x0, len 0x02, cap 0x02, wptr 0x00 reset:0 
non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
+  start: 0x2, len 0x02, cap 0x02, wptr 0x00 reset:0 
non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
+  start: 0x4, len 0x02, cap 0x02, wptr 0x00 reset:0 
non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
+  start: 0x6, len 0x02, cap 0x02, wptr 0x00 reset:0 
non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
+  start: 0x8, len 0x02, cap 0x02, wptr 0x00 reset:0 
non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
+  start: 0xa, len 0x02, cap 0x02, wptr 0x00 reset:0 
non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
+  start: 0xc, len 0x02, cap 0x02, wptr 0x00 reset:0 
non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
+  start: 0xe, len 0x02, cap 0x02, wptr 0x00 reset:0 
non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
+  start: 0x00010, len 0x02, cap 0x02, wptr 0x00 reset:0 
non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
+  start: 0x00012, len 0x02, cap 0x02, wptr 0x00 reset:0 
non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
+  start: 0x00014, len 0x02, cap 0x02, wptr 0x00 reset:0 
non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
+  start: 0x00016, len 0x02, cap 0x02, wptr 0x00 reset:0 
non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
+```
+
+## Zoned emulation
+
+Currently, QEMU can support zoned devices by virtio-scsi or PCI device 
passthrough. It needs to specify the device type it is talking to. Whereas 
storage controller emulation uses block layer APIs instead of directly 
accessing disk images. Extending virtio-blk emulation avoids code duplication 
and simplify the support by hiding the device types under a unified zoned 
storage interface, simplifying VM deployment for different types of zoned 
devices. Virtio-blk can also be implemented in hardware. If those devices wish 
to follow the zoned storage model then the virtio-blk specification needs to 
natively support zoned storage. With such support, individual NVMe namespaces 
or anything that is a zoned Linux block device can be exposed to the guest 
without passing through a full device.
+
+For zoned storage emulation, zoned storage APIs supp

Re: [PATCH v13 3/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2022-11-29 Thread Sam Li
Stefan Hajnoczi  于2022年11月30日周三 10:01写道:
>
> On Thu, 27 Oct 2022 at 11:46, Sam Li  wrote:
> >
> > Add a new zoned_host_device BlockDriver. The zoned_host_device option
> > accepts only zoned host block devices. By adding zone management
> > operations in this new BlockDriver, users can use the new block
> > layer APIs including Report Zone and four zone management operations
> > (open, close, finish, reset, reset_all).
> >
> > Qemu-io uses the new APIs to perform zoned storage commands of the device:
> > zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
> > zone_finish(zf).
> >
> > For example, to test zone_report, use following command:
> > $ ./build/qemu-io --image-opts -n driver=zoned_host_device, 
> > filename=/dev/nullb0
> > -c "zrp offset nr_zones"
> >
> > Signed-off-by: Sam Li 
> > Reviewed-by: Hannes Reinecke 
> > ---
> >  block/block-backend.c | 147 +
> >  block/file-posix.c| 348 ++
> >  block/io.c|  41 
> >  include/block/block-io.h  |   7 +
> >  include/block/block_int-common.h  |  21 ++
> >  include/block/raw-aio.h   |   6 +-
> >  include/sysemu/block-backend-io.h |  18 ++
> >  meson.build   |   4 +
> >  qapi/block-core.json  |   8 +-
> >  qemu-io-cmds.c| 149 +
> >  10 files changed, 746 insertions(+), 3 deletions(-)
> >
> > diff --git a/block/block-backend.c b/block/block-backend.c
> > index aa4adf06ae..731f23e816 100644
> > --- a/block/block-backend.c
> > +++ b/block/block-backend.c
> > @@ -1431,6 +1431,15 @@ typedef struct BlkRwCo {
> >  void *iobuf;
> >  int ret;
> >  BdrvRequestFlags flags;
> > +union {
> > +struct {
> > +unsigned int *nr_zones;
> > +BlockZoneDescriptor *zones;
> > +} zone_report;
> > +struct {
> > +unsigned long op;
> > +} zone_mgmt;
> > +};
> >  } BlkRwCo;
> >
> >  int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
> > @@ -1775,6 +1784,144 @@ int coroutine_fn blk_co_flush(BlockBackend *blk)
> >  return ret;
> >  }
> >
> > +static void coroutine_fn blk_aio_zone_report_entry(void *opaque)
> > +{
> > +BlkAioEmAIOCB *acb = opaque;
> > +BlkRwCo *rwco = &acb->rwco;
> > +
> > +rwco->ret = blk_co_zone_report(rwco->blk, rwco->offset,
> > +   rwco->zone_report.nr_zones,
> > +   rwco->zone_report.zones);
> > +blk_aio_complete(acb);
> > +}
> > +
> > +BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
> > +unsigned int *nr_zones,
> > +BlockZoneDescriptor  *zones,
> > +BlockCompletionFunc *cb, void *opaque)
> > +{
> > +BlkAioEmAIOCB *acb;
> > +Coroutine *co;
> > +IO_CODE();
> > +
> > +blk_inc_in_flight(blk);
> > +acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
> > +acb->rwco = (BlkRwCo) {
> > +.blk= blk,
> > +.offset = offset,
> > +.ret= NOT_DONE,
> > +.zone_report = {
> > +.zones = zones,
> > +.nr_zones = nr_zones,
> > +},
> > +};
> > +acb->has_returned = false;
> > +
> > +co = qemu_coroutine_create(blk_aio_zone_report_entry, acb);
> > +bdrv_coroutine_enter(blk_bs(blk), co);
> > +
> > +acb->has_returned = true;
> > +if (acb->rwco.ret != NOT_DONE) {
> > +replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> > + blk_aio_complete_bh, acb);
> > +}
> > +
> > +return &acb->common;
> > +}
> > +
> > +static void coroutine_fn blk_aio_zone_mgmt_entry(void *opaque)
> > +{
> > +BlkAioEmAIOCB *acb = opaque;
> > +BlkRwCo *rwco = &acb->rwco;
> > +
> > +rwco->ret = blk_co_zone_mgmt(rwco->blk, rwco->zone_mgmt.op,
> > + rwco->offset, acb->bytes);
> > +blk_aio_complete(acb);
> > +}
> > +
> > +BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> > +  int64_t offset, int64_t len,
> > +   

Re: [PATCH v13 0/8] Add support for zoned device

2022-11-29 Thread Sam Li
Stefan Hajnoczi  于2022年11月30日周三 10:04写道:
>
> On Thu, 27 Oct 2022 at 11:46, Sam Li  wrote:
> > v13:
> > - add some tracing points for new zone APIs [Dmitry]
> > - change error handling in zone_mgmt [Damien, Stefan]
>
> Hi Sam,
> This looks very close! I sent comments.

That's great! I'll fix them.

Sam



Re: [PATCH v13 3/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2022-12-05 Thread Sam Li
Stefan Hajnoczi  于2022年12月5日周一 20:20写道:
>
> On Wed, Nov 30, 2022 at 10:24:10AM +0800, Sam Li wrote:
> > Stefan Hajnoczi  于2022年11月30日周三 10:01写道:
> > > On Thu, 27 Oct 2022 at 11:46, Sam Li  wrote:
> > > > @@ -1374,9 +1428,11 @@ static int 
> > > > hdev_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz)
> > > >  int ret;
> > > >
> > > >  /* If DASD, get blocksizes */
> > > > +#ifndef CONFIG_BLKZONED
> > > >  if (check_for_dasd(s->fd) < 0) {
> > > >  return -ENOTSUP;
> > > >  }
> > > > +#endif
> > >
> > > What is the purpose of this #ifndef? .bdrv_probe_blocksizes() should
> > > only return block sizes for s390 DASD devices. I don't think zoned
> > > storage needs block size probing here.
> >
> > Zoned storage needs to be virtualized with the correct physical block
> > size and logical block size. And the probing here can guarantee that.
> > Or virtio-blk may send wrong block size to the guest. If manually set
> > block size in the command line as before, it is somewhat inaccurate.
>
> I see. I/O won't work if the guest block size differs from the physical
> zoned device's block size.
>
> However, we must not do this for regular host_device BlockDriverStates.
> The block size is manually controlled from those devices and defaults to
> 512B. That way the blocksize doesn't change across live migration and
> break the guest.
>
> Please use a run-time check instead of an #ifdef. Only probe blocksizes
> for dasd and zoned devices.

I see. Like this?

#ifndef CONFIG_BLKZONED
static int hdev_probe_zbd_blocksizes(BlockDriverState *bs, BlockSizes *bsz){
int ret;
/* check zbd */
...
/* probe zbd */
}
+#endif



Re: [RFC v4 0/9] Add support for zoned device

2022-07-27 Thread Sam Li
Stefan Hajnoczi  于2022年7月27日周三 23:06写道:
>
> This patch series introduces the concept of zoned storage to the QEMU
> block layer. Documentation is needed so that users and developers know
> how to use and maintain this feature.
>
> As a minimum, let's document how to pass through zoned block devices on Linux:
>
> diff --git a/docs/system/qemu-block-drivers.rst.inc
> b/docs/system/qemu-block-drivers.rst.inc
> index dfe5d2293d..f6ba05710a 100644
> --- a/docs/system/qemu-block-drivers.rst.inc
> +++ b/docs/system/qemu-block-drivers.rst.inc
> @@ -430,6 +430,12 @@ Hard disks
>you may corrupt your host data (use the ``-snapshot`` command
>line option or modify the device permissions accordingly).
>
> +Zoned block devices
> +  Zoned block devices can be passed through to the guest if the emulated
> +  storage controller supports zoned storage. Use ``--blockdev
> +  zoned_host_device,node-name=drive0,filename=/dev/nullb0`` to pass through
> +  ``/dev/nullb0`` as ``drive0``.
> +
>  Windows
>  ^^^
>
> For developers there should be an explanation of the zoned storage
> APIs and how BlockDrivers declare support. It should also mention the
> status of pass through (implemented in the zoned_host_device driver)
> vs zone emulation (not implemented in the QEMU block layer) so
> developers understand the block layer's zoned storage capabilities.
> You can add a docs/devel/zoned-storage.rst file to document this or
> let me know if you want me to write it.

I will write the document and address the issues in the reviews, which
should be in the next revision.
Thanks for reviewing!

Have a good day!
Sam



[RFC v5 00/11] Add support for zoned device

2022-07-31 Thread Sam Li
Zoned Block Devices (ZBDs) devide the LBA space to block regions called zones
that are larger than the LBA size. It can only allow sequential writes, which
reduces write amplification in SSD, leading to higher throughput and increased
capacity. More details about ZBDs can be found at:

https://zonedstorage.io/docs/introduction/zoned-storage

The zoned device support aims to let guests (virtual machines) access zoned
storage devices on the host (hypervisor) through a virtio-blk device. This
involves extending QEMU's block layer and virtio-blk emulation code.  In its
current status, the virtio-blk device is not aware of ZBDs but the guest sees
host-managed drives as regular drive that will runs correctly under the most
common write workloads.

This patch series extend the block layer APIs and virtio-blk emulation section
with the minimum set of zoned commands that are necessary to support zoned
devices. The commands are - Report Zones, four zone operations and Zone Append
(developing).

It can be tested on a null_blk device using qemu-io, qemu-iotests or blkzone(8)
command in the guest os. For example, the command line for zone report using
qemu-io is:

$ path/to/qemu-io --image-opts driver=zoned_host_device,filename=/dev/nullb0 -c
"zrp offset nr_zones"

To enable zoned device in the guest os, the guest kernel must have the 
virtio-blk
driver with ZBDs support. The link to such patches for the kernel is:
https://github.com/dmitry-fomichev/virtblk-zbd

Then, add the following options to the QEMU command line:
-blockdev node-name=drive0,driver=zoned_host_device,filename=/dev/nullb0

After the guest os booting, use blkzone(8) to test zone operations:
blkzone report -o offset -c nr_zones /dev/vda

v5:
- add zoned storage emulation to virtio-blk device
- add documentation for zoned storage
- address review comments
  * fix qemu-iotests
  * fix check to block layer
  * modify interfaces of sysfs helper functions
  * rename zoned device structs according to QEMU styles
  * reorder patches

v4:
- add virtio-blk headers for zoned device
- add configurations for zoned host device
- add zone operations for raw-format
- address review comments
  * fix memory leak bug in zone_report
  * add checks to block layers
  * fix qemu-iotests format
  * fix sysfs helper functions

v3:
- add helper functions to get sysfs attributes
- address review comments
  * fix zone report bugs
  * fix the qemu-io code path
  * use thread pool to avoid blocking ioctl() calls

v2:
- add qemu-io sub-commands
- address review comments
  * modify interfaces of APIs

v1:
- add block layer APIs resembling Linux ZoneBlockDevice ioctls

Sam Li (11):
  include: add zoned device structs
  include: import virtio_blk headers from linux with zoned storage
support
  file-posix: introduce get_sysfs_long_val for the long sysfs attribute
  file-posix: introduce get_sysfs_str_val for device zoned model
  block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  raw-format: add zone operations to pass through requests
  config: add check to block layer
  virtio-blk: add zoned storage APIs for zoned devices
  qemu-io: add zoned block device operations.
  qemu-iotests: test new zone operations
  docs/zoned-storage: add zoned device documentation

 block.c |  13 +
 block/block-backend.c   | 139 +++
 block/coroutines.h  |   6 +
 block/file-posix.c  | 383 +++-
 block/io.c  |  41 +++
 block/raw-format.c  |  14 +
 docs/devel/zoned-storage.rst|  68 
 docs/system/qemu-block-drivers.rst.inc  |   6 +
 hw/block/virtio-blk.c   | 172 -
 include/block/block-common.h|  44 ++-
 include/block/block-io.h|  13 +
 include/block/block_int-common.h|  35 +-
 include/block/raw-aio.h |   6 +-
 include/standard-headers/linux/virtio_blk.h | 118 ++
 include/sysemu/block-backend-io.h   |   6 +
 meson.build |   1 +
 qapi/block-core.json|   7 +-
 qemu-io-cmds.c  | 144 
 tests/qemu-iotests/tests/zoned.out  |  53 +++
 tests/qemu-iotests/tests/zoned.sh   |  86 +
 20 files changed, 1340 insertions(+), 15 deletions(-)
 create mode 100644 docs/devel/zoned-storage.rst
 create mode 100644 tests/qemu-iotests/tests/zoned.out
 create mode 100755 tests/qemu-iotests/tests/zoned.sh

-- 
2.37.1




[RFC v5 01/11] include: add zoned device structs

2022-07-31 Thread Sam Li
Signed-off-by: Sam Li 
---
 include/block/block-common.h | 43 
 1 file changed, 43 insertions(+)

diff --git a/include/block/block-common.h b/include/block/block-common.h
index fdb7306e78..c9d28b1c51 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -49,6 +49,49 @@ typedef struct BlockDriver BlockDriver;
 typedef struct BdrvChild BdrvChild;
 typedef struct BdrvChildClass BdrvChildClass;
 
+typedef enum BlockZoneOp {
+BLK_ZO_OPEN,
+BLK_ZO_CLOSE,
+BLK_ZO_FINISH,
+BLK_ZO_RESET,
+} BlockZoneOp;
+
+typedef enum BlockZoneModel {
+BLK_Z_NONE = 0x0, /* Regular block device */
+BLK_Z_HM = 0x1, /* Host-aware zoned block device */
+BLK_Z_HA = 0x2, /* Host-managed zoned block device */
+} BlockZoneModel;
+
+typedef enum BlockZoneCondition {
+BLK_ZS_NOT_WP = 0x0,
+BLK_ZS_EMPTY = 0x1,
+BLK_ZS_IOPEN = 0x2,
+BLK_ZS_EOPEN = 0x3,
+BLK_ZS_CLOSED = 0x4,
+BLK_ZS_RDONLY = 0xD,
+BLK_ZS_FULL = 0xE,
+BLK_ZS_OFFLINE = 0xF,
+} BlockZoneCondition;
+
+typedef enum BlockZoneType {
+BLK_ZT_CONV = 0x1,
+BLK_ZT_SWR = 0x2,
+BLK_ZT_SWP = 0x3,
+} BlockZoneType;
+
+/*
+ * Zone descriptor data structure.
+ * Provide information on a zone with all position and size values in bytes.
+ */
+typedef struct BlockZoneDescriptor {
+uint64_t start;
+uint64_t length;
+uint64_t cap;
+uint64_t wp;
+BlockZoneType type;
+BlockZoneCondition cond;
+} BlockZoneDescriptor;
+
 typedef struct BlockDriverInfo {
 /* in bytes, 0 if irrelevant */
 int cluster_size;
-- 
2.37.1




[RFC v5 02/11] include: import virtio_blk headers from linux with zoned storage support

2022-07-31 Thread Sam Li
Add file from Dmitry's "virtio-blk:add support for zoned block devices"
linux patch using scripts/update-linux-headers.sh. There is a link for
more information: https://github.com/dmitry-fomichev/virtblk-zbd

Signed-off-by: Sam Li 
---
 include/standard-headers/linux/virtio_blk.h | 118 
 1 file changed, 118 insertions(+)

diff --git a/include/standard-headers/linux/virtio_blk.h 
b/include/standard-headers/linux/virtio_blk.h
index 2dcc90826a..5c6856aec3 100644
--- a/include/standard-headers/linux/virtio_blk.h
+++ b/include/standard-headers/linux/virtio_blk.h
@@ -40,6 +40,7 @@
 #define VIRTIO_BLK_F_MQ12  /* support more than one vq */
 #define VIRTIO_BLK_F_DISCARD   13  /* DISCARD is supported */
 #define VIRTIO_BLK_F_WRITE_ZEROES  14  /* WRITE ZEROES is supported */
+#define VIRTIO_BLK_F_ZONED 17  /* Zoned block device */
 
 /* Legacy feature bits */
 #ifndef VIRTIO_BLK_NO_LEGACY
@@ -119,6 +120,20 @@ struct virtio_blk_config {
uint8_t write_zeroes_may_unmap;
 
uint8_t unused1[3];
+
+   /* Secure erase fields that are defined in the virtio spec */
+   uint8_t sec_erase[12];
+
+   /* Zoned block device characteristics (if VIRTIO_BLK_F_ZONED) */
+   struct virtio_blk_zoned_characteristics {
+   __virtio32 zone_sectors;
+   __virtio32 max_open_zones;
+   __virtio32 max_active_zones;
+   __virtio32 max_append_sectors;
+   __virtio32 write_granularity;
+   uint8_t model;
+   uint8_t unused2[3];
+   } zoned;
 } QEMU_PACKED;
 
 /*
@@ -153,6 +168,24 @@ struct virtio_blk_config {
 /* Write zeroes command */
 #define VIRTIO_BLK_T_WRITE_ZEROES  13
 
+/* Zone append command */
+#define VIRTIO_BLK_T_ZONE_APPEND15
+
+/* Report zones command */
+#define VIRTIO_BLK_T_ZONE_REPORT16
+
+/* Open zone command */
+#define VIRTIO_BLK_T_ZONE_OPEN  18
+
+/* Close zone command */
+#define VIRTIO_BLK_T_ZONE_CLOSE 20
+
+/* Finish zone command */
+#define VIRTIO_BLK_T_ZONE_FINISH22
+
+/* Reset zone command */
+#define VIRTIO_BLK_T_ZONE_RESET 24
+
 #ifndef VIRTIO_BLK_NO_LEGACY
 /* Barrier before this op. */
 #define VIRTIO_BLK_T_BARRIER   0x8000
@@ -172,6 +205,84 @@ struct virtio_blk_outhdr {
__virtio64 sector;
 };
 
+/*
+ * Supported zoned device models.
+ */
+
+/* Regular block device */
+#define VIRTIO_BLK_Z_NONE  0
+/* Host-managed zoned device */
+#define VIRTIO_BLK_Z_HM1
+/* Host-aware zoned device */
+#define VIRTIO_BLK_Z_HA2
+
+/* ZBD Management Out ALL flag */
+#define VIRTIO_BLK_ZONED_FLAG_ALL  (1 << 0)
+
+/*
+ * Header for VIRTIO_BLK_T_ZONE_OPEN, VIRTIO_BLK_T_ZONE_CLOSE,
+ * VIRTIO_BLK_T_ZONE_RESET, VIRTIO_BLK_T_ZONE_FINISH requests.
+ */
+struct virtio_blk_zone_mgmt_outhdr {
+   /* Zoned request flags */
+   __virtio32 flags;
+};
+
+/*
+ * Zone descriptor. A part of VIRTIO_BLK_T_ZONE_REPORT command reply.
+ */
+struct virtio_blk_zone_descriptor {
+   /* Zone capacity */
+   __virtio64 z_cap;
+   /* The starting sector of the zone */
+   __virtio64 z_start;
+   /* Zone write pointer position in sectors */
+   __virtio64 z_wp;
+   /* Zone type */
+   uint8_t z_type;
+   /* Zone state */
+   uint8_t z_state;
+   uint8_t reserved[38];
+};
+
+struct virtio_blk_zone_report {
+   __virtio64 nr_zones;
+   uint8_t reserved[56];
+   struct virtio_blk_zone_descriptor zones[];
+};
+
+/*
+ * Supported zone types.
+ */
+
+/* Conventional zone */
+#define VIRTIO_BLK_ZT_CONV 1
+/* Sequential Write Required zone */
+#define VIRTIO_BLK_ZT_SWR  2
+/* Sequential Write Preferred zone */
+#define VIRTIO_BLK_ZT_SWP  3
+
+/*
+ * Zone states that are available for zones of all types.
+ */
+
+/* Not a write pointer (conventional zones only) */
+#define VIRTIO_BLK_ZS_NOT_WP   0
+/* Empty */
+#define VIRTIO_BLK_ZS_EMPTY1
+/* Implicitly Open */
+#define VIRTIO_BLK_ZS_IOPEN2
+/* Explicitly Open */
+#define VIRTIO_BLK_ZS_EOPEN3
+/* Closed */
+#define VIRTIO_BLK_ZS_CLOSED   4
+/* Read-Only */
+#define VIRTIO_BLK_ZS_RDONLY   13
+/* Full */
+#define VIRTIO_BLK_ZS_FULL 14
+/* Offline */
+#define VIRTIO_BLK_ZS_OFFLINE  15
+
 /* Unmap this range (only valid for write zeroes command) */
 #define VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP 0x0001
 
@@ -198,4 +309,11 @@ struct virtio_scsi_inhdr {
 #define VIRTIO_BLK_S_OK0
 #define VIRTIO_BLK_S_IOERR 1
 #define VIRTIO_BLK_S_UNSUPP2
+
+/* Error codes that are specific to zoned block devices */
+#define VIRTIO_BLK_S_ZONE_INVALID_CMD 3
+#define VIRTIO_BLK_S_ZONE_UNALIGNED_WP4
+#define VIRTIO_BLK_S_ZONE_OPEN_RESOURCE   5
+#define VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE 6
+
 #endif /* _LINUX_VIRTIO_BLK_H */
-- 
2.37.1




[RFC v5 04/11] file-posix: introduce get_sysfs_str_val for device zoned model

2022-07-31 Thread Sam Li
Use sysfs attribute files to get the string value of device
zoned model. Then get_sysfs_zoned_model can convert it to
BlockZoneModel type in QEMU.

Signed-off-by: Sam Li 
---
 block/file-posix.c   | 86 
 include/block/block_int-common.h |  3 ++
 2 files changed, 89 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index bcf898f0cb..0d8b4acdc7 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1271,6 +1271,85 @@ out:
 #endif
 }
 
+/*
+ * Convert the zoned attribute file in sysfs to internal value.
+ */
+static int get_sysfs_str_val(int fd, struct stat *st,
+  const char *attribute,
+  char **val) {
+#ifdef CONFIG_LINUX
+char buf[32];
+char *sysfspath = NULL;
+int ret, offset;
+int sysfd = -1;
+
+if (S_ISCHR(st->st_mode)) {
+if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
+return ret;
+}
+return -ENOTSUP;
+}
+
+if (!S_ISBLK(st->st_mode)) {
+return -ENOTSUP;
+}
+
+sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
+major(st->st_rdev), minor(st->st_rdev),
+attribute);
+sysfd = open(sysfspath, O_RDONLY);
+if (sysfd == -1) {
+ret = -errno;
+goto out;
+}
+offset = 0;
+do {
+ret = read(sysfd, buf + offset, sizeof(buf) - 1 + offset);
+if (ret > 0) {
+offset += ret;
+}
+} while (ret == -1);
+/* The file is ended with '\n' */
+if (buf[ret - 1] == '\n') {
+buf[ret - 1] = '\0';
+}
+
+if (!strncpy(*val, buf, ret)) {
+goto out;
+}
+
+out:
+if (sysfd != -1) {
+close(sysfd);
+}
+g_free(sysfspath);
+return ret;
+#else
+return -ENOTSUP;
+#endif
+}
+
+static int get_sysfs_zoned_model(int fd, struct stat *st,
+ BlockZoneModel *zoned) {
+g_autofree char *val = NULL;
+val = g_malloc(32);
+get_sysfs_str_val(fd, st, "zoned", &val);
+if (!val) {
+return -ENOTSUP;
+}
+
+if (strcmp(val, "host-managed") == 0) {
+*zoned = BLK_Z_HM;
+} else if (strcmp(val, "host-aware") == 0) {
+*zoned = BLK_Z_HA;
+} else if (strcmp(val, "none") == 0) {
+*zoned = BLK_Z_NONE;
+} else {
+return -ENOTSUP;
+}
+return 0;
+}
+
 static int hdev_get_max_segments(int fd, struct stat *st) {
 return get_sysfs_long_val(fd, st, "max_segments");
 }
@@ -1279,6 +1358,8 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 {
 BDRVRawState *s = bs->opaque;
 struct stat st;
+int ret;
+BlockZoneModel zoned;
 
 s->needs_alignment = raw_needs_alignment(bs);
 raw_probe_alignment(bs, s->fd, errp);
@@ -1316,6 +1397,11 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 bs->bl.max_hw_iov = ret;
 }
 }
+
+ret = get_sysfs_zoned_model(s->fd, &st, &zoned);
+if (ret < 0)
+zoned = BLK_Z_NONE;
+bs->bl.zoned = zoned;
 }
 
 static int check_for_dasd(int fd)
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 8947abab76..7f7863cc9e 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -825,6 +825,9 @@ typedef struct BlockLimits {
 
 /* maximum number of iovec elements */
 int max_iov;
+
+/* device zone model */
+BlockZoneModel zoned;
 } BlockLimits;
 
 typedef struct BdrvOpBlocker BdrvOpBlocker;
-- 
2.37.1




[RFC v5 05/11] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2022-07-31 Thread Sam Li
By adding zone management operations in BlockDriver, storage controller
emulation can use the new block layer APIs including Report Zone and
four zone management operations (open, close, finish, reset).

BlockDriver can get zone information from null_blk device by refreshing
BLockLimits.

Signed-off-by: Sam Li 
---
 block/block-backend.c|  47 ++
 block/coroutines.h   |   6 +
 block/file-posix.c   | 272 ++-
 block/io.c   |  57 +++
 include/block/block-common.h |   1 -
 include/block/block-io.h |  13 ++
 include/block/block_int-common.h |  22 ++-
 include/block/raw-aio.h  |   6 +-
 meson.build  |   1 +
 qapi/block-core.json |   7 +-
 10 files changed, 426 insertions(+), 6 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index d4a5df2ac2..ef6a1f33d5 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1775,6 +1775,53 @@ int coroutine_fn blk_co_flush(BlockBackend *blk)
 return ret;
 }
 
+/*
+ * Send a zone_report command.
+ * offset is a byte offset from the start of the device. No alignment
+ * required for offset.
+ * nr_zones represents IN maximum and OUT actual.
+ */
+int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
+unsigned int *nr_zones,
+BlockZoneDescriptor *zones)
+{
+int ret;
+IO_CODE();
+
+blk_inc_in_flight(blk); /* increase before waiting */
+blk_wait_while_drained(blk);
+if (!blk_is_available(blk)) {
+return -ENOMEDIUM;
+}
+ret = bdrv_co_zone_report(blk_bs(blk), offset, nr_zones, zones);
+blk_dec_in_flight(blk);
+return ret;
+}
+
+/*
+ * Send a zone_management command.
+ * offset is the starting zone specified as a sector offset.
+ * len is the maximum number of sectors the command should operate on.
+ */
+int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+int64_t offset, int64_t len)
+{
+int ret;
+IO_CODE();
+
+ret = blk_check_byte_request(blk, offset, len);
+if (ret < 0)
+return ret;
+blk_inc_in_flight(blk);
+blk_wait_while_drained(blk);
+if (!blk_is_available(blk)) {
+return -ENOMEDIUM;
+}
+ret = bdrv_co_zone_mgmt(blk_bs(blk), op, offset, len);
+blk_dec_in_flight(blk);
+return ret;
+}
+
 void blk_drain(BlockBackend *blk)
 {
 BlockDriverState *bs = blk_bs(blk);
diff --git a/block/coroutines.h b/block/coroutines.h
index 3a2bad564f..e3f62d94e5 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -63,6 +63,12 @@ nbd_co_do_establish_connection(BlockDriverState *bs, bool 
blocking,
Error **errp);
 
 
+int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
+unsigned int *nr_zones,
+BlockZoneDescriptor *zones);
+int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+  int64_t offset, int64_t len);
+
 /*
  * "I/O or GS" API functions. These functions can run without
  * the BQL, but only in one specific iothread/main loop.
diff --git a/block/file-posix.c b/block/file-posix.c
index 0d8b4acdc7..6c045eb6e8 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -67,6 +67,9 @@
 #include 
 #include 
 #include 
+#if defined(CONFIG_BLKZONED)
+#include 
+#endif
 #include 
 #include 
 #include 
@@ -216,6 +219,13 @@ typedef struct RawPosixAIOData {
 PreallocMode prealloc;
 Error **errp;
 } truncate;
+struct {
+unsigned int *nr_zones;
+BlockZoneDescriptor *zones;
+} zone_report;
+struct {
+BlockZoneOp op;
+} zone_mgmt;
 };
 } RawPosixAIOData;
 
@@ -1386,7 +1396,7 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 #endif
 
 if (bs->sg || S_ISBLK(st.st_mode)) {
-int ret = hdev_get_max_hw_transfer(s->fd, &st);
+ret = hdev_get_max_hw_transfer(s->fd, &st);
 
 if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
 bs->bl.max_hw_transfer = ret;
@@ -1402,6 +1412,27 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 if (ret < 0)
 zoned = BLK_Z_NONE;
 bs->bl.zoned = zoned;
+if (zoned != BLK_Z_NONE) {
+ret = get_sysfs_long_val(s->fd, &st, "chunk_sectors");
+if (ret > 0) {
+bs->bl.zone_sectors = ret;
+}
+
+ret = get_sysfs_long_val(s->fd, &st, "zone_append_max_bytes");
+if (ret > 0) {
+bs->bl.zone_append_max_bytes = ret;
+}
+
+ret = get_sysfs_long_val(s->fd, &st, "max_open_zones");
+if (ret > 0) {
+bs->bl.max_open_zones = re

[RFC v5 03/11] file-posix: introduce get_sysfs_long_val for the long sysfs attribute

2022-07-31 Thread Sam Li
Use sysfs attribute files to get the long value of zoned device
information.

Signed-off-by: Sam Li 
---
 block/file-posix.c | 23 ---
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 48cd096624..bcf898f0cb 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1210,15 +1210,19 @@ static int hdev_get_max_hw_transfer(int fd, struct stat 
*st)
 #endif
 }
 
-static int hdev_get_max_segments(int fd, struct stat *st)
-{
+/*
+ * Get zoned device information (chunk_sectors, zoned_append_max_bytes,
+ * max_open_zones, max_active_zones) through sysfs attribute files.
+ */
+static long get_sysfs_long_val(int fd, struct stat *st,
+   const char *attribute) {
 #ifdef CONFIG_LINUX
 char buf[32];
 const char *end;
 char *sysfspath = NULL;
 int ret;
 int sysfd = -1;
-long max_segments;
+long val;
 
 if (S_ISCHR(st->st_mode)) {
 if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
@@ -1231,8 +1235,9 @@ static int hdev_get_max_segments(int fd, struct stat *st)
 return -ENOTSUP;
 }
 
-sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
-major(st->st_rdev), minor(st->st_rdev));
+sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
+major(st->st_rdev), minor(st->st_rdev),
+attribute);
 sysfd = open(sysfspath, O_RDONLY);
 if (sysfd == -1) {
 ret = -errno;
@@ -1250,9 +1255,9 @@ static int hdev_get_max_segments(int fd, struct stat *st)
 }
 buf[ret] = 0;
 /* The file is ended with '\n', pass 'end' to accept that. */
-ret = qemu_strtol(buf, &end, 10, &max_segments);
+ret = qemu_strtol(buf, &end, 10, &val);
 if (ret == 0 && end && *end == '\n') {
-ret = max_segments;
+ret = val;
 }
 
 out:
@@ -1266,6 +1271,10 @@ out:
 #endif
 }
 
+static int hdev_get_max_segments(int fd, struct stat *st) {
+return get_sysfs_long_val(fd, st, "max_segments");
+}
+
 static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 {
 BDRVRawState *s = bs->opaque;
-- 
2.37.1




[RFC v5 07/11] config: add check to block layer

2022-07-31 Thread Sam Li
Putting zoned/non-zoned BlockDrivers on top of each other is not
allowed.

Signed-off-by: Sam Li 
---
 block.c  | 13 +
 block/file-posix.c   |  2 ++
 block/raw-format.c   |  1 +
 include/block/block_int-common.h | 10 ++
 4 files changed, 26 insertions(+)

diff --git a/block.c b/block.c
index bc85f46eed..8a259b158c 100644
--- a/block.c
+++ b/block.c
@@ -7947,6 +7947,19 @@ void bdrv_add_child(BlockDriverState *parent_bs, 
BlockDriverState *child_bs,
 return;
 }
 
+/*
+ * Non-zoned block drivers do not follow zoned storage constraints
+ * (i.e. sequential writes to zones). Refuse mixing zoned and non-zoned
+ * drivers in a graph.
+ */
+if (!parent_bs->drv->supports_zoned_children && child_bs->drv->is_zoned) {
+error_setg(errp, "Cannot add a %s child to a %s parent",
+   child_bs->drv->is_zoned ? "zoned" : "non-zoned",
+   parent_bs->drv->supports_zoned_children ?
+   "support zoned children" : "not support zoned children");
+return;
+}
+
 if (!QLIST_EMPTY(&child_bs->parents)) {
 error_setg(errp, "The node %s already has a parent",
child_bs->node_name);
diff --git a/block/file-posix.c b/block/file-posix.c
index 6c045eb6e8..8eb0b7bc9b 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -4023,6 +4023,8 @@ static BlockDriver bdrv_zoned_host_device = {
 .format_name = "zoned_host_device",
 .protocol_name = "zoned_host_device",
 .instance_size = sizeof(BDRVRawState),
+.is_zoned = true,
+.supports_zoned_children = true,
 .bdrv_needs_filename = true,
 .bdrv_probe_device  = hdev_probe_device,
 .bdrv_parse_filename = zoned_host_device_parse_filename,
diff --git a/block/raw-format.c b/block/raw-format.c
index 6b20bd22ef..9441536819 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -614,6 +614,7 @@ static void raw_child_perm(BlockDriverState *bs, BdrvChild 
*c,
 BlockDriver bdrv_raw = {
 .format_name  = "raw",
 .instance_size= sizeof(BDRVRawState),
+.supports_zoned_children = true,
 .bdrv_probe   = &raw_probe,
 .bdrv_reopen_prepare  = &raw_reopen_prepare,
 .bdrv_reopen_commit   = &raw_reopen_commit,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index de44c7b6f4..0476cd0491 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -126,6 +126,16 @@ struct BlockDriver {
  */
 bool is_format;
 
+/*
+ * Set to true if the BlockDriver is a zoned block driver.
+ */
+bool is_zoned;
+
+/*
+ * Set to true if the BlockDriver supports zoned children.
+ */
+bool supports_zoned_children;
+
 /*
  * Drivers not implementing bdrv_parse_filename nor bdrv_open should have
  * this field set to true, except ones that are defined only by their
-- 
2.37.1




[RFC v5 10/11] qemu-iotests: test new zone operations

2022-07-31 Thread Sam Li
We have added new block layer APIs of zoned block devices. Test it with:
Create a null_blk device, run each zone operation on it and see
whether reporting right zone information.

Signed-off-by: Sam Li 
---
 tests/qemu-iotests/tests/zoned.out | 53 ++
 tests/qemu-iotests/tests/zoned.sh  | 86 ++
 2 files changed, 139 insertions(+)
 create mode 100644 tests/qemu-iotests/tests/zoned.out
 create mode 100755 tests/qemu-iotests/tests/zoned.sh

diff --git a/tests/qemu-iotests/tests/zoned.out 
b/tests/qemu-iotests/tests/zoned.out
new file mode 100644
index 00..d09be2ffcd
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned.out
@@ -0,0 +1,53 @@
+QA output created by zoned.sh
+Testing a null_blk device:
+Simple cases: if the operations work
+(1) report the first zone:
+start: 0x0, len 0x8, cap 0x8,wptr 0x0, zcond:1, [type: 2]
+
+report the first 10 zones
+start: 0x0, len 0x8, cap 0x8,wptr 0x0, zcond:1, [type: 2]
+start: 0x8, len 0x8, cap 0x8,wptr 0x8, zcond:1, [type: 2]
+start: 0x10, len 0x8, cap 0x8,wptr 0x10, zcond:1, [type: 2]
+start: 0x18, len 0x8, cap 0x8,wptr 0x18, zcond:1, [type: 2]
+start: 0x20, len 0x8, cap 0x8,wptr 0x20, zcond:1, [type: 2]
+start: 0x28, len 0x8, cap 0x8,wptr 0x28, zcond:1, [type: 2]
+start: 0x30, len 0x8, cap 0x8,wptr 0x30, zcond:1, [type: 2]
+start: 0x38, len 0x8, cap 0x8,wptr 0x38, zcond:1, [type: 2]
+start: 0x40, len 0x8, cap 0x8,wptr 0x40, zcond:1, [type: 2]
+start: 0x48, len 0x8, cap 0x8,wptr 0x48, zcond:1, [type: 2]
+
+report the last zone:
+start: 0x1f38, len 0x8, cap 0x8,wptr 0x1f38, zcond:1, [type: 2]
+
+
+(2) opening the first zone
+report after:
+start: 0x0, len 0x8, cap 0x8,wptr 0x0, zcond:3, [type: 2]
+
+opening the second zone
+report after:
+start: 0x8, len 0x8, cap 0x8,wptr 0x8, zcond:3, [type: 2]
+
+opening the last zone
+report after:
+start: 0x1f38, len 0x8, cap 0x8,wptr 0x1f38, zcond:3, [type: 2]
+
+
+(3) closing the first zone
+report after:
+start: 0x0, len 0x8, cap 0x8,wptr 0x0, zcond:1, [type: 2]
+
+closing the last zone
+report after:
+start: 0x1f38, len 0x8, cap 0x8,wptr 0x1f38, zcond:1, [type: 2]
+
+
+(4) finishing the second zone
+After finishing a zone:
+start: 0x8, len 0x8, cap 0x8,wptr 0x10, zcond:14, [type: 2]
+
+
+(5) resetting the second zone
+After resetting a zone:
+start: 0x8, len 0x8, cap 0x8,wptr 0x8, zcond:1, [type: 2]
+*** done
diff --git a/tests/qemu-iotests/tests/zoned.sh 
b/tests/qemu-iotests/tests/zoned.sh
new file mode 100755
index 00..db68aa88d4
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned.sh
@@ -0,0 +1,86 @@
+#!/usr/bin/env bash
+#
+# Test zone management operations.
+#
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+status=1 # failure is the default!
+
+_cleanup()
+{
+  _cleanup_test_img
+  sudo rmmod null_blk
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+. ./common.qemu
+
+# This test only runs on Linux hosts with raw image files.
+_supported_fmt raw
+_supported_proto file
+_supported_os Linux
+
+QEMU_IO="build/qemu-io"
+IMG="--image-opts driver=zoned_host_device,filename=/dev/nullb0"
+QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT
+
+echo "Testing a null_blk device:"
+echo "Simple cases: if the operations work"
+sudo modprobe null_blk nr_devices=1 zoned=1
+
+echo "(1) report the first zone:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "report the first 10 zones"
+sudo $QEMU_IO $IMG -c "zrp 0 10"
+echo
+echo "report the last zone:"
+sudo $QEMU_IO $IMG -c "zrp 0x3e7000 2"
+echo
+echo
+echo "(2) opening the first zone"
+sudo $QEMU_IO $IMG -c "zo 0 0x8"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "opening the second zone"
+sudo $QEMU_IO $IMG -c "zo 524288 0x8" # 524288 is the zone sector size
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 268435456 1" # 268435456 / 512 = 524288
+echo
+echo "opening the last zone"
+sudo $QEMU_IO $IMG -c "zo 0x1f38 0x8"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0x3e7000 2"
+echo
+echo
+echo "(3) closing the first zone"
+sudo $QEMU_IO $IMG -c "zc 0 0x8"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "closing the last zone"
+sudo $QEMU_IO $IMG -c "zc 0x1f38 0x8"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0x3e7000 2"
+echo
+echo
+echo "(4) finishing the second 

[RFC v5 06/11] raw-format: add zone operations to pass through requests

2022-07-31 Thread Sam Li
raw-format driver usually sits on top of file-posix driver. It needs to
pass through requests of zone commands.

Signed-off-by: Sam Li 
---
 block/raw-format.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/block/raw-format.c b/block/raw-format.c
index 69fd650eaf..6b20bd22ef 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -314,6 +314,17 @@ static int coroutine_fn raw_co_pdiscard(BlockDriverState 
*bs,
 return bdrv_co_pdiscard(bs->file, offset, bytes);
 }
 
+static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t 
offset,
+   unsigned int *nr_zones,
+   BlockZoneDescriptor *zones) {
+return bdrv_co_zone_report(bs->file->bs, offset, nr_zones, zones);
+}
+
+static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
+ int64_t offset, int64_t len) {
+return bdrv_co_zone_mgmt(bs->file->bs, op, offset, len);
+}
+
 static int64_t raw_getlength(BlockDriverState *bs)
 {
 int64_t len;
@@ -614,6 +625,8 @@ BlockDriver bdrv_raw = {
 .bdrv_co_pwritev  = &raw_co_pwritev,
 .bdrv_co_pwrite_zeroes = &raw_co_pwrite_zeroes,
 .bdrv_co_pdiscard = &raw_co_pdiscard,
+.bdrv_co_zone_report  = &raw_co_zone_report,
+.bdrv_co_zone_mgmt  = &raw_co_zone_mgmt,
 .bdrv_co_block_status = &raw_co_block_status,
 .bdrv_co_copy_range_from = &raw_co_copy_range_from,
 .bdrv_co_copy_range_to  = &raw_co_copy_range_to,
-- 
2.37.1




[RFC v5 09/11] qemu-io: add zoned block device operations.

2022-07-31 Thread Sam Li
Add zoned storage commands of the device: zone_report(zrp), zone_open(zo),
zone_close(zc), zone_reset(zrs), zone_finish(zf).

For example, to test zone_report, use following command:
$ ./build/qemu-io --image-opts driver=zoned_host_device, filename=/dev/nullb0
-c "zrp offset nr_zones"

Signed-off-by: Sam Li 
---
 block/io.c |  24 ++---
 qemu-io-cmds.c | 144 +
 2 files changed, 148 insertions(+), 20 deletions(-)

diff --git a/block/io.c b/block/io.c
index a4625fb0e1..de9ec1d740 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3209,19 +3209,11 @@ int bdrv_co_zone_report(BlockDriverState *bs, int64_t 
offset,
 IO_CODE();
 
 bdrv_inc_in_flight(bs);
-if (!drv || (!drv->bdrv_co_zone_report)) {
+if (!drv || !drv->bdrv_co_zone_report) {
 co.ret = -ENOTSUP;
 goto out;
 }
-
-if (drv->bdrv_co_zone_report) {
-co.ret = drv->bdrv_co_zone_report(bs, offset, nr_zones, zones);
-} else {
-co.ret = -ENOTSUP;
-goto out;
-qemu_coroutine_yield();
-}
-
+co.ret = drv->bdrv_co_zone_report(bs, offset, nr_zones, zones);
 out:
 bdrv_dec_in_flight(bs);
 return co.ret;
@@ -3237,19 +3229,11 @@ int bdrv_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp 
op,
 IO_CODE();
 
 bdrv_inc_in_flight(bs);
-if (!drv || (!drv->bdrv_co_zone_mgmt)) {
+if (!drv || !drv->bdrv_co_zone_mgmt) {
 co.ret = -ENOTSUP;
 goto out;
 }
-
-if (drv->bdrv_co_zone_mgmt) {
-co.ret = drv->bdrv_co_zone_mgmt(bs, op, offset, len);
-} else {
-co.ret = -ENOTSUP;
-goto out;
-qemu_coroutine_yield();
-}
-
+co.ret = drv->bdrv_co_zone_mgmt(bs, op, offset, len);
 out:
 bdrv_dec_in_flight(bs);
 return co.ret;
diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index 952dc940f1..5a215277c7 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -1712,6 +1712,145 @@ static const cmdinfo_t flush_cmd = {
 .oneline= "flush all in-core file state to disk",
 };
 
+static int zone_report_f(BlockBackend *blk, int argc, char **argv)
+{
+int ret;
+int64_t offset;
+unsigned int nr_zones;
+
+++optind;
+offset = cvtnum(argv[optind]);
+++optind;
+nr_zones = cvtnum(argv[optind]);
+
+g_autofree BlockZoneDescriptor *zones = NULL;
+zones = g_new(BlockZoneDescriptor, nr_zones);
+ret = blk_zone_report(blk, offset, &nr_zones, zones);
+if (ret < 0) {
+printf("zone report failed: %s\n", strerror(-ret));
+} else {
+for (int i = 0; i < nr_zones; ++i) {
+printf("start: 0x%" PRIx64 ", len 0x%" PRIx64 ", "
+   "cap"" 0x%" PRIx64 ",wptr 0x%" PRIx64 ", "
+   "zcond:%u, [type: %u]\n",
+   zones[i].start, zones[i].length, zones[i].cap, zones[i].wp,
+   zones[i].cond, zones[i].type);
+}
+}
+return ret;
+}
+
+static const cmdinfo_t zone_report_cmd = {
+.name = "zone_report",
+.altname = "zrp",
+.cfunc = zone_report_f,
+.argmin = 2,
+.argmax = 2,
+.args = "offset number",
+.oneline = "report zone information",
+};
+
+static int zone_open_f(BlockBackend *blk, int argc, char **argv)
+{
+int ret;
+int64_t offset, len;
+++optind;
+offset = cvtnum(argv[optind]);
+++optind;
+len = cvtnum(argv[optind]);
+ret = blk_zone_mgmt(blk, BLK_ZO_OPEN, offset, len);
+if (ret < 0) {
+printf("zone open failed: %s\n", strerror(-ret));
+}
+return ret;
+}
+
+static const cmdinfo_t zone_open_cmd = {
+.name = "zone_open",
+.altname = "zo",
+.cfunc = zone_open_f,
+.argmin = 2,
+.argmax = 2,
+.args = "offset len",
+.oneline = "explicit open a range of zones in zone block device",
+};
+
+static int zone_close_f(BlockBackend *blk, int argc, char **argv)
+{
+int ret;
+int64_t offset, len;
+++optind;
+offset = cvtnum(argv[optind]);
+++optind;
+len = cvtnum(argv[optind]);
+ret = blk_zone_mgmt(blk, BLK_ZO_CLOSE, offset, len);
+if (ret < 0) {
+printf("zone close failed: %s\n", strerror(-ret));
+}
+return ret;
+}
+
+static const cmdinfo_t zone_close_cmd = {
+.name = "zone_close",
+.altname = "zc",
+.cfunc = zone_close_f,
+.argmin = 2,
+.argmax = 2,
+.args = "offset len",
+.oneline = "close a range of zones in zone block device",
+};
+
+static int zone_finish_f(BlockBackend *blk, int argc, char **argv)
+{
+int ret;
+int64_t offset, len;
+++optind;
+offset = cvtnum(argv[optind]);
+++optind;
+

[RFC v5 11/11] docs/zoned-storage: add zoned device documentation

2022-07-31 Thread Sam Li
Add the documentation about the zoned device support to virtio-blk
emulation.

Signed-off-by: Sam Li 
---
 docs/devel/zoned-storage.rst   | 68 ++
 docs/system/qemu-block-drivers.rst.inc |  6 +++
 2 files changed, 74 insertions(+)
 create mode 100644 docs/devel/zoned-storage.rst

diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
new file mode 100644
index 00..e62927dceb
--- /dev/null
+++ b/docs/devel/zoned-storage.rst
@@ -0,0 +1,68 @@
+=
+zoned-storage
+=
+
+Zoned Block Devices (ZBDs) devide the LBA space to block regions called zones
+that are larger than the LBA size. It can only allow sequential writes, which
+reduces write amplification in SSD, leading to higher throughput and increased
+capacity. More details about ZBDs can be found at:
+
+https://zonedstorage.io/docs/introduction/zoned-storage
+
+zone emulation
+--
+In its current status, the virtio-blk device is not aware of ZBDs but the guest
+sees host-managed drives as regular drive that will runs correctly under the
+most common write workloads.
+
+The zoned device support aims to let guests (virtual machines) access zoned
+storage devices on the host (hypervisor) through a virtio-blk device. This
+involves extending QEMU's block layer and virtio-blk emulation code.
+
+If the host supports zoned block devices, it can set VIRTIO_BLK_F_ZONED. Then
+in the guest side, it appears following situations:
+1) If the guest virtio-blk driver sees the VIRTIO_BLK_F_ZONED bit set, then it
+will assume that the zoned characteristics fields of the config space are 
valid.
+2) If the guest virtio-blk driver sees a zoned model that is NONE, then it is
+known that is a regular block device.
+3) If the guest virtio-blk driver sees a zoned model that is HM(or HA), then it
+is known that is a zoned block device and probes the other zone fields.
+
+On QEMU sides,
+1) The DEFINE PROP BIT macro must be used to declare that the host supports
+zones.
+2) BlockDrivers can declare zoned device support once known the zoned model
+for the block device is not NONE.
+
+zoned storage APIs
+--
+
+Zone emulation part extends the block layer APIs and virtio-blk emulation 
section
+with the minimum set of zoned commands that are necessary to support zoned
+devices. The commands are - Report Zones, four zone operations and Zone Append
+(developing).
+
+testing
+---
+
+It can be tested on a null_blk device using qemu-io, qemu-iotests or blkzone(8)
+command in the guest os.
+
+1. For example, the command line for zone report using qemu-io is:
+
+$ path/to/qemu-io --image-opts driver=zoned_host_device,filename=/dev/nullb0 -c
+"zrp offset nr_zones"
+
+To enable zoned device in the guest os, the guest kernel must have the 
virtio-blk
+driver with ZBDs support. The link to such patches for the kernel is:
+
+https://github.com/dmitry-fomichev/virtblk-zbd
+
+Then, add the following options to the QEMU command line:
+-blockdev node-name=drive0,driver=zoned_host_device,filename=/dev/nullb0
+
+After the guest os booting, use blkzone(8) to test zone operations:
+blkzone report -o offset -c nr_zones /dev/vda
+
+2. We can also use the qemu-iotests in ./tests/qemu-iotests/tests/zoned.sh.
+
diff --git a/docs/system/qemu-block-drivers.rst.inc 
b/docs/system/qemu-block-drivers.rst.inc
index dfe5d2293d..2a761a4b80 100644
--- a/docs/system/qemu-block-drivers.rst.inc
+++ b/docs/system/qemu-block-drivers.rst.inc
@@ -430,6 +430,12 @@ Hard disks
   you may corrupt your host data (use the ``-snapshot`` command
   line option or modify the device permissions accordingly).
 
+Zoned block devices
+  Zoned block devices can passed through to the guest if the emulated storage
+  controller supports zoned storage. Use ``--blockdev zoned_host_device,
+  node-name=drive0,filename=/dev/nullb0`` to pass through ``/dev/nullb0``
+  as ``drive0``.
+
 Windows
 ^^^
 
-- 
2.37.1




[RFC v5 08/11] virtio-blk: add zoned storage APIs for zoned devices

2022-07-31 Thread Sam Li
This patch extends virtio-blk emulation to handle zoned device commands
by calling the new block layer APIs to perform zoned device I/O on
behalf of the guest. It supports Report Zone, and four zone oparations (open,
close, finish, reset). The virtio-blk zoned device command specifications
is currently in the reviewing process.

VIRTIO_BLK_F_ZONED will only be set if the host does support zoned block
devices. The regular block device will not be set. The guest os having
zoned device support can use blkzone(8) to test those commands.

Signed-off-by: Sam Li 
---
 block/block-backend.c |  92 
 hw/block/virtio-blk.c | 172 +-
 include/sysemu/block-backend-io.h |   6 ++
 3 files changed, 268 insertions(+), 2 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index ef6a1f33d5..8f2cfcbd9d 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1431,6 +1431,15 @@ typedef struct BlkRwCo {
 void *iobuf;
 int ret;
 BdrvRequestFlags flags;
+union {
+struct {
+unsigned int *nr_zones;
+BlockZoneDescriptor *zones;
+} zone_report;
+struct {
+BlockZoneOp op;
+} zone_mgmt;
+};
 } BlkRwCo;
 
 int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
@@ -1775,6 +1784,89 @@ int coroutine_fn blk_co_flush(BlockBackend *blk)
 return ret;
 }
 
+static void blk_aio_zone_report_entry(void *opaque) {
+BlkAioEmAIOCB *acb = opaque;
+BlkRwCo *rwco = &acb->rwco;
+
+rwco->ret = blk_co_zone_report(rwco->blk, rwco->offset,
+   rwco->zone_report.nr_zones,
+   rwco->zone_report.zones);
+blk_aio_complete(acb);
+}
+
+BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
+unsigned int *nr_zones,
+BlockZoneDescriptor  *zones,
+BlockCompletionFunc *cb, void *opaque)
+{
+BlkAioEmAIOCB *acb;
+Coroutine *co;
+
+blk_inc_in_flight(blk);
+acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
+acb->rwco = (BlkRwCo) {
+.blk= blk,
+.offset = offset,
+.ret= NOT_DONE,
+.zone_report = {
+.zones = zones,
+.nr_zones = nr_zones,
+},
+};
+acb->has_returned = false;
+
+co = qemu_coroutine_create(blk_aio_zone_report_entry, acb);
+bdrv_coroutine_enter(blk_bs(blk), co);
+
+acb->has_returned = true;
+if (acb->rwco.ret != NOT_DONE) {
+replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+ blk_aio_complete_bh, acb);
+}
+
+return &acb->common;
+}
+
+static void blk_aio_zone_mgmt_entry(void *opaque) {
+BlkAioEmAIOCB *acb = opaque;
+BlkRwCo *rwco = &acb->rwco;
+
+rwco->ret = blk_co_zone_mgmt(rwco->blk, rwco->zone_mgmt.op,
+ rwco->offset, acb->bytes);
+blk_aio_complete(acb);
+}
+
+BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+  int64_t offset, int64_t len,
+  BlockCompletionFunc *cb, void *opaque) {
+BlkAioEmAIOCB *acb;
+Coroutine *co;
+
+blk_inc_in_flight(blk);
+acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
+acb->rwco = (BlkRwCo) {
+.blk= blk,
+.offset = offset,
+.ret= NOT_DONE,
+.zone_mgmt = {
+.op = op,
+},
+};
+acb->bytes = len;
+acb->has_returned = false;
+
+co = qemu_coroutine_create(blk_aio_zone_mgmt_entry, acb);
+bdrv_coroutine_enter(blk_bs(blk), co);
+
+acb->has_returned = true;
+if (acb->rwco.ret != NOT_DONE) {
+replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+ blk_aio_complete_bh, acb);
+}
+
+return &acb->common;
+}
+
 /*
  * Send a zone_report command.
  * offset is a byte offset from the start of the device. No alignment
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index e9ba752f6b..9722f447a2 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -37,6 +37,7 @@
 /* Config size before the discard support (hide associated config fields) */
 #define VIRTIO_BLK_CFG_SIZE offsetof(struct virtio_blk_config, \
  max_discard_sectors)
+
 /*
  * Starting from the discard feature, we can use this array to properly
  * set the config size depending on the features enabled.
@@ -46,6 +47,8 @@ static const VirtIOFeature feature_sizes[] = {
  .end = endof(struct virtio_blk_config, discard_sector_alignment)},
 {.flags = 1ULL << VIRTIO_BLK_F_WRITE_ZEROES,
  .end = endof(struct

[PATCH v6 2/8] file-posix: introduce get_sysfs_long_val for the long sysfs attribute

2022-08-05 Thread Sam Li
Use sysfs attribute files to get the long value of zoned device
information.

Signed-off-by: Sam Li 
Reviewed-by: Hannes Reinecke 
---
 block/file-posix.c | 37 +++--
 1 file changed, 23 insertions(+), 14 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 48cd096624..a40eab64a2 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1210,29 +1210,27 @@ static int hdev_get_max_hw_transfer(int fd, struct stat 
*st)
 #endif
 }
 
-static int hdev_get_max_segments(int fd, struct stat *st)
-{
+/*
+ * Get zoned device information (chunk_sectors, zoned_append_max_bytes,
+ * max_open_zones, max_active_zones) through sysfs attribute files.
+ */
+static long get_sysfs_long_val(int fd, struct stat *st,
+   const char *attribute) {
 #ifdef CONFIG_LINUX
 char buf[32];
 const char *end;
 char *sysfspath = NULL;
 int ret;
 int sysfd = -1;
-long max_segments;
-
-if (S_ISCHR(st->st_mode)) {
-if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
-return ret;
-}
-return -ENOTSUP;
-}
+long val;
 
 if (!S_ISBLK(st->st_mode)) {
 return -ENOTSUP;
 }
 
-sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
-major(st->st_rdev), minor(st->st_rdev));
+sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
+major(st->st_rdev), minor(st->st_rdev),
+attribute);
 sysfd = open(sysfspath, O_RDONLY);
 if (sysfd == -1) {
 ret = -errno;
@@ -1250,9 +1248,9 @@ static int hdev_get_max_segments(int fd, struct stat *st)
 }
 buf[ret] = 0;
 /* The file is ended with '\n', pass 'end' to accept that. */
-ret = qemu_strtol(buf, &end, 10, &max_segments);
+ret = qemu_strtol(buf, &end, 10, &val);
 if (ret == 0 && end && *end == '\n') {
-ret = max_segments;
+ret = val;
 }
 
 out:
@@ -1266,6 +1264,17 @@ out:
 #endif
 }
 
+static int hdev_get_max_segments(int fd, struct stat *st) {
+int ret;
+if (S_ISCHR(st->st_mode)) {
+if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
+return ret;
+}
+return -ENOTSUP;
+}
+return get_sysfs_long_val(fd, st, "max_segments");
+}
+
 static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 {
 BDRVRawState *s = bs->opaque;
-- 
2.37.1




[PATCH v6 1/8] include: add zoned device structs

2022-08-05 Thread Sam Li
Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
---
 include/block/block-common.h | 43 
 1 file changed, 43 insertions(+)

diff --git a/include/block/block-common.h b/include/block/block-common.h
index fdb7306e78..36bd0e480e 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -49,6 +49,49 @@ typedef struct BlockDriver BlockDriver;
 typedef struct BdrvChild BdrvChild;
 typedef struct BdrvChildClass BdrvChildClass;
 
+typedef enum BlockZoneOp {
+BLK_ZO_OPEN,
+BLK_ZO_CLOSE,
+BLK_ZO_FINISH,
+BLK_ZO_RESET,
+} BlockZoneOp;
+
+typedef enum BlockZoneModel {
+BLK_Z_NONE = 0x0, /* Regular block device */
+BLK_Z_HM = 0x1, /* Host-managed zoned block device */
+BLK_Z_HA = 0x2, /* Host-aware zoned block device */
+} BlockZoneModel;
+
+typedef enum BlockZoneCondition {
+BLK_ZS_NOT_WP = 0x0,
+BLK_ZS_EMPTY = 0x1,
+BLK_ZS_IOPEN = 0x2,
+BLK_ZS_EOPEN = 0x3,
+BLK_ZS_CLOSED = 0x4,
+BLK_ZS_RDONLY = 0xD,
+BLK_ZS_FULL = 0xE,
+BLK_ZS_OFFLINE = 0xF,
+} BlockZoneCondition;
+
+typedef enum BlockZoneType {
+BLK_ZT_CONV = 0x1, /* Conventional random writes supported */
+BLK_ZT_SWR = 0x2, /* Sequential writes required */
+BLK_ZT_SWP = 0x3, /* Sequential writes preferred */
+} BlockZoneType;
+
+/*
+ * Zone descriptor data structure.
+ * Provides information on a zone with all position and size values in bytes.
+ */
+typedef struct BlockZoneDescriptor {
+uint64_t start;
+uint64_t length;
+uint64_t cap;
+uint64_t wp;
+BlockZoneType type;
+BlockZoneCondition cond;
+} BlockZoneDescriptor;
+
 typedef struct BlockDriverInfo {
 /* in bytes, 0 if irrelevant */
 int cluster_size;
-- 
2.37.1




[PATCH v6 3/8] file-posix: introduce get_sysfs_str_val for device zoned model

2022-08-05 Thread Sam Li
Use sysfs attribute files to get the string value of device
zoned model. Then get_sysfs_zoned_model can convert it to
BlockZoneModel type in QEMU.

Signed-off-by: Sam Li 
Reviewed-by: Hannes Reinecke 
---
 block/file-posix.c   | 70 
 include/block/block_int-common.h |  3 ++
 2 files changed, 73 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index a40eab64a2..4785203eea 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1264,6 +1264,68 @@ out:
 #endif
 }
 
+/*
+ * Convert the zoned attribute file in sysfs to internal value.
+ */
+static int get_sysfs_str_val(int fd, struct stat *st,
+  const char *attribute,
+  char **val) {
+#ifdef CONFIG_LINUX
+char *buf = NULL;
+g_autofree char *sysfspath = NULL;
+int ret;
+size_t len;
+
+if (!S_ISBLK(st->st_mode)) {
+return -ENOTSUP;
+}
+
+sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
+major(st->st_rdev), minor(st->st_rdev),
+attribute);
+ret = g_file_get_contents(sysfspath, &buf, &len, NULL);
+if (ret == -1) {
+ret = -errno;
+return ret;
+}
+
+/* The file is ended with '\n' */
+if (buf[len - 1] == '\n') {
+buf[len - 1] = '\0';
+}
+
+if (!strncpy(*val, buf, len)) {
+ret = -errno;
+return ret;
+}
+g_free(buf);
+return 0;
+#else
+return -ENOTSUP;
+#endif
+}
+
+static int get_sysfs_zoned_model(int fd, struct stat *st,
+ BlockZoneModel *zoned) {
+g_autofree char *val = NULL;
+val = g_malloc(32);
+get_sysfs_str_val(fd, st, "zoned", &val);
+if (!val) {
+return -ENOTSUP;
+}
+
+if (strcmp(val, "host-managed") == 0) {
+*zoned = BLK_Z_HM;
+} else if (strcmp(val, "host-aware") == 0) {
+*zoned = BLK_Z_HA;
+} else if (strcmp(val, "none") == 0) {
+*zoned = BLK_Z_NONE;
+} else {
+return -ENOTSUP;
+}
+return 0;
+}
+
 static int hdev_get_max_segments(int fd, struct stat *st) {
 int ret;
 if (S_ISCHR(st->st_mode)) {
@@ -1279,6 +1341,8 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 {
 BDRVRawState *s = bs->opaque;
 struct stat st;
+int ret;
+BlockZoneModel zoned;
 
 s->needs_alignment = raw_needs_alignment(bs);
 raw_probe_alignment(bs, s->fd, errp);
@@ -1316,6 +1380,12 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 bs->bl.max_hw_iov = ret;
 }
 }
+
+ret = get_sysfs_zoned_model(s->fd, &st, &zoned);
+if (ret < 0) {
+zoned = BLK_Z_NONE;
+}
+bs->bl.zoned = zoned;
 }
 
 static int check_for_dasd(int fd)
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 8947abab76..7f7863cc9e 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -825,6 +825,9 @@ typedef struct BlockLimits {
 
 /* maximum number of iovec elements */
 int max_iov;
+
+/* device zone model */
+BlockZoneModel zoned;
 } BlockLimits;
 
 typedef struct BdrvOpBlocker BdrvOpBlocker;
-- 
2.37.1




[PATCH v6 0/8] Add support for zoned device

2022-08-05 Thread Sam Li
Zoned Block Devices (ZBDs) devide the LBA space to block regions called zones
that are larger than the LBA size. It can only allow sequential writes, which
reduces write amplification in SSD, leading to higher throughput and increased
capacity. More details about ZBDs can be found at:

https://zonedstorage.io/docs/introduction/zoned-storage

The zoned device support aims to let guests (virtual machines) access zoned
storage devices on the host (hypervisor) through a virtio-blk device. This
involves extending QEMU's block layer and virtio-blk emulation code.  In its
current status, the virtio-blk device is not aware of ZBDs but the guest sees
host-managed drives as regular drive that will runs correctly under the most
common write workloads.

This patch series extend the block layer APIs with the minimum set of zoned 
commands that are necessary to support zoned devices. The commands are - Report 
Zones, four zone operations and Zone Append (developing).

It can be tested on a null_blk device using qemu-io or qemu-iotests. For 
example, the command line for zone report using qemu-io is:

$ path/to/qemu-io --image-opts driver=zoned_host_device,filename=/dev/nullb0 -c 
"zrp offset nr_zones"

v6:
- drop virtio-blk emulation changes
- address Stefan's review comments
  * fix CONFIG_BLKZONED configs in related functions
  * replace reading fd by g_file_get_contents() in get_sysfs_str_val()
  * rewrite documentation for zoned storage

v5:
- add zoned storage emulation to virtio-blk device
- add documentation for zoned storage
- address review comments
  * fix qemu-iotests
  * fix check to block layer
  * modify interfaces of sysfs helper functions
  * rename zoned device structs according to QEMU styles
  * reorder patches

v4:
- add virtio-blk headers for zoned device
- add configurations for zoned host device
- add zone operations for raw-format
- address review comments
  * fix memory leak bug in zone_report
  * add checks to block layers
  * fix qemu-iotests format
  * fix sysfs helper functions

v3:
- add helper functions to get sysfs attributes
- address review comments
  * fix zone report bugs
  * fix the qemu-io code path
  * use thread pool to avoid blocking ioctl() calls

v2:
- add qemu-io sub-commands
- address review comments
  * modify interfaces of APIs

v1:
- add block layer APIs resembling Linux ZoneBlockDevice ioctls

Sam Li (8):
  include: add zoned device structs
  file-posix: introduce get_sysfs_long_val for the long sysfs attribute
  file-posix: introduce get_sysfs_str_val for device zoned model
  block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  raw-format: add zone operations to pass through requests
  config: add check to block layer
  qemu-iotests: test new zone operations
  docs/zoned-storage: add zoned device documentation

 block.c|  13 +
 block/block-backend.c  |  50 +++
 block/coroutines.h |   6 +
 block/file-posix.c | 423 -
 block/io.c |  41 +++
 block/raw-format.c |  14 +
 docs/devel/zoned-storage.rst   |  41 +++
 docs/system/qemu-block-drivers.rst.inc |   6 +
 include/block/block-common.h   |  44 ++-
 include/block/block-io.h   |  13 +
 include/block/block_int-common.h   |  35 +-
 include/block/raw-aio.h|   6 +-
 meson.build|   1 +
 qapi/block-core.json   |   8 +-
 qemu-io-cmds.c | 144 +
 tests/qemu-iotests/tests/zoned.out |  53 
 tests/qemu-iotests/tests/zoned.sh  |  86 +
 17 files changed, 964 insertions(+), 20 deletions(-)
 create mode 100644 docs/devel/zoned-storage.rst
 create mode 100644 tests/qemu-iotests/tests/zoned.out
 create mode 100755 tests/qemu-iotests/tests/zoned.sh

-- 
2.37.1




[PATCH v6 4/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2022-08-05 Thread Sam Li
By adding zone management operations in BlockDriver, storage controller
emulation can use the new block layer APIs including Report Zone and
four zone management operations (open, close, finish, reset).

Add zoned storage commands of the device: zone_report(zrp), zone_open(zo),
zone_close(zc), zone_reset(zrs), zone_finish(zf).

For example, to test zone_report, use following command:
$ ./build/qemu-io --image-opts driver=zoned_host_device, filename=/dev/nullb0
-c "zrp offset nr_zones"

Signed-off-by: Sam Li 
Reviewed-by: Hannes Reinecke 
---
 block/block-backend.c|  50 +
 block/coroutines.h   |   6 +
 block/file-posix.c   | 315 ++-
 block/io.c   |  41 
 include/block/block-common.h |   1 -
 include/block/block-io.h |  13 ++
 include/block/block_int-common.h |  22 ++-
 include/block/raw-aio.h  |   6 +-
 meson.build  |   1 +
 qapi/block-core.json |   8 +-
 qemu-io-cmds.c   | 144 ++
 11 files changed, 601 insertions(+), 6 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index d4a5df2ac2..fc639b0cd7 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1775,6 +1775,56 @@ int coroutine_fn blk_co_flush(BlockBackend *blk)
 return ret;
 }
 
+/*
+ * Send a zone_report command.
+ * offset is a byte offset from the start of the device. No alignment
+ * required for offset.
+ * nr_zones represents IN maximum and OUT actual.
+ */
+int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
+unsigned int *nr_zones,
+BlockZoneDescriptor *zones)
+{
+int ret;
+IO_CODE();
+
+blk_inc_in_flight(blk); /* increase before waiting */
+blk_wait_while_drained(blk);
+if (!blk_is_available(blk)) {
+blk_dec_in_flight(blk);
+return -ENOMEDIUM;
+}
+ret = bdrv_co_zone_report(blk_bs(blk), offset, nr_zones, zones);
+blk_dec_in_flight(blk);
+return ret;
+}
+
+/*
+ * Send a zone_management command.
+ * offset is the starting zone specified as a sector offset.
+ * len is the maximum number of sectors the command should operate on.
+ */
+int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+int64_t offset, int64_t len)
+{
+int ret;
+IO_CODE();
+
+ret = blk_check_byte_request(blk, offset, len);
+if (ret < 0) {
+return ret;
+}
+blk_inc_in_flight(blk);
+blk_wait_while_drained(blk);
+if (!blk_is_available(blk)) {
+blk_dec_in_flight(blk);
+return -ENOMEDIUM;
+}
+ret = bdrv_co_zone_mgmt(blk_bs(blk), op, offset, len);
+blk_dec_in_flight(blk);
+return ret;
+}
+
 void blk_drain(BlockBackend *blk)
 {
 BlockDriverState *bs = blk_bs(blk);
diff --git a/block/coroutines.h b/block/coroutines.h
index 3a2bad564f..e3f62d94e5 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -63,6 +63,12 @@ nbd_co_do_establish_connection(BlockDriverState *bs, bool 
blocking,
Error **errp);
 
 
+int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
+unsigned int *nr_zones,
+BlockZoneDescriptor *zones);
+int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+  int64_t offset, int64_t len);
+
 /*
  * "I/O or GS" API functions. These functions can run without
  * the BQL, but only in one specific iothread/main loop.
diff --git a/block/file-posix.c b/block/file-posix.c
index 4785203eea..2627431581 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -67,6 +67,9 @@
 #include 
 #include 
 #include 
+#if defined(CONFIG_BLKZONED)
+#include 
+#endif
 #include 
 #include 
 #include 
@@ -216,6 +219,13 @@ typedef struct RawPosixAIOData {
 PreallocMode prealloc;
 Error **errp;
 } truncate;
+struct {
+unsigned int *nr_zones;
+BlockZoneDescriptor *zones;
+} zone_report;
+struct {
+BlockZoneOp op;
+} zone_mgmt;
 };
 } RawPosixAIOData;
 
@@ -1369,7 +1379,7 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 #endif
 
 if (bs->sg || S_ISBLK(st.st_mode)) {
-int ret = hdev_get_max_hw_transfer(s->fd, &st);
+ret = hdev_get_max_hw_transfer(s->fd, &st);
 
 if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
 bs->bl.max_hw_transfer = ret;
@@ -1386,6 +1396,27 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 zoned = BLK_Z_NONE;
 }
 bs->bl.zoned = zoned;
+if (zoned != BLK_Z_NONE) {
+ret = get_sysfs_long_val(s->fd, &st, "chunk_sectors");
+if (ret > 0) {
+bs->bl.zone_sectors = ret

[PATCH v6 5/8] raw-format: add zone operations to pass through requests

2022-08-05 Thread Sam Li
raw-format driver usually sits on top of file-posix driver. It needs to
pass through requests of zone commands.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
---
 block/raw-format.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/block/raw-format.c b/block/raw-format.c
index 69fd650eaf..6b20bd22ef 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -314,6 +314,17 @@ static int coroutine_fn raw_co_pdiscard(BlockDriverState 
*bs,
 return bdrv_co_pdiscard(bs->file, offset, bytes);
 }
 
+static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t 
offset,
+   unsigned int *nr_zones,
+   BlockZoneDescriptor *zones) {
+return bdrv_co_zone_report(bs->file->bs, offset, nr_zones, zones);
+}
+
+static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
+ int64_t offset, int64_t len) {
+return bdrv_co_zone_mgmt(bs->file->bs, op, offset, len);
+}
+
 static int64_t raw_getlength(BlockDriverState *bs)
 {
 int64_t len;
@@ -614,6 +625,8 @@ BlockDriver bdrv_raw = {
 .bdrv_co_pwritev  = &raw_co_pwritev,
 .bdrv_co_pwrite_zeroes = &raw_co_pwrite_zeroes,
 .bdrv_co_pdiscard = &raw_co_pdiscard,
+.bdrv_co_zone_report  = &raw_co_zone_report,
+.bdrv_co_zone_mgmt  = &raw_co_zone_mgmt,
 .bdrv_co_block_status = &raw_co_block_status,
 .bdrv_co_copy_range_from = &raw_co_copy_range_from,
 .bdrv_co_copy_range_to  = &raw_co_copy_range_to,
-- 
2.37.1




[PATCH v6 6/8] config: add check to block layer

2022-08-05 Thread Sam Li
Putting zoned/non-zoned BlockDrivers on top of each other is not
allowed.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
---
 block.c  | 13 +
 block/file-posix.c   |  1 +
 block/raw-format.c   |  1 +
 include/block/block_int-common.h | 10 ++
 4 files changed, 25 insertions(+)

diff --git a/block.c b/block.c
index bc85f46eed..8a259b158c 100644
--- a/block.c
+++ b/block.c
@@ -7947,6 +7947,19 @@ void bdrv_add_child(BlockDriverState *parent_bs, 
BlockDriverState *child_bs,
 return;
 }
 
+/*
+ * Non-zoned block drivers do not follow zoned storage constraints
+ * (i.e. sequential writes to zones). Refuse mixing zoned and non-zoned
+ * drivers in a graph.
+ */
+if (!parent_bs->drv->supports_zoned_children && child_bs->drv->is_zoned) {
+error_setg(errp, "Cannot add a %s child to a %s parent",
+   child_bs->drv->is_zoned ? "zoned" : "non-zoned",
+   parent_bs->drv->supports_zoned_children ?
+   "support zoned children" : "not support zoned children");
+return;
+}
+
 if (!QLIST_EMPTY(&child_bs->parents)) {
 error_setg(errp, "The node %s already has a parent",
child_bs->node_name);
diff --git a/block/file-posix.c b/block/file-posix.c
index 2627431581..7ab39eb291 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -4048,6 +4048,7 @@ static BlockDriver bdrv_zoned_host_device = {
 .format_name = "zoned_host_device",
 .protocol_name = "zoned_host_device",
 .instance_size = sizeof(BDRVRawState),
+.is_zoned = true,
 .bdrv_needs_filename = true,
 .bdrv_probe_device  = hdev_probe_device,
 .bdrv_parse_filename = zoned_host_device_parse_filename,
diff --git a/block/raw-format.c b/block/raw-format.c
index 6b20bd22ef..9441536819 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -614,6 +614,7 @@ static void raw_child_perm(BlockDriverState *bs, BdrvChild 
*c,
 BlockDriver bdrv_raw = {
 .format_name  = "raw",
 .instance_size= sizeof(BDRVRawState),
+.supports_zoned_children = true,
 .bdrv_probe   = &raw_probe,
 .bdrv_reopen_prepare  = &raw_reopen_prepare,
 .bdrv_reopen_commit   = &raw_reopen_commit,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index de44c7b6f4..0476cd0491 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -126,6 +126,16 @@ struct BlockDriver {
  */
 bool is_format;
 
+/*
+ * Set to true if the BlockDriver is a zoned block driver.
+ */
+bool is_zoned;
+
+/*
+ * Set to true if the BlockDriver supports zoned children.
+ */
+bool supports_zoned_children;
+
 /*
  * Drivers not implementing bdrv_parse_filename nor bdrv_open should have
  * this field set to true, except ones that are defined only by their
-- 
2.37.1




[PATCH v6 7/8] qemu-iotests: test new zone operations

2022-08-05 Thread Sam Li
We have added new block layer APIs of zoned block devices. Test it with:
Create a null_blk device, run each zone operation on it and see
whether reporting right zone information.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
---
 tests/qemu-iotests/tests/zoned.out | 53 ++
 tests/qemu-iotests/tests/zoned.sh  | 86 ++
 2 files changed, 139 insertions(+)
 create mode 100644 tests/qemu-iotests/tests/zoned.out
 create mode 100755 tests/qemu-iotests/tests/zoned.sh

diff --git a/tests/qemu-iotests/tests/zoned.out 
b/tests/qemu-iotests/tests/zoned.out
new file mode 100644
index 00..d09be2ffcd
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned.out
@@ -0,0 +1,53 @@
+QA output created by zoned.sh
+Testing a null_blk device:
+Simple cases: if the operations work
+(1) report the first zone:
+start: 0x0, len 0x8, cap 0x8,wptr 0x0, zcond:1, [type: 2]
+
+report the first 10 zones
+start: 0x0, len 0x8, cap 0x8,wptr 0x0, zcond:1, [type: 2]
+start: 0x8, len 0x8, cap 0x8,wptr 0x8, zcond:1, [type: 2]
+start: 0x10, len 0x8, cap 0x8,wptr 0x10, zcond:1, [type: 2]
+start: 0x18, len 0x8, cap 0x8,wptr 0x18, zcond:1, [type: 2]
+start: 0x20, len 0x8, cap 0x8,wptr 0x20, zcond:1, [type: 2]
+start: 0x28, len 0x8, cap 0x8,wptr 0x28, zcond:1, [type: 2]
+start: 0x30, len 0x8, cap 0x8,wptr 0x30, zcond:1, [type: 2]
+start: 0x38, len 0x8, cap 0x8,wptr 0x38, zcond:1, [type: 2]
+start: 0x40, len 0x8, cap 0x8,wptr 0x40, zcond:1, [type: 2]
+start: 0x48, len 0x8, cap 0x8,wptr 0x48, zcond:1, [type: 2]
+
+report the last zone:
+start: 0x1f38, len 0x8, cap 0x8,wptr 0x1f38, zcond:1, [type: 2]
+
+
+(2) opening the first zone
+report after:
+start: 0x0, len 0x8, cap 0x8,wptr 0x0, zcond:3, [type: 2]
+
+opening the second zone
+report after:
+start: 0x8, len 0x8, cap 0x8,wptr 0x8, zcond:3, [type: 2]
+
+opening the last zone
+report after:
+start: 0x1f38, len 0x8, cap 0x8,wptr 0x1f38, zcond:3, [type: 2]
+
+
+(3) closing the first zone
+report after:
+start: 0x0, len 0x8, cap 0x8,wptr 0x0, zcond:1, [type: 2]
+
+closing the last zone
+report after:
+start: 0x1f38, len 0x8, cap 0x8,wptr 0x1f38, zcond:1, [type: 2]
+
+
+(4) finishing the second zone
+After finishing a zone:
+start: 0x8, len 0x8, cap 0x8,wptr 0x10, zcond:14, [type: 2]
+
+
+(5) resetting the second zone
+After resetting a zone:
+start: 0x8, len 0x8, cap 0x8,wptr 0x8, zcond:1, [type: 2]
+*** done
diff --git a/tests/qemu-iotests/tests/zoned.sh 
b/tests/qemu-iotests/tests/zoned.sh
new file mode 100755
index 00..db68aa88d4
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned.sh
@@ -0,0 +1,86 @@
+#!/usr/bin/env bash
+#
+# Test zone management operations.
+#
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+status=1 # failure is the default!
+
+_cleanup()
+{
+  _cleanup_test_img
+  sudo rmmod null_blk
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+. ./common.qemu
+
+# This test only runs on Linux hosts with raw image files.
+_supported_fmt raw
+_supported_proto file
+_supported_os Linux
+
+QEMU_IO="build/qemu-io"
+IMG="--image-opts driver=zoned_host_device,filename=/dev/nullb0"
+QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT
+
+echo "Testing a null_blk device:"
+echo "Simple cases: if the operations work"
+sudo modprobe null_blk nr_devices=1 zoned=1
+
+echo "(1) report the first zone:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "report the first 10 zones"
+sudo $QEMU_IO $IMG -c "zrp 0 10"
+echo
+echo "report the last zone:"
+sudo $QEMU_IO $IMG -c "zrp 0x3e7000 2"
+echo
+echo
+echo "(2) opening the first zone"
+sudo $QEMU_IO $IMG -c "zo 0 0x8"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "opening the second zone"
+sudo $QEMU_IO $IMG -c "zo 524288 0x8" # 524288 is the zone sector size
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 268435456 1" # 268435456 / 512 = 524288
+echo
+echo "opening the last zone"
+sudo $QEMU_IO $IMG -c "zo 0x1f38 0x8"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0x3e7000 2"
+echo
+echo
+echo "(3) closing the first zone"
+sudo $QEMU_IO $IMG -c "zc 0 0x8"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "closing the last zone"
+sudo $QEMU_IO $IMG -c "zc 0x1f38 0x8"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0x3e7000 2"
+echo
+echo
+echo "(

[PATCH v6 8/8] docs/zoned-storage: add zoned device documentation

2022-08-05 Thread Sam Li
Add the documentation about the zoned device support to virtio-blk
emulation.

Signed-off-by: Sam Li 
---
 docs/devel/zoned-storage.rst   | 41 ++
 docs/system/qemu-block-drivers.rst.inc |  6 
 2 files changed, 47 insertions(+)
 create mode 100644 docs/devel/zoned-storage.rst

diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
new file mode 100644
index 00..c3f1e477ac
--- /dev/null
+++ b/docs/devel/zoned-storage.rst
@@ -0,0 +1,41 @@
+=
+zoned-storage
+=
+
+Zoned Block Devices (ZBDs) devide the LBA space into block regions called zones
+that are larger than the LBA size. It can only allow sequential writes, which
+reduces write amplification in SSDs, leading to higher throughput and increased
+capacity. More details about ZBDs can be found at:
+
+https://zonedstorage.io/docs/introduction/zoned-storage
+
+1. Block layer APIs for zoned storage
+-
+QEMU block layer has three zoned storage model:
+- BLK_Z_HM: This model only allows sequential writes access. It supports a set
+of ZBD-specific I/O request that used by the host to manage device zones.
+- BLK_Z_HA: It deals with both sequential writes and random writes access.
+- BLK_Z_NONE: Regular block devices and drive-managed ZBDs are treated as
+non-zoned devices.
+
+The block device information is resided inside BlockDriverState. QEMU uses
+BlockLimits struct(BlockDriverState::bl) that is continuously accessed by the
+block layer while processing I/O requests. A BlockBackend has a root pointer to
+a BlockDriverState graph(for example, raw format on top of file-posix). The
+zoned storage information can be propagated from the leaf BlockDriverState all
+the way up to the BlockBackend. If the zoned storage model in file-posix is
+set to BLK_Z_HM, then block drivers will declare support for zoned host device.
+
+The block layer APIs support commands needed for zoned storage devices,
+including report zones, four zone operations, and zone append.
+
+2. Emulating zoned storage controllers
+--
+When the BlockBackend's BlockLimits model reports a zoned storage device, users
+like the virtio-blk emulation or the qemu-io-cmds.c utility can use block layer
+APIs for zoned storage emulation or testing.
+
+For example, the command line for zone report testing a null_blk device of
+qemu-io-cmds.c is:
+$ path/to/qemu-io --image-opts driver=zoned_host_device,filename=/dev/nullb0 -c
+"zrp offset nr_zones"
diff --git a/docs/system/qemu-block-drivers.rst.inc 
b/docs/system/qemu-block-drivers.rst.inc
index dfe5d2293d..0b97227fd9 100644
--- a/docs/system/qemu-block-drivers.rst.inc
+++ b/docs/system/qemu-block-drivers.rst.inc
@@ -430,6 +430,12 @@ Hard disks
   you may corrupt your host data (use the ``-snapshot`` command
   line option or modify the device permissions accordingly).
 
+Zoned block devices
+  Zoned block devices can be passed through to the guest if the emulated 
storage
+  controller supports zoned storage. Use ``--blockdev zoned_host_device,
+  node-name=drive0,filename=/dev/nullb0`` to pass through ``/dev/nullb0``
+  as ``drive0``.
+
 Windows
 ^^^
 
-- 
2.37.1




Re: [PATCH v6 4/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2022-08-07 Thread Sam Li
Sam Li  于2022年8月5日周五 15:58写道:
>
> By adding zone management operations in BlockDriver, storage controller
> emulation can use the new block layer APIs including Report Zone and
> four zone management operations (open, close, finish, reset).
>
> Add zoned storage commands of the device: zone_report(zrp), zone_open(zo),
> zone_close(zc), zone_reset(zrs), zone_finish(zf).
>
> For example, to test zone_report, use following command:
> $ ./build/qemu-io --image-opts driver=zoned_host_device, filename=/dev/nullb0
> -c "zrp offset nr_zones"
>
> Signed-off-by: Sam Li 
> Reviewed-by: Hannes Reinecke 
> ---
>  block/block-backend.c|  50 +
>  block/coroutines.h   |   6 +
>  block/file-posix.c   | 315 ++-
>  block/io.c   |  41 
>  include/block/block-common.h |   1 -
>  include/block/block-io.h |  13 ++
>  include/block/block_int-common.h |  22 ++-
>  include/block/raw-aio.h  |   6 +-
>  meson.build  |   1 +
>  qapi/block-core.json |   8 +-
>  qemu-io-cmds.c   | 144 ++
>  11 files changed, 601 insertions(+), 6 deletions(-)
>
> diff --git a/block/block-backend.c b/block/block-backend.c
> index d4a5df2ac2..fc639b0cd7 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1775,6 +1775,56 @@ int coroutine_fn blk_co_flush(BlockBackend *blk)
>  return ret;
>  }
>
> +/*
> + * Send a zone_report command.
> + * offset is a byte offset from the start of the device. No alignment
> + * required for offset.
> + * nr_zones represents IN maximum and OUT actual.
> + */
> +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> +unsigned int *nr_zones,
> +BlockZoneDescriptor *zones)
> +{
> +int ret;
> +IO_CODE();
> +
> +blk_inc_in_flight(blk); /* increase before waiting */
> +blk_wait_while_drained(blk);
> +if (!blk_is_available(blk)) {
> +blk_dec_in_flight(blk);
> +return -ENOMEDIUM;
> +}
> +ret = bdrv_co_zone_report(blk_bs(blk), offset, nr_zones, zones);
> +blk_dec_in_flight(blk);
> +return ret;
> +}
> +
> +/*
> + * Send a zone_management command.
> + * offset is the starting zone specified as a sector offset.
> + * len is the maximum number of sectors the command should operate on.
> + */
> +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +int64_t offset, int64_t len)
> +{
> +int ret;
> +IO_CODE();
> +
> +ret = blk_check_byte_request(blk, offset, len);
> +if (ret < 0) {
> +return ret;
> +}
> +blk_inc_in_flight(blk);
> +blk_wait_while_drained(blk);
> +if (!blk_is_available(blk)) {
> +blk_dec_in_flight(blk);
> +return -ENOMEDIUM;
> +}
> +ret = bdrv_co_zone_mgmt(blk_bs(blk), op, offset, len);
> +blk_dec_in_flight(blk);
> +return ret;
> +}
> +
>  void blk_drain(BlockBackend *blk)
>  {
>  BlockDriverState *bs = blk_bs(blk);
> diff --git a/block/coroutines.h b/block/coroutines.h
> index 3a2bad564f..e3f62d94e5 100644
> --- a/block/coroutines.h
> +++ b/block/coroutines.h
> @@ -63,6 +63,12 @@ nbd_co_do_establish_connection(BlockDriverState *bs, bool 
> blocking,
> Error **errp);
>
>
> +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> +unsigned int *nr_zones,
> +BlockZoneDescriptor *zones);
> +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +  int64_t offset, int64_t len);
> +
>  /*
>   * "I/O or GS" API functions. These functions can run without
>   * the BQL, but only in one specific iothread/main loop.
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 4785203eea..2627431581 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -67,6 +67,9 @@
>  #include 
>  #include 
>  #include 
> +#if defined(CONFIG_BLKZONED)
> +#include 
> +#endif
>  #include 
>  #include 
>  #include 
> @@ -216,6 +219,13 @@ typedef struct RawPosixAIOData {
>  PreallocMode prealloc;
>  Error **errp;
>  } truncate;
> +struct {
> +unsigned int *nr_zones;
> +BlockZoneDescriptor *zones;
> +} zone_report;
> +struct {
> +BlockZoneOp op;
> +} zone_mgmt;
>  };
>  } RawPosixAIOData;
>
> @@ -1369,7 +1379,7 @@ s

Re: [PATCH v6 2/8] file-posix: introduce get_sysfs_long_val for the long sysfs attribute

2022-08-08 Thread Sam Li
Stefan Hajnoczi  于2022年8月8日周一 21:52写道:
>
> On Fri, Aug 05, 2022 at 03:57:45PM +0800, Sam Li wrote:
> > Use sysfs attribute files to get the long value of zoned device
> > information.
> >
> > Signed-off-by: Sam Li 
> > Reviewed-by: Hannes Reinecke 
> > ---
> >  block/file-posix.c | 37 +++--
> >  1 file changed, 23 insertions(+), 14 deletions(-)
> >
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 48cd096624..a40eab64a2 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -1210,29 +1210,27 @@ static int hdev_get_max_hw_transfer(int fd, struct 
> > stat *st)
> >  #endif
> >  }
> >
> > -static int hdev_get_max_segments(int fd, struct stat *st)
> > -{
> > +/*
> > + * Get zoned device information (chunk_sectors, zoned_append_max_bytes,
> > + * max_open_zones, max_active_zones) through sysfs attribute files.
> > + */
> > +static long get_sysfs_long_val(int fd, struct stat *st,
> > +   const char *attribute) {
>
> Is the fd argument used or can it be removed?

Yes, it can be removed.



Re: [PATCH v6 3/8] file-posix: introduce get_sysfs_str_val for device zoned model

2022-08-08 Thread Sam Li
Stefan Hajnoczi  于2022年8月8日周一 21:52写道:
>
> On Fri, Aug 05, 2022 at 03:57:46PM +0800, Sam Li wrote:
> > Use sysfs attribute files to get the string value of device
> > zoned model. Then get_sysfs_zoned_model can convert it to
> > BlockZoneModel type in QEMU.
> >
> > Signed-off-by: Sam Li 
> > Reviewed-by: Hannes Reinecke 
> > ---
> >  block/file-posix.c   | 70 
> >  include/block/block_int-common.h |  3 ++
> >  2 files changed, 73 insertions(+)
> >
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index a40eab64a2..4785203eea 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -1264,6 +1264,68 @@ out:
> >  #endif
> >  }
> >
> > +/*
> > + * Convert the zoned attribute file in sysfs to internal value.
> > + */
> > +static int get_sysfs_str_val(int fd, struct stat *st,
> > +  const char *attribute,
> > +  char **val) {
>
> The fd argument is unused and can be dropped.
>
> > +#ifdef CONFIG_LINUX
> > +char *buf = NULL;
> > +g_autofree char *sysfspath = NULL;
> > +int ret;
> > +size_t len;
> > +
> > +if (!S_ISBLK(st->st_mode)) {
> > +return -ENOTSUP;
> > +}
> > +
> > +sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
> > +major(st->st_rdev), minor(st->st_rdev),
> > +attribute);
> > +ret = g_file_get_contents(sysfspath, &buf, &len, NULL);
> > +if (ret == -1) {
> > +ret = -errno;
>
> g_file_get_contents() does not set errno. You can either pass in a
> GError and report the message string by converting it into a QEMU Error
> object (grep for g_file_get_contents() to see example), or you can
> return a fixed error code like -ENOENT.
>
> > +return ret;
> > +}
> > +
> > +/* The file is ended with '\n' */
> > +if (buf[len - 1] == '\n') {
> > +buf[len - 1] = '\0';
> > +}
> > +
> > +if (!strncpy(*val, buf, len)) {
> > +ret = -errno;
> > +return ret;
> > +}
> > +g_free(buf);
>
> buf is not necessary. val can be passed directly to g_file_get_contents().
>
> > +return 0;
> > +#else
> > +return -ENOTSUP;
> > +#endif
> > +}
>
> Now get_sysfs_long_val() can be written using get_sysfs_str_val():
>
>   static long get_sysfs_long_val(struct stat *st, const char *attribute)
>   {
>   g_autofree char *str = NULL;
>   const char *end;
>   long val;
>   int ret;
>
>   ret = get_sysfs_str_val(st, attribute, &str);
>   if (ret < 0) {
>   return ret;
>   }
>
>   ret = qemu_strtol(str, &end, 10, &val);
>   if (ret == 0 && end && *end == '\0') {
>   ret = val;
>   }
>   return ret;
>   }
>
> The get_sysfs_long_val() patch can be moved after the
> get_sysfs_str_val() patch.

Cool! Will change it.

> > +
> > +static int get_sysfs_zoned_model(int fd, struct stat *st,
> > + BlockZoneModel *zoned) {
> > +g_autofree char *val = NULL;
> > +val = g_malloc(32);
> > +get_sysfs_str_val(fd, st, "zoned", &val);
>
> Once get_sysfs_str_val() passes val through to g_get_file_contents() the
> caller will no longer have to g_malloc() val themselves.



Re: [PATCH 1/2] block/file-posix: fix g_file_get_contents return path

2023-07-27 Thread Sam Li
Matthew Rosato  于2023年7月27日周四 19:46写道:
>
> On 7/5/23 10:54 AM, Matthew Rosato wrote:
> > On 6/4/23 2:16 AM, Sam Li wrote:
> >> The g_file_get_contents() function returns a g_boolean. If it fails, the
> >> returned value will be 0 instead of -1. Solve the issue by skipping
> >> assigning ret value.
> >>
> >> This issue was found by Matthew Rosato using virtio-blk-{pci,ccw} backed
> >> by an NVMe partition e.g. /dev/nvme0n1p1 on s390x.
> >>
> >> Signed-off-by: Sam Li 
> >
> > Polite ping on this patch -- this issue still exists in master as of today 
> > and this patch resolves it for me.  Just want to make sure it gets into 8.1
> >
>
> Ping -- I can still reproduce this crash on -rc1.  Any chance this patch can 
> get picked up for the 8.1 release?
>
> @Sam I see you sent a v2 of only patch #2 in this series ('block/file-posix: 
> fix wps checking in raw_co_prw')..  I wonder if this one just got forgotten 
> since it wasn't sent as part of v2.  Maybe try a resend of this patch by 
> itself (plus the review tags added)?

Ok, I will resend it as a separate patch.

>
> Thanks,
> Matt
>
> >
> >> ---
> >>  block/file-posix.c | 6 ++
> >>  1 file changed, 2 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/block/file-posix.c b/block/file-posix.c
> >> index ac1ed54811..0d9d179a35 100644
> >> --- a/block/file-posix.c
> >> +++ b/block/file-posix.c
> >> @@ -1232,7 +1232,6 @@ static int hdev_get_max_hw_transfer(int fd, struct 
> >> stat *st)
> >>  static int get_sysfs_str_val(struct stat *st, const char *attribute,
> >>   char **val) {
> >>  g_autofree char *sysfspath = NULL;
> >> -int ret;
> >>  size_t len;
> >>
> >>  if (!S_ISBLK(st->st_mode)) {
> >> @@ -1242,8 +1241,7 @@ static int get_sysfs_str_val(struct stat *st, const 
> >> char *attribute,
> >>  sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
> >>  major(st->st_rdev), minor(st->st_rdev),
> >>  attribute);
> >> -ret = g_file_get_contents(sysfspath, val, &len, NULL);
> >> -if (ret == -1) {
> >> +if (!g_file_get_contents(sysfspath, val, &len, NULL)) {
> >>  return -ENOENT;
> >>  }
> >>
> >> @@ -1253,7 +1251,7 @@ static int get_sysfs_str_val(struct stat *st, const 
> >> char *attribute,
> >>  if (*(p + len - 1) == '\n') {
> >>  *(p + len - 1) = '\0';
> >>  }
> >> -return ret;
> >> +return 0;
> >>  }
> >>  #endif
> >>
> >
> >
>



[PATCH v2] block/file-posix: fix g_file_get_contents return path

2023-07-27 Thread Sam Li
The g_file_get_contents() function returns a g_boolean. If it fails, the
returned value will be 0 instead of -1. Solve the issue by skipping
assigning ret value.

This issue was found by Matthew Rosato using virtio-blk-{pci,ccw} backed
by an NVMe partition e.g. /dev/nvme0n1p1 on s390x.

Signed-off-by: Sam Li 
Reviewed-by: Matthew Rosato 
Reviewed-by: Stefan Hajnoczi 
---
 block/file-posix.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 9e8e3d8ca5..b16e9c21a1 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1232,7 +1232,6 @@ static int hdev_get_max_hw_transfer(int fd, struct stat 
*st)
 static int get_sysfs_str_val(struct stat *st, const char *attribute,
  char **val) {
 g_autofree char *sysfspath = NULL;
-int ret;
 size_t len;
 
 if (!S_ISBLK(st->st_mode)) {
@@ -1242,8 +1241,7 @@ static int get_sysfs_str_val(struct stat *st, const char 
*attribute,
 sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
 major(st->st_rdev), minor(st->st_rdev),
 attribute);
-ret = g_file_get_contents(sysfspath, val, &len, NULL);
-if (ret == -1) {
+if (!g_file_get_contents(sysfspath, val, &len, NULL)) {
 return -ENOENT;
 }
 
@@ -1253,7 +1251,7 @@ static int get_sysfs_str_val(struct stat *st, const char 
*attribute,
 if (*(p + len - 1) == '\n') {
 *(p + len - 1) = '\0';
 }
-return ret;
+return 0;
 }
 #endif
 
-- 
2.40.1




Re: [PATCH v7 2/4] qcow2: add configurations for zoned format extension

2024-02-19 Thread Sam Li
Markus Armbruster  于2024年2月19日周一 13:05写道:
>
> One more thing...
>
> Markus Armbruster  writes:
>
> > I apologize for the delayed review.

No problems. Thanks for reviewing!

> >
> > Sam Li  writes:
> >
> >> To configure the zoned format feature on the qcow2 driver, it
> >> requires settings as: the device size, zone model, zone size,
> >> zone capacity, number of conventional zones, limits on zone
> >> resources (max append bytes, max open zones, and max_active_zones).
> >>
> >> To create a qcow2 image with zoned format feature, use command like
> >> this:
> >> qemu-img create -f qcow2 zbc.qcow2 -o size=768M \
> >> -o zone.size=64M -o zone.capacity=64M -o zone.conventional_zones=0 \
> >> -o zone.max_append_bytes=4096 -o zone.max_open_zones=6 \
> >> -o zone.max_active_zones=8 -o zone.mode=host-managed
> >>
> >> Signed-off-by: Sam Li 
> >
> > [...]
> >
> >> diff --git a/qapi/block-core.json b/qapi/block-core.json
> >> index ca390c5700..e2e0ec21a5 100644
> >> --- a/qapi/block-core.json
> >> +++ b/qapi/block-core.json
> >> @@ -5038,6 +5038,67 @@
> >>  { 'enum': 'Qcow2CompressionType',
> >>'data': [ 'zlib', { 'name': 'zstd', 'if': 'CONFIG_ZSTD' } ] }
> >>
> >> +##
> >> +# @Qcow2ZoneModel:
> >> +#
> >> +# Zoned device model used in qcow2 image file
> >> +#
> >> +# @host-managed: The host-managed model only allows sequential write over 
> >> the
> >> +# device zones.
> >> +#
> >> +# Since 8.2
> >> +##
> >> +{ 'enum': 'Qcow2ZoneModel',
> >> +  'data': [ 'host-managed'] }
> >> +
> >> +##
> >> +# @Qcow2ZoneHostManaged:
> >> +#
> >> +# The host-managed zone model.  It only allows sequential writes.
> >> +#
> >> +# @size: Total number of bytes within zones.
> >
> > Default?

It should be set by users. No default value provided. If it's unset
then it is zero and an error will be returned.

> >
> >> +#
> >> +# @capacity: The number of usable logical blocks within zones
> >> +# in bytes.  A zone capacity is always smaller or equal to the
> >> +# zone size.
> >
> > Default?

Same.

> >
> >> +# @max-append-bytes: The maximal number of bytes of a zone
> >> +# append request that can be issued to the device.  It must be
> >> +# 512-byte aligned and less than the zone capacity.
> >
> > Default?

Same.

For those values, I guess it could be set when users provide no
information and still want a workable emulated zoned block device.

> >
> >> +#
> >> +# Since 8.2
> >> +##
> >> +{ 'struct': 'Qcow2ZoneHostManaged',
> >> +  'data': { '*size':  'size',
> >> +'*capacity':  'size',
> >> +'*conventional-zones': 'uint32',
> >> +'*max-open-zones': 'uint32',
> >> +'*max-active-zones':   'uint32',
> >> +'*max-append-bytes':   'size' } }
> >> +
> >> +##
> >> +# @Qcow2ZoneCreateOptions:
> >> +#
> >> +# The zone device model for the qcow2 image.
>
> Please document member @mode.
>
> Fails to build since merge commit 61e7a0d27c1:
>
> qapi/block-core.json: In union 'Qcow2ZoneCreateOptions':
> qapi/block-core.json:5135: member 'mode' lacks documentation
>

I see. Will update to the latest commit.

> >> +#
> >> +# Since 8.2
> >> +##
> >> +{ 'union': 'Qcow2ZoneCreateOptions',
> >> +  'base': { 'mode': 'Qcow2ZoneModel' },
> >> +  'discriminator': 'mode',
> >> +  'data': { 'host-managed': 'Qcow2ZoneHostManaged' } }
> >> +
> >>  ##
> >>  # @BlockdevCreateOptionsQcow2:
> >>  #
> >> @@ -5080,6 +5141,9 @@
> >>  # @compression-type: The image cluster compression method
> >>  # (default: zlib, since 5.1)
> >>  #
> >> +# @zone: The zone device model modes.  The default is that the device is
> >> +# not zoned.  (since 8.2)
> >> +#
> >>  # Since: 2.12
> >>  ##
> >>  { 'struct': 'BlockdevCreateOptionsQcow2',
> >> @@ -5096,7 +5160,8 @@
> >>  '*preallocation':   'PreallocMode',
> >>  '*lazy-refcounts':  'bool',
> >>  '*refcount-bits':   'int',
> >> -'*compression-type':'Qcow2CompressionType' } }
> >> +'*compression-type':'Qcow2CompressionType',
> >> +'*zone':'Qcow2ZoneCreateOptions' } }
> >>
> >>  ##
> >>  # @BlockdevCreateOptionsQed:
>



Re: [PATCH v7 2/4] qcow2: add configurations for zoned format extension

2024-02-19 Thread Sam Li
Markus Armbruster  于2024年2月19日周一 15:40写道:
>
> Sam Li  writes:
>
> > Markus Armbruster  于2024年2月19日周一 13:05写道:
> >>
> >> One more thing...
> >>
> >> Markus Armbruster  writes:
> >>
> >> > I apologize for the delayed review.
> >
> > No problems. Thanks for reviewing!
> >
> >> >
> >> > Sam Li  writes:
> >> >
> >> >> To configure the zoned format feature on the qcow2 driver, it
> >> >> requires settings as: the device size, zone model, zone size,
> >> >> zone capacity, number of conventional zones, limits on zone
> >> >> resources (max append bytes, max open zones, and max_active_zones).
> >> >>
> >> >> To create a qcow2 image with zoned format feature, use command like
> >> >> this:
> >> >> qemu-img create -f qcow2 zbc.qcow2 -o size=768M \
> >> >> -o zone.size=64M -o zone.capacity=64M -o zone.conventional_zones=0 \
> >> >> -o zone.max_append_bytes=4096 -o zone.max_open_zones=6 \
> >> >> -o zone.max_active_zones=8 -o zone.mode=host-managed
> >> >>
> >> >> Signed-off-by: Sam Li 
> >> >
> >> > [...]
> >> >
> >> >> diff --git a/qapi/block-core.json b/qapi/block-core.json
> >> >> index ca390c5700..e2e0ec21a5 100644
> >> >> --- a/qapi/block-core.json
> >> >> +++ b/qapi/block-core.json
> >> >> @@ -5038,6 +5038,67 @@
> >> >>  { 'enum': 'Qcow2CompressionType',
> >> >>'data': [ 'zlib', { 'name': 'zstd', 'if': 'CONFIG_ZSTD' } ] }
> >> >>
> >> >> +##
> >> >> +# @Qcow2ZoneModel:
> >> >> +#
> >> >> +# Zoned device model used in qcow2 image file
> >> >> +#
> >> >> +# @host-managed: The host-managed model only allows sequential write 
> >> >> over the
> >> >> +# device zones.
> >> >> +#
> >> >> +# Since 8.2
> >> >> +##
> >> >> +{ 'enum': 'Qcow2ZoneModel',
> >> >> +  'data': [ 'host-managed'] }
> >> >> +
> >> >> +##
> >> >> +# @Qcow2ZoneHostManaged:
> >> >> +#
> >> >> +# The host-managed zone model.  It only allows sequential writes.
> >> >> +#
> >> >> +# @size: Total number of bytes within zones.
> >> >
> >> > Default?
> >
> > It should be set by users. No default value provided. If it's unset
> > then it is zero and an error will be returned.
>
> If the user must provide @size, why is it optional then?

It is not optional when the zone model is host-managed. If it's
non-zoned, then we don't care about zone info. I am not sure how to
make it unoptional.

>
> >> >
> >> >> +#
> >> >> +# @capacity: The number of usable logical blocks within zones
> >> >> +# in bytes.  A zone capacity is always smaller or equal to the
> >> >> +# zone size.
> >> >
> >> > Default?
> >
> > Same.
> >
> >> >
> >> >> +# @max-append-bytes: The maximal number of bytes of a zone
> >> >> +# append request that can be issued to the device.  It must be
> >> >> +# 512-byte aligned and less than the zone capacity.
> >> >
> >> > Default?
> >
> > Same.
> >
> > For those values, I guess it could be set when users provide no
> > information and still want a workable emulated zoned block device.
> >
> >> >
> >> >> +#
> >> >> +# Since 8.2
> >> >> +##
> >> >> +{ 'struct': 'Qcow2ZoneHostManaged',
> >> >> +  'data': { '*size':  'size',
> >> >> +'*capacity':  'size',
> >> >> +'*conventional-zones': 'uint32',
> >> >> +'*max-open-zones': 'uint32',
> >> >> +'*max-active-zones':   'uint32',
> >> >> +'*max-append-bytes':   'size' } }
>
> [...]
>



Re: [PATCH v7 2/4] qcow2: add configurations for zoned format extension

2024-02-19 Thread Sam Li
Markus Armbruster  于2024年2月19日周一 16:56写道:
>
> Sam Li  writes:
>
> > Markus Armbruster  于2024年2月19日周一 15:40写道:
> >>
> >> Sam Li  writes:
> >>
> >> > Markus Armbruster  于2024年2月19日周一 13:05写道:
> >> >>
> >> >> One more thing...
> >> >>
> >> >> Markus Armbruster  writes:
> >> >>
> >> >> > I apologize for the delayed review.
> >> >
> >> > No problems. Thanks for reviewing!
> >> >
> >> >> >
> >> >> > Sam Li  writes:
> >> >> >
> >> >> >> To configure the zoned format feature on the qcow2 driver, it
> >> >> >> requires settings as: the device size, zone model, zone size,
> >> >> >> zone capacity, number of conventional zones, limits on zone
> >> >> >> resources (max append bytes, max open zones, and max_active_zones).
> >> >> >>
> >> >> >> To create a qcow2 image with zoned format feature, use command like
> >> >> >> this:
> >> >> >> qemu-img create -f qcow2 zbc.qcow2 -o size=768M \
> >> >> >> -o zone.size=64M -o zone.capacity=64M -o zone.conventional_zones=0 \
> >> >> >> -o zone.max_append_bytes=4096 -o zone.max_open_zones=6 \
> >> >> >> -o zone.max_active_zones=8 -o zone.mode=host-managed
> >> >> >>
> >> >> >> Signed-off-by: Sam Li 
> >> >> >
> >> >> > [...]
> >> >> >
> >> >> >> diff --git a/qapi/block-core.json b/qapi/block-core.json
> >> >> >> index ca390c5700..e2e0ec21a5 100644
> >> >> >> --- a/qapi/block-core.json
> >> >> >> +++ b/qapi/block-core.json
> >> >> >> @@ -5038,6 +5038,67 @@
> >> >> >>  { 'enum': 'Qcow2CompressionType',
> >> >> >>'data': [ 'zlib', { 'name': 'zstd', 'if': 'CONFIG_ZSTD' } ] }
> >> >> >>
> >> >> >> +##
> >> >> >> +# @Qcow2ZoneModel:
> >> >> >> +#
> >> >> >> +# Zoned device model used in qcow2 image file
> >> >> >> +#
> >> >> >> +# @host-managed: The host-managed model only allows sequential 
> >> >> >> write over the
> >> >> >> +# device zones.
> >> >> >> +#
> >> >> >> +# Since 8.2
> >> >> >> +##
> >> >> >> +{ 'enum': 'Qcow2ZoneModel',
> >> >> >> +  'data': [ 'host-managed'] }
> >> >> >> +
> >> >> >> +##
> >> >> >> +# @Qcow2ZoneHostManaged:
> >> >> >> +#
> >> >> >> +# The host-managed zone model.  It only allows sequential writes.
> >> >> >> +#
> >> >> >> +# @size: Total number of bytes within zones.
> >> >> >
> >> >> > Default?
> >> >
> >> > It should be set by users. No default value provided. If it's unset
> >> > then it is zero and an error will be returned.
> >>
> >> If the user must provide @size, why is it optional then?
> >
> > It is not optional when the zone model is host-managed. If it's
> > non-zoned, then we don't care about zone info. I am not sure how to
> > make it unoptional.
>
> We have:
>
>blockdev-create argument @options of type BlockdevCreateOptions
>
>BlockdevCreateOptions union branch @qcow2 of type
>BlockdevCreateOptionsQcow2, union tag member is @driver
>
>BlockdevCreateOptionsQcow2 optional member @zone of type
>Qcow2ZoneCreateOptions, default not zoned
>
>Qcow2ZoneCreateOptions union branch @host-managed of type
>Qcow2ZoneHostManaged, union tag member is @mode
>
>Qcow2ZoneHostManaged optional member @size of type size.
>
> Making this member @size mandatory means we must specify it when
> BlockdevCreateOptionsQcow2 member @zone is present and @zone's member
> @mode is "host-managed".  Feels right to me.  Am I missing anything?

That's right. And the checks when creating such an img can help do
that. It's not specified in the .json file directly.

>
> >>
> >> >> >
> >> >> >> +#
> >> >> >> +# @capacity: The number of usable logical blocks within zones
> >> >> >> +# in bytes.  A zone capacity is always smaller or equal to the
> >> >> >> +# zone size.
> >> >> >
> >> >> > Default?
> >> >
> >> > Same.
> >> >
> >> >> >
> >> >> >> +# @max-append-bytes: The maximal number of bytes of a zone
> >> >> >> +# append request that can be issued to the device.  It must be
> >> >> >> +# 512-byte aligned and less than the zone capacity.
> >> >> >
> >> >> > Default?
> >> >
> >> > Same.
> >> >
> >> > For those values, I guess it could be set when users provide no
> >> > information and still want a workable emulated zoned block device.
> >> >
> >> >> >
> >> >> >> +#
> >> >> >> +# Since 8.2
> >> >> >> +##
> >> >> >> +{ 'struct': 'Qcow2ZoneHostManaged',
> >> >> >> +  'data': { '*size':  'size',
> >> >> >> +'*capacity':  'size',
> >> >> >> +'*conventional-zones': 'uint32',
> >> >> >> +'*max-open-zones': 'uint32',
> >> >> >> +'*max-active-zones':   'uint32',
> >> >> >> +'*max-append-bytes':   'size' } }
> >>
> >> [...]
> >>
>



Re: [PATCH v7 2/4] qcow2: add configurations for zoned format extension

2024-02-19 Thread Sam Li
Markus Armbruster  于2024年2月19日周一 21:42写道:
>
> Sam Li  writes:
>
> > Markus Armbruster  于2024年2月19日周一 16:56写道:
> >>
> >> Sam Li  writes:
> >>
> >> > Markus Armbruster  于2024年2月19日周一 15:40写道:
> >> >>
> >> >> Sam Li  writes:
> >> >>
> >> >> > Markus Armbruster  于2024年2月19日周一 13:05写道:
> >> >> >>
> >> >> >> One more thing...
> >> >> >>
> >> >> >> Markus Armbruster  writes:
> >> >> >>
> >> >> >> > I apologize for the delayed review.
> >> >> >
> >> >> > No problems. Thanks for reviewing!
> >> >> >
> >> >> >> >
> >> >> >> > Sam Li  writes:
> >> >> >> >
> >> >> >> >> To configure the zoned format feature on the qcow2 driver, it
> >> >> >> >> requires settings as: the device size, zone model, zone size,
> >> >> >> >> zone capacity, number of conventional zones, limits on zone
> >> >> >> >> resources (max append bytes, max open zones, and 
> >> >> >> >> max_active_zones).
> >> >> >> >>
> >> >> >> >> To create a qcow2 image with zoned format feature, use command 
> >> >> >> >> like
> >> >> >> >> this:
> >> >> >> >> qemu-img create -f qcow2 zbc.qcow2 -o size=768M \
> >> >> >> >> -o zone.size=64M -o zone.capacity=64M -o 
> >> >> >> >> zone.conventional_zones=0 \
> >> >> >> >> -o zone.max_append_bytes=4096 -o zone.max_open_zones=6 \
> >> >> >> >> -o zone.max_active_zones=8 -o zone.mode=host-managed
> >> >> >> >>
> >> >> >> >> Signed-off-by: Sam Li 
> >> >> >> >
> >> >> >> > [...]
> >> >> >> >
> >> >> >> >> diff --git a/qapi/block-core.json b/qapi/block-core.json
> >> >> >> >> index ca390c5700..e2e0ec21a5 100644
> >> >> >> >> --- a/qapi/block-core.json
> >> >> >> >> +++ b/qapi/block-core.json
> >> >> >> >> @@ -5038,6 +5038,67 @@
> >> >> >> >>  { 'enum': 'Qcow2CompressionType',
> >> >> >> >>'data': [ 'zlib', { 'name': 'zstd', 'if': 'CONFIG_ZSTD' } ] }
> >> >> >> >>
> >> >> >> >> +##
> >> >> >> >> +# @Qcow2ZoneModel:
> >> >> >> >> +#
> >> >> >> >> +# Zoned device model used in qcow2 image file
> >> >> >> >> +#
> >> >> >> >> +# @host-managed: The host-managed model only allows sequential 
> >> >> >> >> write over the
> >> >> >> >> +# device zones.
> >> >> >> >> +#
> >> >> >> >> +# Since 8.2
> >> >> >> >> +##
> >> >> >> >> +{ 'enum': 'Qcow2ZoneModel',
> >> >> >> >> +  'data': [ 'host-managed'] }
> >> >> >> >> +
> >> >> >> >> +##
> >> >> >> >> +# @Qcow2ZoneHostManaged:
> >> >> >> >> +#
> >> >> >> >> +# The host-managed zone model.  It only allows sequential writes.
> >> >> >> >> +#
> >> >> >> >> +# @size: Total number of bytes within zones.
> >> >> >> >
> >> >> >> > Default?
> >> >> >
> >> >> > It should be set by users. No default value provided. If it's unset
> >> >> > then it is zero and an error will be returned.
> >> >>
> >> >> If the user must provide @size, why is it optional then?
> >> >
> >> > It is not optional when the zone model is host-managed. If it's
> >> > non-zoned, then we don't care about zone info. I am not sure how to
> >> > make it unoptional.
> >>
> >> We have:
> >>
> >>blockdev-create argument @options of type BlockdevCreateOptions
> >>
> >>BlockdevCreateOptions union branch @qcow2 of type
> >>BlockdevCreateOptionsQcow2, union tag member is @driver
> >>
> >>BlockdevCreateOptionsQcow2 optional member @zone of type
> >>Qcow2ZoneCreateOptions, default not zoned
> >>
> >>Qcow2ZoneCreateOptions union branch @host-managed of type
> >>Qcow2ZoneHostManaged, union tag member is @mode
> >>
> >>Qcow2ZoneHostManaged optional member @size of type size.
> >>
> >> Making this member @size mandatory means we must specify it when
> >> BlockdevCreateOptionsQcow2 member @zone is present and @zone's member
> >> @mode is "host-managed".  Feels right to me.  Am I missing anything?
> >
> > That's right. And the checks when creating such an img can help do
> > that. It's not specified in the .json file directly.
>
> What would break if we did specify it in the QAPI schema directly?

Nothing I think. We can keep the current schema and add a default zone
size like 131072.

>
> [...]
>



Re: [PATCH] qemu-io: add cvtnum() error handling for zone commands

2024-05-07 Thread Sam Li
Stefan Hajnoczi  于2024年5月7日周二 20:06写道:
>
> cvtnum() parses positive int64_t values and returns a negative errno on
> failure. Print errors and return early when cvtnum() fails.
>
> While we're at it, also reject nr_zones values greater or equal to 2^32
> since they cannot be represented.
>
> Reported-by: Peter Maydell 
> Cc: Sam Li 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  qemu-io-cmds.c | 48 +++-
>  1 file changed, 47 insertions(+), 1 deletion(-)

Reviewed-by: Sam Li 

Hi Stefan,

Thank you for fixing that. I've been a little busy with moving house lately :)

Sam



Re: [PATCH v5 1/4] file-posix: add tracking of the zone write pointers

2023-01-16 Thread Sam Li
Sam Li  于2022年10月27日周四 23:52写道:
>
> Since Linux doesn't have a user API to issue zone append operations to
> zoned devices from user space, the file-posix driver is modified to add
> zone append emulation using regular writes. To do this, the file-posix
> driver tracks the wp location of all zones of the device. It uses an
> array of uint64_t. The most significant bit of each wp location indicates
> if the zone type is conventional zones.
>
> The zones wp can be changed due to the following operations issued:
> - zone reset: change the wp to the start offset of that zone
> - zone finish: change to the end location of that zone
> - write to a zone
> - zone append
>
> Signed-off-by: Sam Li 
> ---
>  block/file-posix.c   | 153 ++-
>  include/block/block-common.h |  14 +++
>  include/block/block_int-common.h |   3 +
>  3 files changed, 166 insertions(+), 4 deletions(-)
>
> diff --git a/block/file-posix.c b/block/file-posix.c
> index fe52e91da4..fbab23f450 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1323,6 +1323,77 @@ static int hdev_get_max_segments(int fd, struct stat 
> *st)
>  #endif
>  }
>
> +#if defined(CONFIG_BLKZONED)
> +static int get_zones_wp(int fd, BlockZoneWps *wps, int64_t offset,
> +unsigned int nrz) {
> +struct blk_zone *blkz;
> +size_t rep_size;
> +uint64_t sector = offset >> BDRV_SECTOR_BITS;
> +int ret, n = 0, i = 0;
> +rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct 
> blk_zone);
> +g_autofree struct blk_zone_report *rep = NULL;
> +
> +rep = g_malloc(rep_size);
> +blkz = (struct blk_zone *)(rep + 1);
> +while (n < nrz) {
> +memset(rep, 0, rep_size);
> +rep->sector = sector;
> +rep->nr_zones = nrz - n;
> +
> +do {
> +ret = ioctl(fd, BLKREPORTZONE, rep);
> +} while (ret != 0 && errno == EINTR);
> +if (ret != 0) {
> +error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d",
> +fd, offset, errno);
> +return -errno;
> +}
> +
> +if (!rep->nr_zones) {
> +break;
> +}
> +
> +for (i = 0; i < rep->nr_zones; i++, n++) {
> +/*
> + * The wp tracking cares only about sequential writes required 
> and
> + * sequential write preferred zones so that the wp can advance to
> + * the right location.
> + * Use the most significant bit of the wp location to indicate 
> the
> + * zone type: 0 for SWR/SWP zones and 1 for conventional zones.
> + */
> +if (blkz[i].type == BLK_ZONE_TYPE_CONVENTIONAL) {
> +wps->wp[i] = 1ULL << 63;
> +} else {
> +switch(blkz[i].cond) {
> +case BLK_ZONE_COND_FULL:
> +case BLK_ZONE_COND_READONLY:
> +/* Zone not writable */
> +wps->wp[i] = (blkz[i].start + blkz[i].len) << 
> BDRV_SECTOR_BITS;
> +break;
> +case BLK_ZONE_COND_OFFLINE:
> +/* Zone not writable nor readable */
> +wps->wp[i] = (blkz[i].start) << BDRV_SECTOR_BITS;
> +break;
> +default:
> +wps->wp[i] = blkz[i].wp << BDRV_SECTOR_BITS;
> +break;
> +}
> +}
> +}
> +sector = blkz[i - 1].start + blkz[i - 1].len;
> +}
> +
> +return 0;
> +}
> +
> +static void update_zones_wp(int fd, BlockZoneWps *wps, int64_t offset,
> +unsigned int nrz) {
> +if (get_zones_wp(fd, wps, offset, nrz) < 0) {
> +error_report("update zone wp failed");
> +}
> +}
> +#endif
> +
>  static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
>  {
>  BDRVRawState *s = bs->opaque;
> @@ -1412,6 +1483,15 @@ static void raw_refresh_limits(BlockDriverState *bs, 
> Error **errp)
>  if (ret >= 0) {
>  bs->bl.max_active_zones = ret;
>  }
> +
> +bs->bl.wps = g_malloc(sizeof(BlockZoneWps) + sizeof(int64_t) * ret);
> +ret = get_zones_wp(s->fd, bs->bl.wps, 0, bs->bl.nr_zones);
> +if (ret < 0) {
> +error_setg_errno(errp, -ret, "report wps failed");
> +g_free(bs->bl.wps);
> +return;
> +}
> +qemu_co_mutex_init(&bs

Re: [PATCH v8 0/4] Add zoned storage emulation to virtio-blk driver

2023-03-23 Thread Sam Li
Matias Bjørling  于2023年3月23日周四 21:26写道:
>
> On 23/03/2023 06.28, Sam Li wrote:
> > This patch adds zoned storage emulation to the virtio-blk driver.
> >
> > The patch implements the virtio-blk ZBD support standardization that is
> > recently accepted by virtio-spec. The link to related commit is at
> >
> > https://github.com/oasis-tcs/virtio-spec/commit/b4e8efa0fa6c8d844328090ad15db65af8d7d981
> >
> > The Linux zoned device code that implemented by Dmitry Fomichev has been
> > released at the latest Linux version v6.3-rc1.
> >
> > Aside: adding zoned=on alike options to virtio-blk device will be
> > considered in following-up plan.
> >
> > v7:
> > - address Stefan's review comments
> >* rm aio_context_acquire/release in handle_req
> >* rename function return type
> >* rename BLOCK_ACCT_APPEND to BLOCK_ACCT_ZONE_APPEND for clarity
> >
> > v6:
> > - update headers to v6.3-rc1
> >
> > v5:
> > - address Stefan's review comments
> >* restore the way writing zone append result to buffer
> >* fix error checking case and other errands
> >
> > v4:
> > - change the way writing zone append request result to buffer
> > - change zone state, zone type value of virtio_blk_zone_descriptor
> > - add trace events for new zone APIs
> >
> > v3:
> > - use qemuio_from_buffer to write status bit [Stefan]
> > - avoid using req->elem directly [Stefan]
> > - fix error checkings and memory leak [Stefan]
> >
> > v2:
> > - change units of emulated zone op coresponding to block layer APIs
> > - modify error checking cases [Stefan, Damien]
> >
> > v1:
> > - add zoned storage emulation
> >
> > Sam Li (4):
> >include: update virtio_blk headers to v6.3-rc1
> >virtio-blk: add zoned storage emulation for zoned devices
> >block: add accounting for zone append operation
> >virtio-blk: add some trace events for zoned emulation
> >
> >   block/qapi-sysemu.c  |  11 +
> >   block/qapi.c |  18 +
> >   hw/block/trace-events|   7 +
> >   hw/block/virtio-blk-common.c |   2 +
> >   hw/block/virtio-blk.c| 405 +++
> >   include/block/accounting.h   |   1 +
> >   include/standard-headers/drm/drm_fourcc.h|  12 +
> >   include/standard-headers/linux/ethtool.h |  48 ++-
> >   include/standard-headers/linux/fuse.h|  45 ++-
> >   include/standard-headers/linux/pci_regs.h|   1 +
> >   include/standard-headers/linux/vhost_types.h |   2 +
> >   include/standard-headers/linux/virtio_blk.h  | 105 +
> >   linux-headers/asm-arm64/kvm.h|   1 +
> >   linux-headers/asm-x86/kvm.h  |  34 +-
> >   linux-headers/linux/kvm.h|   9 +
> >   linux-headers/linux/vfio.h   |  15 +-
> >   linux-headers/linux/vhost.h  |   8 +
> >   qapi/block-core.json |  62 ++-
> >   qapi/block.json  |   4 +
> >   19 files changed, 769 insertions(+), 21 deletions(-)
> >
>
>
> Hi Sam,
>
> I applied your patches and can report that they work with both SMR HDDs
> and ZNS SSDs. Very nice work!
>
> Regarding the documentation (docs/system/qemu-block-drivers.rst.inc). Is
> it possible to expose the host's zoned block device through something
> else than virtio-blk? If not, I wouldn't mind seeing the documentation
> updated to show a case when using the virtio-blk driver.
>
> For example (this also includes the device part):
>
> -device virtio-blk-pci,drive=drive0,id=virtblk0 \
> -blockdev
> host_device,node-name=drive0,filename=/dev/nullb0,cache.direct=on``
>
> It might also be nice to describe the shorthand for those that likes to
> pass in the parameters using only the -drive parameter.
>
>   -drive driver=host_device,file=/dev/nullb0,if=virtio,cache.direct=on

Hi Matias,

I'm glad it works. Thanks for your feedback!

For the question, this patch is exposing the zoned interface through
virtio-blk only. It's a good suggestion to put a use case inside
documentation. I will add it in the subsequent patch.

Thanks,
Sam



Re: [PATCH v8 2/4] virtio-blk: add zoned storage emulation for zoned devices

2023-03-23 Thread Sam Li
Matias Bjørling  于2023年3月23日周四 21:39写道:
>
> On 23/03/2023 06.28, Sam Li wrote:
> > This patch extends virtio-blk emulation to handle zoned device commands
> > by calling the new block layer APIs to perform zoned device I/O on
> > behalf of the guest. It supports Report Zone, four zone oparations (open,
> > close, finish, reset), and Append Zone.
> >
> > The VIRTIO_BLK_F_ZONED feature bit will only be set if the host does
> > support zoned block devices. Regular block devices(conventional zones)
> > will not be set.
> >
> > The guest os can use blktests, fio to test those commands on zoned devices.
> > Furthermore, using zonefs to test zone append write is also supported.
> >
> > Signed-off-by: Sam Li 
> > ---
> >   hw/block/virtio-blk-common.c |   2 +
> >   hw/block/virtio-blk.c| 389 +++
> >   2 files changed, 391 insertions(+)
> >
> > diff --git a/hw/block/virtio-blk-common.c b/hw/block/virtio-blk-common.c
> > index ac52d7c176..e2f8e2f6da 100644
> > --- a/hw/block/virtio-blk-common.c
> > +++ b/hw/block/virtio-blk-common.c
> > @@ -29,6 +29,8 @@ static const VirtIOFeature feature_sizes[] = {
> >.end = endof(struct virtio_blk_config, discard_sector_alignment)},
> >   {.flags = 1ULL << VIRTIO_BLK_F_WRITE_ZEROES,
> >.end = endof(struct virtio_blk_config, write_zeroes_may_unmap)},
> > +{.flags = 1ULL << VIRTIO_BLK_F_ZONED,
> > + .end = endof(struct virtio_blk_config, zoned)},
> >   {}
> >   };
>
> I used the qemu monitor to expect the state of the devices, and on the
> zoned block device specific entries, the zoned device feature shows up
> in the "unknown-features" field (info virtio-status )
>
> What is missing is an entry in the blk_feature_map structure within
> hw/virtio/virtio-qmp.c. The below fixes it up.
>
> diff --git i/hw/virtio/virtio-qmp.c w/hw/virtio/virtio-qmp.c
> index b70148aba9..3efa529bab 100644
> --- i/hw/virtio/virtio-qmp.c
> +++ w/hw/virtio/virtio-qmp.c
> @@ -176,6 +176,8 @@ static const qmp_virtio_feature_map_t
> virtio_blk_feature_map[] = {
>   "VIRTIO_BLK_F_DISCARD: Discard command supported"),
>   FEATURE_ENTRY(VIRTIO_BLK_F_WRITE_ZEROES, \
>   "VIRTIO_BLK_F_WRITE_ZEROES: Write zeroes command supported"),
> +FEATURE_ENTRY(VIRTIO_BLK_F_ZONED, \
> +"VIRTIO_BLK_F_ZONED: Zoned block device"),
>   #ifndef VIRTIO_BLK_NO_LEGACY
>   FEATURE_ENTRY(VIRTIO_BLK_F_BARRIER, \
>   "VIRTIO_BLK_F_BARRIER: Request barriers supported"),
>
> Which then lets qemu report the support like this:
>
> (qemu) info virtio-status /machine/peripheral/virtblk0/virtio-backend
> /machine/peripheral/virtblk0/virtio-backend:
>device_name: virtio-blk
>device_id:   2
>vhost_started:   false
>bus_name:(null)
>broken:  false
>disabled:false
>disable_legacy_check:false
>started: true
>use_started: true
>start_on_kick:   false
>use_guest_notifier_mask: true
>vm_running:  true
>num_vqs: 4
>queue_sel:   3
>isr: 1
>endianness:  little
>status:
>  VIRTIO_CONFIG_S_ACKNOWLEDGE: Valid virtio device found,
>  VIRTIO_CONFIG_S_DRIVER: Guest OS compatible with device,
>  VIRTIO_CONFIG_S_FEATURES_OK: Feature negotiation complete,
>  VIRTIO_CONFIG_S_DRIVER_OK: Driver setup and ready
>Guest features:
>  VIRTIO_RING_F_EVENT_IDX: Used & avail. event fields enabled,
>  VIRTIO_RING_F_INDIRECT_DESC: Indirect descriptors supported,
>  VIRTIO_F_VERSION_1: Device compliant for v1 spec (legacy)
>  VIRTIO_BLK_F_CONFIG_WCE: Cache writeback and ...,
>  VIRTIO_BLK_F_FLUSH: Flush command supported,
>  VIRTIO_BLK_F_ZONED: Zoned block device,
>  VIRTIO_BLK_F_WRITE_ZEROES: Write zeroes command supported,
>  VIRTIO_BLK_F_MQ: Multiqueue supported,
>  VIRTIO_BLK_F_TOPOLOGY: Topology information available,
>  VIRTIO_BLK_F_BLK_SIZE: Block size of disk available,
>  VIRTIO_BLK_F_GEOMETRY: Legacy geometry available,
>  VIRTIO_BLK_F_SEG_MAX: Max segments in a request is seg_max
>unknown-features(0x0100)
>Host features:
>  VIRTIO_RING_F_EVENT_IDX: Used & avail. event fields enabled,
>  VIRTIO_RING_F_INDIRECT_DESC: Indirect descriptors supported,
>  VIRTIO_F_VE

[PATCH v9 5/5] docs/zoned-storage:add zoned emulation use case

2023-03-24 Thread Sam Li
Add the documentation about the example of using virtio-blk driver
to pass the zoned block devices through to the guest.

Signed-off-by: Sam Li 
---
 docs/devel/zoned-storage.rst | 17 +
 1 file changed, 17 insertions(+)

diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
index 6a36133e51..05ecf3729c 100644
--- a/docs/devel/zoned-storage.rst
+++ b/docs/devel/zoned-storage.rst
@@ -41,3 +41,20 @@ APIs for zoned storage emulation or testing.
 For example, to test zone_report on a null_blk device using qemu-io is:
 $ path/to/qemu-io --image-opts -n driver=host_device,filename=/dev/nullb0
 -c "zrp offset nr_zones"
+
+To expose the host's zoned block device through virtio-blk, the command line
+can be (includes the -device parameter):
+-blockdev node-name=drive0,driver=host_device,filename=/dev/nullb0,
+cache.direct=on \
+-device virtio-blk-pci,drive=drive0
+Or only use the -drive parameter:
+-driver driver=host_device,file=/dev/nullb0,if=virtio,cache.direct=on
+
+Additionally, QEMU has several ways of supporting zoned storage, including:
+(1) Using virtio-scsi: --device scsi-block allows for the passing through of
+SCSI ZBC devices, enabling the attachment of ZBC or ZAC HDDs to QEMU.
+(2) PCI device pass-through: While NVMe ZNS emulation is available for testing
+purposes, it cannot yet pass through a zoned device from the host. To pass on
+the NVMe ZNS device to the guest, use VFIO PCI pass the entire NVMe PCI adapter
+through to the guest. Likewise, an HDD HBA can be passed on to QEMU all HDDs
+attached to the HBA.
-- 
2.39.2




[PATCH v18 6/8] qemu-iotests: test new zone operations

2023-03-24 Thread Sam Li
The new block layer APIs of zoned block devices can be tested by:
$ tests/qemu-iotests/check zoned
Run each zone operation on a newly created null_blk device
and see whether it outputs the same zone information.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
---
 tests/qemu-iotests/tests/zoned | 89 ++
 tests/qemu-iotests/tests/zoned.out | 53 ++
 2 files changed, 142 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/zoned
 create mode 100644 tests/qemu-iotests/tests/zoned.out

diff --git a/tests/qemu-iotests/tests/zoned b/tests/qemu-iotests/tests/zoned
new file mode 100755
index 00..56f60616b5
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned
@@ -0,0 +1,89 @@
+#!/usr/bin/env bash
+#
+# Test zone management operations.
+#
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+status=1 # failure is the default!
+
+_cleanup()
+{
+  _cleanup_test_img
+  sudo -n rmmod null_blk
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ../common.rc
+. ../common.filter
+. ../common.qemu
+
+# This test only runs on Linux hosts with raw image files.
+_supported_fmt raw
+_supported_proto file
+_supported_os Linux
+
+sudo -n true || \
+_notrun 'Password-less sudo required'
+
+IMG="--image-opts -n driver=host_device,filename=/dev/nullb0"
+QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT
+
+echo "Testing a null_blk device:"
+echo "case 1: if the operations work"
+sudo -n modprobe null_blk nr_devices=1 zoned=1
+sudo -n chmod 0666 /dev/nullb0
+
+echo "(1) report the first zone:"
+$QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "report the first 10 zones"
+$QEMU_IO $IMG -c "zrp 0 10"
+echo
+echo "report the last zone:"
+$QEMU_IO $IMG -c "zrp 0x3e7000 2" # 0x3e7000 / 512 = 0x1f38
+echo
+echo
+echo "(2) opening the first zone"
+$QEMU_IO $IMG -c "zo 0 268435456"  # 268435456 / 512 = 524288
+echo "report after:"
+$QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "opening the second zone"
+$QEMU_IO $IMG -c "zo 268435456 268435456" #
+echo "report after:"
+$QEMU_IO $IMG -c "zrp 268435456 1"
+echo
+echo "opening the last zone"
+$QEMU_IO $IMG -c "zo 0x3e7000 268435456"
+echo "report after:"
+$QEMU_IO $IMG -c "zrp 0x3e7000 2"
+echo
+echo
+echo "(3) closing the first zone"
+$QEMU_IO $IMG -c "zc 0 268435456"
+echo "report after:"
+$QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "closing the last zone"
+$QEMU_IO $IMG -c "zc 0x3e7000 268435456"
+echo "report after:"
+$QEMU_IO $IMG -c "zrp 0x3e7000 2"
+echo
+echo
+echo "(4) finishing the second zone"
+$QEMU_IO $IMG -c "zf 268435456 268435456"
+echo "After finishing a zone:"
+$QEMU_IO $IMG -c "zrp 268435456 1"
+echo
+echo
+echo "(5) resetting the second zone"
+$QEMU_IO $IMG -c "zrs 268435456 268435456"
+echo "After resetting a zone:"
+$QEMU_IO $IMG -c "zrp 268435456 1"
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/tests/zoned.out 
b/tests/qemu-iotests/tests/zoned.out
new file mode 100644
index 00..b2d061da49
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned.out
@@ -0,0 +1,53 @@
+QA output created by zoned
+Testing a null_blk device:
+case 1: if the operations work
+(1) report the first zone:
+start: 0x0, len 0x8, cap 0x8, wptr 0x0, zcond:1, [type: 2]
+
+report the first 10 zones
+start: 0x0, len 0x8, cap 0x8, wptr 0x0, zcond:1, [type: 2]
+start: 0x8, len 0x8, cap 0x8, wptr 0x8, zcond:1, [type: 2]
+start: 0x10, len 0x8, cap 0x8, wptr 0x10, zcond:1, [type: 2]
+start: 0x18, len 0x8, cap 0x8, wptr 0x18, zcond:1, [type: 2]
+start: 0x20, len 0x8, cap 0x8, wptr 0x20, zcond:1, [type: 2]
+start: 0x28, len 0x8, cap 0x8, wptr 0x28, zcond:1, [type: 2]
+start: 0x30, len 0x8, cap 0x8, wptr 0x30, zcond:1, [type: 2]
+start: 0x38, len 0x8, cap 0x8, wptr 0x38, zcond:1, [type: 2]
+start: 0x40, len 0x8, cap 0x8, wptr 0x40, zcond:1, [type: 2]
+start: 0x48, len 0x8, cap 0x8, wptr 0x48, zcond:1, [type: 2]
+
+report the last zone:
+start: 0x1f38, len 0x8, cap 0x8, wptr 0x1f38, zcond:1, [type: 
2]
+
+
+(2) opening the first zone
+report after:
+start: 0x0, len 0x8, cap 0x8, wptr 0x0, zcond:3, [type: 2]
+
+opening the second zone
+report after:
+start: 0x8, len 0x8, cap 0x8, wptr 0x8, zcond:3, [type: 2]
+
+opening the last zone
+report after:
+start: 0x1f38, len 0x8, cap 0x8, wptr 0x1f38, zcond:3, [type: 
2]
+
+
+(3) 

[PATCH v18 4/8] raw-format: add zone operations to pass through requests

2023-03-24 Thread Sam Li
raw-format driver usually sits on top of file-posix driver. It needs to
pass through requests of zone commands.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Damien Le Moal 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Dmitry Fomichev 
---
 block/raw-format.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/block/raw-format.c b/block/raw-format.c
index 66783ed8e7..6e1b9394c8 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -317,6 +317,21 @@ raw_co_pdiscard(BlockDriverState *bs, int64_t offset, 
int64_t bytes)
 return bdrv_co_pdiscard(bs->file, offset, bytes);
 }
 
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_zone_report(BlockDriverState *bs, int64_t offset,
+   unsigned int *nr_zones,
+   BlockZoneDescriptor *zones)
+{
+return bdrv_co_zone_report(bs->file->bs, offset, nr_zones, zones);
+}
+
+static int coroutine_fn GRAPH_RDLOCK
+raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
+ int64_t offset, int64_t len)
+{
+return bdrv_co_zone_mgmt(bs->file->bs, op, offset, len);
+}
+
 static int64_t coroutine_fn GRAPH_RDLOCK
 raw_co_getlength(BlockDriverState *bs)
 {
@@ -617,6 +632,8 @@ BlockDriver bdrv_raw = {
 .bdrv_co_pwritev  = &raw_co_pwritev,
 .bdrv_co_pwrite_zeroes = &raw_co_pwrite_zeroes,
 .bdrv_co_pdiscard = &raw_co_pdiscard,
+.bdrv_co_zone_report  = &raw_co_zone_report,
+.bdrv_co_zone_mgmt  = &raw_co_zone_mgmt,
 .bdrv_co_block_status = &raw_co_block_status,
 .bdrv_co_copy_range_from = &raw_co_copy_range_from,
 .bdrv_co_copy_range_to  = &raw_co_copy_range_to,
-- 
2.39.2




[PATCH v18 2/8] file-posix: introduce helper functions for sysfs attributes

2023-03-24 Thread Sam Li
Use get_sysfs_str_val() to get the string value of device
zoned model. Then get_sysfs_zoned_model() can convert it to
BlockZoneModel type of QEMU.

Use get_sysfs_long_val() to get the long value of zoned device
information.

Signed-off-by: Sam Li 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Damien Le Moal 
Reviewed-by: Dmitry Fomichev 
---
 block/file-posix.c   | 122 ++-
 include/block/block_int-common.h |   3 +
 2 files changed, 91 insertions(+), 34 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 5760cf22d1..496edc644c 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1202,64 +1202,112 @@ static int hdev_get_max_hw_transfer(int fd, struct 
stat *st)
 #endif
 }
 
-static int hdev_get_max_segments(int fd, struct stat *st)
-{
+/*
+ * Get a sysfs attribute value as character string.
+ */
+static int get_sysfs_str_val(struct stat *st, const char *attribute,
+ char **val) {
 #ifdef CONFIG_LINUX
-char buf[32];
-const char *end;
-char *sysfspath = NULL;
+g_autofree char *sysfspath = NULL;
 int ret;
-int sysfd = -1;
-long max_segments;
+size_t len;
 
-if (S_ISCHR(st->st_mode)) {
-if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
-return ret;
-}
+if (!S_ISBLK(st->st_mode)) {
 return -ENOTSUP;
 }
 
-if (!S_ISBLK(st->st_mode)) {
-return -ENOTSUP;
+sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
+major(st->st_rdev), minor(st->st_rdev),
+attribute);
+ret = g_file_get_contents(sysfspath, val, &len, NULL);
+if (ret == -1) {
+return -ENOENT;
 }
 
-sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
-major(st->st_rdev), minor(st->st_rdev));
-sysfd = open(sysfspath, O_RDONLY);
-if (sysfd == -1) {
-ret = -errno;
-goto out;
+/* The file is ended with '\n' */
+char *p;
+p = *val;
+if (*(p + len - 1) == '\n') {
+*(p + len - 1) = '\0';
 }
-ret = RETRY_ON_EINTR(read(sysfd, buf, sizeof(buf) - 1));
+return ret;
+#else
+return -ENOTSUP;
+#endif
+}
+
+static int get_sysfs_zoned_model(struct stat *st, BlockZoneModel *zoned)
+{
+g_autofree char *val = NULL;
+int ret;
+
+ret = get_sysfs_str_val(st, "zoned", &val);
 if (ret < 0) {
-ret = -errno;
-goto out;
-} else if (ret == 0) {
-ret = -EIO;
-goto out;
+return ret;
 }
-buf[ret] = 0;
-/* The file is ended with '\n', pass 'end' to accept that. */
-ret = qemu_strtol(buf, &end, 10, &max_segments);
-if (ret == 0 && end && *end == '\n') {
-ret = max_segments;
+
+if (strcmp(val, "host-managed") == 0) {
+*zoned = BLK_Z_HM;
+} else if (strcmp(val, "host-aware") == 0) {
+*zoned = BLK_Z_HA;
+} else if (strcmp(val, "none") == 0) {
+*zoned = BLK_Z_NONE;
+} else {
+return -ENOTSUP;
+}
+return 0;
+}
+
+/*
+ * Get a sysfs attribute value as a long integer.
+ */
+static long get_sysfs_long_val(struct stat *st, const char *attribute)
+{
+#ifdef CONFIG_LINUX
+g_autofree char *str = NULL;
+const char *end;
+long val;
+int ret;
+
+ret = get_sysfs_str_val(st, attribute, &str);
+if (ret < 0) {
+return ret;
 }
 
-out:
-if (sysfd != -1) {
-close(sysfd);
+/* The file is ended with '\n', pass 'end' to accept that. */
+ret = qemu_strtol(str, &end, 10, &val);
+if (ret == 0 && end && *end == '\0') {
+ret = val;
 }
-g_free(sysfspath);
 return ret;
 #else
 return -ENOTSUP;
 #endif
 }
 
+static int hdev_get_max_segments(int fd, struct stat *st)
+{
+#ifdef CONFIG_LINUX
+int ret;
+
+if (S_ISCHR(st->st_mode)) {
+if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
+return ret;
+}
+return -ENOTSUP;
+}
+return get_sysfs_long_val(st, "max_segments");
+#else
+return -ENOTSUP;
+#endif
+}
+
 static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 {
 BDRVRawState *s = bs->opaque;
 struct stat st;
+int ret;
+BlockZoneModel zoned;
 
 s->needs_alignment = raw_needs_alignment(bs);
 raw_probe_alignment(bs, s->fd, errp);
@@ -1297,6 +1345,12 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 bs->bl.max_hw_iov = ret;
 }
 }
+
+ret = get_sysfs_zoned_model(&st, &zoned);
+if (ret < 0) {
+zoned = BLK_Z_NONE;
+}
+bs->bl.zoned = zone

[PATCH v18 8/8] docs/zoned-storage: add zoned device documentation

2023-03-24 Thread Sam Li
Add the documentation about the zoned device support to virtio-blk
emulation.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Damien Le Moal 
Reviewed-by: Dmitry Fomichev 
---
 docs/devel/zoned-storage.rst   | 43 ++
 docs/system/qemu-block-drivers.rst.inc |  6 
 2 files changed, 49 insertions(+)
 create mode 100644 docs/devel/zoned-storage.rst

diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
new file mode 100644
index 00..6a36133e51
--- /dev/null
+++ b/docs/devel/zoned-storage.rst
@@ -0,0 +1,43 @@
+=
+zoned-storage
+=
+
+Zoned Block Devices (ZBDs) divide the LBA space into block regions called zones
+that are larger than the LBA size. They can only allow sequential writes, which
+can reduce write amplification in SSDs, and potentially lead to higher
+throughput and increased capacity. More details about ZBDs can be found at:
+
+https://zonedstorage.io/docs/introduction/zoned-storage
+
+1. Block layer APIs for zoned storage
+-
+QEMU block layer supports three zoned storage models:
+- BLK_Z_HM: The host-managed zoned model only allows sequential writes access
+to zones. It supports ZBD-specific I/O commands that can be used by a host to
+manage the zones of a device.
+- BLK_Z_HA: The host-aware zoned model allows random write operations in
+zones, making it backward compatible with regular block devices.
+- BLK_Z_NONE: The non-zoned model has no zones support. It includes both
+regular and drive-managed ZBD devices. ZBD-specific I/O commands are not
+supported.
+
+The block device information resides inside BlockDriverState. QEMU uses
+BlockLimits struct(BlockDriverState::bl) that is continuously accessed by the
+block layer while processing I/O requests. A BlockBackend has a root pointer to
+a BlockDriverState graph(for example, raw format on top of file-posix). The
+zoned storage information can be propagated from the leaf BlockDriverState all
+the way up to the BlockBackend. If the zoned storage model in file-posix is
+set to BLK_Z_HM, then block drivers will declare support for zoned host device.
+
+The block layer APIs support commands needed for zoned storage devices,
+including report zones, four zone operations, and zone append.
+
+2. Emulating zoned storage controllers
+--
+When the BlockBackend's BlockLimits model reports a zoned storage device, users
+like the virtio-blk emulation or the qemu-io-cmds.c utility can use block layer
+APIs for zoned storage emulation or testing.
+
+For example, to test zone_report on a null_blk device using qemu-io is:
+$ path/to/qemu-io --image-opts -n driver=host_device,filename=/dev/nullb0
+-c "zrp offset nr_zones"
diff --git a/docs/system/qemu-block-drivers.rst.inc 
b/docs/system/qemu-block-drivers.rst.inc
index dfe5d2293d..105cb9679c 100644
--- a/docs/system/qemu-block-drivers.rst.inc
+++ b/docs/system/qemu-block-drivers.rst.inc
@@ -430,6 +430,12 @@ Hard disks
   you may corrupt your host data (use the ``-snapshot`` command
   line option or modify the device permissions accordingly).
 
+Zoned block devices
+  Zoned block devices can be passed through to the guest if the emulated 
storage
+  controller supports zoned storage. Use ``--blockdev host_device,
+  node-name=drive0,filename=/dev/nullb0,cache.direct=on`` to pass through
+  ``/dev/nullb0`` as ``drive0``.
+
 Windows
 ^^^
 
-- 
2.39.2




[PATCH v9 1/5] include: update virtio_blk headers to v6.3-rc1

2023-03-24 Thread Sam Li
Use scripts/update-linux-headers.sh to update headers to 6.3-rc1.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Dmitry Fomichev 
---
 include/standard-headers/drm/drm_fourcc.h|  12 +++
 include/standard-headers/linux/ethtool.h |  48 -
 include/standard-headers/linux/fuse.h|  45 +++-
 include/standard-headers/linux/pci_regs.h|   1 +
 include/standard-headers/linux/vhost_types.h |   2 +
 include/standard-headers/linux/virtio_blk.h  | 105 +++
 linux-headers/asm-arm64/kvm.h|   1 +
 linux-headers/asm-x86/kvm.h  |  34 +-
 linux-headers/linux/kvm.h|   9 ++
 linux-headers/linux/vfio.h   |  15 +--
 linux-headers/linux/vhost.h  |   8 ++
 11 files changed, 270 insertions(+), 10 deletions(-)

diff --git a/include/standard-headers/drm/drm_fourcc.h 
b/include/standard-headers/drm/drm_fourcc.h
index 69cab17b38..dc3e6112c1 100644
--- a/include/standard-headers/drm/drm_fourcc.h
+++ b/include/standard-headers/drm/drm_fourcc.h
@@ -87,6 +87,18 @@ extern "C" {
  *
  * The authoritative list of format modifier codes is found in
  * `include/uapi/drm/drm_fourcc.h`
+ *
+ * Open Source User Waiver
+ * ---
+ *
+ * Because this is the authoritative source for pixel formats and modifiers
+ * referenced by GL, Vulkan extensions and other standards and hence used both
+ * by open source and closed source driver stacks, the usual requirement for an
+ * upstream in-kernel or open source userspace user does not apply.
+ *
+ * To ensure, as much as feasible, compatibility across stacks and avoid
+ * confusion with incompatible enumerations stakeholders for all relevant 
driver
+ * stacks should approve additions.
  */
 
 #define fourcc_code(a, b, c, d) ((uint32_t)(a) | ((uint32_t)(b) << 8) | \
diff --git a/include/standard-headers/linux/ethtool.h 
b/include/standard-headers/linux/ethtool.h
index 87176ab075..99fcddf04f 100644
--- a/include/standard-headers/linux/ethtool.h
+++ b/include/standard-headers/linux/ethtool.h
@@ -711,6 +711,24 @@ enum ethtool_stringset {
ETH_SS_COUNT
 };
 
+/**
+ * enum ethtool_mac_stats_src - source of ethtool MAC statistics
+ * @ETHTOOL_MAC_STATS_SRC_AGGREGATE:
+ * if device supports a MAC merge layer, this retrieves the aggregate
+ * statistics of the eMAC and pMAC. Otherwise, it retrieves just the
+ * statistics of the single (express) MAC.
+ * @ETHTOOL_MAC_STATS_SRC_EMAC:
+ * if device supports a MM layer, this retrieves the eMAC statistics.
+ * Otherwise, it retrieves the statistics of the single (express) MAC.
+ * @ETHTOOL_MAC_STATS_SRC_PMAC:
+ * if device supports a MM layer, this retrieves the pMAC statistics.
+ */
+enum ethtool_mac_stats_src {
+   ETHTOOL_MAC_STATS_SRC_AGGREGATE,
+   ETHTOOL_MAC_STATS_SRC_EMAC,
+   ETHTOOL_MAC_STATS_SRC_PMAC,
+};
+
 /**
  * enum ethtool_module_power_mode_policy - plug-in module power mode policy
  * @ETHTOOL_MODULE_POWER_MODE_POLICY_HIGH: Module is always in high power mode.
@@ -779,6 +797,31 @@ enum ethtool_podl_pse_pw_d_status {
ETHTOOL_PODL_PSE_PW_D_STATUS_ERROR,
 };
 
+/**
+ * enum ethtool_mm_verify_status - status of MAC Merge Verify function
+ * @ETHTOOL_MM_VERIFY_STATUS_UNKNOWN:
+ * verification status is unknown
+ * @ETHTOOL_MM_VERIFY_STATUS_INITIAL:
+ * the 802.3 Verify State diagram is in the state INIT_VERIFICATION
+ * @ETHTOOL_MM_VERIFY_STATUS_VERIFYING:
+ * the Verify State diagram is in the state VERIFICATION_IDLE,
+ * SEND_VERIFY or WAIT_FOR_RESPONSE
+ * @ETHTOOL_MM_VERIFY_STATUS_SUCCEEDED:
+ * indicates that the Verify State diagram is in the state VERIFIED
+ * @ETHTOOL_MM_VERIFY_STATUS_FAILED:
+ * the Verify State diagram is in the state VERIFY_FAIL
+ * @ETHTOOL_MM_VERIFY_STATUS_DISABLED:
+ * verification of preemption operation is disabled
+ */
+enum ethtool_mm_verify_status {
+   ETHTOOL_MM_VERIFY_STATUS_UNKNOWN,
+   ETHTOOL_MM_VERIFY_STATUS_INITIAL,
+   ETHTOOL_MM_VERIFY_STATUS_VERIFYING,
+   ETHTOOL_MM_VERIFY_STATUS_SUCCEEDED,
+   ETHTOOL_MM_VERIFY_STATUS_FAILED,
+   ETHTOOL_MM_VERIFY_STATUS_DISABLED,
+};
+
 /**
  * struct ethtool_gstrings - string set for data tagging
  * @cmd: Command number = %ETHTOOL_GSTRINGS
@@ -1183,7 +1226,7 @@ struct ethtool_rxnfc {
uint32_trule_cnt;
uint32_trss_context;
};
-   uint32_trule_locs[0];
+   uint32_trule_locs[];
 };
 
 
@@ -1741,6 +1784,9 @@ enum ethtool_link_mode_bit_indices {
ETHTOOL_LINK_MODE_80baseDR8_2_Full_BIT   = 96,
ETHTOOL_LINK_MODE_80baseSR8_Full_BIT = 97,
ETHTOOL_LINK_MODE_80baseVR8_Full_BIT = 98,
+   ETHTOOL_LINK_MODE_10baseT1S_Full_BIT = 99,
+   ETHTOOL_LINK_MODE_10

[PATCH v18 0/8] Add support for zoned device

2023-03-24 Thread Sam Li
Zoned Block Devices (ZBDs) devide the LBA space to block regions called zones
that are larger than the LBA size. It can only allow sequential writes, which
reduces write amplification in SSD, leading to higher throughput and increased
capacity. More details about ZBDs can be found at:

https://zonedstorage.io/docs/introduction/zoned-storage

The zoned device support aims to let guests (virtual machines) access zoned
storage devices on the host (hypervisor) through a virtio-blk device. This
involves extending QEMU's block layer and virtio-blk emulation code.  In its
current status, the virtio-blk device is not aware of ZBDs but the guest sees
host-managed drives as regular drive that will runs correctly under the most
common write workloads.

This patch series extend the block layer APIs with the minimum set of zoned
commands that are necessary to support zoned devices. The commands are - Report
Zones, four zone operations and Zone Append.

There has been a debate on whethre introducing new zoned_host_device BlockDriver
specifically for zoned devices. In the end, it's been decided to stick to
existing host_device BlockDriver interface by only adding new zoned operations
inside it. The benefit of that is to avoid further changes - one example is
command line syntax - to the applications like Libvirt using QEMU zoned
emulation.

It can be tested on a null_blk device using qemu-io or qemu-iotests. For
example, to test zone report using qemu-io:
$ path/to/qemu-io --image-opts -n driver=host_device,filename=/dev/nullb0
-c "zrp offset nr_zones"

v18:
- use 'sudo -n' in qemuio-tests [Stefan]

v17:
- fix qemuiotests for zoned support patches [Dmitry]

v16:
- update zoned_host device name to host_device [Stefan]
- fix probing zoned device blocksizes [Stefan]
- Use empty fields instead of changing struct size of BlkRwCo [Kevin, Stefan]

v15:
- drop zoned_host_device BlockDriver
- add zoned device option to host_device driver instead of introducing a new
  zoned_host_device BlockDriver [Stefan]

v14:
- address Stefan's comments of probing block sizes

v13:
- add some tracing points for new zone APIs [Dmitry]
- change error handling in zone_mgmt [Damien, Stefan]

v12:
- address review comments
  * drop BLK_ZO_RESET_ALL bit [Damien]
  * fix error messages, style, and typos[Damien, Hannes]

v11:
- address review comments
  * fix possible BLKZONED config compiling warnings [Stefan]
  * fix capacity field compiling warnings on older kernel [Stefan,Damien]

v10:
- address review comments
  * deal with the last small zone case in zone_mgmt operations [Damien]
  * handle the capacity field outdated in old kernel(before 5.9) [Damien]
  * use byte unit in block layer to be consistent with QEMU [Eric]
  * fix coding style related problems [Stefan]

v9:
- address review comments
  * specify units of zone commands requests [Stefan]
  * fix some error handling in file-posix [Stefan]
  * introduce zoned_host_devcie in the commit message [Markus]

v8:
- address review comments
  * solve patch conflicts and merge sysfs helper funcations into one patch
  * add cache.direct=on check in config

v7:
- address review comments
  * modify sysfs attribute helper funcations
  * move the input validation and error checking into raw_co_zone_* function
  * fix checks in config

v6:
- drop virtio-blk emulation changes
- address Stefan's review comments
  * fix CONFIG_BLKZONED configs in related functions
  * replace reading fd by g_file_get_contents() in get_sysfs_str_val()
  * rewrite documentation for zoned storage

v5:
- add zoned storage emulation to virtio-blk device
- add documentation for zoned storage
- address review comments
  * fix qemu-iotests
  * fix check to block layer
  * modify interfaces of sysfs helper functions
  * rename zoned device structs according to QEMU styles
  * reorder patches

v4:
- add virtio-blk headers for zoned device
- add configurations for zoned host device
- add zone operations for raw-format
- address review comments
  * fix memory leak bug in zone_report
  * add checks to block layers
  * fix qemu-iotests format
  * fix sysfs helper functions

v3:
- add helper functions to get sysfs attributes
- address review comments
  * fix zone report bugs
  * fix the qemu-io code path
  * use thread pool to avoid blocking ioctl() calls

v2:
- add qemu-io sub-commands
- address review comments
  * modify interfaces of APIs

v1:
- add block layer APIs resembling Linux ZoneBlockDevice ioctls

Sam Li (8):
  include: add zoned device structs
  file-posix: introduce helper functions for sysfs attributes
  block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  raw-format: add zone operations to pass through requests
  config: add check to block layer
  qemu-iotests: test new zone operations
  block: add some trace events for new block layer APIs
  docs/zoned-storage: add zoned device documentation

 block.c|  19 ++
 block/block-backend.

Re: [PATCH v9 5/5] docs/zoned-storage:add zoned emulation use case

2023-03-24 Thread Sam Li
Sam Li  于2023年3月24日周五 18:54写道:
>
> Add the documentation about the example of using virtio-blk driver
> to pass the zoned block devices through to the guest.
>
> Signed-off-by: Sam Li 
> ---
>  docs/devel/zoned-storage.rst | 17 +
>  1 file changed, 17 insertions(+)
>
> diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
> index 6a36133e51..05ecf3729c 100644
> --- a/docs/devel/zoned-storage.rst
> +++ b/docs/devel/zoned-storage.rst
> @@ -41,3 +41,20 @@ APIs for zoned storage emulation or testing.
>  For example, to test zone_report on a null_blk device using qemu-io is:
>  $ path/to/qemu-io --image-opts -n driver=host_device,filename=/dev/nullb0
>  -c "zrp offset nr_zones"
> +
> +To expose the host's zoned block device through virtio-blk, the command line
> +can be (includes the -device parameter):
> +-blockdev node-name=drive0,driver=host_device,filename=/dev/nullb0,
> +cache.direct=on \
> +-device virtio-blk-pci,drive=drive0
> +Or only use the -drive parameter:
> +-driver driver=host_device,file=/dev/nullb0,if=virtio,cache.direct=on
> +
> +Additionally, QEMU has several ways of supporting zoned storage, including:
> +(1) Using virtio-scsi: --device scsi-block allows for the passing through of
> +SCSI ZBC devices, enabling the attachment of ZBC or ZAC HDDs to QEMU.
> +(2) PCI device pass-through: While NVMe ZNS emulation is available for 
> testing
> +purposes, it cannot yet pass through a zoned device from the host. To pass on
> +the NVMe ZNS device to the guest, use VFIO PCI pass the entire NVMe PCI 
> adapter
> +through to the guest. Likewise, an HDD HBA can be passed on to QEMU all HDDs
> +attached to the HBA.
> --
> 2.39.2
>



[PATCH v9 0/5] Add zoned storage emulation to virtio-blk driver

2023-03-24 Thread Sam Li
This patch adds zoned storage emulation to the virtio-blk driver.

The patch implements the virtio-blk ZBD support standardization that is
recently accepted by virtio-spec. The link to related commit is at

https://github.com/oasis-tcs/virtio-spec/commit/b4e8efa0fa6c8d844328090ad15db65af8d7d981

The Linux zoned device code that implemented by Dmitry Fomichev has been
released at the latest Linux version v6.3-rc1.

Aside: adding zoned=on alike options to virtio-blk device will be
considered in following-up plan.

v9:
- address review comments
  * add docs for zoned emulation use case [Matias]
  * add the zoned feature bit to qmp monitor [Matias]
  * add the version number for newly added configs of accounting [Markus]

v8:
- address Stefan's review comments
  * rm aio_context_acquire/release in handle_req
  * rename function return type
  * rename BLOCK_ACCT_APPEND to BLOCK_ACCT_ZONE_APPEND for clarity

v7:
- update headers to v6.3-rc1

v6:
- address Stefan's review comments
  * add accounting for zone append operation
  * fix in_iov usage in handle_request, error handling and typos

v5:
- address Stefan's review comments
  * restore the way writing zone append result to buffer
  * fix error checking case and other errands

v4:
- change the way writing zone append request result to buffer
- change zone state, zone type value of virtio_blk_zone_descriptor
- add trace events for new zone APIs

v3:
- use qemuio_from_buffer to write status bit [Stefan]
- avoid using req->elem directly [Stefan]
- fix error checkings and memory leak [Stefan]

v2:
- change units of emulated zone op coresponding to block layer APIs
- modify error checking cases [Stefan, Damien]

v1:
- add zoned storage emulation

Sam Li (5):
  include: update virtio_blk headers to v6.3-rc1
  virtio-blk: add zoned storage emulation for zoned devices
  block: add accounting for zone append operation
  virtio-blk: add some trace events for zoned emulation
  docs/zoned-storage:add zoned emulation use case

 block/qapi-sysemu.c  |  11 +
 block/qapi.c |  18 +
 docs/devel/zoned-storage.rst |  17 +
 hw/block/trace-events|   7 +
 hw/block/virtio-blk-common.c |   2 +
 hw/block/virtio-blk.c| 405 +++
 hw/virtio/virtio-qmp.c   |   2 +
 include/block/accounting.h   |   1 +
 include/standard-headers/drm/drm_fourcc.h|  12 +
 include/standard-headers/linux/ethtool.h |  48 ++-
 include/standard-headers/linux/fuse.h|  45 ++-
 include/standard-headers/linux/pci_regs.h|   1 +
 include/standard-headers/linux/vhost_types.h |   2 +
 include/standard-headers/linux/virtio_blk.h  | 105 +
 linux-headers/asm-arm64/kvm.h|   1 +
 linux-headers/asm-x86/kvm.h  |  34 +-
 linux-headers/linux/kvm.h|   9 +
 linux-headers/linux/vfio.h   |  15 +-
 linux-headers/linux/vhost.h  |   8 +
 qapi/block-core.json |  68 +++-
 qapi/block.json  |   4 +
 21 files changed, 794 insertions(+), 21 deletions(-)

-- 
2.39.2




[PATCH v18 5/8] config: add check to block layer

2023-03-24 Thread Sam Li
Putting zoned/non-zoned BlockDrivers on top of each other is not
allowed.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Dmitry Fomichev 
---
 block.c  | 19 +++
 block/file-posix.c   | 12 
 block/raw-format.c   |  1 +
 include/block/block_int-common.h |  5 +
 4 files changed, 37 insertions(+)

diff --git a/block.c b/block.c
index 0dd604d0f6..4ebf7bbc90 100644
--- a/block.c
+++ b/block.c
@@ -7953,6 +7953,25 @@ void bdrv_add_child(BlockDriverState *parent_bs, 
BlockDriverState *child_bs,
 return;
 }
 
+/*
+ * Non-zoned block drivers do not follow zoned storage constraints
+ * (i.e. sequential writes to zones). Refuse mixing zoned and non-zoned
+ * drivers in a graph.
+ */
+if (!parent_bs->drv->supports_zoned_children &&
+child_bs->bl.zoned == BLK_Z_HM) {
+/*
+ * The host-aware model allows zoned storage constraints and random
+ * write. Allow mixing host-aware and non-zoned drivers. Using
+ * host-aware device as a regular device.
+ */
+error_setg(errp, "Cannot add a %s child to a %s parent",
+   child_bs->bl.zoned == BLK_Z_HM ? "zoned" : "non-zoned",
+   parent_bs->drv->supports_zoned_children ?
+   "support zoned children" : "not support zoned children");
+return;
+}
+
 if (!QLIST_EMPTY(&child_bs->parents)) {
 error_setg(errp, "The node %s already has a parent",
child_bs->node_name);
diff --git a/block/file-posix.c b/block/file-posix.c
index 0c19cfb5cc..5fa80933c9 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -776,6 +776,18 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
 goto fail;
 }
 }
+#ifdef CONFIG_BLKZONED
+/*
+ * The kernel page cache does not reliably work for writes to SWR zones
+ * of zoned block device because it can not guarantee the order of writes.
+ */
+if ((bs->bl.zoned != BLK_Z_NONE) &&
+(!(s->open_flags & O_DIRECT))) {
+error_setg(errp, "The driver supports zoned devices, and it requires "
+ "cache.direct=on, which was not specified.");
+return -EINVAL; /* No host kernel page cache */
+}
+#endif
 
 if (S_ISBLK(st.st_mode)) {
 #ifdef __linux__
diff --git a/block/raw-format.c b/block/raw-format.c
index 6e1b9394c8..72e23e7b55 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -621,6 +621,7 @@ static void raw_child_perm(BlockDriverState *bs, BdrvChild 
*c,
 BlockDriver bdrv_raw = {
 .format_name  = "raw",
 .instance_size= sizeof(BDRVRawState),
+.supports_zoned_children = true,
 .bdrv_probe   = &raw_probe,
 .bdrv_reopen_prepare  = &raw_reopen_prepare,
 .bdrv_reopen_commit   = &raw_reopen_commit,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index a3efb385e0..1bd2aef4d5 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -137,6 +137,11 @@ struct BlockDriver {
  */
 bool is_format;
 
+/*
+ * Set to true if the BlockDriver supports zoned children.
+ */
+bool supports_zoned_children;
+
 /*
  * Drivers not implementing bdrv_parse_filename nor bdrv_open should have
  * this field set to true, except ones that are defined only by their
-- 
2.39.2




[PATCH v18 3/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2023-03-24 Thread Sam Li
Add zoned device option to host_device BlockDriver. It will be presented only
for zoned host block devices. By adding zone management operations to the
host_block_device BlockDriver, users can use the new block layer APIs
including Report Zone and four zone management operations
(open, close, finish, reset, reset_all).

Qemu-io uses the new APIs to perform zoned storage commands of the device:
zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
zone_finish(zf).

For example, to test zone_report, use following command:
$ ./build/qemu-io --image-opts -n driver=host_device, filename=/dev/nullb0
-c "zrp offset nr_zones"

Signed-off-by: Sam Li 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Dmitry Fomichev 
Acked-by: Kevin Wolf 
---
 block/block-backend.c | 133 +
 block/file-posix.c| 307 +-
 block/io.c|  41 
 include/block/block-io.h  |   9 +
 include/block/block_int-common.h  |  21 ++
 include/block/raw-aio.h   |   6 +-
 include/sysemu/block-backend-io.h |  18 ++
 meson.build   |   4 +
 qemu-io-cmds.c| 149 +++
 9 files changed, 685 insertions(+), 3 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 278b04ce69..f70b08e3f6 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1806,6 +1806,139 @@ int coroutine_fn blk_co_flush(BlockBackend *blk)
 return ret;
 }
 
+static void coroutine_fn blk_aio_zone_report_entry(void *opaque)
+{
+BlkAioEmAIOCB *acb = opaque;
+BlkRwCo *rwco = &acb->rwco;
+
+rwco->ret = blk_co_zone_report(rwco->blk, rwco->offset,
+   (unsigned int*)acb->bytes,rwco->iobuf);
+blk_aio_complete(acb);
+}
+
+BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
+unsigned int *nr_zones,
+BlockZoneDescriptor  *zones,
+BlockCompletionFunc *cb, void *opaque)
+{
+BlkAioEmAIOCB *acb;
+Coroutine *co;
+IO_CODE();
+
+blk_inc_in_flight(blk);
+acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
+acb->rwco = (BlkRwCo) {
+.blk= blk,
+.offset = offset,
+.iobuf  = zones,
+.ret= NOT_DONE,
+};
+acb->bytes = (int64_t)nr_zones,
+acb->has_returned = false;
+
+co = qemu_coroutine_create(blk_aio_zone_report_entry, acb);
+aio_co_enter(blk_get_aio_context(blk), co);
+
+acb->has_returned = true;
+if (acb->rwco.ret != NOT_DONE) {
+replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+ blk_aio_complete_bh, acb);
+}
+
+return &acb->common;
+}
+
+static void coroutine_fn blk_aio_zone_mgmt_entry(void *opaque)
+{
+BlkAioEmAIOCB *acb = opaque;
+BlkRwCo *rwco = &acb->rwco;
+
+rwco->ret = blk_co_zone_mgmt(rwco->blk, (BlockZoneOp)rwco->iobuf,
+ rwco->offset, acb->bytes);
+blk_aio_complete(acb);
+}
+
+BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+  int64_t offset, int64_t len,
+  BlockCompletionFunc *cb, void *opaque) {
+BlkAioEmAIOCB *acb;
+Coroutine *co;
+IO_CODE();
+
+blk_inc_in_flight(blk);
+acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
+acb->rwco = (BlkRwCo) {
+.blk= blk,
+.offset = offset,
+.iobuf  = (void *)op,
+.ret= NOT_DONE,
+};
+acb->bytes = len;
+acb->has_returned = false;
+
+co = qemu_coroutine_create(blk_aio_zone_mgmt_entry, acb);
+aio_co_enter(blk_get_aio_context(blk), co);
+
+acb->has_returned = true;
+if (acb->rwco.ret != NOT_DONE) {
+replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+ blk_aio_complete_bh, acb);
+}
+
+return &acb->common;
+}
+
+/*
+ * Send a zone_report command.
+ * offset is a byte offset from the start of the device. No alignment
+ * required for offset.
+ * nr_zones represents IN maximum and OUT actual.
+ */
+int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
+unsigned int *nr_zones,
+BlockZoneDescriptor *zones)
+{
+int ret;
+IO_CODE();
+
+blk_inc_in_flight(blk); /* increase before waiting */
+blk_wait_while_drained(blk);
+if (!blk_is_available(blk)) {
+blk_dec_in_flight(blk);
+return -ENOMEDIUM;
+}
+ret = bdrv_co_zone_report(blk_bs(blk), offset, nr_zones, zones);
+blk_dec_in_flight(blk);
+return ret;
+}
+
+/*
+ * Send a zone_management command.
+ * op is the zone operation;
+ * offset is the byte offset fr

[PATCH v18 7/8] block: add some trace events for new block layer APIs

2023-03-24 Thread Sam Li
Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Dmitry Fomichev 
---
 block/file-posix.c | 3 +++
 block/trace-events | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index 5fa80933c9..65efe5147e 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -3266,6 +3266,7 @@ static int coroutine_fn 
raw_co_zone_report(BlockDriverState *bs, int64_t offset,
 },
 };
 
+trace_zbd_zone_report(bs, *nr_zones, offset >> BDRV_SECTOR_BITS);
 return raw_thread_pool_submit(bs, handle_aiocb_zone_report, &acb);
 }
 #endif
@@ -3332,6 +,8 @@ static int coroutine_fn raw_co_zone_mgmt(BlockDriverState 
*bs, BlockZoneOp op,
 },
 };
 
+trace_zbd_zone_mgmt(bs, op_name, offset >> BDRV_SECTOR_BITS,
+len >> BDRV_SECTOR_BITS);
 ret = raw_thread_pool_submit(bs, handle_aiocb_zone_mgmt, &acb);
 if (ret != 0) {
 ret = -errno;
diff --git a/block/trace-events b/block/trace-events
index 48dbf10c66..3f4e1d088a 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -209,6 +209,8 @@ file_FindEjectableOpticalMedia(const char *media) "Matching 
using %s"
 file_setup_cdrom(const char *partition) "Using %s as optical disc"
 file_hdev_is_sg(int type, int version) "SG device found: type=%d, version=%d"
 file_flush_fdatasync_failed(int err) "errno %d"
+zbd_zone_report(void *bs, unsigned int nr_zones, int64_t sector) "bs %p report 
%d zones starting at sector offset 0x%" PRIx64 ""
+zbd_zone_mgmt(void *bs, const char *op_name, int64_t sector, int64_t len) "bs 
%p %s starts at sector offset 0x%" PRIx64 " over a range of 0x%" PRIx64 " 
sectors"
 
 # ssh.c
 sftp_error(const char *op, const char *ssh_err, int ssh_err_code, int 
sftp_err_code) "%s failed: %s (libssh error code: %d, sftp error code: %d)"
-- 
2.39.2




[PATCH v9 2/5] virtio-blk: add zoned storage emulation for zoned devices

2023-03-24 Thread Sam Li
This patch extends virtio-blk emulation to handle zoned device commands
by calling the new block layer APIs to perform zoned device I/O on
behalf of the guest. It supports Report Zone, four zone oparations (open,
close, finish, reset), and Append Zone.

The VIRTIO_BLK_F_ZONED feature bit will only be set if the host does
support zoned block devices. Regular block devices(conventional zones)
will not be set.

The guest os can use blktests, fio to test those commands on zoned devices.
Furthermore, using zonefs to test zone append write is also supported.

Signed-off-by: Sam Li 
---
 hw/block/virtio-blk-common.c |   2 +
 hw/block/virtio-blk.c| 389 +++
 hw/virtio/virtio-qmp.c   |   2 +
 3 files changed, 393 insertions(+)

diff --git a/hw/block/virtio-blk-common.c b/hw/block/virtio-blk-common.c
index ac52d7c176..e2f8e2f6da 100644
--- a/hw/block/virtio-blk-common.c
+++ b/hw/block/virtio-blk-common.c
@@ -29,6 +29,8 @@ static const VirtIOFeature feature_sizes[] = {
  .end = endof(struct virtio_blk_config, discard_sector_alignment)},
 {.flags = 1ULL << VIRTIO_BLK_F_WRITE_ZEROES,
  .end = endof(struct virtio_blk_config, write_zeroes_may_unmap)},
+{.flags = 1ULL << VIRTIO_BLK_F_ZONED,
+ .end = endof(struct virtio_blk_config, zoned)},
 {}
 };
 
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index cefca93b31..66c2bc4b16 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -17,6 +17,7 @@
 #include "qemu/module.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
+#include "block/block_int.h"
 #include "trace.h"
 #include "hw/block/block.h"
 #include "hw/qdev-properties.h"
@@ -601,6 +602,335 @@ err:
 return err_status;
 }
 
+typedef struct ZoneCmdData {
+VirtIOBlockReq *req;
+struct iovec *in_iov;
+unsigned in_num;
+union {
+struct {
+unsigned int nr_zones;
+BlockZoneDescriptor *zones;
+} zone_report_data;
+struct {
+int64_t offset;
+} zone_append_data;
+};
+} ZoneCmdData;
+
+/*
+ * check zoned_request: error checking before issuing requests. If all checks
+ * passed, return true.
+ * append: true if only zone append requests issued.
+ */
+static bool check_zoned_request(VirtIOBlock *s, int64_t offset, int64_t len,
+ bool append, uint8_t *status) {
+BlockDriverState *bs = blk_bs(s->blk);
+int index;
+
+if (!virtio_has_feature(s->host_features, VIRTIO_BLK_F_ZONED)) {
+*status = VIRTIO_BLK_S_UNSUPP;
+return false;
+}
+
+if (offset < 0 || len < 0 || len > (bs->total_sectors << BDRV_SECTOR_BITS)
+|| offset > (bs->total_sectors << BDRV_SECTOR_BITS) - len) {
+*status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+return false;
+}
+
+if (append) {
+if (bs->bl.write_granularity) {
+if ((offset % bs->bl.write_granularity) != 0) {
+*status = VIRTIO_BLK_S_ZONE_UNALIGNED_WP;
+return false;
+}
+}
+
+index = offset / bs->bl.zone_size;
+if (BDRV_ZT_IS_CONV(bs->bl.wps->wp[index])) {
+*status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+return false;
+}
+
+if (len / 512 > bs->bl.max_append_sectors) {
+if (bs->bl.max_append_sectors == 0) {
+*status = VIRTIO_BLK_S_UNSUPP;
+} else {
+*status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+}
+return false;
+}
+}
+return true;
+}
+
+static void virtio_blk_zone_report_complete(void *opaque, int ret)
+{
+ZoneCmdData *data = opaque;
+VirtIOBlockReq *req = data->req;
+VirtIOBlock *s = req->dev;
+VirtIODevice *vdev = VIRTIO_DEVICE(req->dev);
+struct iovec *in_iov = data->in_iov;
+unsigned in_num = data->in_num;
+int64_t zrp_size, n, j = 0;
+int64_t nz = data->zone_report_data.nr_zones;
+int8_t err_status = VIRTIO_BLK_S_OK;
+
+if (ret) {
+err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+goto out;
+}
+
+struct virtio_blk_zone_report zrp_hdr = (struct virtio_blk_zone_report) {
+.nr_zones = cpu_to_le64(nz),
+};
+zrp_size = sizeof(struct virtio_blk_zone_report)
+   + sizeof(struct virtio_blk_zone_descriptor) * nz;
+n = iov_from_buf(in_iov, in_num, 0, &zrp_hdr, sizeof(zrp_hdr));
+if (n != sizeof(zrp_hdr)) {
+virtio_error(vdev, "Driver provided input buffer that is too small!");
+err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+goto out;
+}
+
+for (size_t i = sizeof(zrp_hdr); i < zrp_size;
+i += sizeof(struct virtio_blk_zone_descriptor), ++j) {
+struct virtio_blk_zone_descriptor desc =
+(struct virtio_blk_zone_

Re: [PATCH v17 6/8] qemu-iotests: test new zone operations

2023-03-24 Thread Sam Li
Stefan Hajnoczi  于2023年3月24日周五 03:31写道:
>
> On Thu, Mar 23, 2023 at 01:08:32PM +0800, Sam Li wrote:
> > The new block layer APIs of zoned block devices can be tested by:
> > $ tests/qemu-iotests/check zoned
> > Run each zone operation on a newly created null_blk device
> > and see whether it outputs the same zone information.
> >
> > Signed-off-by: Sam Li 
> > Reviewed-by: Stefan Hajnoczi 
> > ---
> >  tests/qemu-iotests/tests/zoned | 86 ++
> >  tests/qemu-iotests/tests/zoned.out | 53 ++
> >  2 files changed, 139 insertions(+)
> >  create mode 100755 tests/qemu-iotests/tests/zoned
> >  create mode 100644 tests/qemu-iotests/tests/zoned.out
> >
> > diff --git a/tests/qemu-iotests/tests/zoned b/tests/qemu-iotests/tests/zoned
> > new file mode 100755
> > index 00..53097e44d9
> > --- /dev/null
> > +++ b/tests/qemu-iotests/tests/zoned
> > @@ -0,0 +1,86 @@
> > +#!/usr/bin/env bash
> > +#
> > +# Test zone management operations.
> > +#
> > +
> > +seq="$(basename $0)"
> > +echo "QA output created by $seq"
> > +status=1 # failure is the default!
> > +
> > +_cleanup()
> > +{
> > +  _cleanup_test_img
> > +  sudo rmmod null_blk
> > +}
> > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > +
> > +# get standard environment, filters and checks
> > +. ../common.rc
> > +. ../common.filter
> > +. ../common.qemu
> > +
> > +# This test only runs on Linux hosts with raw image files.
> > +_supported_fmt raw
> > +_supported_proto file
> > +_supported_os Linux
> > +
> > +IMG="--image-opts -n driver=host_device,filename=/dev/nullb0"
> > +QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT
> > +
> > +echo "Testing a null_blk device:"
> > +echo "case 1: if the operations work"
> > +sudo modprobe null_blk nr_devices=1 zoned=1
>
> I took a look at how existing qemu-iotests use sudo. The run it in
> non-interactive mode and skip the test if sudo is unavailable.
>
> Please do something like this to check for sudo support:
>
>   sudo -n true || _notrun 'Password-less sudo required'
>
> Then always use "sudo -n ...".

Ok. Then, passwordless sudo demands setup on Linux otherwise the
script will not run just as existing qemu-iotests does.

>
>
> > +sudo chmod 0666 /dev/nullb0
> > +
> > +echo "(1) report the first zone:"
> > +$QEMU_IO $IMG -c "zrp 0 1"
> > +echo
> > +echo "report the first 10 zones"
> > +$QEMU_IO $IMG -c "zrp 0 10"
> > +echo
> > +echo "report the last zone:"
> > +$QEMU_IO $IMG -c "zrp 0x3e7000 2" # 0x3e7000 / 512 = 0x1f38
> > +echo
> > +echo
> > +echo "(2) opening the first zone"
> > +$QEMU_IO $IMG -c "zo 0 268435456"  # 268435456 / 512 = 524288
> > +echo "report after:"
> > +$QEMU_IO $IMG -c "zrp 0 1"
> > +echo
> > +echo "opening the second zone"
> > +$QEMU_IO $IMG -c "zo 268435456 268435456" #
> > +echo "report after:"
> > +$QEMU_IO $IMG -c "zrp 268435456 1"
> > +echo
> > +echo "opening the last zone"
> > +$QEMU_IO $IMG -c "zo 0x3e7000 268435456"
> > +echo "report after:"
> > +$QEMU_IO $IMG -c "zrp 0x3e7000 2"
> > +echo
> > +echo
> > +echo "(3) closing the first zone"
> > +$QEMU_IO $IMG -c "zc 0 268435456"
> > +echo "report after:"
> > +$QEMU_IO $IMG -c "zrp 0 1"
> > +echo
> > +echo "closing the last zone"
> > +$QEMU_IO $IMG -c "zc 0x3e7000 268435456"
> > +echo "report after:"
> > +$QEMU_IO $IMG -c "zrp 0x3e7000 2"
> > +echo
> > +echo
> > +echo "(4) finishing the second zone"
> > +$QEMU_IO $IMG -c "zf 268435456 268435456"
> > +echo "After finishing a zone:"
> > +$QEMU_IO $IMG -c "zrp 268435456 1"
> > +echo
> > +echo
> > +echo "(5) resetting the second zone"
> > +$QEMU_IO $IMG -c "zrs 268435456 268435456"
> > +echo "After resetting a zone:"
> > +$QEMU_IO $IMG -c "zrp 268435456 1"
> > +
> > +# success, all done
> > +echo "*** done"
> > +rm -f $seq.full
> > +status=0
> > diff --git a/tests/qemu-iotests/tests/zoned.out 
> > b/tests/qemu-iotests/tests/zoned.out
&

[PATCH v9 1/5] include: update virtio_blk headers to v6.3-rc1

2023-03-24 Thread Sam Li
Use scripts/update-linux-headers.sh to update headers to 6.3-rc1.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Dmitry Fomichev 
---
 include/standard-headers/drm/drm_fourcc.h|  12 +++
 include/standard-headers/linux/ethtool.h |  48 -
 include/standard-headers/linux/fuse.h|  45 +++-
 include/standard-headers/linux/pci_regs.h|   1 +
 include/standard-headers/linux/vhost_types.h |   2 +
 include/standard-headers/linux/virtio_blk.h  | 105 +++
 linux-headers/asm-arm64/kvm.h|   1 +
 linux-headers/asm-x86/kvm.h  |  34 +-
 linux-headers/linux/kvm.h|   9 ++
 linux-headers/linux/vfio.h   |  15 +--
 linux-headers/linux/vhost.h  |   8 ++
 11 files changed, 270 insertions(+), 10 deletions(-)

diff --git a/include/standard-headers/drm/drm_fourcc.h 
b/include/standard-headers/drm/drm_fourcc.h
index 69cab17b38..dc3e6112c1 100644
--- a/include/standard-headers/drm/drm_fourcc.h
+++ b/include/standard-headers/drm/drm_fourcc.h
@@ -87,6 +87,18 @@ extern "C" {
  *
  * The authoritative list of format modifier codes is found in
  * `include/uapi/drm/drm_fourcc.h`
+ *
+ * Open Source User Waiver
+ * ---
+ *
+ * Because this is the authoritative source for pixel formats and modifiers
+ * referenced by GL, Vulkan extensions and other standards and hence used both
+ * by open source and closed source driver stacks, the usual requirement for an
+ * upstream in-kernel or open source userspace user does not apply.
+ *
+ * To ensure, as much as feasible, compatibility across stacks and avoid
+ * confusion with incompatible enumerations stakeholders for all relevant 
driver
+ * stacks should approve additions.
  */
 
 #define fourcc_code(a, b, c, d) ((uint32_t)(a) | ((uint32_t)(b) << 8) | \
diff --git a/include/standard-headers/linux/ethtool.h 
b/include/standard-headers/linux/ethtool.h
index 87176ab075..99fcddf04f 100644
--- a/include/standard-headers/linux/ethtool.h
+++ b/include/standard-headers/linux/ethtool.h
@@ -711,6 +711,24 @@ enum ethtool_stringset {
ETH_SS_COUNT
 };
 
+/**
+ * enum ethtool_mac_stats_src - source of ethtool MAC statistics
+ * @ETHTOOL_MAC_STATS_SRC_AGGREGATE:
+ * if device supports a MAC merge layer, this retrieves the aggregate
+ * statistics of the eMAC and pMAC. Otherwise, it retrieves just the
+ * statistics of the single (express) MAC.
+ * @ETHTOOL_MAC_STATS_SRC_EMAC:
+ * if device supports a MM layer, this retrieves the eMAC statistics.
+ * Otherwise, it retrieves the statistics of the single (express) MAC.
+ * @ETHTOOL_MAC_STATS_SRC_PMAC:
+ * if device supports a MM layer, this retrieves the pMAC statistics.
+ */
+enum ethtool_mac_stats_src {
+   ETHTOOL_MAC_STATS_SRC_AGGREGATE,
+   ETHTOOL_MAC_STATS_SRC_EMAC,
+   ETHTOOL_MAC_STATS_SRC_PMAC,
+};
+
 /**
  * enum ethtool_module_power_mode_policy - plug-in module power mode policy
  * @ETHTOOL_MODULE_POWER_MODE_POLICY_HIGH: Module is always in high power mode.
@@ -779,6 +797,31 @@ enum ethtool_podl_pse_pw_d_status {
ETHTOOL_PODL_PSE_PW_D_STATUS_ERROR,
 };
 
+/**
+ * enum ethtool_mm_verify_status - status of MAC Merge Verify function
+ * @ETHTOOL_MM_VERIFY_STATUS_UNKNOWN:
+ * verification status is unknown
+ * @ETHTOOL_MM_VERIFY_STATUS_INITIAL:
+ * the 802.3 Verify State diagram is in the state INIT_VERIFICATION
+ * @ETHTOOL_MM_VERIFY_STATUS_VERIFYING:
+ * the Verify State diagram is in the state VERIFICATION_IDLE,
+ * SEND_VERIFY or WAIT_FOR_RESPONSE
+ * @ETHTOOL_MM_VERIFY_STATUS_SUCCEEDED:
+ * indicates that the Verify State diagram is in the state VERIFIED
+ * @ETHTOOL_MM_VERIFY_STATUS_FAILED:
+ * the Verify State diagram is in the state VERIFY_FAIL
+ * @ETHTOOL_MM_VERIFY_STATUS_DISABLED:
+ * verification of preemption operation is disabled
+ */
+enum ethtool_mm_verify_status {
+   ETHTOOL_MM_VERIFY_STATUS_UNKNOWN,
+   ETHTOOL_MM_VERIFY_STATUS_INITIAL,
+   ETHTOOL_MM_VERIFY_STATUS_VERIFYING,
+   ETHTOOL_MM_VERIFY_STATUS_SUCCEEDED,
+   ETHTOOL_MM_VERIFY_STATUS_FAILED,
+   ETHTOOL_MM_VERIFY_STATUS_DISABLED,
+};
+
 /**
  * struct ethtool_gstrings - string set for data tagging
  * @cmd: Command number = %ETHTOOL_GSTRINGS
@@ -1183,7 +1226,7 @@ struct ethtool_rxnfc {
uint32_trule_cnt;
uint32_trss_context;
};
-   uint32_trule_locs[0];
+   uint32_trule_locs[];
 };
 
 
@@ -1741,6 +1784,9 @@ enum ethtool_link_mode_bit_indices {
ETHTOOL_LINK_MODE_80baseDR8_2_Full_BIT   = 96,
ETHTOOL_LINK_MODE_80baseSR8_Full_BIT = 97,
ETHTOOL_LINK_MODE_80baseVR8_Full_BIT = 98,
+   ETHTOOL_LINK_MODE_10baseT1S_Full_BIT = 99,
+   ETHTOOL_LINK_MODE_10

[PATCH v18 1/8] include: add zoned device structs

2023-03-24 Thread Sam Li
Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Damien Le Moal 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Dmitry Fomichev 
---
 include/block/block-common.h | 43 
 1 file changed, 43 insertions(+)

diff --git a/include/block/block-common.h b/include/block/block-common.h
index b5122ef8ab..1576fcf2ed 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -75,6 +75,49 @@ typedef struct BlockDriver BlockDriver;
 typedef struct BdrvChild BdrvChild;
 typedef struct BdrvChildClass BdrvChildClass;
 
+typedef enum BlockZoneOp {
+BLK_ZO_OPEN,
+BLK_ZO_CLOSE,
+BLK_ZO_FINISH,
+BLK_ZO_RESET,
+} BlockZoneOp;
+
+typedef enum BlockZoneModel {
+BLK_Z_NONE = 0x0, /* Regular block device */
+BLK_Z_HM = 0x1, /* Host-managed zoned block device */
+BLK_Z_HA = 0x2, /* Host-aware zoned block device */
+} BlockZoneModel;
+
+typedef enum BlockZoneState {
+BLK_ZS_NOT_WP = 0x0,
+BLK_ZS_EMPTY = 0x1,
+BLK_ZS_IOPEN = 0x2,
+BLK_ZS_EOPEN = 0x3,
+BLK_ZS_CLOSED = 0x4,
+BLK_ZS_RDONLY = 0xD,
+BLK_ZS_FULL = 0xE,
+BLK_ZS_OFFLINE = 0xF,
+} BlockZoneState;
+
+typedef enum BlockZoneType {
+BLK_ZT_CONV = 0x1, /* Conventional random writes supported */
+BLK_ZT_SWR = 0x2, /* Sequential writes required */
+BLK_ZT_SWP = 0x3, /* Sequential writes preferred */
+} BlockZoneType;
+
+/*
+ * Zone descriptor data structure.
+ * Provides information on a zone with all position and size values in bytes.
+ */
+typedef struct BlockZoneDescriptor {
+uint64_t start;
+uint64_t length;
+uint64_t cap;
+uint64_t wp;
+BlockZoneType type;
+BlockZoneState state;
+} BlockZoneDescriptor;
+
 typedef struct BlockDriverInfo {
 /* in bytes, 0 if irrelevant */
 int cluster_size;
-- 
2.39.2




[PATCH v18 5/8] config: add check to block layer

2023-03-24 Thread Sam Li
Putting zoned/non-zoned BlockDrivers on top of each other is not
allowed.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Dmitry Fomichev 
---
 block.c  | 19 +++
 block/file-posix.c   | 12 
 block/raw-format.c   |  1 +
 include/block/block_int-common.h |  5 +
 4 files changed, 37 insertions(+)

diff --git a/block.c b/block.c
index 0dd604d0f6..4ebf7bbc90 100644
--- a/block.c
+++ b/block.c
@@ -7953,6 +7953,25 @@ void bdrv_add_child(BlockDriverState *parent_bs, 
BlockDriverState *child_bs,
 return;
 }
 
+/*
+ * Non-zoned block drivers do not follow zoned storage constraints
+ * (i.e. sequential writes to zones). Refuse mixing zoned and non-zoned
+ * drivers in a graph.
+ */
+if (!parent_bs->drv->supports_zoned_children &&
+child_bs->bl.zoned == BLK_Z_HM) {
+/*
+ * The host-aware model allows zoned storage constraints and random
+ * write. Allow mixing host-aware and non-zoned drivers. Using
+ * host-aware device as a regular device.
+ */
+error_setg(errp, "Cannot add a %s child to a %s parent",
+   child_bs->bl.zoned == BLK_Z_HM ? "zoned" : "non-zoned",
+   parent_bs->drv->supports_zoned_children ?
+   "support zoned children" : "not support zoned children");
+return;
+}
+
 if (!QLIST_EMPTY(&child_bs->parents)) {
 error_setg(errp, "The node %s already has a parent",
child_bs->node_name);
diff --git a/block/file-posix.c b/block/file-posix.c
index 0c19cfb5cc..5fa80933c9 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -776,6 +776,18 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
 goto fail;
 }
 }
+#ifdef CONFIG_BLKZONED
+/*
+ * The kernel page cache does not reliably work for writes to SWR zones
+ * of zoned block device because it can not guarantee the order of writes.
+ */
+if ((bs->bl.zoned != BLK_Z_NONE) &&
+(!(s->open_flags & O_DIRECT))) {
+error_setg(errp, "The driver supports zoned devices, and it requires "
+ "cache.direct=on, which was not specified.");
+return -EINVAL; /* No host kernel page cache */
+}
+#endif
 
 if (S_ISBLK(st.st_mode)) {
 #ifdef __linux__
diff --git a/block/raw-format.c b/block/raw-format.c
index 6e1b9394c8..72e23e7b55 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -621,6 +621,7 @@ static void raw_child_perm(BlockDriverState *bs, BdrvChild 
*c,
 BlockDriver bdrv_raw = {
 .format_name  = "raw",
 .instance_size= sizeof(BDRVRawState),
+.supports_zoned_children = true,
 .bdrv_probe   = &raw_probe,
 .bdrv_reopen_prepare  = &raw_reopen_prepare,
 .bdrv_reopen_commit   = &raw_reopen_commit,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index a3efb385e0..1bd2aef4d5 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -137,6 +137,11 @@ struct BlockDriver {
  */
 bool is_format;
 
+/*
+ * Set to true if the BlockDriver supports zoned children.
+ */
+bool supports_zoned_children;
+
 /*
  * Drivers not implementing bdrv_parse_filename nor bdrv_open should have
  * this field set to true, except ones that are defined only by their
-- 
2.39.2




[PATCH v9 4/5] virtio-blk: add some trace events for zoned emulation

2023-03-24 Thread Sam Li
Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
---
 hw/block/trace-events |  7 +++
 hw/block/virtio-blk.c | 12 
 2 files changed, 19 insertions(+)

diff --git a/hw/block/trace-events b/hw/block/trace-events
index 2c45a62bd5..34be8b9135 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -44,9 +44,16 @@ pflash_write_unknown(const char *name, uint8_t cmd) "%s: 
unknown command 0x%02x"
 # virtio-blk.c
 virtio_blk_req_complete(void *vdev, void *req, int status) "vdev %p req %p 
status %d"
 virtio_blk_rw_complete(void *vdev, void *req, int ret) "vdev %p req %p ret %d"
+virtio_blk_zone_report_complete(void *vdev, void *req, unsigned int nr_zones, 
int ret) "vdev %p req %p nr_zones %u ret %d"
+virtio_blk_zone_mgmt_complete(void *vdev, void *req, int ret) "vdev %p req %p 
ret %d"
+virtio_blk_zone_append_complete(void *vdev, void *req, int64_t sector, int 
ret) "vdev %p req %p, append sector 0x%" PRIx64 " ret %d"
 virtio_blk_handle_write(void *vdev, void *req, uint64_t sector, size_t 
nsectors) "vdev %p req %p sector %"PRIu64" nsectors %zu"
 virtio_blk_handle_read(void *vdev, void *req, uint64_t sector, size_t 
nsectors) "vdev %p req %p sector %"PRIu64" nsectors %zu"
 virtio_blk_submit_multireq(void *vdev, void *mrb, int start, int num_reqs, 
uint64_t offset, size_t size, bool is_write) "vdev %p mrb %p start %d num_reqs 
%d offset %"PRIu64" size %zu is_write %d"
+virtio_blk_handle_zone_report(void *vdev, void *req, int64_t sector, unsigned 
int nr_zones) "vdev %p req %p sector 0x%" PRIx64 " nr_zones %u"
+virtio_blk_handle_zone_mgmt(void *vdev, void *req, uint8_t op, int64_t sector, 
int64_t len) "vdev %p req %p op 0x%x sector 0x%" PRIx64 " len 0x%" PRIx64 ""
+virtio_blk_handle_zone_reset_all(void *vdev, void *req, int64_t sector, 
int64_t len) "vdev %p req %p sector 0x%" PRIx64 " cap 0x%" PRIx64 ""
+virtio_blk_handle_zone_append(void *vdev, void *req, int64_t sector) "vdev %p 
req %p, append sector 0x%" PRIx64 ""
 
 # hd-geometry.c
 hd_geometry_lchs_guess(void *blk, int cyls, int heads, int secs) "blk %p LCHS 
%d %d %d"
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 0d85c2c9b0..2afd5cf96c 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -676,6 +676,7 @@ static void virtio_blk_zone_report_complete(void *opaque, 
int ret)
 int64_t nz = data->zone_report_data.nr_zones;
 int8_t err_status = VIRTIO_BLK_S_OK;
 
+trace_virtio_blk_zone_report_complete(vdev, req, nz, ret);
 if (ret) {
 err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
 goto out;
@@ -792,6 +793,8 @@ static void virtio_blk_handle_zone_report(VirtIOBlockReq 
*req,
 nr_zones = (req->in_len - sizeof(struct virtio_blk_inhdr) -
 sizeof(struct virtio_blk_zone_report)) /
sizeof(struct virtio_blk_zone_descriptor);
+trace_virtio_blk_handle_zone_report(vdev, req,
+offset >> BDRV_SECTOR_BITS, nr_zones);
 
 zone_size = sizeof(BlockZoneDescriptor) * nr_zones;
 data = g_malloc(sizeof(ZoneCmdData));
@@ -814,7 +817,9 @@ static void virtio_blk_zone_mgmt_complete(void *opaque, int 
ret)
 {
 VirtIOBlockReq *req = opaque;
 VirtIOBlock *s = req->dev;
+VirtIODevice *vdev = VIRTIO_DEVICE(s);
 int8_t err_status = VIRTIO_BLK_S_OK;
+trace_virtio_blk_zone_mgmt_complete(vdev, req,ret);
 
 if (ret) {
 err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
@@ -841,6 +846,8 @@ static int virtio_blk_handle_zone_mgmt(VirtIOBlockReq *req, 
BlockZoneOp op)
 /* Entire drive capacity */
 offset = 0;
 len = capacity;
+trace_virtio_blk_handle_zone_reset_all(vdev, req, 0,
+   bs->total_sectors);
 } else {
 if (bs->bl.zone_size > capacity - offset) {
 /* The zoned device allows the last smaller zone. */
@@ -848,6 +855,9 @@ static int virtio_blk_handle_zone_mgmt(VirtIOBlockReq *req, 
BlockZoneOp op)
 } else {
 len = bs->bl.zone_size;
 }
+trace_virtio_blk_handle_zone_mgmt(vdev, req, op,
+  offset >> BDRV_SECTOR_BITS,
+  len >> BDRV_SECTOR_BITS);
 }
 
 if (!check_zoned_request(s, offset, len, false, &err_status)) {
@@ -888,6 +898,7 @@ static void virtio_blk_zone_append_complete(void *opaque, 
int ret)
 err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
 goto out;
 }
+trace_virtio_blk_zone_append_complete(vdev, req, append_sector, ret);
 
 out:
 aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
@@ -909,6 +920,7 @@ static int virtio_blk_handle_zone_append(VirtIOBlo

[PATCH v9 3/5] block: add accounting for zone append operation

2023-03-24 Thread Sam Li
Taking account of the new zone append write operation for zoned devices,
BLOCK_ACCT_ZONE_APPEND enum is introduced as other I/O request type (read,
write, flush).

Signed-off-by: Sam Li 
---
 block/qapi-sysemu.c| 11 ++
 block/qapi.c   | 18 ++
 hw/block/virtio-blk.c  |  4 +++
 include/block/accounting.h |  1 +
 qapi/block-core.json   | 68 --
 qapi/block.json|  4 +++
 6 files changed, 95 insertions(+), 11 deletions(-)

diff --git a/block/qapi-sysemu.c b/block/qapi-sysemu.c
index 7bd7554150..cec3c1afb4 100644
--- a/block/qapi-sysemu.c
+++ b/block/qapi-sysemu.c
@@ -517,6 +517,7 @@ void qmp_block_latency_histogram_set(
 bool has_boundaries, uint64List *boundaries,
 bool has_boundaries_read, uint64List *boundaries_read,
 bool has_boundaries_write, uint64List *boundaries_write,
+bool has_boundaries_append, uint64List *boundaries_append,
 bool has_boundaries_flush, uint64List *boundaries_flush,
 Error **errp)
 {
@@ -557,6 +558,16 @@ void qmp_block_latency_histogram_set(
 }
 }
 
+if (has_boundaries || has_boundaries_append) {
+ret = block_latency_histogram_set(
+stats, BLOCK_ACCT_ZONE_APPEND,
+has_boundaries_append ? boundaries_append : boundaries);
+if (ret) {
+error_setg(errp, "Device '%s' set append write boundaries fail", 
id);
+return;
+}
+}
+
 if (has_boundaries || has_boundaries_flush) {
 ret = block_latency_histogram_set(
 stats, BLOCK_ACCT_FLUSH,
diff --git a/block/qapi.c b/block/qapi.c
index c84147849d..2684484e9d 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -533,27 +533,36 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
BlockBackend *blk)
 
 ds->rd_bytes = stats->nr_bytes[BLOCK_ACCT_READ];
 ds->wr_bytes = stats->nr_bytes[BLOCK_ACCT_WRITE];
+ds->zone_append_bytes = stats->nr_bytes[BLOCK_ACCT_ZONE_APPEND];
 ds->unmap_bytes = stats->nr_bytes[BLOCK_ACCT_UNMAP];
 ds->rd_operations = stats->nr_ops[BLOCK_ACCT_READ];
 ds->wr_operations = stats->nr_ops[BLOCK_ACCT_WRITE];
+ds->zone_append_operations = stats->nr_ops[BLOCK_ACCT_ZONE_APPEND];
 ds->unmap_operations = stats->nr_ops[BLOCK_ACCT_UNMAP];
 
 ds->failed_rd_operations = stats->failed_ops[BLOCK_ACCT_READ];
 ds->failed_wr_operations = stats->failed_ops[BLOCK_ACCT_WRITE];
+ds->failed_zone_append_operations =
+stats->failed_ops[BLOCK_ACCT_ZONE_APPEND];
 ds->failed_flush_operations = stats->failed_ops[BLOCK_ACCT_FLUSH];
 ds->failed_unmap_operations = stats->failed_ops[BLOCK_ACCT_UNMAP];
 
 ds->invalid_rd_operations = stats->invalid_ops[BLOCK_ACCT_READ];
 ds->invalid_wr_operations = stats->invalid_ops[BLOCK_ACCT_WRITE];
+ds->invalid_zone_append_operations =
+stats->invalid_ops[BLOCK_ACCT_ZONE_APPEND];
 ds->invalid_flush_operations =
 stats->invalid_ops[BLOCK_ACCT_FLUSH];
 ds->invalid_unmap_operations = stats->invalid_ops[BLOCK_ACCT_UNMAP];
 
 ds->rd_merged = stats->merged[BLOCK_ACCT_READ];
 ds->wr_merged = stats->merged[BLOCK_ACCT_WRITE];
+ds->zone_append_merged = stats->merged[BLOCK_ACCT_ZONE_APPEND];
 ds->unmap_merged = stats->merged[BLOCK_ACCT_UNMAP];
 ds->flush_operations = stats->nr_ops[BLOCK_ACCT_FLUSH];
 ds->wr_total_time_ns = stats->total_time_ns[BLOCK_ACCT_WRITE];
+ds->zone_append_total_time_ns =
+stats->total_time_ns[BLOCK_ACCT_ZONE_APPEND];
 ds->rd_total_time_ns = stats->total_time_ns[BLOCK_ACCT_READ];
 ds->flush_total_time_ns = stats->total_time_ns[BLOCK_ACCT_FLUSH];
 ds->unmap_total_time_ns = stats->total_time_ns[BLOCK_ACCT_UNMAP];
@@ -571,6 +580,7 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
BlockBackend *blk)
 
 TimedAverage *rd = &ts->latency[BLOCK_ACCT_READ];
 TimedAverage *wr = &ts->latency[BLOCK_ACCT_WRITE];
+TimedAverage *zap = &ts->latency[BLOCK_ACCT_ZONE_APPEND];
 TimedAverage *fl = &ts->latency[BLOCK_ACCT_FLUSH];
 
 dev_stats->interval_length = ts->interval_length;
@@ -583,6 +593,10 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
BlockBackend *blk)
 dev_stats->max_wr_latency_ns = timed_average_max(wr);
 dev_stats->avg_wr_latency_ns = timed_average_avg(wr);
 
+dev_stats->min_zone_append_latency_ns = timed_average_min(zap);
+dev_stats->max_zone_append_latency_ns = timed_average_max(zap);
+dev_stats->avg_zone_append_latency_ns = timed_average_avg(zap);
+
 dev_stats->min_flush_latency_ns = timed_average_min(fl);
 dev_stats->max_flush_latency_ns = timed_average_max(fl);
 dev_s

[PATCH v9 0/5] Add zoned storage emulation to virtio-blk driver

2023-03-27 Thread Sam Li
This patch adds zoned storage emulation to the virtio-blk driver. It
implements the virtio-blk ZBD support standardization that is
recently accepted by virtio-spec. The link to related commit is at

https://github.com/oasis-tcs/virtio-spec/commit/b4e8efa0fa6c8d844328090ad15db65af8d7d981

The Linux zoned device code that implemented by Dmitry Fomichev has been
released at the latest Linux version v6.3-rc1.

Aside: adding zoned=on alike options to virtio-blk device will be
considered in following-up plan.

Note: Sorry to send it again because of the previous incoherent patches caused
by network error.

v9:
- address review comments
  * add docs for zoned emulation use case [Matias]
  * add the zoned feature bit to qmp monitor [Matias]
  * add the version number for newly added configs of accounting [Markus]

v8:
- address Stefan's review comments
  * rm aio_context_acquire/release in handle_req
  * rename function return type
  * rename BLOCK_ACCT_APPEND to BLOCK_ACCT_ZONE_APPEND for clarity

v7:
- update headers to v6.3-rc1

v6:
- address Stefan's review comments
  * add accounting for zone append operation
  * fix in_iov usage in handle_request, error handling and typos

v5:
- address Stefan's review comments
  * restore the way writing zone append result to buffer
  * fix error checking case and other errands

v4:
- change the way writing zone append request result to buffer
- change zone state, zone type value of virtio_blk_zone_descriptor
- add trace events for new zone APIs

v3:
- use qemuio_from_buffer to write status bit [Stefan]
- avoid using req->elem directly [Stefan]
- fix error checkings and memory leak [Stefan]

v2:
- change units of emulated zone op coresponding to block layer APIs
- modify error checking cases [Stefan, Damien]

v1:
- add zoned storage emulation

Sam Li (5):
  include: update virtio_blk headers to v6.3-rc1
  virtio-blk: add zoned storage emulation for zoned devices
  block: add accounting for zone append operation
  virtio-blk: add some trace events for zoned emulation
  docs/zoned-storage:add zoned emulation use case

 block/qapi-sysemu.c  |  11 +
 block/qapi.c |  18 +
 docs/devel/zoned-storage.rst |  17 +
 hw/block/trace-events|   7 +
 hw/block/virtio-blk-common.c |   2 +
 hw/block/virtio-blk.c| 405 +++
 hw/virtio/virtio-qmp.c   |   2 +
 include/block/accounting.h   |   1 +
 include/standard-headers/drm/drm_fourcc.h|  12 +
 include/standard-headers/linux/ethtool.h |  48 ++-
 include/standard-headers/linux/fuse.h|  45 ++-
 include/standard-headers/linux/pci_regs.h|   1 +
 include/standard-headers/linux/vhost_types.h |   2 +
 include/standard-headers/linux/virtio_blk.h  | 105 +
 linux-headers/asm-arm64/kvm.h|   1 +
 linux-headers/asm-x86/kvm.h  |  34 +-
 linux-headers/linux/kvm.h|   9 +
 linux-headers/linux/vfio.h   |  15 +-
 linux-headers/linux/vhost.h  |   8 +
 qapi/block-core.json |  68 +++-
 qapi/block.json  |   4 +
 21 files changed, 794 insertions(+), 21 deletions(-)

-- 
2.39.2




[PATCH v9 5/5] docs/zoned-storage:add zoned emulation use case

2023-03-27 Thread Sam Li
Add the documentation about the example of using virtio-blk driver
to pass the zoned block devices through to the guest.

Signed-off-by: Sam Li 
---
 docs/devel/zoned-storage.rst | 17 +
 1 file changed, 17 insertions(+)

diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
index 6a36133e51..05ecf3729c 100644
--- a/docs/devel/zoned-storage.rst
+++ b/docs/devel/zoned-storage.rst
@@ -41,3 +41,20 @@ APIs for zoned storage emulation or testing.
 For example, to test zone_report on a null_blk device using qemu-io is:
 $ path/to/qemu-io --image-opts -n driver=host_device,filename=/dev/nullb0
 -c "zrp offset nr_zones"
+
+To expose the host's zoned block device through virtio-blk, the command line
+can be (includes the -device parameter):
+-blockdev node-name=drive0,driver=host_device,filename=/dev/nullb0,
+cache.direct=on \
+-device virtio-blk-pci,drive=drive0
+Or only use the -drive parameter:
+-driver driver=host_device,file=/dev/nullb0,if=virtio,cache.direct=on
+
+Additionally, QEMU has several ways of supporting zoned storage, including:
+(1) Using virtio-scsi: --device scsi-block allows for the passing through of
+SCSI ZBC devices, enabling the attachment of ZBC or ZAC HDDs to QEMU.
+(2) PCI device pass-through: While NVMe ZNS emulation is available for testing
+purposes, it cannot yet pass through a zoned device from the host. To pass on
+the NVMe ZNS device to the guest, use VFIO PCI pass the entire NVMe PCI adapter
+through to the guest. Likewise, an HDD HBA can be passed on to QEMU all HDDs
+attached to the HBA.
-- 
2.39.2




[PATCH v9 4/5] virtio-blk: add some trace events for zoned emulation

2023-03-27 Thread Sam Li
Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
---
 hw/block/trace-events |  7 +++
 hw/block/virtio-blk.c | 12 
 2 files changed, 19 insertions(+)

diff --git a/hw/block/trace-events b/hw/block/trace-events
index 2c45a62bd5..34be8b9135 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -44,9 +44,16 @@ pflash_write_unknown(const char *name, uint8_t cmd) "%s: 
unknown command 0x%02x"
 # virtio-blk.c
 virtio_blk_req_complete(void *vdev, void *req, int status) "vdev %p req %p 
status %d"
 virtio_blk_rw_complete(void *vdev, void *req, int ret) "vdev %p req %p ret %d"
+virtio_blk_zone_report_complete(void *vdev, void *req, unsigned int nr_zones, 
int ret) "vdev %p req %p nr_zones %u ret %d"
+virtio_blk_zone_mgmt_complete(void *vdev, void *req, int ret) "vdev %p req %p 
ret %d"
+virtio_blk_zone_append_complete(void *vdev, void *req, int64_t sector, int 
ret) "vdev %p req %p, append sector 0x%" PRIx64 " ret %d"
 virtio_blk_handle_write(void *vdev, void *req, uint64_t sector, size_t 
nsectors) "vdev %p req %p sector %"PRIu64" nsectors %zu"
 virtio_blk_handle_read(void *vdev, void *req, uint64_t sector, size_t 
nsectors) "vdev %p req %p sector %"PRIu64" nsectors %zu"
 virtio_blk_submit_multireq(void *vdev, void *mrb, int start, int num_reqs, 
uint64_t offset, size_t size, bool is_write) "vdev %p mrb %p start %d num_reqs 
%d offset %"PRIu64" size %zu is_write %d"
+virtio_blk_handle_zone_report(void *vdev, void *req, int64_t sector, unsigned 
int nr_zones) "vdev %p req %p sector 0x%" PRIx64 " nr_zones %u"
+virtio_blk_handle_zone_mgmt(void *vdev, void *req, uint8_t op, int64_t sector, 
int64_t len) "vdev %p req %p op 0x%x sector 0x%" PRIx64 " len 0x%" PRIx64 ""
+virtio_blk_handle_zone_reset_all(void *vdev, void *req, int64_t sector, 
int64_t len) "vdev %p req %p sector 0x%" PRIx64 " cap 0x%" PRIx64 ""
+virtio_blk_handle_zone_append(void *vdev, void *req, int64_t sector) "vdev %p 
req %p, append sector 0x%" PRIx64 ""
 
 # hd-geometry.c
 hd_geometry_lchs_guess(void *blk, int cyls, int heads, int secs) "blk %p LCHS 
%d %d %d"
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 0d85c2c9b0..2afd5cf96c 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -676,6 +676,7 @@ static void virtio_blk_zone_report_complete(void *opaque, 
int ret)
 int64_t nz = data->zone_report_data.nr_zones;
 int8_t err_status = VIRTIO_BLK_S_OK;
 
+trace_virtio_blk_zone_report_complete(vdev, req, nz, ret);
 if (ret) {
 err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
 goto out;
@@ -792,6 +793,8 @@ static void virtio_blk_handle_zone_report(VirtIOBlockReq 
*req,
 nr_zones = (req->in_len - sizeof(struct virtio_blk_inhdr) -
 sizeof(struct virtio_blk_zone_report)) /
sizeof(struct virtio_blk_zone_descriptor);
+trace_virtio_blk_handle_zone_report(vdev, req,
+offset >> BDRV_SECTOR_BITS, nr_zones);
 
 zone_size = sizeof(BlockZoneDescriptor) * nr_zones;
 data = g_malloc(sizeof(ZoneCmdData));
@@ -814,7 +817,9 @@ static void virtio_blk_zone_mgmt_complete(void *opaque, int 
ret)
 {
 VirtIOBlockReq *req = opaque;
 VirtIOBlock *s = req->dev;
+VirtIODevice *vdev = VIRTIO_DEVICE(s);
 int8_t err_status = VIRTIO_BLK_S_OK;
+trace_virtio_blk_zone_mgmt_complete(vdev, req,ret);
 
 if (ret) {
 err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
@@ -841,6 +846,8 @@ static int virtio_blk_handle_zone_mgmt(VirtIOBlockReq *req, 
BlockZoneOp op)
 /* Entire drive capacity */
 offset = 0;
 len = capacity;
+trace_virtio_blk_handle_zone_reset_all(vdev, req, 0,
+   bs->total_sectors);
 } else {
 if (bs->bl.zone_size > capacity - offset) {
 /* The zoned device allows the last smaller zone. */
@@ -848,6 +855,9 @@ static int virtio_blk_handle_zone_mgmt(VirtIOBlockReq *req, 
BlockZoneOp op)
 } else {
 len = bs->bl.zone_size;
 }
+trace_virtio_blk_handle_zone_mgmt(vdev, req, op,
+  offset >> BDRV_SECTOR_BITS,
+  len >> BDRV_SECTOR_BITS);
 }
 
 if (!check_zoned_request(s, offset, len, false, &err_status)) {
@@ -888,6 +898,7 @@ static void virtio_blk_zone_append_complete(void *opaque, 
int ret)
 err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
 goto out;
 }
+trace_virtio_blk_zone_append_complete(vdev, req, append_sector, ret);
 
 out:
 aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
@@ -909,6 +920,7 @@ static int virtio_blk_handle_zone_append(VirtIOBlo

[PATCH v9 2/5] virtio-blk: add zoned storage emulation for zoned devices

2023-03-27 Thread Sam Li
This patch extends virtio-blk emulation to handle zoned device commands
by calling the new block layer APIs to perform zoned device I/O on
behalf of the guest. It supports Report Zone, four zone oparations (open,
close, finish, reset), and Append Zone.

The VIRTIO_BLK_F_ZONED feature bit will only be set if the host does
support zoned block devices. Regular block devices(conventional zones)
will not be set.

The guest os can use blktests, fio to test those commands on zoned devices.
Furthermore, using zonefs to test zone append write is also supported.

Signed-off-by: Sam Li 
---
 hw/block/virtio-blk-common.c |   2 +
 hw/block/virtio-blk.c| 389 +++
 hw/virtio/virtio-qmp.c   |   2 +
 3 files changed, 393 insertions(+)

diff --git a/hw/block/virtio-blk-common.c b/hw/block/virtio-blk-common.c
index ac52d7c176..e2f8e2f6da 100644
--- a/hw/block/virtio-blk-common.c
+++ b/hw/block/virtio-blk-common.c
@@ -29,6 +29,8 @@ static const VirtIOFeature feature_sizes[] = {
  .end = endof(struct virtio_blk_config, discard_sector_alignment)},
 {.flags = 1ULL << VIRTIO_BLK_F_WRITE_ZEROES,
  .end = endof(struct virtio_blk_config, write_zeroes_may_unmap)},
+{.flags = 1ULL << VIRTIO_BLK_F_ZONED,
+ .end = endof(struct virtio_blk_config, zoned)},
 {}
 };
 
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index cefca93b31..66c2bc4b16 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -17,6 +17,7 @@
 #include "qemu/module.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
+#include "block/block_int.h"
 #include "trace.h"
 #include "hw/block/block.h"
 #include "hw/qdev-properties.h"
@@ -601,6 +602,335 @@ err:
 return err_status;
 }
 
+typedef struct ZoneCmdData {
+VirtIOBlockReq *req;
+struct iovec *in_iov;
+unsigned in_num;
+union {
+struct {
+unsigned int nr_zones;
+BlockZoneDescriptor *zones;
+} zone_report_data;
+struct {
+int64_t offset;
+} zone_append_data;
+};
+} ZoneCmdData;
+
+/*
+ * check zoned_request: error checking before issuing requests. If all checks
+ * passed, return true.
+ * append: true if only zone append requests issued.
+ */
+static bool check_zoned_request(VirtIOBlock *s, int64_t offset, int64_t len,
+ bool append, uint8_t *status) {
+BlockDriverState *bs = blk_bs(s->blk);
+int index;
+
+if (!virtio_has_feature(s->host_features, VIRTIO_BLK_F_ZONED)) {
+*status = VIRTIO_BLK_S_UNSUPP;
+return false;
+}
+
+if (offset < 0 || len < 0 || len > (bs->total_sectors << BDRV_SECTOR_BITS)
+|| offset > (bs->total_sectors << BDRV_SECTOR_BITS) - len) {
+*status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+return false;
+}
+
+if (append) {
+if (bs->bl.write_granularity) {
+if ((offset % bs->bl.write_granularity) != 0) {
+*status = VIRTIO_BLK_S_ZONE_UNALIGNED_WP;
+return false;
+}
+}
+
+index = offset / bs->bl.zone_size;
+if (BDRV_ZT_IS_CONV(bs->bl.wps->wp[index])) {
+*status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+return false;
+}
+
+if (len / 512 > bs->bl.max_append_sectors) {
+if (bs->bl.max_append_sectors == 0) {
+*status = VIRTIO_BLK_S_UNSUPP;
+} else {
+*status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+}
+return false;
+}
+}
+return true;
+}
+
+static void virtio_blk_zone_report_complete(void *opaque, int ret)
+{
+ZoneCmdData *data = opaque;
+VirtIOBlockReq *req = data->req;
+VirtIOBlock *s = req->dev;
+VirtIODevice *vdev = VIRTIO_DEVICE(req->dev);
+struct iovec *in_iov = data->in_iov;
+unsigned in_num = data->in_num;
+int64_t zrp_size, n, j = 0;
+int64_t nz = data->zone_report_data.nr_zones;
+int8_t err_status = VIRTIO_BLK_S_OK;
+
+if (ret) {
+err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+goto out;
+}
+
+struct virtio_blk_zone_report zrp_hdr = (struct virtio_blk_zone_report) {
+.nr_zones = cpu_to_le64(nz),
+};
+zrp_size = sizeof(struct virtio_blk_zone_report)
+   + sizeof(struct virtio_blk_zone_descriptor) * nz;
+n = iov_from_buf(in_iov, in_num, 0, &zrp_hdr, sizeof(zrp_hdr));
+if (n != sizeof(zrp_hdr)) {
+virtio_error(vdev, "Driver provided input buffer that is too small!");
+err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+goto out;
+}
+
+for (size_t i = sizeof(zrp_hdr); i < zrp_size;
+i += sizeof(struct virtio_blk_zone_descriptor), ++j) {
+struct virtio_blk_zone_descriptor desc =
+(struct virtio_blk_zone_

[PATCH v9 1/5] include: update virtio_blk headers to v6.3-rc1

2023-03-27 Thread Sam Li
Use scripts/update-linux-headers.sh to update headers to 6.3-rc1.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Dmitry Fomichev 
---
 include/standard-headers/drm/drm_fourcc.h|  12 +++
 include/standard-headers/linux/ethtool.h |  48 -
 include/standard-headers/linux/fuse.h|  45 +++-
 include/standard-headers/linux/pci_regs.h|   1 +
 include/standard-headers/linux/vhost_types.h |   2 +
 include/standard-headers/linux/virtio_blk.h  | 105 +++
 linux-headers/asm-arm64/kvm.h|   1 +
 linux-headers/asm-x86/kvm.h  |  34 +-
 linux-headers/linux/kvm.h|   9 ++
 linux-headers/linux/vfio.h   |  15 +--
 linux-headers/linux/vhost.h  |   8 ++
 11 files changed, 270 insertions(+), 10 deletions(-)

diff --git a/include/standard-headers/drm/drm_fourcc.h 
b/include/standard-headers/drm/drm_fourcc.h
index 69cab17b38..dc3e6112c1 100644
--- a/include/standard-headers/drm/drm_fourcc.h
+++ b/include/standard-headers/drm/drm_fourcc.h
@@ -87,6 +87,18 @@ extern "C" {
  *
  * The authoritative list of format modifier codes is found in
  * `include/uapi/drm/drm_fourcc.h`
+ *
+ * Open Source User Waiver
+ * ---
+ *
+ * Because this is the authoritative source for pixel formats and modifiers
+ * referenced by GL, Vulkan extensions and other standards and hence used both
+ * by open source and closed source driver stacks, the usual requirement for an
+ * upstream in-kernel or open source userspace user does not apply.
+ *
+ * To ensure, as much as feasible, compatibility across stacks and avoid
+ * confusion with incompatible enumerations stakeholders for all relevant 
driver
+ * stacks should approve additions.
  */
 
 #define fourcc_code(a, b, c, d) ((uint32_t)(a) | ((uint32_t)(b) << 8) | \
diff --git a/include/standard-headers/linux/ethtool.h 
b/include/standard-headers/linux/ethtool.h
index 87176ab075..99fcddf04f 100644
--- a/include/standard-headers/linux/ethtool.h
+++ b/include/standard-headers/linux/ethtool.h
@@ -711,6 +711,24 @@ enum ethtool_stringset {
ETH_SS_COUNT
 };
 
+/**
+ * enum ethtool_mac_stats_src - source of ethtool MAC statistics
+ * @ETHTOOL_MAC_STATS_SRC_AGGREGATE:
+ * if device supports a MAC merge layer, this retrieves the aggregate
+ * statistics of the eMAC and pMAC. Otherwise, it retrieves just the
+ * statistics of the single (express) MAC.
+ * @ETHTOOL_MAC_STATS_SRC_EMAC:
+ * if device supports a MM layer, this retrieves the eMAC statistics.
+ * Otherwise, it retrieves the statistics of the single (express) MAC.
+ * @ETHTOOL_MAC_STATS_SRC_PMAC:
+ * if device supports a MM layer, this retrieves the pMAC statistics.
+ */
+enum ethtool_mac_stats_src {
+   ETHTOOL_MAC_STATS_SRC_AGGREGATE,
+   ETHTOOL_MAC_STATS_SRC_EMAC,
+   ETHTOOL_MAC_STATS_SRC_PMAC,
+};
+
 /**
  * enum ethtool_module_power_mode_policy - plug-in module power mode policy
  * @ETHTOOL_MODULE_POWER_MODE_POLICY_HIGH: Module is always in high power mode.
@@ -779,6 +797,31 @@ enum ethtool_podl_pse_pw_d_status {
ETHTOOL_PODL_PSE_PW_D_STATUS_ERROR,
 };
 
+/**
+ * enum ethtool_mm_verify_status - status of MAC Merge Verify function
+ * @ETHTOOL_MM_VERIFY_STATUS_UNKNOWN:
+ * verification status is unknown
+ * @ETHTOOL_MM_VERIFY_STATUS_INITIAL:
+ * the 802.3 Verify State diagram is in the state INIT_VERIFICATION
+ * @ETHTOOL_MM_VERIFY_STATUS_VERIFYING:
+ * the Verify State diagram is in the state VERIFICATION_IDLE,
+ * SEND_VERIFY or WAIT_FOR_RESPONSE
+ * @ETHTOOL_MM_VERIFY_STATUS_SUCCEEDED:
+ * indicates that the Verify State diagram is in the state VERIFIED
+ * @ETHTOOL_MM_VERIFY_STATUS_FAILED:
+ * the Verify State diagram is in the state VERIFY_FAIL
+ * @ETHTOOL_MM_VERIFY_STATUS_DISABLED:
+ * verification of preemption operation is disabled
+ */
+enum ethtool_mm_verify_status {
+   ETHTOOL_MM_VERIFY_STATUS_UNKNOWN,
+   ETHTOOL_MM_VERIFY_STATUS_INITIAL,
+   ETHTOOL_MM_VERIFY_STATUS_VERIFYING,
+   ETHTOOL_MM_VERIFY_STATUS_SUCCEEDED,
+   ETHTOOL_MM_VERIFY_STATUS_FAILED,
+   ETHTOOL_MM_VERIFY_STATUS_DISABLED,
+};
+
 /**
  * struct ethtool_gstrings - string set for data tagging
  * @cmd: Command number = %ETHTOOL_GSTRINGS
@@ -1183,7 +1226,7 @@ struct ethtool_rxnfc {
uint32_trule_cnt;
uint32_trss_context;
};
-   uint32_trule_locs[0];
+   uint32_trule_locs[];
 };
 
 
@@ -1741,6 +1784,9 @@ enum ethtool_link_mode_bit_indices {
ETHTOOL_LINK_MODE_80baseDR8_2_Full_BIT   = 96,
ETHTOOL_LINK_MODE_80baseSR8_Full_BIT = 97,
ETHTOOL_LINK_MODE_80baseVR8_Full_BIT = 98,
+   ETHTOOL_LINK_MODE_10baseT1S_Full_BIT = 99,
+   ETHTOOL_LINK_MODE_10

[PATCH v9 3/5] block: add accounting for zone append operation

2023-03-27 Thread Sam Li
Taking account of the new zone append write operation for zoned devices,
BLOCK_ACCT_ZONE_APPEND enum is introduced as other I/O request type (read,
write, flush).

Signed-off-by: Sam Li 
---
 block/qapi-sysemu.c| 11 ++
 block/qapi.c   | 18 ++
 hw/block/virtio-blk.c  |  4 +++
 include/block/accounting.h |  1 +
 qapi/block-core.json   | 68 --
 qapi/block.json|  4 +++
 6 files changed, 95 insertions(+), 11 deletions(-)

diff --git a/block/qapi-sysemu.c b/block/qapi-sysemu.c
index 7bd7554150..cec3c1afb4 100644
--- a/block/qapi-sysemu.c
+++ b/block/qapi-sysemu.c
@@ -517,6 +517,7 @@ void qmp_block_latency_histogram_set(
 bool has_boundaries, uint64List *boundaries,
 bool has_boundaries_read, uint64List *boundaries_read,
 bool has_boundaries_write, uint64List *boundaries_write,
+bool has_boundaries_append, uint64List *boundaries_append,
 bool has_boundaries_flush, uint64List *boundaries_flush,
 Error **errp)
 {
@@ -557,6 +558,16 @@ void qmp_block_latency_histogram_set(
 }
 }
 
+if (has_boundaries || has_boundaries_append) {
+ret = block_latency_histogram_set(
+stats, BLOCK_ACCT_ZONE_APPEND,
+has_boundaries_append ? boundaries_append : boundaries);
+if (ret) {
+error_setg(errp, "Device '%s' set append write boundaries fail", 
id);
+return;
+}
+}
+
 if (has_boundaries || has_boundaries_flush) {
 ret = block_latency_histogram_set(
 stats, BLOCK_ACCT_FLUSH,
diff --git a/block/qapi.c b/block/qapi.c
index c84147849d..2684484e9d 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -533,27 +533,36 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
BlockBackend *blk)
 
 ds->rd_bytes = stats->nr_bytes[BLOCK_ACCT_READ];
 ds->wr_bytes = stats->nr_bytes[BLOCK_ACCT_WRITE];
+ds->zone_append_bytes = stats->nr_bytes[BLOCK_ACCT_ZONE_APPEND];
 ds->unmap_bytes = stats->nr_bytes[BLOCK_ACCT_UNMAP];
 ds->rd_operations = stats->nr_ops[BLOCK_ACCT_READ];
 ds->wr_operations = stats->nr_ops[BLOCK_ACCT_WRITE];
+ds->zone_append_operations = stats->nr_ops[BLOCK_ACCT_ZONE_APPEND];
 ds->unmap_operations = stats->nr_ops[BLOCK_ACCT_UNMAP];
 
 ds->failed_rd_operations = stats->failed_ops[BLOCK_ACCT_READ];
 ds->failed_wr_operations = stats->failed_ops[BLOCK_ACCT_WRITE];
+ds->failed_zone_append_operations =
+stats->failed_ops[BLOCK_ACCT_ZONE_APPEND];
 ds->failed_flush_operations = stats->failed_ops[BLOCK_ACCT_FLUSH];
 ds->failed_unmap_operations = stats->failed_ops[BLOCK_ACCT_UNMAP];
 
 ds->invalid_rd_operations = stats->invalid_ops[BLOCK_ACCT_READ];
 ds->invalid_wr_operations = stats->invalid_ops[BLOCK_ACCT_WRITE];
+ds->invalid_zone_append_operations =
+stats->invalid_ops[BLOCK_ACCT_ZONE_APPEND];
 ds->invalid_flush_operations =
 stats->invalid_ops[BLOCK_ACCT_FLUSH];
 ds->invalid_unmap_operations = stats->invalid_ops[BLOCK_ACCT_UNMAP];
 
 ds->rd_merged = stats->merged[BLOCK_ACCT_READ];
 ds->wr_merged = stats->merged[BLOCK_ACCT_WRITE];
+ds->zone_append_merged = stats->merged[BLOCK_ACCT_ZONE_APPEND];
 ds->unmap_merged = stats->merged[BLOCK_ACCT_UNMAP];
 ds->flush_operations = stats->nr_ops[BLOCK_ACCT_FLUSH];
 ds->wr_total_time_ns = stats->total_time_ns[BLOCK_ACCT_WRITE];
+ds->zone_append_total_time_ns =
+stats->total_time_ns[BLOCK_ACCT_ZONE_APPEND];
 ds->rd_total_time_ns = stats->total_time_ns[BLOCK_ACCT_READ];
 ds->flush_total_time_ns = stats->total_time_ns[BLOCK_ACCT_FLUSH];
 ds->unmap_total_time_ns = stats->total_time_ns[BLOCK_ACCT_UNMAP];
@@ -571,6 +580,7 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
BlockBackend *blk)
 
 TimedAverage *rd = &ts->latency[BLOCK_ACCT_READ];
 TimedAverage *wr = &ts->latency[BLOCK_ACCT_WRITE];
+TimedAverage *zap = &ts->latency[BLOCK_ACCT_ZONE_APPEND];
 TimedAverage *fl = &ts->latency[BLOCK_ACCT_FLUSH];
 
 dev_stats->interval_length = ts->interval_length;
@@ -583,6 +593,10 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
BlockBackend *blk)
 dev_stats->max_wr_latency_ns = timed_average_max(wr);
 dev_stats->avg_wr_latency_ns = timed_average_avg(wr);
 
+dev_stats->min_zone_append_latency_ns = timed_average_min(zap);
+dev_stats->max_zone_append_latency_ns = timed_average_max(zap);
+dev_stats->avg_zone_append_latency_ns = timed_average_avg(zap);
+
 dev_stats->min_flush_latency_ns = timed_average_min(fl);
 dev_stats->max_flush_latency_ns = timed_average_max(fl);
 dev_s

Re: [PATCH v7 1/4] file-posix: add tracking of the zone write pointers

2023-04-04 Thread Sam Li
Stefan Hajnoczi  于2023年4月4日周二 01:04写道:
>
> On Thu, Mar 23, 2023 at 01:19:04PM +0800, Sam Li wrote:
> > Since Linux doesn't have a user API to issue zone append operations to
> > zoned devices from user space, the file-posix driver is modified to add
> > zone append emulation using regular writes. To do this, the file-posix
> > driver tracks the wp location of all zones of the device. It uses an
> > array of uint64_t. The most significant bit of each wp location indicates
> > if the zone type is conventional zones.
> >
> > The zones wp can be changed due to the following operations issued:
> > - zone reset: change the wp to the start offset of that zone
> > - zone finish: change to the end location of that zone
> > - write to a zone
> > - zone append
> >
> > Signed-off-by: Sam Li 
> > ---
> >  block/file-posix.c   | 168 ++-
> >  include/block/block-common.h |  14 +++
> >  include/block/block_int-common.h |   5 +
> >  3 files changed, 183 insertions(+), 4 deletions(-)
> >
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 65efe5147e..0fb425dcae 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -1324,6 +1324,85 @@ static int hdev_get_max_segments(int fd, struct stat 
> > *st)
> >  #endif
> >  }
> >
> > +#if defined(CONFIG_BLKZONED)
> > +/*
> > + * If the ra (reset_all) flag > 0, then the wp of that zone should be 
> > reset to
> > + * the start sector. Else, take the real wp of the device.
> > + */
> > +static int get_zones_wp(int fd, BlockZoneWps *wps, int64_t offset,
> > +unsigned int nrz, int ra) {
>
> Please use bool for true/false and use clear variable names:
> int ra -> bool reset_all
>
> > +struct blk_zone *blkz;
> > +size_t rep_size;
> > +uint64_t sector = offset >> BDRV_SECTOR_BITS;
> > +int ret, n = 0, i = 0;
> > +rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct 
> > blk_zone);
> > +g_autofree struct blk_zone_report *rep = NULL;
> > +
> > +rep = g_malloc(rep_size);
> > +blkz = (struct blk_zone *)(rep + 1);
> > +while (n < nrz) {
> > +memset(rep, 0, rep_size);
> > +rep->sector = sector;
> > +rep->nr_zones = nrz - n;
> > +
> > +do {
> > +ret = ioctl(fd, BLKREPORTZONE, rep);
> > +} while (ret != 0 && errno == EINTR);
> > +if (ret != 0) {
> > +error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed 
> > %d",
> > +fd, offset, errno);
> > +return -errno;
> > +}
> > +
> > +if (!rep->nr_zones) {
> > +break;
> > +}
> > +
> > +for (i = 0; i < rep->nr_zones; i++, n++) {
> > +/*
> > + * The wp tracking cares only about sequential writes required 
> > and
> > + * sequential write preferred zones so that the wp can advance 
> > to
> > + * the right location.
> > + * Use the most significant bit of the wp location to indicate 
> > the
> > + * zone type: 0 for SWR/SWP zones and 1 for conventional zones.
> > + */
> > +if (blkz[i].type == BLK_ZONE_TYPE_CONVENTIONAL) {
> > +wps->wp[i] &= 1ULL << 63;
> > +} else {
> > +switch(blkz[i].cond) {
> > +case BLK_ZONE_COND_FULL:
> > +case BLK_ZONE_COND_READONLY:
> > +/* Zone not writable */
> > +wps->wp[i] = (blkz[i].start + blkz[i].len) << 
> > BDRV_SECTOR_BITS;
> > +break;
> > +case BLK_ZONE_COND_OFFLINE:
> > +/* Zone not writable nor readable */
> > +wps->wp[i] = (blkz[i].start) << BDRV_SECTOR_BITS;
> > +break;
> > +default:
> > +if (ra > 0) {
> > +wps->wp[i] = blkz[i].start << BDRV_SECTOR_BITS;
> > +} else {
> > +wps->wp[i] = blkz[i].wp << BDRV_SECTOR_BITS;
> > +}
> > +break;
> > +}
> > +}
> > +}
> > +sector = blkz[i - 1].start + blkz[i - 1].le

[PATCH v8 3/4] qemu-iotests: test zone append operation

2023-04-04 Thread Sam Li
The patch tests zone append writes by reporting the zone wp after
the completion of the call. "zap -p" option can print the sector
offset value after completion, which should be the start sector
where the append write begins.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
---
 qemu-io-cmds.c | 75 ++
 tests/qemu-iotests/tests/zoned | 16 +++
 tests/qemu-iotests/tests/zoned.out | 16 +++
 3 files changed, 107 insertions(+)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index f35ea627d7..3f75d2f5a6 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -1874,6 +1874,80 @@ static const cmdinfo_t zone_reset_cmd = {
 .oneline = "reset a zone write pointer in zone block device",
 };
 
+static int do_aio_zone_append(BlockBackend *blk, QEMUIOVector *qiov,
+  int64_t *offset, int flags, int *total)
+{
+int async_ret = NOT_DONE;
+
+blk_aio_zone_append(blk, offset, qiov, flags, aio_rw_done, &async_ret);
+while (async_ret == NOT_DONE) {
+main_loop_wait(false);
+}
+
+*total = qiov->size;
+return async_ret < 0 ? async_ret : 1;
+}
+
+static int zone_append_f(BlockBackend *blk, int argc, char **argv)
+{
+int ret;
+bool pflag = false;
+int flags = 0;
+int total = 0;
+int64_t offset;
+char *buf;
+int c, nr_iov;
+int pattern = 0xcd;
+QEMUIOVector qiov;
+
+if (optind > argc - 3) {
+return -EINVAL;
+}
+
+if ((c = getopt(argc, argv, "p")) != -1) {
+pflag = true;
+}
+
+offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
+optind++;
+nr_iov = argc - optind;
+buf = create_iovec(blk, &qiov, &argv[optind], nr_iov, pattern,
+   flags & BDRV_REQ_REGISTERED_BUF);
+if (buf == NULL) {
+return -EINVAL;
+}
+ret = do_aio_zone_append(blk, &qiov, &offset, flags, &total);
+if (ret < 0) {
+printf("zone append failed: %s\n", strerror(-ret));
+goto out;
+}
+
+if (pflag) {
+printf("After zap done, the append sector is 0x%" PRIx64 "\n",
+   tosector(offset));
+}
+
+out:
+qemu_io_free(blk, buf, qiov.size,
+ flags & BDRV_REQ_REGISTERED_BUF);
+qemu_iovec_destroy(&qiov);
+return ret;
+}
+
+static const cmdinfo_t zone_append_cmd = {
+.name = "zone_append",
+.altname = "zap",
+.cfunc = zone_append_f,
+.argmin = 3,
+.argmax = 4,
+.args = "offset len [len..]",
+.oneline = "append write a number of bytes at a specified offset",
+};
+
 static int truncate_f(BlockBackend *blk, int argc, char **argv);
 static const cmdinfo_t truncate_cmd = {
 .name   = "truncate",
@@ -2672,6 +2746,7 @@ static void __attribute((constructor)) 
init_qemuio_commands(void)
 qemuio_add_command(&zone_close_cmd);
 qemuio_add_command(&zone_finish_cmd);
 qemuio_add_command(&zone_reset_cmd);
+qemuio_add_command(&zone_append_cmd);
 qemuio_add_command(&truncate_cmd);
 qemuio_add_command(&length_cmd);
 qemuio_add_command(&info_cmd);
diff --git a/tests/qemu-iotests/tests/zoned b/tests/qemu-iotests/tests/zoned
index 56f60616b5..3d23ce9cc1 100755
--- a/tests/qemu-iotests/tests/zoned
+++ b/tests/qemu-iotests/tests/zoned
@@ -82,6 +82,22 @@ echo "(5) resetting the second zone"
 $QEMU_IO $IMG -c "zrs 268435456 268435456"
 echo "After resetting a zone:"
 $QEMU_IO $IMG -c "zrp 268435456 1"
+echo
+echo
+echo "(6) append write" # the physical block size of the device is 4096
+$QEMU_IO $IMG -c "zrp 0 1"
+$QEMU_IO $IMG -c "zap -p 0 0x1000 0x2000"
+echo "After appending the first zone firstly:"
+$QEMU_IO $IMG -c "zrp 0 1"
+$QEMU_IO $IMG -c "zap -p 0 0x1000 0x2000"
+echo "After appending the first zone secondly:"
+$QEMU_IO $IMG -c "zrp 0 1"
+$QEMU_IO $IMG -c "zap -p 268435456 0x1000 0x2000"
+echo "After appending the second zone firstly:"
+$QEMU_IO $IMG -c "zrp 268435456 1"
+$QEMU_IO $IMG -c "zap -p 268435456 0x1000 0x2000"
+echo "After appending the second zone secondly:"
+$QEMU_IO $IMG -c "zrp 268435456 1"
 
 # success, all done
 echo "*** done"
diff --git a/tests/qemu-iotests/tests/zoned.out 
b/tests/qemu-iotests/tests/zoned.out
index b2d061da49..fe53ba4744 100644
--- a/tests/qemu-iotests/tests/zoned.out
+++ b/tests/qemu-iotests/tests/zoned.out
@@ -50,4 +50,20 @@ start: 0x8, len 0x8, cap 0x8, wptr 0x10, 
zcond:14, [type: 2]
 (5) resetting the second zone
 After resetting a zone:
 start: 0x8, len 0x8, cap 0x8, wptr 0x8, zcond:1, [type:

[PATCH v8 2/4] block: introduce zone append write for zoned devices

2023-04-04 Thread Sam Li
A zone append command is a write operation that specifies the first
logical block of a zone as the write position. When writing to a zoned
block device using zone append, the byte offset of the call may point at
any position within the zone to which the data is being appended. Upon
completion the device will respond with the position where the data has
been written in the zone.

Signed-off-by: Sam Li 
Reviewed-by: Dmitry Fomichev 
---
 block/block-backend.c | 60 +++
 block/file-posix.c| 56 +
 block/io.c| 27 ++
 block/io_uring.c  |  4 +++
 block/linux-aio.c |  3 ++
 block/raw-format.c|  8 +
 include/block/block-io.h  |  4 +++
 include/block/block_int-common.h  |  3 ++
 include/block/raw-aio.h   |  4 ++-
 include/sysemu/block-backend-io.h |  9 +
 10 files changed, 171 insertions(+), 7 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index f70b08e3f6..bcb3a1eff0 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1888,6 +1888,45 @@ BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, 
BlockZoneOp op,
 return &acb->common;
 }
 
+static void coroutine_fn blk_aio_zone_append_entry(void *opaque)
+{
+BlkAioEmAIOCB *acb = opaque;
+BlkRwCo *rwco = &acb->rwco;
+
+rwco->ret = blk_co_zone_append(rwco->blk, (int64_t *)acb->bytes,
+   rwco->iobuf, rwco->flags);
+blk_aio_complete(acb);
+}
+
+BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t *offset,
+QEMUIOVector *qiov, BdrvRequestFlags flags,
+BlockCompletionFunc *cb, void *opaque) {
+BlkAioEmAIOCB *acb;
+Coroutine *co;
+IO_CODE();
+
+blk_inc_in_flight(blk);
+acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
+acb->rwco = (BlkRwCo) {
+.blk= blk,
+.ret= NOT_DONE,
+.flags  = flags,
+.iobuf  = qiov,
+};
+acb->bytes = (int64_t)offset;
+acb->has_returned = false;
+
+co = qemu_coroutine_create(blk_aio_zone_append_entry, acb);
+aio_co_enter(blk_get_aio_context(blk), co);
+acb->has_returned = true;
+if (acb->rwco.ret != NOT_DONE) {
+replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+ blk_aio_complete_bh, acb);
+}
+
+return &acb->common;
+}
+
 /*
  * Send a zone_report command.
  * offset is a byte offset from the start of the device. No alignment
@@ -1939,6 +1978,27 @@ int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, 
BlockZoneOp op,
 return ret;
 }
 
+/*
+ * Send a zone_append command.
+ */
+int coroutine_fn blk_co_zone_append(BlockBackend *blk, int64_t *offset,
+QEMUIOVector *qiov, BdrvRequestFlags flags)
+{
+int ret;
+IO_CODE();
+
+blk_inc_in_flight(blk);
+blk_wait_while_drained(blk);
+if (!blk_is_available(blk)) {
+blk_dec_in_flight(blk);
+return -ENOMEDIUM;
+}
+
+ret = bdrv_co_zone_append(blk_bs(blk), offset, qiov, flags);
+blk_dec_in_flight(blk);
+return ret;
+}
+
 void blk_drain(BlockBackend *blk)
 {
 BlockDriverState *bs = blk_bs(blk);
diff --git a/block/file-posix.c b/block/file-posix.c
index bc58f7193b..a7130b1024 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -160,6 +160,7 @@ typedef struct BDRVRawState {
 bool has_write_zeroes:1;
 bool use_linux_aio:1;
 bool use_linux_io_uring:1;
+int64_t *offset; /* offset of zone append operation */
 int page_cache_inconsistent; /* errno from fdatasync failure */
 bool has_fallocate;
 bool needs_alignment;
@@ -1685,7 +1686,7 @@ static ssize_t handle_aiocb_rw_vector(RawPosixAIOData 
*aiocb)
 ssize_t len;
 
 len = RETRY_ON_EINTR(
-(aiocb->aio_type & QEMU_AIO_WRITE) ?
+(aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) ?
 qemu_pwritev(aiocb->aio_fildes,
aiocb->io.iov,
aiocb->io.niov,
@@ -1714,7 +1715,7 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData 
*aiocb, char *buf)
 ssize_t len;
 
 while (offset < aiocb->aio_nbytes) {
-if (aiocb->aio_type & QEMU_AIO_WRITE) {
+if (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) {
 len = pwrite(aiocb->aio_fildes,
  (const char *)buf + offset,
  aiocb->aio_nbytes - offset,
@@ -1807,7 +1808,7 @@ static int handle_aiocb_rw(void *opaque)
 }
 
 nbytes = handle_aiocb_rw_linear(aiocb, buf);
-if (!(aiocb->aio_type & QEMU_AIO_WRITE)) {
+if (!(aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND))) {
 char *p = buf;
 size_t count = a

[PATCH v8 0/4] Add zone append write for zoned device

2023-04-04 Thread Sam Li
This patch series add zone append operation based on the previous
zoned device support part. The file-posix driver is modified to
add zone append emulation using regular writes.

v8:
- address review comments [Stefan]
  * fix zone_mgmt covering multiple zones case
  * fix memory leak bug of wps in refresh_limits()
  * mv BlockZoneWps field from BlockLimits to BlockDriverState
  * add check_qiov_request() to bdrv_co_zone_append

v7:
- address review comments
  * fix wp assignment [Stefan]
  * fix reset_all cases, skip R/O & offline zones [Dmitry, Damien]
  * fix locking on non-zap related cases [Stefan]
  * cleanups and typos correction
- add "zap -p" option to qemuio-cmds [Stefan]

v6:
- add small fixes

v5:
- fix locking conditions and error handling
- drop some trival optimizations
- add tracing points for zone append

v4:
- fix lock related issues[Damien]
- drop all field in zone_mgmt op [Damien]
- fix state checks in zong_mgmt command [Damien]
- return start sector of wp when issuing zap req [Damien]

v3:
- only read wps when it is locked [Damien]
- allow last smaller zone case [Damien]
- add zone type and state checks in zone_mgmt command [Damien]
- fix RESET_ALL related problems

v2:
- split patch to two patches for better reviewing
- change BlockZoneWps's structure to an array of integers
- use only mutex lock on locking conditions of zone wps
- coding styles and clean-ups

v1:
- introduce zone append write

Sam Li (4):
  file-posix: add tracking of the zone write pointers
  block: introduce zone append write for zoned devices
  qemu-iotests: test zone append operation
  block: add some trace events for zone append

 block/block-backend.c  |  60 
 block/file-posix.c | 221 -
 block/io.c |  27 
 block/io_uring.c   |   4 +
 block/linux-aio.c  |   3 +
 block/raw-format.c |   8 ++
 block/trace-events |   2 +
 include/block/block-common.h   |  14 ++
 include/block/block-io.h   |   4 +
 include/block/block_int-common.h   |   8 ++
 include/block/raw-aio.h|   4 +-
 include/sysemu/block-backend-io.h  |   9 ++
 qemu-io-cmds.c |  75 ++
 tests/qemu-iotests/tests/zoned |  16 +++
 tests/qemu-iotests/tests/zoned.out |  16 +++
 15 files changed, 464 insertions(+), 7 deletions(-)

-- 
2.39.2




[PATCH v8 1/4] file-posix: add tracking of the zone write pointers

2023-04-04 Thread Sam Li
Since Linux doesn't have a user API to issue zone append operations to
zoned devices from user space, the file-posix driver is modified to add
zone append emulation using regular writes. To do this, the file-posix
driver tracks the wp location of all zones of the device. It uses an
array of uint64_t. The most significant bit of each wp location indicates
if the zone type is conventional zones.

The zones wp can be changed due to the following operations issued:
- zone reset: change the wp to the start offset of that zone
- zone finish: change to the end location of that zone
- write to a zone
- zone append

Signed-off-by: Sam Li 
---
 block/file-posix.c   | 168 ++-
 include/block/block-common.h |  14 +++
 include/block/block_int-common.h |   5 +
 3 files changed, 184 insertions(+), 3 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 65efe5147e..bc58f7193b 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1324,6 +1324,88 @@ static int hdev_get_max_segments(int fd, struct stat *st)
 #endif
 }
 
+#if defined(CONFIG_BLKZONED)
+/*
+ * If the reset_all flag is true, then the wps of zone whose state is
+ * not readonly or offline should be all reset to the start sector.
+ * Else, take the real wp of the device.
+ */
+static int get_zones_wp(int fd, BlockZoneWps *wps, int64_t offset,
+unsigned int nrz, bool reset_all)
+{
+struct blk_zone *blkz;
+size_t rep_size;
+uint64_t sector = offset >> BDRV_SECTOR_BITS;
+int ret, n = 0, i = 0;
+rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
+g_autofree struct blk_zone_report *rep = NULL;
+
+rep = g_malloc(rep_size);
+blkz = (struct blk_zone *)(rep + 1);
+while (n < nrz) {
+memset(rep, 0, rep_size);
+rep->sector = sector;
+rep->nr_zones = nrz - n;
+
+do {
+ret = ioctl(fd, BLKREPORTZONE, rep);
+} while (ret != 0 && errno == EINTR);
+if (ret != 0) {
+error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d",
+fd, offset, errno);
+return -errno;
+}
+
+if (!rep->nr_zones) {
+break;
+}
+
+for (i = 0; i < rep->nr_zones; i++, n++) {
+/*
+ * The wp tracking cares only about sequential writes required and
+ * sequential write preferred zones so that the wp can advance to
+ * the right location.
+ * Use the most significant bit of the wp location to indicate the
+ * zone type: 0 for SWR/SWP zones and 1 for conventional zones.
+ */
+if (blkz[i].type == BLK_ZONE_TYPE_CONVENTIONAL) {
+wps->wp[i] &= 1ULL << 63;
+} else {
+switch(blkz[i].cond) {
+case BLK_ZONE_COND_FULL:
+case BLK_ZONE_COND_READONLY:
+/* Zone not writable */
+wps->wp[i] = (blkz[i].start + blkz[i].len) << 
BDRV_SECTOR_BITS;
+break;
+case BLK_ZONE_COND_OFFLINE:
+/* Zone not writable nor readable */
+wps->wp[i] = (blkz[i].start) << BDRV_SECTOR_BITS;
+break;
+default:
+if (reset_all) {
+wps->wp[i] = blkz[i].start << BDRV_SECTOR_BITS;
+} else {
+wps->wp[i] = blkz[i].wp << BDRV_SECTOR_BITS;
+}
+break;
+}
+}
+}
+sector = blkz[i - 1].start + blkz[i - 1].len;
+}
+
+return 0;
+}
+
+static void update_zones_wp(int fd, BlockZoneWps *wps, int64_t offset,
+unsigned int nrz)
+{
+if (get_zones_wp(fd, wps, offset, nrz, 0) < 0) {
+error_report("update zone wp failed");
+}
+}
+#endif
+
 static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 {
 BDRVRawState *s = bs->opaque;
@@ -1413,6 +1495,23 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 if (ret >= 0) {
 bs->bl.max_active_zones = ret;
 }
+
+ret = get_sysfs_long_val(&st, "physical_block_size");
+if (ret >= 0) {
+bs->bl.write_granularity = ret;
+}
+
+/* The refresh_limits() function can be called multiple times. */
+bs->wps = NULL;
+bs->wps = g_malloc(sizeof(BlockZoneWps) +
+sizeof(int64_t) * bs->bl.nr_zones);
+ret = get_zones_wp(s->fd, bs->wps, 0, bs->bl.nr_zones, 0);
+if (ret < 0) {
+error_setg_errno(errp, -ret, "report wps failed");
+bs->wps = NULL;
+return;
+ 

[PATCH v8 4/4] block: add some trace events for zone append

2023-04-04 Thread Sam Li
Signed-off-by: Sam Li 
Reviewed-by: Dmitry Fomichev 
Reviewed-by: Stefan Hajnoczi 
---
 block/file-posix.c | 3 +++
 block/trace-events | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index a7130b1024..825301467e 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2502,6 +2502,8 @@ out:
 if (!BDRV_ZT_IS_CONV(*wp)) {
 if (type & QEMU_AIO_ZONE_APPEND) {
 *s->offset = *wp;
+trace_zbd_zone_append_complete(bs, *s->offset
+>> BDRV_SECTOR_BITS);
 }
 /* Advance the wp if needed */
 if (offset + bytes > *wp) {
@@ -3546,6 +3548,7 @@ static int coroutine_fn 
raw_co_zone_append(BlockDriverState *bs,
 len += iov_len;
 }
 
+trace_zbd_zone_append(bs, *offset >> BDRV_SECTOR_BITS);
 return raw_co_prw(bs, *offset, len, qiov, QEMU_AIO_ZONE_APPEND);
 }
 #endif
diff --git a/block/trace-events b/block/trace-events
index 3f4e1d088a..32665158d6 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -211,6 +211,8 @@ file_hdev_is_sg(int type, int version) "SG device found: 
type=%d, version=%d"
 file_flush_fdatasync_failed(int err) "errno %d"
 zbd_zone_report(void *bs, unsigned int nr_zones, int64_t sector) "bs %p report 
%d zones starting at sector offset 0x%" PRIx64 ""
 zbd_zone_mgmt(void *bs, const char *op_name, int64_t sector, int64_t len) "bs 
%p %s starts at sector offset 0x%" PRIx64 " over a range of 0x%" PRIx64 " 
sectors"
+zbd_zone_append(void *bs, int64_t sector) "bs %p append at sector offset 0x%" 
PRIx64 ""
+zbd_zone_append_complete(void *bs, int64_t sector) "bs %p returns append 
sector 0x%" PRIx64 ""
 
 # ssh.c
 sftp_error(const char *op, const char *ssh_err, int ssh_err_code, int 
sftp_err_code) "%s failed: %s (libssh error code: %d, sftp error code: %d)"
-- 
2.39.2




Re: [PATCH v9 0/5] Add zoned storage emulation to virtio-blk driver

2023-04-04 Thread Sam Li
Stefan Hajnoczi  于2023年4月3日周一 20:18写道:
>
> On Wed, 29 Mar 2023 at 01:01, Michael S. Tsirkin  wrote:
> >
> > On Mon, Mar 27, 2023 at 10:45:48PM +0800, Sam Li wrote:
> >
> > virtio bits look ok.
> >
> > Reviewed-by: Michael S. Tsirkin 
> >
> > merge through block layer tree I'm guessing?
>
> Sounds good. Thank you!

Hi Stefan,

I've sent the v8 zone append write to the list where I move the wps
field to BlockDriverState. It will make a small change the emulation
code, which is in hw/block/virtio-blk.c of [2/5] virtio-blk: add zoned
storage emulation for zoned devices:
- if (BDRV_ZT_IS_CONV(bs->bl.wps->wp[index])) {
+ if (BDRV_ZT_IS_CONV(bs->wps->wp[index])) {

Please let me know if you prefer a new version or not.

Thanks,
Sam



[PATCH v9 4/4] block: add some trace events for zone append

2023-04-07 Thread Sam Li
Signed-off-by: Sam Li 
Reviewed-by: Dmitry Fomichev 
Reviewed-by: Stefan Hajnoczi 
---
 block/file-posix.c | 3 +++
 block/trace-events | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index 4e26641ce0..da986a33fd 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2504,6 +2504,8 @@ out:
 if (!BDRV_ZT_IS_CONV(*wp)) {
 if (type & QEMU_AIO_ZONE_APPEND) {
 *s->offset = *wp;
+trace_zbd_zone_append_complete(bs, *s->offset
+>> BDRV_SECTOR_BITS);
 }
 /* Advance the wp if needed */
 if (offset + bytes > *wp) {
@@ -3551,6 +3553,7 @@ static int coroutine_fn 
raw_co_zone_append(BlockDriverState *bs,
 len += iov_len;
 }
 
+trace_zbd_zone_append(bs, *offset >> BDRV_SECTOR_BITS);
 return raw_co_prw(bs, *offset, len, qiov, QEMU_AIO_ZONE_APPEND);
 }
 #endif
diff --git a/block/trace-events b/block/trace-events
index 3f4e1d088a..32665158d6 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -211,6 +211,8 @@ file_hdev_is_sg(int type, int version) "SG device found: 
type=%d, version=%d"
 file_flush_fdatasync_failed(int err) "errno %d"
 zbd_zone_report(void *bs, unsigned int nr_zones, int64_t sector) "bs %p report 
%d zones starting at sector offset 0x%" PRIx64 ""
 zbd_zone_mgmt(void *bs, const char *op_name, int64_t sector, int64_t len) "bs 
%p %s starts at sector offset 0x%" PRIx64 " over a range of 0x%" PRIx64 " 
sectors"
+zbd_zone_append(void *bs, int64_t sector) "bs %p append at sector offset 0x%" 
PRIx64 ""
+zbd_zone_append_complete(void *bs, int64_t sector) "bs %p returns append 
sector 0x%" PRIx64 ""
 
 # ssh.c
 sftp_error(const char *op, const char *ssh_err, int ssh_err_code, int 
sftp_err_code) "%s failed: %s (libssh error code: %d, sftp error code: %d)"
-- 
2.39.2




[PATCH v9 3/4] qemu-iotests: test zone append operation

2023-04-07 Thread Sam Li
The patch tests zone append writes by reporting the zone wp after
the completion of the call. "zap -p" option can print the sector
offset value after completion, which should be the start sector
where the append write begins.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
---
 qemu-io-cmds.c | 75 ++
 tests/qemu-iotests/tests/zoned | 16 +++
 tests/qemu-iotests/tests/zoned.out | 16 +++
 3 files changed, 107 insertions(+)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index f35ea627d7..3f75d2f5a6 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -1874,6 +1874,80 @@ static const cmdinfo_t zone_reset_cmd = {
 .oneline = "reset a zone write pointer in zone block device",
 };
 
+static int do_aio_zone_append(BlockBackend *blk, QEMUIOVector *qiov,
+  int64_t *offset, int flags, int *total)
+{
+int async_ret = NOT_DONE;
+
+blk_aio_zone_append(blk, offset, qiov, flags, aio_rw_done, &async_ret);
+while (async_ret == NOT_DONE) {
+main_loop_wait(false);
+}
+
+*total = qiov->size;
+return async_ret < 0 ? async_ret : 1;
+}
+
+static int zone_append_f(BlockBackend *blk, int argc, char **argv)
+{
+int ret;
+bool pflag = false;
+int flags = 0;
+int total = 0;
+int64_t offset;
+char *buf;
+int c, nr_iov;
+int pattern = 0xcd;
+QEMUIOVector qiov;
+
+if (optind > argc - 3) {
+return -EINVAL;
+}
+
+if ((c = getopt(argc, argv, "p")) != -1) {
+pflag = true;
+}
+
+offset = cvtnum(argv[optind]);
+if (offset < 0) {
+print_cvtnum_err(offset, argv[optind]);
+return offset;
+}
+optind++;
+nr_iov = argc - optind;
+buf = create_iovec(blk, &qiov, &argv[optind], nr_iov, pattern,
+   flags & BDRV_REQ_REGISTERED_BUF);
+if (buf == NULL) {
+return -EINVAL;
+}
+ret = do_aio_zone_append(blk, &qiov, &offset, flags, &total);
+if (ret < 0) {
+printf("zone append failed: %s\n", strerror(-ret));
+goto out;
+}
+
+if (pflag) {
+printf("After zap done, the append sector is 0x%" PRIx64 "\n",
+   tosector(offset));
+}
+
+out:
+qemu_io_free(blk, buf, qiov.size,
+ flags & BDRV_REQ_REGISTERED_BUF);
+qemu_iovec_destroy(&qiov);
+return ret;
+}
+
+static const cmdinfo_t zone_append_cmd = {
+.name = "zone_append",
+.altname = "zap",
+.cfunc = zone_append_f,
+.argmin = 3,
+.argmax = 4,
+.args = "offset len [len..]",
+.oneline = "append write a number of bytes at a specified offset",
+};
+
 static int truncate_f(BlockBackend *blk, int argc, char **argv);
 static const cmdinfo_t truncate_cmd = {
 .name   = "truncate",
@@ -2672,6 +2746,7 @@ static void __attribute((constructor)) 
init_qemuio_commands(void)
 qemuio_add_command(&zone_close_cmd);
 qemuio_add_command(&zone_finish_cmd);
 qemuio_add_command(&zone_reset_cmd);
+qemuio_add_command(&zone_append_cmd);
 qemuio_add_command(&truncate_cmd);
 qemuio_add_command(&length_cmd);
 qemuio_add_command(&info_cmd);
diff --git a/tests/qemu-iotests/tests/zoned b/tests/qemu-iotests/tests/zoned
index 56f60616b5..3d23ce9cc1 100755
--- a/tests/qemu-iotests/tests/zoned
+++ b/tests/qemu-iotests/tests/zoned
@@ -82,6 +82,22 @@ echo "(5) resetting the second zone"
 $QEMU_IO $IMG -c "zrs 268435456 268435456"
 echo "After resetting a zone:"
 $QEMU_IO $IMG -c "zrp 268435456 1"
+echo
+echo
+echo "(6) append write" # the physical block size of the device is 4096
+$QEMU_IO $IMG -c "zrp 0 1"
+$QEMU_IO $IMG -c "zap -p 0 0x1000 0x2000"
+echo "After appending the first zone firstly:"
+$QEMU_IO $IMG -c "zrp 0 1"
+$QEMU_IO $IMG -c "zap -p 0 0x1000 0x2000"
+echo "After appending the first zone secondly:"
+$QEMU_IO $IMG -c "zrp 0 1"
+$QEMU_IO $IMG -c "zap -p 268435456 0x1000 0x2000"
+echo "After appending the second zone firstly:"
+$QEMU_IO $IMG -c "zrp 268435456 1"
+$QEMU_IO $IMG -c "zap -p 268435456 0x1000 0x2000"
+echo "After appending the second zone secondly:"
+$QEMU_IO $IMG -c "zrp 268435456 1"
 
 # success, all done
 echo "*** done"
diff --git a/tests/qemu-iotests/tests/zoned.out 
b/tests/qemu-iotests/tests/zoned.out
index b2d061da49..fe53ba4744 100644
--- a/tests/qemu-iotests/tests/zoned.out
+++ b/tests/qemu-iotests/tests/zoned.out
@@ -50,4 +50,20 @@ start: 0x8, len 0x8, cap 0x8, wptr 0x10, 
zcond:14, [type: 2]
 (5) resetting the second zone
 After resetting a zone:
 start: 0x8, len 0x8, cap 0x8, wptr 0x8, zcond:1, [type:

[PATCH v9 2/4] block: introduce zone append write for zoned devices

2023-04-07 Thread Sam Li
A zone append command is a write operation that specifies the first
logical block of a zone as the write position. When writing to a zoned
block device using zone append, the byte offset of the call may point at
any position within the zone to which the data is being appended. Upon
completion the device will respond with the position where the data has
been written in the zone.

Signed-off-by: Sam Li 
Reviewed-by: Dmitry Fomichev 
Reviewed-by: Stefan Hajnoczi 
---
 block/block-backend.c | 60 +++
 block/file-posix.c| 58 ++
 block/io.c| 27 ++
 block/io_uring.c  |  4 +++
 block/linux-aio.c |  3 ++
 block/raw-format.c|  8 +
 include/block/block-io.h  |  4 +++
 include/block/block_int-common.h  |  3 ++
 include/block/raw-aio.h   |  4 ++-
 include/sysemu/block-backend-io.h |  9 +
 10 files changed, 172 insertions(+), 8 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index f70b08e3f6..bcb3a1eff0 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1888,6 +1888,45 @@ BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, 
BlockZoneOp op,
 return &acb->common;
 }
 
+static void coroutine_fn blk_aio_zone_append_entry(void *opaque)
+{
+BlkAioEmAIOCB *acb = opaque;
+BlkRwCo *rwco = &acb->rwco;
+
+rwco->ret = blk_co_zone_append(rwco->blk, (int64_t *)acb->bytes,
+   rwco->iobuf, rwco->flags);
+blk_aio_complete(acb);
+}
+
+BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t *offset,
+QEMUIOVector *qiov, BdrvRequestFlags flags,
+BlockCompletionFunc *cb, void *opaque) {
+BlkAioEmAIOCB *acb;
+Coroutine *co;
+IO_CODE();
+
+blk_inc_in_flight(blk);
+acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
+acb->rwco = (BlkRwCo) {
+.blk= blk,
+.ret= NOT_DONE,
+.flags  = flags,
+.iobuf  = qiov,
+};
+acb->bytes = (int64_t)offset;
+acb->has_returned = false;
+
+co = qemu_coroutine_create(blk_aio_zone_append_entry, acb);
+aio_co_enter(blk_get_aio_context(blk), co);
+acb->has_returned = true;
+if (acb->rwco.ret != NOT_DONE) {
+replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+ blk_aio_complete_bh, acb);
+}
+
+return &acb->common;
+}
+
 /*
  * Send a zone_report command.
  * offset is a byte offset from the start of the device. No alignment
@@ -1939,6 +1978,27 @@ int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, 
BlockZoneOp op,
 return ret;
 }
 
+/*
+ * Send a zone_append command.
+ */
+int coroutine_fn blk_co_zone_append(BlockBackend *blk, int64_t *offset,
+QEMUIOVector *qiov, BdrvRequestFlags flags)
+{
+int ret;
+IO_CODE();
+
+blk_inc_in_flight(blk);
+blk_wait_while_drained(blk);
+if (!blk_is_available(blk)) {
+blk_dec_in_flight(blk);
+return -ENOMEDIUM;
+}
+
+ret = bdrv_co_zone_append(blk_bs(blk), offset, qiov, flags);
+blk_dec_in_flight(blk);
+return ret;
+}
+
 void blk_drain(BlockBackend *blk)
 {
 BlockDriverState *bs = blk_bs(blk);
diff --git a/block/file-posix.c b/block/file-posix.c
index e7957f5559..4e26641ce0 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -160,6 +160,7 @@ typedef struct BDRVRawState {
 bool has_write_zeroes:1;
 bool use_linux_aio:1;
 bool use_linux_io_uring:1;
+int64_t *offset; /* offset of zone append operation */
 int page_cache_inconsistent; /* errno from fdatasync failure */
 bool has_fallocate;
 bool needs_alignment;
@@ -1687,7 +1688,7 @@ static ssize_t handle_aiocb_rw_vector(RawPosixAIOData 
*aiocb)
 ssize_t len;
 
 len = RETRY_ON_EINTR(
-(aiocb->aio_type & QEMU_AIO_WRITE) ?
+(aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) ?
 qemu_pwritev(aiocb->aio_fildes,
aiocb->io.iov,
aiocb->io.niov,
@@ -1716,7 +1717,7 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData 
*aiocb, char *buf)
 ssize_t len;
 
 while (offset < aiocb->aio_nbytes) {
-if (aiocb->aio_type & QEMU_AIO_WRITE) {
+if (aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND)) {
 len = pwrite(aiocb->aio_fildes,
  (const char *)buf + offset,
  aiocb->aio_nbytes - offset,
@@ -1809,7 +1810,7 @@ static int handle_aiocb_rw(void *opaque)
 }
 
 nbytes = handle_aiocb_rw_linear(aiocb, buf);
-if (!(aiocb->aio_type & QEMU_AIO_WRITE)) {
+if (!(aiocb->aio_type & (QEMU_AIO_WRITE | QEMU_AIO_ZONE_APPEND))) {
 char *p 

[PATCH v9 0/4] Add zone append write for zoned device

2023-04-07 Thread Sam Li
This patch series add zone append operation based on the previous
zoned device support part. The file-posix driver is modified to
add zone append emulation using regular writes.

v9:
- address review comments [Stefan]
  * fix get_zones_wp() for wrong offset index
  * fix misuses of QEMU_LOCK_GUARD()
  * free and allocate wps in refresh_limits for now

v8:
- address review comments [Stefan]
  * fix zone_mgmt covering multiple zones case
  * fix memory leak bug of wps in refresh_limits()
  * mv BlockZoneWps field from BlockLimits to BlockDriverState
  * add check_qiov_request() to bdrv_co_zone_append

v7:
- address review comments
  * fix wp assignment [Stefan]
  * fix reset_all cases, skip R/O & offline zones [Dmitry, Damien]
  * fix locking on non-zap related cases [Stefan]
  * cleanups and typos correction
- add "zap -p" option to qemuio-cmds [Stefan]

v6:
- add small fixes

v5:
- fix locking conditions and error handling
- drop some trival optimizations
- add tracing points for zone append

v4:
- fix lock related issues[Damien]
- drop all field in zone_mgmt op [Damien]
- fix state checks in zong_mgmt command [Damien]
- return start sector of wp when issuing zap req [Damien]

v3:
- only read wps when it is locked [Damien]
- allow last smaller zone case [Damien]
- add zone type and state checks in zone_mgmt command [Damien]
- fix RESET_ALL related problems

v2:
- split patch to two patches for better reviewing
- change BlockZoneWps's structure to an array of integers
- use only mutex lock on locking conditions of zone wps
- coding styles and clean-ups

v1:
- introduce zone append write

Sam Li (4):
  file-posix: add tracking of the zone write pointers
  block: introduce zone append write for zoned devices
  qemu-iotests: test zone append operation
  block: add some trace events for zone append

 block/block-backend.c  |  60 
 block/file-posix.c | 226 -
 block/io.c |  27 
 block/io_uring.c   |   4 +
 block/linux-aio.c  |   3 +
 block/raw-format.c |   8 +
 block/trace-events |   2 +
 include/block/block-common.h   |  14 ++
 include/block/block-io.h   |   4 +
 include/block/block_int-common.h   |   8 +
 include/block/raw-aio.h|   4 +-
 include/sysemu/block-backend-io.h  |   9 ++
 qemu-io-cmds.c |  75 ++
 tests/qemu-iotests/tests/zoned |  16 ++
 tests/qemu-iotests/tests/zoned.out |  16 ++
 15 files changed, 469 insertions(+), 7 deletions(-)

-- 
2.39.2




[PATCH v9 1/4] file-posix: add tracking of the zone write pointers

2023-04-07 Thread Sam Li
Since Linux doesn't have a user API to issue zone append operations to
zoned devices from user space, the file-posix driver is modified to add
zone append emulation using regular writes. To do this, the file-posix
driver tracks the wp location of all zones of the device. It uses an
array of uint64_t. The most significant bit of each wp location indicates
if the zone type is conventional zones.

The zones wp can be changed due to the following operations issued:
- zone reset: change the wp to the start offset of that zone
- zone finish: change to the end location of that zone
- write to a zone
- zone append

Signed-off-by: Sam Li 
---
 block/file-posix.c   | 173 ++-
 include/block/block-common.h |  14 +++
 include/block/block_int-common.h |   5 +
 3 files changed, 189 insertions(+), 3 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 65efe5147e..e7957f5559 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1324,6 +1324,90 @@ static int hdev_get_max_segments(int fd, struct stat *st)
 #endif
 }
 
+#if defined(CONFIG_BLKZONED)
+/*
+ * If the reset_all flag is true, then the wps of zone whose state is
+ * not readonly or offline should be all reset to the start sector.
+ * Else, take the real wp of the device.
+ */
+static int get_zones_wp(BlockDriverState *bs, int fd, int64_t offset,
+unsigned int nrz, bool reset_all)
+{
+struct blk_zone *blkz;
+size_t rep_size;
+uint64_t sector = offset >> BDRV_SECTOR_BITS;
+BlockZoneWps *wps = bs->wps;
+int j = offset / bs->bl.zone_size;
+int ret, n = 0, i = 0;
+rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct blk_zone);
+g_autofree struct blk_zone_report *rep = NULL;
+
+rep = g_malloc(rep_size);
+blkz = (struct blk_zone *)(rep + 1);
+while (n < nrz) {
+memset(rep, 0, rep_size);
+rep->sector = sector;
+rep->nr_zones = nrz - n;
+
+do {
+ret = ioctl(fd, BLKREPORTZONE, rep);
+} while (ret != 0 && errno == EINTR);
+if (ret != 0) {
+error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed %d",
+fd, offset, errno);
+return -errno;
+}
+
+if (!rep->nr_zones) {
+break;
+}
+
+for (i = 0; i < rep->nr_zones; ++i, ++n, ++j) {
+/*
+ * The wp tracking cares only about sequential writes required and
+ * sequential write preferred zones so that the wp can advance to
+ * the right location.
+ * Use the most significant bit of the wp location to indicate the
+ * zone type: 0 for SWR/SWP zones and 1 for conventional zones.
+ */
+if (blkz[i].type == BLK_ZONE_TYPE_CONVENTIONAL) {
+wps->wp[j] |= 1ULL << 63;
+} else {
+switch(blkz[i].cond) {
+case BLK_ZONE_COND_FULL:
+case BLK_ZONE_COND_READONLY:
+/* Zone not writable */
+wps->wp[j] = (blkz[i].start + blkz[i].len) << 
BDRV_SECTOR_BITS;
+break;
+case BLK_ZONE_COND_OFFLINE:
+/* Zone not writable nor readable */
+wps->wp[j] = (blkz[i].start) << BDRV_SECTOR_BITS;
+break;
+default:
+if (reset_all) {
+wps->wp[j] = blkz[i].start << BDRV_SECTOR_BITS;
+} else {
+wps->wp[j] = blkz[i].wp << BDRV_SECTOR_BITS;
+}
+break;
+}
+}
+}
+sector = blkz[i - 1].start + blkz[i - 1].len;
+}
+
+return 0;
+}
+
+static void update_zones_wp(BlockDriverState *bs, int fd, int64_t offset,
+unsigned int nrz)
+{
+if (get_zones_wp(bs, fd, offset, nrz, 0) < 0) {
+error_report("update zone wp failed");
+}
+}
+#endif
+
 static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 {
 BDRVRawState *s = bs->opaque;
@@ -1413,6 +1497,23 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 if (ret >= 0) {
 bs->bl.max_active_zones = ret;
 }
+
+ret = get_sysfs_long_val(&st, "physical_block_size");
+if (ret >= 0) {
+bs->bl.write_granularity = ret;
+}
+
+/* The refresh_limits() function can be called multiple times. */
+g_free(bs->wps);
+bs->wps = g_malloc(sizeof(BlockZoneWps) +
+sizeof(int64_t) * bs->bl.nr_zones);
+ret = get_zones_wp(bs, s->fd, 0, bs->bl.nr_zones, 0);
+if (ret < 0) {
+error_setg_errno(errp, -ret, "repor

[PATCH v10 0/5] Add zoned storage emulation to virtio-blk driver

2023-04-07 Thread Sam Li
This patch adds zoned storage emulation to the virtio-blk driver. It
implements the virtio-blk ZBD support standardization that is
recently accepted by virtio-spec. The link to related commit is at

https://github.com/oasis-tcs/virtio-spec/commit/b4e8efa0fa6c8d844328090ad15db65af8d7d981

The Linux zoned device code that implemented by Dmitry Fomichev has been
released at the latest Linux version v6.3-rc1.

Aside: adding zoned=on alike options to virtio-blk device will be
considered in following-up plan.

v10:
- adapt to the latest zone-append patches: rename bs->bl.wps to bs->wps

v9:
- address review comments
  * add docs for zoned emulation use case [Matias]
  * add the zoned feature bit to qmp monitor [Matias]
  * add the version number for newly added configs of accounting [Markus]

v8:
- address Stefan's review comments
  * rm aio_context_acquire/release in handle_req
  * rename function return type
  * rename BLOCK_ACCT_APPEND to BLOCK_ACCT_ZONE_APPEND for clarity

v7:
- update headers to v6.3-rc1

v6:
- address Stefan's review comments
  * add accounting for zone append operation
  * fix in_iov usage in handle_request, error handling and typos

v5:
- address Stefan's review comments
  * restore the way writing zone append result to buffer
  * fix error checking case and other errands

v4:
- change the way writing zone append request result to buffer
- change zone state, zone type value of virtio_blk_zone_descriptor
- add trace events for new zone APIs

v3:
- use qemuio_from_buffer to write status bit [Stefan]
- avoid using req->elem directly [Stefan]
- fix error checkings and memory leak [Stefan]

v2:
- change units of emulated zone op coresponding to block layer APIs
- modify error checking cases [Stefan, Damien]

v1:
- add zoned storage emulation

Sam Li (5):
  include: update virtio_blk headers to v6.3-rc1
  virtio-blk: add zoned storage emulation for zoned devices
  block: add accounting for zone append operation
  virtio-blk: add some trace events for zoned emulation
  docs/zoned-storage:add zoned emulation use case

 block/qapi-sysemu.c  |  11 +
 block/qapi.c |  18 +
 docs/devel/zoned-storage.rst |  17 +
 hw/block/trace-events|   7 +
 hw/block/virtio-blk-common.c |   2 +
 hw/block/virtio-blk.c| 405 +++
 hw/virtio/virtio-qmp.c   |   2 +
 include/block/accounting.h   |   1 +
 include/standard-headers/drm/drm_fourcc.h|  12 +
 include/standard-headers/linux/ethtool.h |  48 ++-
 include/standard-headers/linux/fuse.h|  45 ++-
 include/standard-headers/linux/pci_regs.h|   1 +
 include/standard-headers/linux/vhost_types.h |   2 +
 include/standard-headers/linux/virtio_blk.h  | 105 +
 linux-headers/asm-arm64/kvm.h|   1 +
 linux-headers/asm-x86/kvm.h  |  34 +-
 linux-headers/linux/kvm.h|   9 +
 linux-headers/linux/vfio.h   |  15 +-
 linux-headers/linux/vhost.h  |   8 +
 qapi/block-core.json |  68 +++-
 qapi/block.json  |   4 +
 21 files changed, 794 insertions(+), 21 deletions(-)

-- 
2.39.2




[PATCH v10 3/5] block: add accounting for zone append operation

2023-04-07 Thread Sam Li
Taking account of the new zone append write operation for zoned devices,
BLOCK_ACCT_ZONE_APPEND enum is introduced as other I/O request type (read,
write, flush).

Signed-off-by: Sam Li 
---
 block/qapi-sysemu.c| 11 ++
 block/qapi.c   | 18 ++
 hw/block/virtio-blk.c  |  4 +++
 include/block/accounting.h |  1 +
 qapi/block-core.json   | 68 --
 qapi/block.json|  4 +++
 6 files changed, 95 insertions(+), 11 deletions(-)

diff --git a/block/qapi-sysemu.c b/block/qapi-sysemu.c
index 7bd7554150..cec3c1afb4 100644
--- a/block/qapi-sysemu.c
+++ b/block/qapi-sysemu.c
@@ -517,6 +517,7 @@ void qmp_block_latency_histogram_set(
 bool has_boundaries, uint64List *boundaries,
 bool has_boundaries_read, uint64List *boundaries_read,
 bool has_boundaries_write, uint64List *boundaries_write,
+bool has_boundaries_append, uint64List *boundaries_append,
 bool has_boundaries_flush, uint64List *boundaries_flush,
 Error **errp)
 {
@@ -557,6 +558,16 @@ void qmp_block_latency_histogram_set(
 }
 }
 
+if (has_boundaries || has_boundaries_append) {
+ret = block_latency_histogram_set(
+stats, BLOCK_ACCT_ZONE_APPEND,
+has_boundaries_append ? boundaries_append : boundaries);
+if (ret) {
+error_setg(errp, "Device '%s' set append write boundaries fail", 
id);
+return;
+}
+}
+
 if (has_boundaries || has_boundaries_flush) {
 ret = block_latency_histogram_set(
 stats, BLOCK_ACCT_FLUSH,
diff --git a/block/qapi.c b/block/qapi.c
index c84147849d..2684484e9d 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -533,27 +533,36 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
BlockBackend *blk)
 
 ds->rd_bytes = stats->nr_bytes[BLOCK_ACCT_READ];
 ds->wr_bytes = stats->nr_bytes[BLOCK_ACCT_WRITE];
+ds->zone_append_bytes = stats->nr_bytes[BLOCK_ACCT_ZONE_APPEND];
 ds->unmap_bytes = stats->nr_bytes[BLOCK_ACCT_UNMAP];
 ds->rd_operations = stats->nr_ops[BLOCK_ACCT_READ];
 ds->wr_operations = stats->nr_ops[BLOCK_ACCT_WRITE];
+ds->zone_append_operations = stats->nr_ops[BLOCK_ACCT_ZONE_APPEND];
 ds->unmap_operations = stats->nr_ops[BLOCK_ACCT_UNMAP];
 
 ds->failed_rd_operations = stats->failed_ops[BLOCK_ACCT_READ];
 ds->failed_wr_operations = stats->failed_ops[BLOCK_ACCT_WRITE];
+ds->failed_zone_append_operations =
+stats->failed_ops[BLOCK_ACCT_ZONE_APPEND];
 ds->failed_flush_operations = stats->failed_ops[BLOCK_ACCT_FLUSH];
 ds->failed_unmap_operations = stats->failed_ops[BLOCK_ACCT_UNMAP];
 
 ds->invalid_rd_operations = stats->invalid_ops[BLOCK_ACCT_READ];
 ds->invalid_wr_operations = stats->invalid_ops[BLOCK_ACCT_WRITE];
+ds->invalid_zone_append_operations =
+stats->invalid_ops[BLOCK_ACCT_ZONE_APPEND];
 ds->invalid_flush_operations =
 stats->invalid_ops[BLOCK_ACCT_FLUSH];
 ds->invalid_unmap_operations = stats->invalid_ops[BLOCK_ACCT_UNMAP];
 
 ds->rd_merged = stats->merged[BLOCK_ACCT_READ];
 ds->wr_merged = stats->merged[BLOCK_ACCT_WRITE];
+ds->zone_append_merged = stats->merged[BLOCK_ACCT_ZONE_APPEND];
 ds->unmap_merged = stats->merged[BLOCK_ACCT_UNMAP];
 ds->flush_operations = stats->nr_ops[BLOCK_ACCT_FLUSH];
 ds->wr_total_time_ns = stats->total_time_ns[BLOCK_ACCT_WRITE];
+ds->zone_append_total_time_ns =
+stats->total_time_ns[BLOCK_ACCT_ZONE_APPEND];
 ds->rd_total_time_ns = stats->total_time_ns[BLOCK_ACCT_READ];
 ds->flush_total_time_ns = stats->total_time_ns[BLOCK_ACCT_FLUSH];
 ds->unmap_total_time_ns = stats->total_time_ns[BLOCK_ACCT_UNMAP];
@@ -571,6 +580,7 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
BlockBackend *blk)
 
 TimedAverage *rd = &ts->latency[BLOCK_ACCT_READ];
 TimedAverage *wr = &ts->latency[BLOCK_ACCT_WRITE];
+TimedAverage *zap = &ts->latency[BLOCK_ACCT_ZONE_APPEND];
 TimedAverage *fl = &ts->latency[BLOCK_ACCT_FLUSH];
 
 dev_stats->interval_length = ts->interval_length;
@@ -583,6 +593,10 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
BlockBackend *blk)
 dev_stats->max_wr_latency_ns = timed_average_max(wr);
 dev_stats->avg_wr_latency_ns = timed_average_avg(wr);
 
+dev_stats->min_zone_append_latency_ns = timed_average_min(zap);
+dev_stats->max_zone_append_latency_ns = timed_average_max(zap);
+dev_stats->avg_zone_append_latency_ns = timed_average_avg(zap);
+
 dev_stats->min_flush_latency_ns = timed_average_min(fl);
 dev_stats->max_flush_latency_ns = timed_average_max(fl);
 dev_s

[PATCH v10 5/5] docs/zoned-storage:add zoned emulation use case

2023-04-07 Thread Sam Li
Add the documentation about the example of using virtio-blk driver
to pass the zoned block devices through to the guest.

Signed-off-by: Sam Li 
---
 docs/devel/zoned-storage.rst | 17 +
 1 file changed, 17 insertions(+)

diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
index 6a36133e51..05ecf3729c 100644
--- a/docs/devel/zoned-storage.rst
+++ b/docs/devel/zoned-storage.rst
@@ -41,3 +41,20 @@ APIs for zoned storage emulation or testing.
 For example, to test zone_report on a null_blk device using qemu-io is:
 $ path/to/qemu-io --image-opts -n driver=host_device,filename=/dev/nullb0
 -c "zrp offset nr_zones"
+
+To expose the host's zoned block device through virtio-blk, the command line
+can be (includes the -device parameter):
+-blockdev node-name=drive0,driver=host_device,filename=/dev/nullb0,
+cache.direct=on \
+-device virtio-blk-pci,drive=drive0
+Or only use the -drive parameter:
+-driver driver=host_device,file=/dev/nullb0,if=virtio,cache.direct=on
+
+Additionally, QEMU has several ways of supporting zoned storage, including:
+(1) Using virtio-scsi: --device scsi-block allows for the passing through of
+SCSI ZBC devices, enabling the attachment of ZBC or ZAC HDDs to QEMU.
+(2) PCI device pass-through: While NVMe ZNS emulation is available for testing
+purposes, it cannot yet pass through a zoned device from the host. To pass on
+the NVMe ZNS device to the guest, use VFIO PCI pass the entire NVMe PCI adapter
+through to the guest. Likewise, an HDD HBA can be passed on to QEMU all HDDs
+attached to the HBA.
-- 
2.39.2




[PATCH v10 1/5] include: update virtio_blk headers to v6.3-rc1

2023-04-07 Thread Sam Li
Use scripts/update-linux-headers.sh to update headers to 6.3-rc1.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Dmitry Fomichev 
---
 include/standard-headers/drm/drm_fourcc.h|  12 +++
 include/standard-headers/linux/ethtool.h |  48 -
 include/standard-headers/linux/fuse.h|  45 +++-
 include/standard-headers/linux/pci_regs.h|   1 +
 include/standard-headers/linux/vhost_types.h |   2 +
 include/standard-headers/linux/virtio_blk.h  | 105 +++
 linux-headers/asm-arm64/kvm.h|   1 +
 linux-headers/asm-x86/kvm.h  |  34 +-
 linux-headers/linux/kvm.h|   9 ++
 linux-headers/linux/vfio.h   |  15 +--
 linux-headers/linux/vhost.h  |   8 ++
 11 files changed, 270 insertions(+), 10 deletions(-)

diff --git a/include/standard-headers/drm/drm_fourcc.h 
b/include/standard-headers/drm/drm_fourcc.h
index 69cab17b38..dc3e6112c1 100644
--- a/include/standard-headers/drm/drm_fourcc.h
+++ b/include/standard-headers/drm/drm_fourcc.h
@@ -87,6 +87,18 @@ extern "C" {
  *
  * The authoritative list of format modifier codes is found in
  * `include/uapi/drm/drm_fourcc.h`
+ *
+ * Open Source User Waiver
+ * ---
+ *
+ * Because this is the authoritative source for pixel formats and modifiers
+ * referenced by GL, Vulkan extensions and other standards and hence used both
+ * by open source and closed source driver stacks, the usual requirement for an
+ * upstream in-kernel or open source userspace user does not apply.
+ *
+ * To ensure, as much as feasible, compatibility across stacks and avoid
+ * confusion with incompatible enumerations stakeholders for all relevant 
driver
+ * stacks should approve additions.
  */
 
 #define fourcc_code(a, b, c, d) ((uint32_t)(a) | ((uint32_t)(b) << 8) | \
diff --git a/include/standard-headers/linux/ethtool.h 
b/include/standard-headers/linux/ethtool.h
index 87176ab075..99fcddf04f 100644
--- a/include/standard-headers/linux/ethtool.h
+++ b/include/standard-headers/linux/ethtool.h
@@ -711,6 +711,24 @@ enum ethtool_stringset {
ETH_SS_COUNT
 };
 
+/**
+ * enum ethtool_mac_stats_src - source of ethtool MAC statistics
+ * @ETHTOOL_MAC_STATS_SRC_AGGREGATE:
+ * if device supports a MAC merge layer, this retrieves the aggregate
+ * statistics of the eMAC and pMAC. Otherwise, it retrieves just the
+ * statistics of the single (express) MAC.
+ * @ETHTOOL_MAC_STATS_SRC_EMAC:
+ * if device supports a MM layer, this retrieves the eMAC statistics.
+ * Otherwise, it retrieves the statistics of the single (express) MAC.
+ * @ETHTOOL_MAC_STATS_SRC_PMAC:
+ * if device supports a MM layer, this retrieves the pMAC statistics.
+ */
+enum ethtool_mac_stats_src {
+   ETHTOOL_MAC_STATS_SRC_AGGREGATE,
+   ETHTOOL_MAC_STATS_SRC_EMAC,
+   ETHTOOL_MAC_STATS_SRC_PMAC,
+};
+
 /**
  * enum ethtool_module_power_mode_policy - plug-in module power mode policy
  * @ETHTOOL_MODULE_POWER_MODE_POLICY_HIGH: Module is always in high power mode.
@@ -779,6 +797,31 @@ enum ethtool_podl_pse_pw_d_status {
ETHTOOL_PODL_PSE_PW_D_STATUS_ERROR,
 };
 
+/**
+ * enum ethtool_mm_verify_status - status of MAC Merge Verify function
+ * @ETHTOOL_MM_VERIFY_STATUS_UNKNOWN:
+ * verification status is unknown
+ * @ETHTOOL_MM_VERIFY_STATUS_INITIAL:
+ * the 802.3 Verify State diagram is in the state INIT_VERIFICATION
+ * @ETHTOOL_MM_VERIFY_STATUS_VERIFYING:
+ * the Verify State diagram is in the state VERIFICATION_IDLE,
+ * SEND_VERIFY or WAIT_FOR_RESPONSE
+ * @ETHTOOL_MM_VERIFY_STATUS_SUCCEEDED:
+ * indicates that the Verify State diagram is in the state VERIFIED
+ * @ETHTOOL_MM_VERIFY_STATUS_FAILED:
+ * the Verify State diagram is in the state VERIFY_FAIL
+ * @ETHTOOL_MM_VERIFY_STATUS_DISABLED:
+ * verification of preemption operation is disabled
+ */
+enum ethtool_mm_verify_status {
+   ETHTOOL_MM_VERIFY_STATUS_UNKNOWN,
+   ETHTOOL_MM_VERIFY_STATUS_INITIAL,
+   ETHTOOL_MM_VERIFY_STATUS_VERIFYING,
+   ETHTOOL_MM_VERIFY_STATUS_SUCCEEDED,
+   ETHTOOL_MM_VERIFY_STATUS_FAILED,
+   ETHTOOL_MM_VERIFY_STATUS_DISABLED,
+};
+
 /**
  * struct ethtool_gstrings - string set for data tagging
  * @cmd: Command number = %ETHTOOL_GSTRINGS
@@ -1183,7 +1226,7 @@ struct ethtool_rxnfc {
uint32_trule_cnt;
uint32_trss_context;
};
-   uint32_trule_locs[0];
+   uint32_trule_locs[];
 };
 
 
@@ -1741,6 +1784,9 @@ enum ethtool_link_mode_bit_indices {
ETHTOOL_LINK_MODE_80baseDR8_2_Full_BIT   = 96,
ETHTOOL_LINK_MODE_80baseSR8_Full_BIT = 97,
ETHTOOL_LINK_MODE_80baseVR8_Full_BIT = 98,
+   ETHTOOL_LINK_MODE_10baseT1S_Full_BIT = 99,
+   ETHTOOL_LINK_MODE_10

[PATCH v10 4/5] virtio-blk: add some trace events for zoned emulation

2023-04-07 Thread Sam Li
Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
---
 hw/block/trace-events |  7 +++
 hw/block/virtio-blk.c | 12 
 2 files changed, 19 insertions(+)

diff --git a/hw/block/trace-events b/hw/block/trace-events
index 2c45a62bd5..34be8b9135 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -44,9 +44,16 @@ pflash_write_unknown(const char *name, uint8_t cmd) "%s: 
unknown command 0x%02x"
 # virtio-blk.c
 virtio_blk_req_complete(void *vdev, void *req, int status) "vdev %p req %p 
status %d"
 virtio_blk_rw_complete(void *vdev, void *req, int ret) "vdev %p req %p ret %d"
+virtio_blk_zone_report_complete(void *vdev, void *req, unsigned int nr_zones, 
int ret) "vdev %p req %p nr_zones %u ret %d"
+virtio_blk_zone_mgmt_complete(void *vdev, void *req, int ret) "vdev %p req %p 
ret %d"
+virtio_blk_zone_append_complete(void *vdev, void *req, int64_t sector, int 
ret) "vdev %p req %p, append sector 0x%" PRIx64 " ret %d"
 virtio_blk_handle_write(void *vdev, void *req, uint64_t sector, size_t 
nsectors) "vdev %p req %p sector %"PRIu64" nsectors %zu"
 virtio_blk_handle_read(void *vdev, void *req, uint64_t sector, size_t 
nsectors) "vdev %p req %p sector %"PRIu64" nsectors %zu"
 virtio_blk_submit_multireq(void *vdev, void *mrb, int start, int num_reqs, 
uint64_t offset, size_t size, bool is_write) "vdev %p mrb %p start %d num_reqs 
%d offset %"PRIu64" size %zu is_write %d"
+virtio_blk_handle_zone_report(void *vdev, void *req, int64_t sector, unsigned 
int nr_zones) "vdev %p req %p sector 0x%" PRIx64 " nr_zones %u"
+virtio_blk_handle_zone_mgmt(void *vdev, void *req, uint8_t op, int64_t sector, 
int64_t len) "vdev %p req %p op 0x%x sector 0x%" PRIx64 " len 0x%" PRIx64 ""
+virtio_blk_handle_zone_reset_all(void *vdev, void *req, int64_t sector, 
int64_t len) "vdev %p req %p sector 0x%" PRIx64 " cap 0x%" PRIx64 ""
+virtio_blk_handle_zone_append(void *vdev, void *req, int64_t sector) "vdev %p 
req %p, append sector 0x%" PRIx64 ""
 
 # hd-geometry.c
 hd_geometry_lchs_guess(void *blk, int cyls, int heads, int secs) "blk %p LCHS 
%d %d %d"
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index a9d3168770..7a66056c71 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -676,6 +676,7 @@ static void virtio_blk_zone_report_complete(void *opaque, 
int ret)
 int64_t nz = data->zone_report_data.nr_zones;
 int8_t err_status = VIRTIO_BLK_S_OK;
 
+trace_virtio_blk_zone_report_complete(vdev, req, nz, ret);
 if (ret) {
 err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
 goto out;
@@ -792,6 +793,8 @@ static void virtio_blk_handle_zone_report(VirtIOBlockReq 
*req,
 nr_zones = (req->in_len - sizeof(struct virtio_blk_inhdr) -
 sizeof(struct virtio_blk_zone_report)) /
sizeof(struct virtio_blk_zone_descriptor);
+trace_virtio_blk_handle_zone_report(vdev, req,
+offset >> BDRV_SECTOR_BITS, nr_zones);
 
 zone_size = sizeof(BlockZoneDescriptor) * nr_zones;
 data = g_malloc(sizeof(ZoneCmdData));
@@ -814,7 +817,9 @@ static void virtio_blk_zone_mgmt_complete(void *opaque, int 
ret)
 {
 VirtIOBlockReq *req = opaque;
 VirtIOBlock *s = req->dev;
+VirtIODevice *vdev = VIRTIO_DEVICE(s);
 int8_t err_status = VIRTIO_BLK_S_OK;
+trace_virtio_blk_zone_mgmt_complete(vdev, req,ret);
 
 if (ret) {
 err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
@@ -841,6 +846,8 @@ static int virtio_blk_handle_zone_mgmt(VirtIOBlockReq *req, 
BlockZoneOp op)
 /* Entire drive capacity */
 offset = 0;
 len = capacity;
+trace_virtio_blk_handle_zone_reset_all(vdev, req, 0,
+   bs->total_sectors);
 } else {
 if (bs->bl.zone_size > capacity - offset) {
 /* The zoned device allows the last smaller zone. */
@@ -848,6 +855,9 @@ static int virtio_blk_handle_zone_mgmt(VirtIOBlockReq *req, 
BlockZoneOp op)
 } else {
 len = bs->bl.zone_size;
 }
+trace_virtio_blk_handle_zone_mgmt(vdev, req, op,
+  offset >> BDRV_SECTOR_BITS,
+  len >> BDRV_SECTOR_BITS);
 }
 
 if (!check_zoned_request(s, offset, len, false, &err_status)) {
@@ -888,6 +898,7 @@ static void virtio_blk_zone_append_complete(void *opaque, 
int ret)
 err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
 goto out;
 }
+trace_virtio_blk_zone_append_complete(vdev, req, append_sector, ret);
 
 out:
 aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
@@ -909,6 +920,7 @@ static int virtio_blk_handle_zone_append(VirtIOBlo

[PATCH v10 2/5] virtio-blk: add zoned storage emulation for zoned devices

2023-04-07 Thread Sam Li
This patch extends virtio-blk emulation to handle zoned device commands
by calling the new block layer APIs to perform zoned device I/O on
behalf of the guest. It supports Report Zone, four zone oparations (open,
close, finish, reset), and Append Zone.

The VIRTIO_BLK_F_ZONED feature bit will only be set if the host does
support zoned block devices. Regular block devices(conventional zones)
will not be set.

The guest os can use blktests, fio to test those commands on zoned devices.
Furthermore, using zonefs to test zone append write is also supported.

Signed-off-by: Sam Li 
---
 hw/block/virtio-blk-common.c |   2 +
 hw/block/virtio-blk.c| 389 +++
 hw/virtio/virtio-qmp.c   |   2 +
 3 files changed, 393 insertions(+)

diff --git a/hw/block/virtio-blk-common.c b/hw/block/virtio-blk-common.c
index ac52d7c176..e2f8e2f6da 100644
--- a/hw/block/virtio-blk-common.c
+++ b/hw/block/virtio-blk-common.c
@@ -29,6 +29,8 @@ static const VirtIOFeature feature_sizes[] = {
  .end = endof(struct virtio_blk_config, discard_sector_alignment)},
 {.flags = 1ULL << VIRTIO_BLK_F_WRITE_ZEROES,
  .end = endof(struct virtio_blk_config, write_zeroes_may_unmap)},
+{.flags = 1ULL << VIRTIO_BLK_F_ZONED,
+ .end = endof(struct virtio_blk_config, zoned)},
 {}
 };
 
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index cefca93b31..8b6030b1a5 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -17,6 +17,7 @@
 #include "qemu/module.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
+#include "block/block_int.h"
 #include "trace.h"
 #include "hw/block/block.h"
 #include "hw/qdev-properties.h"
@@ -601,6 +602,335 @@ err:
 return err_status;
 }
 
+typedef struct ZoneCmdData {
+VirtIOBlockReq *req;
+struct iovec *in_iov;
+unsigned in_num;
+union {
+struct {
+unsigned int nr_zones;
+BlockZoneDescriptor *zones;
+} zone_report_data;
+struct {
+int64_t offset;
+} zone_append_data;
+};
+} ZoneCmdData;
+
+/*
+ * check zoned_request: error checking before issuing requests. If all checks
+ * passed, return true.
+ * append: true if only zone append requests issued.
+ */
+static bool check_zoned_request(VirtIOBlock *s, int64_t offset, int64_t len,
+ bool append, uint8_t *status) {
+BlockDriverState *bs = blk_bs(s->blk);
+int index;
+
+if (!virtio_has_feature(s->host_features, VIRTIO_BLK_F_ZONED)) {
+*status = VIRTIO_BLK_S_UNSUPP;
+return false;
+}
+
+if (offset < 0 || len < 0 || len > (bs->total_sectors << BDRV_SECTOR_BITS)
+|| offset > (bs->total_sectors << BDRV_SECTOR_BITS) - len) {
+*status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+return false;
+}
+
+if (append) {
+if (bs->bl.write_granularity) {
+if ((offset % bs->bl.write_granularity) != 0) {
+*status = VIRTIO_BLK_S_ZONE_UNALIGNED_WP;
+return false;
+}
+}
+
+index = offset / bs->bl.zone_size;
+if (BDRV_ZT_IS_CONV(bs->wps->wp[index])) {
+*status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+return false;
+}
+
+if (len / 512 > bs->bl.max_append_sectors) {
+if (bs->bl.max_append_sectors == 0) {
+*status = VIRTIO_BLK_S_UNSUPP;
+} else {
+*status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+}
+return false;
+}
+}
+return true;
+}
+
+static void virtio_blk_zone_report_complete(void *opaque, int ret)
+{
+ZoneCmdData *data = opaque;
+VirtIOBlockReq *req = data->req;
+VirtIOBlock *s = req->dev;
+VirtIODevice *vdev = VIRTIO_DEVICE(req->dev);
+struct iovec *in_iov = data->in_iov;
+unsigned in_num = data->in_num;
+int64_t zrp_size, n, j = 0;
+int64_t nz = data->zone_report_data.nr_zones;
+int8_t err_status = VIRTIO_BLK_S_OK;
+
+if (ret) {
+err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+goto out;
+}
+
+struct virtio_blk_zone_report zrp_hdr = (struct virtio_blk_zone_report) {
+.nr_zones = cpu_to_le64(nz),
+};
+zrp_size = sizeof(struct virtio_blk_zone_report)
+   + sizeof(struct virtio_blk_zone_descriptor) * nz;
+n = iov_from_buf(in_iov, in_num, 0, &zrp_hdr, sizeof(zrp_hdr));
+if (n != sizeof(zrp_hdr)) {
+virtio_error(vdev, "Driver provided input buffer that is too small!");
+err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+goto out;
+}
+
+for (size_t i = sizeof(zrp_hdr); i < zrp_size;
+i += sizeof(struct virtio_blk_zone_descriptor), ++j) {
+struct virtio_blk_zone_descriptor desc =
+(struct virtio_blk_zone_

Re: [PATCH v9 1/4] file-posix: add tracking of the zone write pointers

2023-04-10 Thread Sam Li
Stefan Hajnoczi  于2023年4月10日周一 21:04写道:
>
> On Fri, Apr 07, 2023 at 04:16:54PM +0800, Sam Li wrote:
> > Since Linux doesn't have a user API to issue zone append operations to
> > zoned devices from user space, the file-posix driver is modified to add
> > zone append emulation using regular writes. To do this, the file-posix
> > driver tracks the wp location of all zones of the device. It uses an
> > array of uint64_t. The most significant bit of each wp location indicates
> > if the zone type is conventional zones.
> >
> > The zones wp can be changed due to the following operations issued:
> > - zone reset: change the wp to the start offset of that zone
> > - zone finish: change to the end location of that zone
> > - write to a zone
> > - zone append
> >
> > Signed-off-by: Sam Li 
> > ---
> >  block/file-posix.c   | 173 ++-
> >  include/block/block-common.h |  14 +++
> >  include/block/block_int-common.h |   5 +
> >  3 files changed, 189 insertions(+), 3 deletions(-)
> >
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 65efe5147e..e7957f5559 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -1324,6 +1324,90 @@ static int hdev_get_max_segments(int fd, struct stat 
> > *st)
> >  #endif
> >  }
> >
> > +#if defined(CONFIG_BLKZONED)
> > +/*
> > + * If the reset_all flag is true, then the wps of zone whose state is
> > + * not readonly or offline should be all reset to the start sector.
> > + * Else, take the real wp of the device.
> > + */
> > +static int get_zones_wp(BlockDriverState *bs, int fd, int64_t offset,
> > +unsigned int nrz, bool reset_all)
> > +{
> > +struct blk_zone *blkz;
> > +size_t rep_size;
> > +uint64_t sector = offset >> BDRV_SECTOR_BITS;
> > +BlockZoneWps *wps = bs->wps;
> > +int j = offset / bs->bl.zone_size;
> > +int ret, n = 0, i = 0;
>
> I would feel more comfortable if i, j, and n were unsigned int like nrz.
> That way we don't need to worry about negative array indices when int
> wraps to INT_MIN.
>
> In practice we'll probably hit scalability problems before nrz becomes
> greater than INT_MAX. Also, such devices probably don't exist. A 5 TB
> drive with 256 MB zones only has 20,480 zones.
>
> So for now I think you can keep the code the way it is.
>
> > +rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct 
> > blk_zone);
> > +g_autofree struct blk_zone_report *rep = NULL;
> > +
> > +rep = g_malloc(rep_size);
> > +blkz = (struct blk_zone *)(rep + 1);
> > +while (n < nrz) {
> > +memset(rep, 0, rep_size);
> > +rep->sector = sector;
> > +rep->nr_zones = nrz - n;
> > +
> > +do {
> > +ret = ioctl(fd, BLKREPORTZONE, rep);
> > +} while (ret != 0 && errno == EINTR);
> > +if (ret != 0) {
> > +error_report("%d: ioctl BLKREPORTZONE at %" PRId64 " failed 
> > %d",
> > +fd, offset, errno);
> > +return -errno;
> > +}
> > +
> > +if (!rep->nr_zones) {
> > +break;
> > +}
> > +
> > +for (i = 0; i < rep->nr_zones; ++i, ++n, ++j) {
> > +/*
> > + * The wp tracking cares only about sequential writes required 
> > and
> > + * sequential write preferred zones so that the wp can advance 
> > to
> > + * the right location.
> > + * Use the most significant bit of the wp location to indicate 
> > the
> > + * zone type: 0 for SWR/SWP zones and 1 for conventional zones.
> > + */
> > +if (blkz[i].type == BLK_ZONE_TYPE_CONVENTIONAL) {
> > +wps->wp[j] |= 1ULL << 63;
> > +} else {
> > +switch(blkz[i].cond) {
> > +case BLK_ZONE_COND_FULL:
> > +case BLK_ZONE_COND_READONLY:
> > +/* Zone not writable */
> > +wps->wp[j] = (blkz[i].start + blkz[i].len) << 
> > BDRV_SECTOR_BITS;
> > +break;
> > +case BLK_ZONE_COND_OFFLINE:
> > +/* Zone not writable nor readable */
> > +wps->wp[j] = (blkz[i].start) << BDRV_SECTOR_BITS;
> > +   

Re: [PATCH] block/file-posix: use unsigned int for zones consistently

2023-04-10 Thread Sam Li
Stefan Hajnoczi  于2023年4月10日周一 21:49写道:
>
> Avoid mixing int and unsigned int for zone index and count values. This
> eliminates the possibility of accidental negative write pointer array
> indices. It also makes code review easier because we don't need to worry
> about signed/unsigned comparisons.
>
> In practice I don't think zoned devices are likely to exceed MAX_INT
> zones any time soon, so this is mostly a code cleanup.
>
> Cc: Sam Li 
> Cc: Dmitry Fomichev 
> Cc: Damien Le Moal 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  block/file-posix.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
>
> This is a cleanup on top of "[PATCH v9 0/4] Add zone append write for
> zoned device".
>
> Based-on: <20230407081657.17947-1-faithilike...@gmail.com>

Reviewed-by: Sam Li 

>
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 32b16bc4fb..77fbf9e33e 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1338,8 +1338,9 @@ static int get_zones_wp(BlockDriverState *bs, int fd, 
> int64_t offset,
>  size_t rep_size;
>  uint64_t sector = offset >> BDRV_SECTOR_BITS;
>  BlockZoneWps *wps = bs->wps;
> -int j = offset / bs->bl.zone_size;
> -int ret, n = 0, i = 0;
> +unsigned int j = offset / bs->bl.zone_size;
> +int ret;
> +unsigned int n = 0, i = 0;
>  rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct 
> blk_zone);
>  g_autofree struct blk_zone_report *rep = NULL;
>
> @@ -2092,7 +2093,8 @@ static int handle_aiocb_zone_report(void *opaque)
>  struct blk_zone *blkz;
>  size_t rep_size;
>  unsigned int nrz;
> -int ret, n = 0, i = 0;
> +int ret;
> +unsigned int n = 0, i = 0;
>
>  nrz = *nr_zones;
>  rep_size = sizeof(struct blk_zone_report) + nrz * sizeof(struct 
> blk_zone);
> @@ -3507,11 +3509,11 @@ static int coroutine_fn 
> raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
>  return ret;
>  }
>  } else if (zo == BLKRESETZONE) {
> -for (int j = 0; j < nrz; ++j) {
> +for (unsigned int j = 0; j < nrz; ++j) {
>  wp[j] = offset + j * zone_size;
>  }
>  } else if (zo == BLKFINISHZONE) {
> -for (int j = 0; j < nrz; ++j) {
> +for (unsigned int j = 0; j < nrz; ++j) {
>  /* The zoned device allows the last zone smaller that the
>   * zone size. */
>  wp[j] = MIN(offset + (j + 1) * zone_size, offset + len);
> --
> 2.39.2
>



[PATCH v15 1/8] include: add zoned device structs

2023-01-29 Thread Sam Li
Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Damien Le Moal 
Reviewed-by: Hannes Reinecke 
---
 include/block/block-common.h | 43 
 1 file changed, 43 insertions(+)

diff --git a/include/block/block-common.h b/include/block/block-common.h
index 41686810de..211fbc0847 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -58,6 +58,49 @@ typedef struct BlockDriver BlockDriver;
 typedef struct BdrvChild BdrvChild;
 typedef struct BdrvChildClass BdrvChildClass;
 
+typedef enum BlockZoneOp {
+BLK_ZO_OPEN,
+BLK_ZO_CLOSE,
+BLK_ZO_FINISH,
+BLK_ZO_RESET,
+} BlockZoneOp;
+
+typedef enum BlockZoneModel {
+BLK_Z_NONE = 0x0, /* Regular block device */
+BLK_Z_HM = 0x1, /* Host-managed zoned block device */
+BLK_Z_HA = 0x2, /* Host-aware zoned block device */
+} BlockZoneModel;
+
+typedef enum BlockZoneState {
+BLK_ZS_NOT_WP = 0x0,
+BLK_ZS_EMPTY = 0x1,
+BLK_ZS_IOPEN = 0x2,
+BLK_ZS_EOPEN = 0x3,
+BLK_ZS_CLOSED = 0x4,
+BLK_ZS_RDONLY = 0xD,
+BLK_ZS_FULL = 0xE,
+BLK_ZS_OFFLINE = 0xF,
+} BlockZoneState;
+
+typedef enum BlockZoneType {
+BLK_ZT_CONV = 0x1, /* Conventional random writes supported */
+BLK_ZT_SWR = 0x2, /* Sequential writes required */
+BLK_ZT_SWP = 0x3, /* Sequential writes preferred */
+} BlockZoneType;
+
+/*
+ * Zone descriptor data structure.
+ * Provides information on a zone with all position and size values in bytes.
+ */
+typedef struct BlockZoneDescriptor {
+uint64_t start;
+uint64_t length;
+uint64_t cap;
+uint64_t wp;
+BlockZoneType type;
+BlockZoneState state;
+} BlockZoneDescriptor;
+
 typedef struct BlockDriverInfo {
 /* in bytes, 0 if irrelevant */
 int cluster_size;
-- 
2.38.1




[PATCH v15 0/8] Add support for zoned device

2023-01-29 Thread Sam Li
Zoned Block Devices (ZBDs) devide the LBA space to block regions called zones
that are larger than the LBA size. It can only allow sequential writes, which
reduces write amplification in SSD, leading to higher throughput and increased
capacity. More details about ZBDs can be found at:

https://zonedstorage.io/docs/introduction/zoned-storage

The zoned device support aims to let guests (virtual machines) access zoned
storage devices on the host (hypervisor) through a virtio-blk device. This
involves extending QEMU's block layer and virtio-blk emulation code.  In its
current status, the virtio-blk device is not aware of ZBDs but the guest sees
host-managed drives as regular drive that will runs correctly under the most
common write workloads.

This patch series extend the block layer APIs with the minimum set of zoned
commands that are necessary to support zoned devices. The commands are - Report
Zones, four zone operations and Zone Append.

There has been a debate on whethre introducing new zoned_host_device BlockDriver
specifically for zoned devices. In the end, it's been decided to stick to
existing host_device BlockDriver interface by only adding new zoned operations
inside it. The benefit of that is to avoid further changes - one example is
command line syntax - to the applications like Libvirt using QEMU zoned
emulation.

It can be tested on a null_blk device using qemu-io or qemu-iotests. For
example, to test zone report using qemu-io:
$ path/to/qemu-io --image-opts -n driver=host_device,filename=/dev/nullb0
-c "zrp offset nr_zones"

v15:
- drop zoned_host_device BlockDriver
- add zoned device option to host_device driver instead of introducing a new
  zoned_host_device BlockDriver [Stefan]

v14:
- address Stefan's comments of probing block sizes

v13:
- add some tracing points for new zone APIs [Dmitry]
- change error handling in zone_mgmt [Damien, Stefan]

v12:
- address review comments
  * drop BLK_ZO_RESET_ALL bit [Damien]
  * fix error messages, style, and typos[Damien, Hannes]

v11:
- address review comments
  * fix possible BLKZONED config compiling warnings [Stefan]
  * fix capacity field compiling warnings on older kernel [Stefan,Damien]

v10:
- address review comments
  * deal with the last small zone case in zone_mgmt operations [Damien]
  * handle the capacity field outdated in old kernel(before 5.9) [Damien]
  * use byte unit in block layer to be consistent with QEMU [Eric]
  * fix coding style related problems [Stefan]

v9:
- address review comments
  * specify units of zone commands requests [Stefan]
  * fix some error handling in file-posix [Stefan]
  * introduce zoned_host_devcie in the commit message [Markus]

v8:
- address review comments
  * solve patch conflicts and merge sysfs helper funcations into one patch
  * add cache.direct=on check in config

v7:
- address review comments
  * modify sysfs attribute helper funcations
  * move the input validation and error checking into raw_co_zone_* function
  * fix checks in config

v6:
- drop virtio-blk emulation changes
- address Stefan's review comments
  * fix CONFIG_BLKZONED configs in related functions
  * replace reading fd by g_file_get_contents() in get_sysfs_str_val()
  * rewrite documentation for zoned storage

v5:
- add zoned storage emulation to virtio-blk device
- add documentation for zoned storage
- address review comments
  * fix qemu-iotests
  * fix check to block layer
  * modify interfaces of sysfs helper functions
  * rename zoned device structs according to QEMU styles
  * reorder patches

v4:
- add virtio-blk headers for zoned device
- add configurations for zoned host device
- add zone operations for raw-format
- address review comments
  * fix memory leak bug in zone_report
  * add checks to block layers
  * fix qemu-iotests format
  * fix sysfs helper functions

v3:
- add helper functions to get sysfs attributes
- address review comments
  * fix zone report bugs
  * fix the qemu-io code path
  * use thread pool to avoid blocking ioctl() calls

v2:
- add qemu-io sub-commands
- address review comments
  * modify interfaces of APIs

v1:
- add block layer APIs resembling Linux ZoneBlockDevice ioctls

Sam Li (8):
  include: add zoned device structs
  file-posix: introduce helper functions for sysfs attributes
  block: add block layer APIs resembling Linux ZonedBlockDevice ioctls
  raw-format: add zone operations to pass through requests
  config: add check to block layer
  qemu-iotests: test new zone operations
  block: add some trace events for new block layer APIs
  docs/zoned-storage: add zoned device documentation

 block.c|  19 +
 block/block-backend.c  | 147 
 block/file-posix.c | 460 +++--
 block/io.c |  41 +++
 block/raw-format.c |  14 +
 block/trace-events |   2 +
 docs/devel/zoned-storage.rst   |  

[PATCH v15 4/8] raw-format: add zone operations to pass through requests

2023-01-29 Thread Sam Li
raw-format driver usually sits on top of file-posix driver. It needs to
pass through requests of zone commands.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Damien Le Moal 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Dmitry Fomichev 
---
 block/raw-format.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/block/raw-format.c b/block/raw-format.c
index b6a0ce58f4..dbbb8f3859 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -317,6 +317,17 @@ static int coroutine_fn raw_co_pdiscard(BlockDriverState 
*bs,
 return bdrv_co_pdiscard(bs->file, offset, bytes);
 }
 
+static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t 
offset,
+   unsigned int *nr_zones,
+   BlockZoneDescriptor *zones) {
+return bdrv_co_zone_report(bs->file->bs, offset, nr_zones, zones);
+}
+
+static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
+ int64_t offset, int64_t len) {
+return bdrv_co_zone_mgmt(bs->file->bs, op, offset, len);
+}
+
 static int64_t raw_getlength(BlockDriverState *bs)
 {
 int64_t len;
@@ -618,6 +629,8 @@ BlockDriver bdrv_raw = {
 .bdrv_co_pwritev  = &raw_co_pwritev,
 .bdrv_co_pwrite_zeroes = &raw_co_pwrite_zeroes,
 .bdrv_co_pdiscard = &raw_co_pdiscard,
+.bdrv_co_zone_report  = &raw_co_zone_report,
+.bdrv_co_zone_mgmt  = &raw_co_zone_mgmt,
 .bdrv_co_block_status = &raw_co_block_status,
 .bdrv_co_copy_range_from = &raw_co_copy_range_from,
 .bdrv_co_copy_range_to  = &raw_co_copy_range_to,
-- 
2.38.1




[PATCH v15 2/8] file-posix: introduce helper functions for sysfs attributes

2023-01-29 Thread Sam Li
Use get_sysfs_str_val() to get the string value of device
zoned model. Then get_sysfs_zoned_model() can convert it to
BlockZoneModel type of QEMU.

Use get_sysfs_long_val() to get the long value of zoned device
information.

Signed-off-by: Sam Li 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Damien Le Moal 
Reviewed-by: Dmitry Fomichev 
---
 block/file-posix.c   | 122 ++-
 include/block/block_int-common.h |   3 +
 2 files changed, 91 insertions(+), 34 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index fa227d9d14..43c59c6d56 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1202,64 +1202,112 @@ static int hdev_get_max_hw_transfer(int fd, struct 
stat *st)
 #endif
 }
 
-static int hdev_get_max_segments(int fd, struct stat *st)
-{
+/*
+ * Get a sysfs attribute value as character string.
+ */
+static int get_sysfs_str_val(struct stat *st, const char *attribute,
+ char **val) {
 #ifdef CONFIG_LINUX
-char buf[32];
-const char *end;
-char *sysfspath = NULL;
+g_autofree char *sysfspath = NULL;
 int ret;
-int sysfd = -1;
-long max_segments;
+size_t len;
 
-if (S_ISCHR(st->st_mode)) {
-if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
-return ret;
-}
+if (!S_ISBLK(st->st_mode)) {
 return -ENOTSUP;
 }
 
-if (!S_ISBLK(st->st_mode)) {
-return -ENOTSUP;
+sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
+major(st->st_rdev), minor(st->st_rdev),
+attribute);
+ret = g_file_get_contents(sysfspath, val, &len, NULL);
+if (ret == -1) {
+return -ENOENT;
 }
 
-sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
-major(st->st_rdev), minor(st->st_rdev));
-sysfd = open(sysfspath, O_RDONLY);
-if (sysfd == -1) {
-ret = -errno;
-goto out;
+/* The file is ended with '\n' */
+char *p;
+p = *val;
+if (*(p + len - 1) == '\n') {
+*(p + len - 1) = '\0';
 }
-ret = RETRY_ON_EINTR(read(sysfd, buf, sizeof(buf) - 1));
+return ret;
+#else
+return -ENOTSUP;
+#endif
+}
+
+static int get_sysfs_zoned_model(struct stat *st, BlockZoneModel *zoned)
+{
+g_autofree char *val = NULL;
+int ret;
+
+ret = get_sysfs_str_val(st, "zoned", &val);
 if (ret < 0) {
-ret = -errno;
-goto out;
-} else if (ret == 0) {
-ret = -EIO;
-goto out;
+return ret;
 }
-buf[ret] = 0;
-/* The file is ended with '\n', pass 'end' to accept that. */
-ret = qemu_strtol(buf, &end, 10, &max_segments);
-if (ret == 0 && end && *end == '\n') {
-ret = max_segments;
+
+if (strcmp(val, "host-managed") == 0) {
+*zoned = BLK_Z_HM;
+} else if (strcmp(val, "host-aware") == 0) {
+*zoned = BLK_Z_HA;
+} else if (strcmp(val, "none") == 0) {
+*zoned = BLK_Z_NONE;
+} else {
+return -ENOTSUP;
+}
+return 0;
+}
+
+/*
+ * Get a sysfs attribute value as a long integer.
+ */
+static long get_sysfs_long_val(struct stat *st, const char *attribute)
+{
+#ifdef CONFIG_LINUX
+g_autofree char *str = NULL;
+const char *end;
+long val;
+int ret;
+
+ret = get_sysfs_str_val(st, attribute, &str);
+if (ret < 0) {
+return ret;
 }
 
-out:
-if (sysfd != -1) {
-close(sysfd);
+/* The file is ended with '\n', pass 'end' to accept that. */
+ret = qemu_strtol(str, &end, 10, &val);
+if (ret == 0 && end && *end == '\0') {
+ret = val;
 }
-g_free(sysfspath);
 return ret;
 #else
 return -ENOTSUP;
 #endif
 }
 
+static int hdev_get_max_segments(int fd, struct stat *st)
+{
+#ifdef CONFIG_LINUX
+int ret;
+
+if (S_ISCHR(st->st_mode)) {
+if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
+return ret;
+}
+return -ENOTSUP;
+}
+return get_sysfs_long_val(st, "max_segments");
+#else
+return -ENOTSUP;
+#endif
+}
+
 static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 {
 BDRVRawState *s = bs->opaque;
 struct stat st;
+int ret;
+BlockZoneModel zoned;
 
 s->needs_alignment = raw_needs_alignment(bs);
 raw_probe_alignment(bs, s->fd, errp);
@@ -1297,6 +1345,12 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 bs->bl.max_hw_iov = ret;
 }
 }
+
+ret = get_sysfs_zoned_model(&st, &zoned);
+if (ret < 0) {
+zoned = BLK_Z_NONE;
+}
+bs->bl.zoned = zone

[PATCH v15 6/8] qemu-iotests: test new zone operations

2023-01-29 Thread Sam Li
We have added new block layer APIs of zoned block devices. Test it as
follows: Run each zone operation on a newly created null_blk device
and see whether the logs show the correct zone information. By:
$ ./tests/qemu-iotests/tests/zoned.sh

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
---
 tests/qemu-iotests/tests/zoned.out | 53 ++
 tests/qemu-iotests/tests/zoned.sh  | 86 ++
 2 files changed, 139 insertions(+)
 create mode 100644 tests/qemu-iotests/tests/zoned.out
 create mode 100755 tests/qemu-iotests/tests/zoned.sh

diff --git a/tests/qemu-iotests/tests/zoned.out 
b/tests/qemu-iotests/tests/zoned.out
new file mode 100644
index 00..0c8f96deb9
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned.out
@@ -0,0 +1,53 @@
+QA output created by zoned.sh
+Testing a null_blk device:
+Simple cases: if the operations work
+(1) report the first zone:
+start: 0x0, len 0x8, cap 0x8, wptr 0x0, zcond:1, [type: 2]
+
+report the first 10 zones
+start: 0x0, len 0x8, cap 0x8, wptr 0x0, zcond:1, [type: 2]
+start: 0x8, len 0x8, cap 0x8, wptr 0x8, zcond:1, [type: 2]
+start: 0x10, len 0x8, cap 0x8, wptr 0x10, zcond:1, [type: 2]
+start: 0x18, len 0x8, cap 0x8, wptr 0x18, zcond:1, [type: 2]
+start: 0x20, len 0x8, cap 0x8, wptr 0x20, zcond:1, [type: 2]
+start: 0x28, len 0x8, cap 0x8, wptr 0x28, zcond:1, [type: 2]
+start: 0x30, len 0x8, cap 0x8, wptr 0x30, zcond:1, [type: 2]
+start: 0x38, len 0x8, cap 0x8, wptr 0x38, zcond:1, [type: 2]
+start: 0x40, len 0x8, cap 0x8, wptr 0x40, zcond:1, [type: 2]
+start: 0x48, len 0x8, cap 0x8, wptr 0x48, zcond:1, [type: 2]
+
+report the last zone:
+start: 0x1f38, len 0x8, cap 0x8, wptr 0x1f38, zcond:1, [type: 
2]
+
+
+(2) opening the first zone
+report after:
+start: 0x0, len 0x8, cap 0x8, wptr 0x0, zcond:3, [type: 2]
+
+opening the second zone
+report after:
+start: 0x8, len 0x8, cap 0x8, wptr 0x8, zcond:3, [type: 2]
+
+opening the last zone
+report after:
+start: 0x1f38, len 0x8, cap 0x8, wptr 0x1f38, zcond:3, [type: 
2]
+
+
+(3) closing the first zone
+report after:
+start: 0x0, len 0x8, cap 0x8, wptr 0x0, zcond:1, [type: 2]
+
+closing the last zone
+report after:
+start: 0x1f38, len 0x8, cap 0x8, wptr 0x1f38, zcond:1, [type: 
2]
+
+
+(4) finishing the second zone
+After finishing a zone:
+start: 0x8, len 0x8, cap 0x8, wptr 0x10, zcond:14, [type: 2]
+
+
+(5) resetting the second zone
+After resetting a zone:
+start: 0x8, len 0x8, cap 0x8, wptr 0x8, zcond:1, [type: 2]
+*** done
diff --git a/tests/qemu-iotests/tests/zoned.sh 
b/tests/qemu-iotests/tests/zoned.sh
new file mode 100755
index 00..9d7c15dde6
--- /dev/null
+++ b/tests/qemu-iotests/tests/zoned.sh
@@ -0,0 +1,86 @@
+#!/usr/bin/env bash
+#
+# Test zone management operations.
+#
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+status=1 # failure is the default!
+
+_cleanup()
+{
+  _cleanup_test_img
+  sudo rmmod null_blk
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+. ./common.qemu
+
+# This test only runs on Linux hosts with raw image files.
+_supported_fmt raw
+_supported_proto file
+_supported_os Linux
+
+QEMU_IO="build/qemu-io"
+IMG="--image-opts -n driver=host_device,filename=/dev/nullb0"
+QEMU_IO_OPTIONS=$QEMU_IO_OPTIONS_NO_FMT
+
+echo "Testing a null_blk device:"
+echo "case 1: if the operations work"
+sudo modprobe null_blk nr_devices=1 zoned=1
+
+echo "(1) report the first zone:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "report the first 10 zones"
+sudo $QEMU_IO $IMG -c "zrp 0 10"
+echo
+echo "report the last zone:"
+sudo $QEMU_IO $IMG -c "zrp 0x3e7000 2" # 0x3e7000 / 512 = 0x1f38
+echo
+echo
+echo "(2) opening the first zone"
+sudo $QEMU_IO $IMG -c "zo 0 268435456"  # 268435456 / 512 = 524288
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "opening the second zone"
+sudo $QEMU_IO $IMG -c "zo 268435456 268435456" #
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 268435456 1"
+echo
+echo "opening the last zone"
+sudo $QEMU_IO $IMG -c "zo 0x3e7000 268435456"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0x3e7000 2"
+echo
+echo
+echo "(3) closing the first zone"
+sudo $QEMU_IO $IMG -c "zc 0 268435456"
+echo "report after:"
+sudo $QEMU_IO $IMG -c "zrp 0 1"
+echo
+echo "closing the last zone"
+sudo $QEMU_IO $IMG -c "zc 0x3e7000 268435456"
+echo "re

[PATCH v15 7/8] block: add some trace events for new block layer APIs

2023-01-29 Thread Sam Li
Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
---
 block/file-posix.c | 3 +++
 block/trace-events | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index f661f202a1..5cf92608db 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -3272,6 +3272,7 @@ static int coroutine_fn 
raw_co_zone_report(BlockDriverState *bs, int64_t offset,
BlockZoneDescriptor *zones) {
 BDRVRawState *s = bs->opaque;
 RawPosixAIOData acb;
+trace_zbd_zone_report(bs, *nr_zones, offset >> BDRV_SECTOR_BITS);
 
 acb = (RawPosixAIOData) {
 .bs = bs,
@@ -3350,6 +3351,8 @@ static int coroutine_fn raw_co_zone_mgmt(BlockDriverState 
*bs, BlockZoneOp op,
 },
 };
 
+trace_zbd_zone_mgmt(bs, op_name, offset >> BDRV_SECTOR_BITS,
+len >> BDRV_SECTOR_BITS);
 ret = raw_thread_pool_submit(bs, handle_aiocb_zone_mgmt, &acb);
 if (ret != 0) {
 ret = -errno;
diff --git a/block/trace-events b/block/trace-events
index 48dbf10c66..3f4e1d088a 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -209,6 +209,8 @@ file_FindEjectableOpticalMedia(const char *media) "Matching 
using %s"
 file_setup_cdrom(const char *partition) "Using %s as optical disc"
 file_hdev_is_sg(int type, int version) "SG device found: type=%d, version=%d"
 file_flush_fdatasync_failed(int err) "errno %d"
+zbd_zone_report(void *bs, unsigned int nr_zones, int64_t sector) "bs %p report 
%d zones starting at sector offset 0x%" PRIx64 ""
+zbd_zone_mgmt(void *bs, const char *op_name, int64_t sector, int64_t len) "bs 
%p %s starts at sector offset 0x%" PRIx64 " over a range of 0x%" PRIx64 " 
sectors"
 
 # ssh.c
 sftp_error(const char *op, const char *ssh_err, int ssh_err_code, int 
sftp_err_code) "%s failed: %s (libssh error code: %d, sftp error code: %d)"
-- 
2.38.1




[PATCH v15 3/8] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2023-01-29 Thread Sam Li
Add zoned device option to host_device BlockDriver. It will be presented only
for zoned host block devices. By adding zone management operations to the
host_block_device BlockDriver, users can use the new block layer APIs
including Report Zone and four zone management operations
(open, close, finish, reset, reset_all).

Qemu-io uses the new APIs to perform zoned storage commands of the device:
zone_report(zrp), zone_open(zo), zone_close(zc), zone_reset(zrs),
zone_finish(zf).

For example, to test zone_report, use following command:
$ ./build/qemu-io --image-opts -n driver=host_device, filename=/dev/nullb0
-c "zrp offset nr_zones"

Signed-off-by: Sam Li 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Stefan Hajnoczi 
---
 block/block-backend.c | 147 ++
 block/file-posix.c| 323 ++
 block/io.c|  41 
 include/block/block-io.h  |   7 +
 include/block/block_int-common.h  |  21 ++
 include/block/raw-aio.h   |   6 +-
 include/sysemu/block-backend-io.h |  18 ++
 meson.build   |   4 +
 qemu-io-cmds.c| 149 ++
 9 files changed, 715 insertions(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index ba7bf1d6bc..a4847b9131 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1451,6 +1451,15 @@ typedef struct BlkRwCo {
 void *iobuf;
 int ret;
 BdrvRequestFlags flags;
+union {
+struct {
+unsigned int *nr_zones;
+BlockZoneDescriptor *zones;
+} zone_report;
+struct {
+unsigned long op;
+} zone_mgmt;
+};
 } BlkRwCo;
 
 int blk_make_zero(BlockBackend *blk, BdrvRequestFlags flags)
@@ -1795,6 +1804,144 @@ int coroutine_fn blk_co_flush(BlockBackend *blk)
 return ret;
 }
 
+static void coroutine_fn blk_aio_zone_report_entry(void *opaque)
+{
+BlkAioEmAIOCB *acb = opaque;
+BlkRwCo *rwco = &acb->rwco;
+
+rwco->ret = blk_co_zone_report(rwco->blk, rwco->offset,
+   rwco->zone_report.nr_zones,
+   rwco->zone_report.zones);
+blk_aio_complete(acb);
+}
+
+BlockAIOCB *blk_aio_zone_report(BlockBackend *blk, int64_t offset,
+unsigned int *nr_zones,
+BlockZoneDescriptor  *zones,
+BlockCompletionFunc *cb, void *opaque)
+{
+BlkAioEmAIOCB *acb;
+Coroutine *co;
+IO_CODE();
+
+blk_inc_in_flight(blk);
+acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
+acb->rwco = (BlkRwCo) {
+.blk= blk,
+.offset = offset,
+.ret= NOT_DONE,
+.zone_report = {
+.zones = zones,
+.nr_zones = nr_zones,
+},
+};
+acb->has_returned = false;
+
+co = qemu_coroutine_create(blk_aio_zone_report_entry, acb);
+bdrv_coroutine_enter(blk_bs(blk), co);
+
+acb->has_returned = true;
+if (acb->rwco.ret != NOT_DONE) {
+replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+ blk_aio_complete_bh, acb);
+}
+
+return &acb->common;
+}
+
+static void coroutine_fn blk_aio_zone_mgmt_entry(void *opaque)
+{
+BlkAioEmAIOCB *acb = opaque;
+BlkRwCo *rwco = &acb->rwco;
+
+rwco->ret = blk_co_zone_mgmt(rwco->blk, rwco->zone_mgmt.op,
+ rwco->offset, acb->bytes);
+blk_aio_complete(acb);
+}
+
+BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
+  int64_t offset, int64_t len,
+  BlockCompletionFunc *cb, void *opaque) {
+BlkAioEmAIOCB *acb;
+Coroutine *co;
+IO_CODE();
+
+blk_inc_in_flight(blk);
+acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
+acb->rwco = (BlkRwCo) {
+.blk= blk,
+.offset = offset,
+.ret= NOT_DONE,
+.zone_mgmt = {
+.op = op,
+},
+};
+acb->bytes = len;
+acb->has_returned = false;
+
+co = qemu_coroutine_create(blk_aio_zone_mgmt_entry, acb);
+bdrv_coroutine_enter(blk_bs(blk), co);
+
+acb->has_returned = true;
+if (acb->rwco.ret != NOT_DONE) {
+replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
+ blk_aio_complete_bh, acb);
+}
+
+return &acb->common;
+}
+
+/*
+ * Send a zone_report command.
+ * offset is a byte offset from the start of the device. No alignment
+ * required for offset.
+ * nr_zones represents IN maximum and OUT actual.
+ */
+int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
+unsigned int *nr_zones,
+BlockZoneDescri

[PATCH v15 5/8] config: add check to block layer

2023-01-29 Thread Sam Li
Putting zoned/non-zoned BlockDrivers on top of each other is not
allowed.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Dmitry Fomichev 
---
 block.c  | 19 +++
 block/file-posix.c   | 12 
 block/raw-format.c   |  1 +
 include/block/block_int-common.h |  5 +
 4 files changed, 37 insertions(+)

diff --git a/block.c b/block.c
index b4a89207ad..5ab0b26510 100644
--- a/block.c
+++ b/block.c
@@ -7913,6 +7913,25 @@ void bdrv_add_child(BlockDriverState *parent_bs, 
BlockDriverState *child_bs,
 return;
 }
 
+/*
+ * Non-zoned block drivers do not follow zoned storage constraints
+ * (i.e. sequential writes to zones). Refuse mixing zoned and non-zoned
+ * drivers in a graph.
+ */
+if (!parent_bs->drv->supports_zoned_children &&
+child_bs->bl.zoned == BLK_Z_HM) {
+/*
+ * The host-aware model allows zoned storage constraints and random
+ * write. Allow mixing host-aware and non-zoned drivers. Using
+ * host-aware device as a regular device.
+ */
+error_setg(errp, "Cannot add a %s child to a %s parent",
+   child_bs->bl.zoned == BLK_Z_HM ? "zoned" : "non-zoned",
+   parent_bs->drv->supports_zoned_children ?
+   "support zoned children" : "not support zoned children");
+return;
+}
+
 if (!QLIST_EMPTY(&child_bs->parents)) {
 error_setg(errp, "The node %s already has a parent",
child_bs->node_name);
diff --git a/block/file-posix.c b/block/file-posix.c
index b6d88db208..f661f202a1 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -776,6 +776,18 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
 goto fail;
 }
 }
+#ifdef CONFIG_BLKZONED
+/*
+ * The kernel page cache does not reliably work for writes to SWR zones
+ * of zoned block device because it can not guarantee the order of writes.
+ */
+if ((strcmp(bs->drv->format_name, "zoned_host_device") == 0) &&
+(!(s->open_flags & O_DIRECT))) {
+error_setg(errp, "driver=zoned_host_device was specified, but it "
+   "requires cache.direct=on, which was not specified.");
+return -EINVAL; /* No host kernel page cache */
+}
+#endif
 
 if (S_ISBLK(st.st_mode)) {
 #ifdef __linux__
diff --git a/block/raw-format.c b/block/raw-format.c
index dbbb8f3859..772ce777ff 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -618,6 +618,7 @@ static void raw_child_perm(BlockDriverState *bs, BdrvChild 
*c,
 BlockDriver bdrv_raw = {
 .format_name  = "raw",
 .instance_size= sizeof(BDRVRawState),
+.supports_zoned_children = true,
 .bdrv_probe   = &raw_probe,
 .bdrv_reopen_prepare  = &raw_reopen_prepare,
 .bdrv_reopen_commit   = &raw_reopen_commit,
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 565228d8dd..cd631f94ed 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -137,6 +137,11 @@ struct BlockDriver {
  */
 bool is_format;
 
+/*
+ * Set to true if the BlockDriver supports zoned children.
+ */
+bool supports_zoned_children;
+
 /*
  * Drivers not implementing bdrv_parse_filename nor bdrv_open should have
  * this field set to true, except ones that are defined only by their
-- 
2.38.1




[PATCH v15 8/8] docs/zoned-storage: add zoned device documentation

2023-01-29 Thread Sam Li
Add the documentation about the zoned device support to virtio-blk
emulation.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Damien Le Moal 
Reviewed-by: Dmitry Fomichev 
---
 docs/devel/zoned-storage.rst   | 43 ++
 docs/system/qemu-block-drivers.rst.inc |  6 
 2 files changed, 49 insertions(+)
 create mode 100644 docs/devel/zoned-storage.rst

diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
new file mode 100644
index 00..03e52efe2e
--- /dev/null
+++ b/docs/devel/zoned-storage.rst
@@ -0,0 +1,43 @@
+=
+zoned-storage
+=
+
+Zoned Block Devices (ZBDs) divide the LBA space into block regions called zones
+that are larger than the LBA size. They can only allow sequential writes, which
+can reduce write amplification in SSDs, and potentially lead to higher
+throughput and increased capacity. More details about ZBDs can be found at:
+
+https://zonedstorage.io/docs/introduction/zoned-storage
+
+1. Block layer APIs for zoned storage
+-
+QEMU block layer supports three zoned storage models:
+- BLK_Z_HM: The host-managed zoned model only allows sequential writes access
+to zones. It supports ZBD-specific I/O commands that can be used by a host to
+manage the zones of a device.
+- BLK_Z_HA: The host-aware zoned model allows random write operations in
+zones, making it backward compatible with regular block devices.
+- BLK_Z_NONE: The non-zoned model has no zones support. It includes both
+regular and drive-managed ZBD devices. ZBD-specific I/O commands are not
+supported.
+
+The block device information resides inside BlockDriverState. QEMU uses
+BlockLimits struct(BlockDriverState::bl) that is continuously accessed by the
+block layer while processing I/O requests. A BlockBackend has a root pointer to
+a BlockDriverState graph(for example, raw format on top of file-posix). The
+zoned storage information can be propagated from the leaf BlockDriverState all
+the way up to the BlockBackend. If the zoned storage model in file-posix is
+set to BLK_Z_HM, then block drivers will declare support for zoned host device.
+
+The block layer APIs support commands needed for zoned storage devices,
+including report zones, four zone operations, and zone append.
+
+2. Emulating zoned storage controllers
+--
+When the BlockBackend's BlockLimits model reports a zoned storage device, users
+like the virtio-blk emulation or the qemu-io-cmds.c utility can use block layer
+APIs for zoned storage emulation or testing.
+
+For example, to test zone_report on a null_blk device using qemu-io is:
+$ path/to/qemu-io --image-opts -n driver=zoned_host_device,filename=/dev/nullb0
+-c "zrp offset nr_zones"
diff --git a/docs/system/qemu-block-drivers.rst.inc 
b/docs/system/qemu-block-drivers.rst.inc
index dfe5d2293d..0b97227fd9 100644
--- a/docs/system/qemu-block-drivers.rst.inc
+++ b/docs/system/qemu-block-drivers.rst.inc
@@ -430,6 +430,12 @@ Hard disks
   you may corrupt your host data (use the ``-snapshot`` command
   line option or modify the device permissions accordingly).
 
+Zoned block devices
+  Zoned block devices can be passed through to the guest if the emulated 
storage
+  controller supports zoned storage. Use ``--blockdev zoned_host_device,
+  node-name=drive0,filename=/dev/nullb0`` to pass through ``/dev/nullb0``
+  as ``drive0``.
+
 Windows
 ^^^
 
-- 
2.38.1




[RFC v6 2/4] virtio-blk: add zoned storage emulation for zoned devices

2023-01-29 Thread Sam Li
This patch extends virtio-blk emulation to handle zoned device commands
by calling the new block layer APIs to perform zoned device I/O on
behalf of the guest. It supports Report Zone, four zone oparations (open,
close, finish, reset), and Append Zone.

The VIRTIO_BLK_F_ZONED feature bit will only be set if the host does
support zoned block devices. Regular block devices(conventional zones)
will not be set.

The guest os can use blktests, fio to test those commands on zoned devices.
Furthermore, using zonefs to test zone append write is also supported.

Signed-off-by: Sam Li 
---
 hw/block/virtio-blk-common.c |   2 +
 hw/block/virtio-blk.c| 394 +++
 2 files changed, 396 insertions(+)

diff --git a/hw/block/virtio-blk-common.c b/hw/block/virtio-blk-common.c
index ac52d7c176..e2f8e2f6da 100644
--- a/hw/block/virtio-blk-common.c
+++ b/hw/block/virtio-blk-common.c
@@ -29,6 +29,8 @@ static const VirtIOFeature feature_sizes[] = {
  .end = endof(struct virtio_blk_config, discard_sector_alignment)},
 {.flags = 1ULL << VIRTIO_BLK_F_WRITE_ZEROES,
  .end = endof(struct virtio_blk_config, write_zeroes_may_unmap)},
+{.flags = 1ULL << VIRTIO_BLK_F_ZONED,
+ .end = endof(struct virtio_blk_config, zoned)},
 {}
 };
 
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 1762517878..09220f400d 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -17,6 +17,7 @@
 #include "qemu/module.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
+#include "block/block_int.h"
 #include "trace.h"
 #include "hw/block/block.h"
 #include "hw/qdev-properties.h"
@@ -601,6 +602,341 @@ err:
 return err_status;
 }
 
+typedef struct ZoneCmdData {
+VirtIOBlockReq *req;
+struct iovec *in_iov;
+unsigned in_num;
+union {
+struct {
+unsigned int nr_zones;
+BlockZoneDescriptor *zones;
+} zone_report_data;
+struct {
+int64_t offset;
+} zone_append_data;
+};
+} ZoneCmdData;
+
+/*
+ * check zoned_request: error checking before issuing requests. If all checks
+ * passed, return true.
+ * append: true if only zone append requests issued.
+ */
+static bool check_zoned_request(VirtIOBlock *s, int64_t offset, int64_t len,
+ bool append, uint8_t *status) {
+BlockDriverState *bs = blk_bs(s->blk);
+int index;
+
+if (!virtio_has_feature(s->host_features, VIRTIO_BLK_F_ZONED)) {
+*status = VIRTIO_BLK_S_UNSUPP;
+return false;
+}
+
+if (offset < 0 || len < 0 || len > (bs->total_sectors << BDRV_SECTOR_BITS)
+|| offset > (bs->total_sectors << BDRV_SECTOR_BITS) - len) {
+*status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+return false;
+}
+
+if (append) {
+if (bs->bl.write_granularity) {
+if ((offset % bs->bl.write_granularity) != 0) {
+*status = VIRTIO_BLK_S_ZONE_UNALIGNED_WP;
+return false;
+}
+}
+
+index = offset / bs->bl.zone_size;
+if (BDRV_ZT_IS_CONV(bs->bl.wps->wp[index])) {
+*status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+return false;
+}
+
+if (len / 512 > bs->bl.max_append_sectors) {
+if (bs->bl.max_append_sectors == 0) {
+*status = VIRTIO_BLK_S_UNSUPP;
+} else {
+*status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+}
+return false;
+}
+}
+return true;
+}
+
+static void virtio_blk_zone_report_complete(void *opaque, int ret)
+{
+ZoneCmdData *data = opaque;
+VirtIOBlockReq *req = data->req;
+VirtIOBlock *s = req->dev;
+VirtIODevice *vdev = VIRTIO_DEVICE(req->dev);
+struct iovec *in_iov = data->in_iov;
+unsigned in_num = data->in_num;
+int64_t zrp_size, n, j = 0;
+int64_t nz = data->zone_report_data.nr_zones;
+int8_t err_status = VIRTIO_BLK_S_OK;
+
+if (ret) {
+err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+goto out;
+}
+
+struct virtio_blk_zone_report zrp_hdr = (struct virtio_blk_zone_report) {
+.nr_zones = cpu_to_le64(nz),
+};
+zrp_size = sizeof(struct virtio_blk_zone_report)
+   + sizeof(struct virtio_blk_zone_descriptor) * nz;
+n = iov_from_buf(in_iov, in_num, 0, &zrp_hdr, sizeof(zrp_hdr));
+if (n != sizeof(zrp_hdr)) {
+virtio_error(vdev, "Driver provided input buffer that is too small!");
+err_status = VIRTIO_BLK_S_ZONE_INVALID_CMD;
+goto out;
+}
+
+for (size_t i = sizeof(zrp_hdr); i < zrp_size;
+i += sizeof(struct virtio_blk_zone_descriptor), ++j) {
+struct virtio_blk_zone_descriptor desc =
+(struct virtio_blk_zone_descriptor) {
+

[RFC v6 3/4] block: add accounting for zone append operation

2023-01-29 Thread Sam Li
Taking account of the new zone append write operation for zoned devices,
BLOCK_ACCT_APPEND enum is introduced as other I/O request type (read,
write, flush).

Signed-off-by: Sam Li 
---
 block/qapi-sysemu.c| 11 
 block/qapi.c   | 15 ++
 hw/block/virtio-blk.c  |  4 +++
 include/block/accounting.h |  1 +
 qapi/block-core.json   | 56 ++
 qapi/block.json|  4 +++
 6 files changed, 80 insertions(+), 11 deletions(-)

diff --git a/block/qapi-sysemu.c b/block/qapi-sysemu.c
index 7bd7554150..f7e56dfeb2 100644
--- a/block/qapi-sysemu.c
+++ b/block/qapi-sysemu.c
@@ -517,6 +517,7 @@ void qmp_block_latency_histogram_set(
 bool has_boundaries, uint64List *boundaries,
 bool has_boundaries_read, uint64List *boundaries_read,
 bool has_boundaries_write, uint64List *boundaries_write,
+bool has_boundaries_append, uint64List *boundaries_append,
 bool has_boundaries_flush, uint64List *boundaries_flush,
 Error **errp)
 {
@@ -557,6 +558,16 @@ void qmp_block_latency_histogram_set(
 }
 }
 
+if (has_boundaries || has_boundaries_append) {
+ret = block_latency_histogram_set(
+stats, BLOCK_ACCT_APPEND,
+has_boundaries_append ? boundaries_append : boundaries);
+if (ret) {
+error_setg(errp, "Device '%s' set append write boundaries fail", 
id);
+return;
+}
+}
+
 if (has_boundaries || has_boundaries_flush) {
 ret = block_latency_histogram_set(
 stats, BLOCK_ACCT_FLUSH,
diff --git a/block/qapi.c b/block/qapi.c
index 9b4da12966..0b37a21af7 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -424,27 +424,33 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
BlockBackend *blk)
 
 ds->rd_bytes = stats->nr_bytes[BLOCK_ACCT_READ];
 ds->wr_bytes = stats->nr_bytes[BLOCK_ACCT_WRITE];
+ds->zap_bytes = stats->nr_bytes[BLOCK_ACCT_APPEND];
 ds->unmap_bytes = stats->nr_bytes[BLOCK_ACCT_UNMAP];
 ds->rd_operations = stats->nr_ops[BLOCK_ACCT_READ];
 ds->wr_operations = stats->nr_ops[BLOCK_ACCT_WRITE];
+ds->zap_operations = stats->nr_ops[BLOCK_ACCT_APPEND];
 ds->unmap_operations = stats->nr_ops[BLOCK_ACCT_UNMAP];
 
 ds->failed_rd_operations = stats->failed_ops[BLOCK_ACCT_READ];
 ds->failed_wr_operations = stats->failed_ops[BLOCK_ACCT_WRITE];
+ds->failed_zap_operations = stats->failed_ops[BLOCK_ACCT_APPEND];
 ds->failed_flush_operations = stats->failed_ops[BLOCK_ACCT_FLUSH];
 ds->failed_unmap_operations = stats->failed_ops[BLOCK_ACCT_UNMAP];
 
 ds->invalid_rd_operations = stats->invalid_ops[BLOCK_ACCT_READ];
 ds->invalid_wr_operations = stats->invalid_ops[BLOCK_ACCT_WRITE];
+ds->invalid_zap_operations = stats->invalid_ops[BLOCK_ACCT_APPEND];
 ds->invalid_flush_operations =
 stats->invalid_ops[BLOCK_ACCT_FLUSH];
 ds->invalid_unmap_operations = stats->invalid_ops[BLOCK_ACCT_UNMAP];
 
 ds->rd_merged = stats->merged[BLOCK_ACCT_READ];
 ds->wr_merged = stats->merged[BLOCK_ACCT_WRITE];
+ds->zap_merged = stats->merged[BLOCK_ACCT_APPEND];
 ds->unmap_merged = stats->merged[BLOCK_ACCT_UNMAP];
 ds->flush_operations = stats->nr_ops[BLOCK_ACCT_FLUSH];
 ds->wr_total_time_ns = stats->total_time_ns[BLOCK_ACCT_WRITE];
+ds->zap_total_time_ns = stats->total_time_ns[BLOCK_ACCT_APPEND];
 ds->rd_total_time_ns = stats->total_time_ns[BLOCK_ACCT_READ];
 ds->flush_total_time_ns = stats->total_time_ns[BLOCK_ACCT_FLUSH];
 ds->unmap_total_time_ns = stats->total_time_ns[BLOCK_ACCT_UNMAP];
@@ -462,6 +468,7 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
BlockBackend *blk)
 
 TimedAverage *rd = &ts->latency[BLOCK_ACCT_READ];
 TimedAverage *wr = &ts->latency[BLOCK_ACCT_WRITE];
+TimedAverage *zap = &ts->latency[BLOCK_ACCT_APPEND];
 TimedAverage *fl = &ts->latency[BLOCK_ACCT_FLUSH];
 
 dev_stats->interval_length = ts->interval_length;
@@ -474,6 +481,10 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
BlockBackend *blk)
 dev_stats->max_wr_latency_ns = timed_average_max(wr);
 dev_stats->avg_wr_latency_ns = timed_average_avg(wr);
 
+dev_stats->min_zap_latency_ns = timed_average_min(zap);
+dev_stats->max_zap_latency_ns = timed_average_max(zap);
+dev_stats->avg_zap_latency_ns = timed_average_avg(zap);
+
 dev_stats->min_flush_latency_ns = timed_average_min(fl);
 dev_stats->max_flush_latency_ns = timed_average_max(fl);
 dev_stats->avg_flush_latency_ns = timed_average_avg(fl);
@@ -482,6 +493,8 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
BlockBackend *blk)
   

[RFC v6 1/4] include: update virtio_blk headers

2023-01-29 Thread Sam Li
Use scripts/update-linux-headers.sh to update virtio-blk headers
from Dmitry's "virtio-blk:add support for zoned block devices"
Linux patches.

Signed-off-by: Sam Li 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Dmitry Fomichev 
---
 include/standard-headers/linux/virtio_blk.h | 158 ++--
 1 file changed, 142 insertions(+), 16 deletions(-)

diff --git a/include/standard-headers/linux/virtio_blk.h 
b/include/standard-headers/linux/virtio_blk.h
index 2dcc90826a..3744e4da1b 100644
--- a/include/standard-headers/linux/virtio_blk.h
+++ b/include/standard-headers/linux/virtio_blk.h
@@ -25,10 +25,10 @@
  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE. */
-#include "standard-headers/linux/types.h"
-#include "standard-headers/linux/virtio_ids.h"
-#include "standard-headers/linux/virtio_config.h"
-#include "standard-headers/linux/virtio_types.h"
+#include 
+#include 
+#include 
+#include 
 
 /* Feature bits */
 #define VIRTIO_BLK_F_SIZE_MAX  1   /* Indicates maximum segment size */
@@ -40,6 +40,8 @@
 #define VIRTIO_BLK_F_MQ12  /* support more than one vq */
 #define VIRTIO_BLK_F_DISCARD   13  /* DISCARD is supported */
 #define VIRTIO_BLK_F_WRITE_ZEROES  14  /* WRITE ZEROES is supported */
+#define VIRTIO_BLK_F_SECURE_ERASE  16 /* Secure Erase is supported */
+#define VIRTIO_BLK_F_ZONED 17  /* Zoned block device */
 
 /* Legacy feature bits */
 #ifndef VIRTIO_BLK_NO_LEGACY
@@ -47,8 +49,10 @@
 #define VIRTIO_BLK_F_SCSI  7   /* Supports scsi command passthru */
 #define VIRTIO_BLK_F_FLUSH 9   /* Flush command supported */
 #define VIRTIO_BLK_F_CONFIG_WCE11  /* Writeback mode available in 
config */
+#ifndef __KERNEL__
 /* Old (deprecated) name for VIRTIO_BLK_F_FLUSH. */
 #define VIRTIO_BLK_F_WCE VIRTIO_BLK_F_FLUSH
+#endif
 #endif /* !VIRTIO_BLK_NO_LEGACY */
 
 #define VIRTIO_BLK_ID_BYTES20  /* ID string length */
@@ -63,8 +67,8 @@ struct virtio_blk_config {
/* geometry of the device (if VIRTIO_BLK_F_GEOMETRY) */
struct virtio_blk_geometry {
__virtio16 cylinders;
-   uint8_t heads;
-   uint8_t sectors;
+   __u8 heads;
+   __u8 sectors;
} geometry;
 
/* block size of device (if VIRTIO_BLK_F_BLK_SIZE) */
@@ -72,17 +76,17 @@ struct virtio_blk_config {
 
/* the next 4 entries are guarded by VIRTIO_BLK_F_TOPOLOGY  */
/* exponent for physical block per logical block. */
-   uint8_t physical_block_exp;
+   __u8 physical_block_exp;
/* alignment offset in logical blocks. */
-   uint8_t alignment_offset;
+   __u8 alignment_offset;
/* minimum I/O size without performance penalty in logical blocks. */
__virtio16 min_io_size;
/* optimal sustained I/O size in logical blocks. */
__virtio32 opt_io_size;
 
/* writeback mode (if VIRTIO_BLK_F_CONFIG_WCE) */
-   uint8_t wce;
-   uint8_t unused;
+   __u8 wce;
+   __u8 unused;
 
/* number of vqs, only available when VIRTIO_BLK_F_MQ is set */
__virtio16 num_queues;
@@ -116,10 +120,35 @@ struct virtio_blk_config {
 * Set if a VIRTIO_BLK_T_WRITE_ZEROES request may result in the
 * deallocation of one or more of the sectors.
 */
-   uint8_t write_zeroes_may_unmap;
+   __u8 write_zeroes_may_unmap;
 
-   uint8_t unused1[3];
-} QEMU_PACKED;
+   __u8 unused1[3];
+
+   /* the next 3 entries are guarded by VIRTIO_BLK_F_SECURE_ERASE */
+   /*
+* The maximum secure erase sectors (in 512-byte sectors) for
+* one segment.
+*/
+   __virtio32 max_secure_erase_sectors;
+   /*
+* The maximum number of secure erase segments in a
+* secure erase command.
+*/
+   __virtio32 max_secure_erase_seg;
+   /* Secure erase commands must be aligned to this number of sectors. */
+   __virtio32 secure_erase_sector_alignment;
+
+   /* Zoned block device characteristics (if VIRTIO_BLK_F_ZONED) */
+   struct virtio_blk_zoned_characteristics {
+   __virtio32 zone_sectors;
+   __virtio32 max_open_zones;
+   __virtio32 max_active_zones;
+   __virtio32 max_append_sectors;
+   __virtio32 write_granularity;
+   __u8 model;
+   __u8 unused2[3];
+   } zoned;
+} __attribute__((packed));
 
 /*
  * Command types
@@ -153,6 +182,30 @@ struct virtio_blk_config {
 /* Write zeroes command */
 #define VIRTIO_BLK_T_WRITE_ZEROES  13
 
+/* Secure erase command */
+#define VIRTIO_BLK_T_SECURE_ERASE  14
+
+/* Zone append command */
+#define VIRTIO_BLK_T_ZONE_APPEND15
+
+/* Report zones command */
+#define VIRTIO_BLK_T_ZONE_REPORT16
+
+/* Open 

[RFC v6 0/4] Add zoned storage emulation to virtio-blk driver

2023-01-29 Thread Sam Li
Note: the virtio-blk headers isn't upstream in the kernel yet therefore
marked as an RFC. The VIRTIO spec changes have been merged. The Linux
virtio_blk guest driver patches are in Michael Tsirkin's vhost tree:
https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/tree/drivers/block/virtio_blk.c?h=vhost

v6:
- address Stefan's review comments
  * add accounting for zone append operation
  * fix in_iov usage in handle_request, error handling and typos

v5:
- address Stefan's review comments
  * restore the way writing zone append result to buffer
  * fix error checking case and other errands

v4:
- change the way writing zone append request result to buffer
- change zone state, zone type value of virtio_blk_zone_descriptor
- add trace events for new zone APIs

v3:
- use qemuio_from_buffer to write status bit [Stefan]
- avoid using req->elem directly [Stefan]
- fix error checkings and memory leak [Stefan]

v2:
- change units of emulated zone op coresponding to block layer APIs
- modify error checking cases [Stefan, Damien]

v1:
- add zoned storage emulation

Sam Li (4):
  include: update virtio_blk headers
  virtio-blk: add zoned storage emulation for zoned devices
  block: add accounting for zone append operation
  virtio-blk: add some trace events for zoned emulation

 block/qapi-sysemu.c |  11 +
 block/qapi.c|  15 +
 hw/block/trace-events   |   7 +
 hw/block/virtio-blk-common.c|   2 +
 hw/block/virtio-blk.c   | 410 
 include/block/accounting.h  |   1 +
 include/standard-headers/linux/virtio_blk.h | 158 +++-
 qapi/block-core.json|  56 ++-
 qapi/block.json |   4 +
 9 files changed, 637 insertions(+), 27 deletions(-)

-- 
2.38.1




  1   2   3   4   5   6   >