date:20220910

Re: [PATCH] bugfix:migrate with block-dirty-bitmap (disk size is big enough) can't be finished

2022-09-10 Thread Vladimir Sementsov-Ogievskiy


On 9/10/22 09:35, liuhaiwei wrote:

From: liuhaiwei 

bug description as  https://gitlab.com/qemu-project/qemu/-/issues/1203
Usually,we use the precopy or postcopy mode to migrate block dirty bitmap.
but if block-dirty-bitmap size more than threshold size,we cannot entry the 
migration_completion in migration_iteration_run function
To solve this problem, we can setting  the pending size to a fake 
value(threshold-1 or 0) to tell  migration_iteration_run function to entry the 
migration_completion,if pending size > threshold size




Actually, bitmaps migrate in postcopy. So, you should start postcopy for it to work (qmp 
command migrate-start-postcopy). This command simply set the boolean variable, so that in 
migration_iteration_run() we'll move to postcopy when needed. So, you can start this 
command immediately after migrate command, or even before it, but after setting the 
"dirty-bitmaps" capability.

Fake pending is a wrong thing to do, it means that you will make downtime to be 
larger than expected.

--
Best regards,
Vladimir

Re: [PATCH] bugfix:migrate with block-dirty-bitmap (disk size is big enough) can't be finished

2022-09-10 Thread Vladimir Sementsov-Ogievskiy


Hi!

On 9/10/22 13:47, Seaway Liu(刘海伟) wrote:

hi,i have a question
if failed in migration using post-copy mode,is there some way to restore the 
memory data back to soucre VM?




As far as I understand, no, there is not.

Postcopy started actually means: target has started. So, RAM is touched by 
target VM process, no way to rollback.

Still, things are not so bad: when you enable dirty-bitmaps capability, but not 
postcopy-ram capability, RAM is migrated in precopy as usual. So, when target 
started, the only thing that is not yet migrated is dirty bitmap. So, in worst 
case (migration failure after postcopy started) you'll loose your dirty bitmap. 
VM is migrated and normally running on target. Unfinished bitmaps on target are 
automatically released (see cancel_incoming_locked()). So, in worst case you'll 
have to start your incremental backup chain from a new full-backup.




发自我的小米
在 Vladimir Sementsov-Ogievskiy ，2022年9月10日 下午6:18写道：

On 9/10/22 09:35, liuhaiwei wrote:

From: liuhaiwei 

bug description as  https://gitlab.com/qemu-project/qemu/-/issues/1203
Usually,we use the precopy or postcopy mode to migrate block dirty bitmap.
but if block-dirty-bitmap size more than threshold size,we cannot entry the 
migration_completion in migration_iteration_run function
To solve this problem, we can setting  the pending size to a fake 
value(threshold-1 or 0) to tell  migration_iteration_run function to entry the 
migration_completion,if pending size > threshold size




Actually, bitmaps migrate in postcopy. So, you should start postcopy for it to work (qmp 
command migrate-start-postcopy). This command simply set the boolean variable, so that in 
migration_iteration_run() we'll move to postcopy when needed. So, you can start this 
command immediately after migrate command, or even before it, but after setting the 
"dirty-bitmaps" capability.

Fake pending is a wrong thing to do, it means that you will make downtime to be 
larger than expected.

--
Best regards,
Vladimir



--
Best regards,
Vladimir

Re: [PATCH] bugfix:migrate with block-dirty-bitmap (disk size is big enough) can't be finished

2022-09-10 Thread 刘海伟

hi,i have a question
if failed in migration using post-copy mode,is there some way to restore the 
memory data back to soucre VM?

发自我的小米
在 Vladimir Sementsov-Ogievskiy ，2022年9月10日 下午6:18写道：

On 9/10/22 09:35, liuhaiwei wrote:
> From: liuhaiwei 
>
> bug description as  https://gitlab.com/qemu-project/qemu/-/issues/1203
> Usually,we use the precopy or postcopy mode to migrate block dirty bitmap.
> but if block-dirty-bitmap size more than threshold size,we cannot entry the 
> migration_completion in migration_iteration_run function
> To solve this problem, we can setting  the pending size to a fake 
> value(threshold-1 or 0) to tell  migration_iteration_run function to entry 
> the migration_completion,if pending size > threshold size
>

Actually, bitmaps migrate in postcopy. So, you should start postcopy for it to 
work (qmp command migrate-start-postcopy). This command simply set the boolean 
variable, so that in migration_iteration_run() we'll move to postcopy when 
needed. So, you can start this command immediately after migrate command, or 
even before it, but after setting the "dirty-bitmaps" capability.

Fake pending is a wrong thing to do, it means that you will make downtime to be 
larger than expected.

--
Best regards,
Vladimir

Re:Re: [PATCH] bugfix:migrate with block-dirty-bitmap (disk size is big enough) can't be finished

2022-09-10 Thread liuhaiwei9699




sometimes ,post-copy mode is not the best choice. For instance, Supposing 
migrate process will take ten minutes,but network may be interruptted In this 
process .
If it does happenthe , memory data of VM will be splitted into two parts, and 
will not be rollback.This is a bad situation


so  migrate-start-postcopy will not be setted in conservative scenario. In this 
case, the migration with block dirty bitmap may not be finished.




The migration of block dirty bitmap should not dependent on post-copy or 
pre-copy mode.


At 2022-09-10 18:58:12, "Vladimir Sementsov-Ogievskiy" 
 wrote:
>Hi!
>
>On 9/10/22 13:47, Seaway Liu(刘海伟) wrote:
>> hi,i have a question
>> if failed in migration using post-copy mode,is there some way to restore the 
>> memory data back to soucre VM?
>> 
>
>
>As far as I understand, no, there is not.
>
>Postcopy started actually means: target has started. So, RAM is touched by 
>target VM process, no way to rollback.
>
>Still, things are not so bad: when you enable dirty-bitmaps capability, but 
>not postcopy-ram capability, RAM is migrated in precopy as usual. So, when 
>target started, the only thing that is not yet migrated is dirty bitmap. So, 
>in worst case (migration failure after postcopy started) you'll loose your 
>dirty bitmap. VM is migrated and normally running on target. Unfinished 
>bitmaps on target are automatically released (see cancel_incoming_locked()). 
>So, in worst case you'll have to start your incremental backup chain from a 
>new full-backup.
>
>> 
>> 
>> 发自我的小米
>> 在 Vladimir Sementsov-Ogievskiy ，2022年9月10日 
>> 下午6:18写道：
>> 
>> On 9/10/22 09:35, liuhaiwei wrote:
>>> From: liuhaiwei 
>>>
>>> bug description as  https://gitlab.com/qemu-project/qemu/-/issues/1203
>>> Usually,we use the precopy or postcopy mode to migrate block dirty bitmap.
>>> but if block-dirty-bitmap size more than threshold size,we cannot entry the 
>>> migration_completion in migration_iteration_run function
>>> To solve this problem, we can setting  the pending size to a fake 
>>> value(threshold-1 or 0) to tell  migration_iteration_run function to entry 
>>> the migration_completion,if pending size > threshold size
>>>
>> 
>> 
>> Actually, bitmaps migrate in postcopy. So, you should start postcopy for it 
>> to work (qmp command migrate-start-postcopy). This command simply set the 
>> boolean variable, so that in migration_iteration_run() we'll move to 
>> postcopy when needed. So, you can start this command immediately after 
>> migrate command, or even before it, but after setting the "dirty-bitmaps" 
>> capability.
>> 
>> Fake pending is a wrong thing to do, it means that you will make downtime to 
>> be larger than expected.
>> 
>> --
>> Best regards,
>> Vladimir
>
>
>-- 
>Best regards,
>Vladimir

Re: [PATCH v9 2/7] file-posix: introduce helper functions for sysfs attributes

2022-09-10 Thread Damien Le Moal

On 2022/09/10 14:27, Sam Li wrote:
> Use get_sysfs_str_val() to get the string value of device
> zoned model. Then get_sysfs_zoned_model() can convert it to
> BlockZoneModel type of QEMU.
> 
> Use get_sysfs_long_val() to get the long value of zoned device
> information.
> 
> Signed-off-by: Sam Li 
> Reviewed-by: Hannes Reinecke 
> Reviewed-by: Stefan Hajnoczi 

Looks good to me.

Reviewed-by: Damien Le Moal 

> ---
>  block/file-posix.c   | 121 ++-
>  include/block/block_int-common.h |   3 +
>  2 files changed, 88 insertions(+), 36 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 48cd096624..0a8b4b426e 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1210,66 +1210,109 @@ static int hdev_get_max_hw_transfer(int fd, struct 
> stat *st)
>  #endif
>  }
>  
> -static int hdev_get_max_segments(int fd, struct stat *st)
> -{
> +/*
> + * Get a sysfs attribute value as character string.
> + */
> +static int get_sysfs_str_val(struct stat *st, const char *attribute,
> + char **val) {
>  #ifdef CONFIG_LINUX
> -char buf[32];
> -const char *end;
> -char *sysfspath = NULL;
> +g_autofree char *sysfspath = NULL;
>  int ret;
> -int sysfd = -1;
> -long max_segments;
> +size_t len;
>  
> -if (S_ISCHR(st->st_mode)) {
> -if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
> -return ret;
> -}
> +if (!S_ISBLK(st->st_mode)) {
>  return -ENOTSUP;
>  }
>  
> -if (!S_ISBLK(st->st_mode)) {
> -return -ENOTSUP;
> +sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/%s",
> +major(st->st_rdev), minor(st->st_rdev),
> +attribute);
> +ret = g_file_get_contents(sysfspath, val, &len, NULL);
> +if (ret == -1) {
> +return -ENOENT;
>  }
>  
> -sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
> -major(st->st_rdev), minor(st->st_rdev));
> -sysfd = open(sysfspath, O_RDONLY);
> -if (sysfd == -1) {
> -ret = -errno;
> -goto out;
> +/* The file is ended with '\n' */
> +char *p;
> +p = *val;
> +if (*(p + len - 1) == '\n') {
> +*(p + len - 1) = '\0';
>  }
> -do {
> -ret = read(sysfd, buf, sizeof(buf) - 1);
> -} while (ret == -1 && errno == EINTR);
> +return ret;
> +#else
> +return -ENOTSUP;
> +#endif
> +}
> +
> +static int get_sysfs_zoned_model(struct stat *st, BlockZoneModel *zoned) {
> +g_autofree char *val = NULL;
> +int ret;
> +
> +ret = get_sysfs_str_val(st, "zoned", &val);
>  if (ret < 0) {
> -ret = -errno;
> -goto out;
> -} else if (ret == 0) {
> -ret = -EIO;
> -goto out;
> +return ret;
>  }
> -buf[ret] = 0;
> -/* The file is ended with '\n', pass 'end' to accept that. */
> -ret = qemu_strtol(buf, &end, 10, &max_segments);
> -if (ret == 0 && end && *end == '\n') {
> -ret = max_segments;
> +
> +if (strcmp(val, "host-managed") == 0) {
> +*zoned = BLK_Z_HM;
> +} else if (strcmp(val, "host-aware") == 0) {
> +*zoned = BLK_Z_HA;
> +} else if (strcmp(val, "none") == 0) {
> +*zoned = BLK_Z_NONE;
> +} else {
> +return -ENOTSUP;
>  }
> +return 0;
> +}
>  
> -out:
> -if (sysfd != -1) {
> -close(sysfd);
> +/*
> + * Get a sysfs attribute value as a long integer.
> + */
> +static long get_sysfs_long_val(struct stat *st, const char *attribute) {
> +#ifdef CONFIG_LINUX
> +g_autofree char *str = NULL;
> +const char *end;
> +long val;
> +int ret;
> +
> +ret = get_sysfs_str_val(st, attribute, &str);
> +if (ret < 0) {
> +return ret;
> +}
> +
> +/* The file is ended with '\n', pass 'end' to accept that. */
> +ret = qemu_strtol(str, &end, 10, &val);
> +if (ret == 0 && end && *end == '\0') {
> +ret = val;
>  }
> -g_free(sysfspath);
>  return ret;
>  #else
>  return -ENOTSUP;
>  #endif
>  }
>  
> +static int hdev_get_max_segments(int fd, struct stat *st) {
> +#ifdef CONFIG_LINUX
> +int ret;
> +
> +if (S_ISCHR(st->st_mode)) {
> +if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
> +return ret;
> +}
> +return -ENOTSUP;
> +}
> +return get_sysfs_long_val(st, "max_segments");
> +#else
> +return -ENOTSUP;
> +#endif
> +}
> +
>  static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
>  {
>  BDRVRawState *s = bs->opaque;
>  struct stat st;
> +int ret;
> +BlockZoneModel zoned;
>  
>  s->needs_alignment = raw_needs_alignment(bs);
>  raw_probe_alignment(bs, s->fd, errp);
> @@ -1307,6 +1350,12 @@ static void raw_refresh_limits(BlockDriverState *bs, 
> Error **errp)
>  bs->bl.max_hw_iov = ret;
>  }
>  }
> +
> +ret = get_sysf

Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2022-09-10 Thread Damien Le Moal

On 2022/09/10 14:27, Sam Li wrote:
[...]
> +/*
> + * Send a zone_report command.
> + * offset is a byte offset from the start of the device. No alignment
> + * required for offset.
> + * nr_zones represents IN maximum and OUT actual.
> + */
> +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> +unsigned int *nr_zones,
> +BlockZoneDescriptor *zones)
> +{
> +int ret;
> +IO_CODE();
> +
> +blk_inc_in_flight(blk); /* increase before waiting */
> +blk_wait_while_drained(blk);
> +if (!blk_is_available(blk)) {
> +blk_dec_in_flight(blk);
> +return -ENOMEDIUM;
> +}
> +ret = bdrv_co_zone_report(blk_bs(blk), offset, nr_zones, zones);
> +blk_dec_in_flight(blk);
> +return ret;
> +}
> +
> +/*
> + * Send a zone_management command.
> + * op is the zone operation;
> + * offset is the byte offset from the start of the zoned device;
> + * len is the maximum number of bytes the command should operate on. It
> + * should be aligned with the zone sector size.

This should read:

* offset is the byte offset of the start of the first zone to operate on;
* len is the maximum number of bytes the command should operate on. It
* should be aligned with the device zone size.

No ?

> + */
> +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> +int64_t offset, int64_t len)
> +{
> +int ret;
> +IO_CODE();
> +
> +
> +blk_inc_in_flight(blk);
> +blk_wait_while_drained(blk);
> +
> +ret = blk_check_byte_request(blk, offset, len);
> +if (ret < 0) {
> +return ret;
> +}
> +
> +ret = bdrv_co_zone_mgmt(blk_bs(blk), op, offset, len);
> +blk_dec_in_flight(blk);
> +return ret;
> +}
> +
>  void blk_drain(BlockBackend *blk)
>  {
>  BlockDriverState *bs = blk_bs(blk);
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 0a8b4b426e..4edfa25d04 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -67,6 +67,9 @@
>  #include 
>  #include 
>  #include 
> +#if defined(CONFIG_BLKZONED)
> +#include 
> +#endif
>  #include 
>  #include 
>  #include 
> @@ -216,6 +219,15 @@ typedef struct RawPosixAIOData {
>  PreallocMode prealloc;
>  Error **errp;
>  } truncate;
> +struct {
> +unsigned int *nr_zones;
> +BlockZoneDescriptor *zones;
> +} zone_report;
> +struct {
> +unsigned long zone_op;
> +const char *zone_op_name;
> +bool all;
> +} zone_mgmt;
>  };
>  } RawPosixAIOData;
>  
> @@ -1339,7 +1351,7 @@ static void raw_refresh_limits(BlockDriverState *bs, 
> Error **errp)
>  #endif
>  
>  if (bs->sg || S_ISBLK(st.st_mode)) {
> -int ret = hdev_get_max_hw_transfer(s->fd, &st);
> +ret = hdev_get_max_hw_transfer(s->fd, &st);
>  
>  if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
>  bs->bl.max_hw_transfer = ret;
> @@ -1356,6 +1368,27 @@ static void raw_refresh_limits(BlockDriverState *bs, 
> Error **errp)
>  zoned = BLK_Z_NONE;
>  }
>  bs->bl.zoned = zoned;
> +if (zoned != BLK_Z_NONE) {
> +ret = get_sysfs_long_val(&st, "chunk_sectors");
> +if (ret > 0) {
> +bs->bl.zone_sectors = ret;
> +}

It may be good to check that we are getting a valid zone size here. So may be
change the check to something like this ?

if (ret <= 0) {
*** print some error message mentioning the invalid zone size ***
bs->bl.zoned = BLK_Z_NONE;
return;
}
bs->bl.zone_sectors = ret;

> +
> +ret = get_sysfs_long_val(&st, "zone_append_max_bytes");
> +if (ret > 0) {
> +bs->bl.max_append_sectors = ret / 512;
> +}
> +
> +ret = get_sysfs_long_val(&st, "max_open_zones");
> +if (ret >= 0) {
> +bs->bl.max_open_zones = ret;
> +}
> +
> +ret = get_sysfs_long_val(&st, "max_active_zones");
> +if (ret >= 0) {
> +bs->bl.max_active_zones = ret;
> +}
> +}
>  }
>  
>  static int check_for_dasd(int fd)
> @@ -1850,6 +1883,145 @@ static off_t copy_file_range(int in_fd, off_t 
> *in_off, int out_fd,
>  }
>  #endif
>  
> +/*
> + * parse_zone - Fill a zone descriptor
> + */
> +#if defined(CONFIG_BLKZONED)
> +static inline void parse_zone(struct BlockZoneDescriptor *zone,
> +  const struct blk_zone *blkz) {
> +zone->start = blkz->start;
> +zone->length = blkz->len;
> +zone->cap = blkz->capacity;
> +zone->wp = blkz->wp;
> +
> +switch (blkz->type) {
> +case BLK_ZONE_TYPE_SEQWRITE_REQ:
> +zone->type = BLK_ZT_SWR;
> +break;
> +case BLK_ZONE_TYPE_SEQWRITE_PREF:
> +zone->type = BLK_ZT_SWP;
> +break;
> +case BLK_ZONE_TYPE_CONVENTIONAL:
> +zone->type = BLK_ZT_CONV;
> +break;
> +default:
> +g_

Re: [PATCH v9 5/7] config: add check to block layer

2022-09-10 Thread Damien Le Moal

On 2022/09/10 14:27, Sam Li wrote:
> Putting zoned/non-zoned BlockDrivers on top of each other is not
> allowed.
> 
> Signed-off-by: Sam Li 
> Reviewed-by: Stefan Hajnoczi 
> ---
>  block.c  | 14 ++
>  block/file-posix.c   | 14 ++
>  block/raw-format.c   |  1 +
>  include/block/block_int-common.h |  5 +
>  4 files changed, 34 insertions(+)
> 
> diff --git a/block.c b/block.c
> index bc85f46eed..dad2ed3959 100644
> --- a/block.c
> +++ b/block.c
> @@ -7947,6 +7947,20 @@ void bdrv_add_child(BlockDriverState *parent_bs, 
> BlockDriverState *child_bs,
>  return;
>  }
>  
> +/*
> + * Non-zoned block drivers do not follow zoned storage constraints
> + * (i.e. sequential writes to zones). Refuse mixing zoned and non-zoned
> + * drivers in a graph.
> + */
> +if (!parent_bs->drv->supports_zoned_children &&
> +child_bs->bl.zoned == BLK_Z_HM) {

Shouldn't this be "child_bs->bl.zoned != BLK_Z_NONE" ?

> +error_setg(errp, "Cannot add a %s child to a %s parent",
> +   child_bs->bl.zoned == BLK_Z_HM ? "zoned" : "non-zoned",
> +   parent_bs->drv->supports_zoned_children ?
> +   "support zoned children" : "not support zoned children");
> +return;
> +}
> +
>  if (!QLIST_EMPTY(&child_bs->parents)) {
>  error_setg(errp, "The node %s already has a parent",
> child_bs->node_name);
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 4edfa25d04..354de22860 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -779,6 +779,20 @@ static int raw_open_common(BlockDriverState *bs, QDict 
> *options,
>  goto fail;
>  }
>  }
> +#ifdef CONFIG_BLKZONED
> +/*
> + * The kernel page chache does not reliably work for writes to SWR zones
> + * of zoned block device because it can not guarantee the order of 
> writes.
> + */
> +if (strcmp(bs->drv->format_name, "zoned_host_device") == 0) {
> +if (!(s->open_flags & O_DIRECT)) {
> +error_setg(errp, "driver=zoned_host_device was specified, but it 
> "
> + "requires cache.direct=on, which was not 
> specified.");
> +ret = -EINVAL;

This line is not needed. Simply "return -EINVAL;".

> +return ret; /* No host kernel page cache */
> +}
> +}
> +#endif
>  
>  if (S_ISBLK(st.st_mode)) {
>  #ifdef BLKDISCARDZEROES
> diff --git a/block/raw-format.c b/block/raw-format.c
> index 6b20bd22ef..9441536819 100644
> --- a/block/raw-format.c
> +++ b/block/raw-format.c
> @@ -614,6 +614,7 @@ static void raw_child_perm(BlockDriverState *bs, 
> BdrvChild *c,
>  BlockDriver bdrv_raw = {
>  .format_name  = "raw",
>  .instance_size= sizeof(BDRVRawState),
> +.supports_zoned_children = true,
>  .bdrv_probe   = &raw_probe,
>  .bdrv_reopen_prepare  = &raw_reopen_prepare,
>  .bdrv_reopen_commit   = &raw_reopen_commit,
> diff --git a/include/block/block_int-common.h 
> b/include/block/block_int-common.h
> index 078ddd7e67..043aa161a0 100644
> --- a/include/block/block_int-common.h
> +++ b/include/block/block_int-common.h
> @@ -127,6 +127,11 @@ struct BlockDriver {
>   */
>  bool is_format;
>  
> +/*
> + * Set to true if the BlockDriver supports zoned children.
> + */
> +bool supports_zoned_children;
> +
>  /*
>   * Drivers not implementing bdrv_parse_filename nor bdrv_open should have
>   * this field set to true, except ones that are defined only by their

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v9 4/7] raw-format: add zone operations to pass through requests

2022-09-10 Thread Damien Le Moal

On 2022/09/10 14:27, Sam Li wrote:
> raw-format driver usually sits on top of file-posix driver. It needs to
> pass through requests of zone commands.
> 
> Signed-off-by: Sam Li 
> Reviewed-by: Stefan Hajnoczi 

Reviewed-by: Damien Le Moal 

> ---
>  block/raw-format.c | 13 +
>  1 file changed, 13 insertions(+)
> 
> diff --git a/block/raw-format.c b/block/raw-format.c
> index 69fd650eaf..6b20bd22ef 100644
> --- a/block/raw-format.c
> +++ b/block/raw-format.c
> @@ -314,6 +314,17 @@ static int coroutine_fn raw_co_pdiscard(BlockDriverState 
> *bs,
>  return bdrv_co_pdiscard(bs->file, offset, bytes);
>  }
>  
> +static int coroutine_fn raw_co_zone_report(BlockDriverState *bs, int64_t 
> offset,
> +   unsigned int *nr_zones,
> +   BlockZoneDescriptor *zones) {
> +return bdrv_co_zone_report(bs->file->bs, offset, nr_zones, zones);
> +}
> +
> +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp 
> op,
> + int64_t offset, int64_t len) {
> +return bdrv_co_zone_mgmt(bs->file->bs, op, offset, len);
> +}
> +
>  static int64_t raw_getlength(BlockDriverState *bs)
>  {
>  int64_t len;
> @@ -614,6 +625,8 @@ BlockDriver bdrv_raw = {
>  .bdrv_co_pwritev  = &raw_co_pwritev,
>  .bdrv_co_pwrite_zeroes = &raw_co_pwrite_zeroes,
>  .bdrv_co_pdiscard = &raw_co_pdiscard,
> +.bdrv_co_zone_report  = &raw_co_zone_report,
> +.bdrv_co_zone_mgmt  = &raw_co_zone_mgmt,
>  .bdrv_co_block_status = &raw_co_block_status,
>  .bdrv_co_copy_range_from = &raw_co_copy_range_from,
>  .bdrv_co_copy_range_to  = &raw_co_copy_range_to,

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v9 7/7] docs/zoned-storage: add zoned device documentation

2022-09-10 Thread Damien Le Moal

On 2022/09/10 14:27, Sam Li wrote:
> Add the documentation about the zoned device support to virtio-blk
> emulation.
> 
> Signed-off-by: Sam Li 
> Reviewed-by: Stefan Hajnoczi 
> ---
>  docs/devel/zoned-storage.rst   | 41 ++
>  docs/system/qemu-block-drivers.rst.inc |  6 
>  2 files changed, 47 insertions(+)
>  create mode 100644 docs/devel/zoned-storage.rst
> 
> diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
> new file mode 100644
> index 00..ead2d149cc
> --- /dev/null
> +++ b/docs/devel/zoned-storage.rst
> @@ -0,0 +1,41 @@
> +=
> +zoned-storage
> +=
> +
> +Zoned Block Devices (ZBDs) devide the LBA space into block regions called 
> zones
> +that are larger than the LBA size. It can only allow sequential writes, which

s/It/They

> +reduces write amplification in SSDs, leading to higher throughput and 
> increased
> +capacity. More details about ZBDs can be found at:

I would rephrase this like this, to be less assertive about the potential
benefits (as they depend on the vendor implementation):

..., which can reduce write amplification in SSDs, and potentially lead to
higher throughput and increased device capacity.

> +
> +https://zonedstorage.io/docs/introduction/zoned-storage
> +
> +1. Block layer APIs for zoned storage
> +-
> +QEMU block layer has three zoned storage model:
> +- BLK_Z_HM: This model only allows sequential writes access. It supports a 
> set
> +of ZBD-specific I/O request that used by the host to manage device zones.
> +- BLK_Z_HA: It deals with both sequential writes and random writes access.
> +- BLK_Z_NONE: Regular block devices and drive-managed ZBDs are treated as
> +non-zoned devices.
> +
> +The block device information resides inside BlockDriverState. QEMU uses
> +BlockLimits struct(BlockDriverState::bl) that is continuously accessed by the
> +block layer while processing I/O requests. A BlockBackend has a root pointer 
> to
> +a BlockDriverState graph(for example, raw format on top of file-posix). The
> +zoned storage information can be propagated from the leaf BlockDriverState 
> all
> +the way up to the BlockBackend. If the zoned storage model in file-posix is
> +set to BLK_Z_HM, then block drivers will declare support for zoned host 
> device.
> +
> +The block layer APIs support commands needed for zoned storage devices,
> +including report zones, four zone operations, and zone append.
> +
> +2. Emulating zoned storage controllers
> +--
> +When the BlockBackend's BlockLimits model reports a zoned storage device, 
> users
> +like the virtio-blk emulation or the qemu-io-cmds.c utility can use block 
> layer
> +APIs for zoned storage emulation or testing.
> +
> +For example, the command line for zone report testing a null_blk device of
> +qemu-io-cmds.c is:
> +$ path/to/qemu-io --image-opts driver=zoned_host_device,filename=/dev/nullb0 
> -c
> +"zrp offset nr_zones"
> diff --git a/docs/system/qemu-block-drivers.rst.inc 
> b/docs/system/qemu-block-drivers.rst.inc
> index dfe5d2293d..0b97227fd9 100644
> --- a/docs/system/qemu-block-drivers.rst.inc
> +++ b/docs/system/qemu-block-drivers.rst.inc
> @@ -430,6 +430,12 @@ Hard disks
>you may corrupt your host data (use the ``-snapshot`` command
>line option or modify the device permissions accordingly).
>  
> +Zoned block devices
> +  Zoned block devices can be passed through to the guest if the emulated 
> storage
> +  controller supports zoned storage. Use ``--blockdev zoned_host_device,
> +  node-name=drive0,filename=/dev/nullb0`` to pass through ``/dev/nullb0``
> +  as ``drive0``.
> +
>  Windows
>  ^^^
>  

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] block: introduce zone append write for zoned devices

2022-09-10 Thread Damien Le Moal

On 2022/09/10 15:38, Sam Li wrote:
> A zone append command is a write operation that specifies the first
> logical block of a zone as the write position. When writing to a zoned
> block device using zone append, the byte offset of the write is pointing
> to the write pointer of that zone. Upon completion the device will
> respond with the position the data has been placed in the zone.

s/placed/written

You need to explain more about what this patch does:

Since Linux does not provide a user API to issue zone append operations to zoned
devices from user space, the file-posix driver is modified to add zone append
emulation using regular write operations. To do this, the file-posix driver
tracks the wp location of all zones of the device Blah.

> 
> Signed-off-by: Sam Li 
> ---
>  block/block-backend.c  |  65 +++
>  block/file-posix.c | 169 -
>  block/io.c |  21 
>  block/raw-format.c |   7 ++
>  include/block/block-common.h   |   2 +
>  include/block/block-io.h   |   3 +
>  include/block/block_int-common.h   |   9 ++
>  include/block/raw-aio.h|   4 +-
>  include/sysemu/block-backend-io.h  |   9 ++
>  qemu-io-cmds.c |  62 +++
>  tests/qemu-iotests/tests/zoned.out |   7 ++
>  tests/qemu-iotests/tests/zoned.sh  |   9 ++
>  12 files changed, 360 insertions(+), 7 deletions(-)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index ebe8d7bdf3..b77a1cb24b 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1439,6 +1439,9 @@ typedef struct BlkRwCo {
>  struct {
>  BlockZoneOp op;
>  } zone_mgmt;
> +struct {
> +int64_t *append_sector;
> +} zone_append;
>  };
>  } BlkRwCo;
>  
> @@ -1869,6 +1872,47 @@ BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, 
> BlockZoneOp op,
>  return &acb->common;
>  }
>  
> +static void blk_aio_zone_append_entry(void *opaque) {
> +BlkAioEmAIOCB *acb = opaque;
> +BlkRwCo *rwco = &acb->rwco;
> +
> +rwco->ret = blk_co_zone_append(rwco->blk, 
> rwco->zone_append.append_sector,
> +   rwco->iobuf, rwco->flags);
> +blk_aio_complete(acb);
> +}
> +
> +BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t *offset,
> +QEMUIOVector *qiov, BdrvRequestFlags flags,
> +BlockCompletionFunc *cb, void *opaque) {
> +BlkAioEmAIOCB *acb;
> +Coroutine *co;
> +IO_CODE();
> +
> +blk_inc_in_flight(blk);
> +acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
> +acb->rwco = (BlkRwCo) {
> +.blk= blk,
> +.ret= NOT_DONE,
> +.flags  = flags,
> +.iobuf  = qiov,
> +.zone_append = {
> +.append_sector = offset,
> +},
> +};
> +acb->has_returned = false;
> +
> +co = qemu_coroutine_create(blk_aio_zone_append_entry, acb);
> +bdrv_coroutine_enter(blk_bs(blk), co);
> +
> +acb->has_returned = true;
> +if (acb->rwco.ret != NOT_DONE) {
> +replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
> + blk_aio_complete_bh, acb);
> +}
> +
> +return &acb->common;
> +}
> +
>  /*
>   * Send a zone_report command.
>   * offset is a byte offset from the start of the device. No alignment
> @@ -1920,6 +1964,27 @@ int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, 
> BlockZoneOp op,
>  return ret;
>  }
>  
> +/*
> + * Send a zone_append command.
> + */
> +int coroutine_fn blk_co_zone_append(BlockBackend *blk, int64_t *offset,
> +QEMUIOVector *qiov, BdrvRequestFlags flags)
> +{
> +int ret;
> +IO_CODE();
> +
> +blk_inc_in_flight(blk);
> +blk_wait_while_drained(blk);
> +if (!blk_is_available(blk)) {
> +blk_dec_in_flight(blk);
> +return -ENOMEDIUM;
> +}
> +
> +ret = bdrv_co_zone_append(blk_bs(blk), offset, qiov, flags);
> +blk_dec_in_flight(blk);
> +return ret;
> +}
> +
>  void blk_drain(BlockBackend *blk)
>  {
>  BlockDriverState *bs = blk_bs(blk);
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 354de22860..65500e43f4 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -173,6 +173,7 @@ typedef struct BDRVRawState {
>  } stats;
>  
>  PRManager *pr_mgr;
> +CoRwlock zones_lock;
>  } BDRVRawState;
>  
>  typedef struct BDRVRawReopenState {
> @@ -206,6 +207,8 @@ typedef struct RawPosixAIOData {
>  struct {
>  struct iovec *iov;
>  int niov;
> +int64_t *append_sector;
> +BlockZoneDescriptor *zone;
>  } io;
>  struct {
>  uint64_t cmd;
> @@ -1333,6 +1336,9 @@ static int hdev_get_max_segments(int fd, struct stat 
> *st) {
>  #endif
>  }
>  
> +static inline void parse_zone(struct

Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2022-09-10 Thread Sam Li

Damien Le Moal  于2022年9月11日周日 13:31写道：
>
> On 2022/09/10 14:27, Sam Li wrote:
> [...]
> > +/*
> > + * Send a zone_report command.
> > + * offset is a byte offset from the start of the device. No alignment
> > + * required for offset.
> > + * nr_zones represents IN maximum and OUT actual.
> > + */
> > +int coroutine_fn blk_co_zone_report(BlockBackend *blk, int64_t offset,
> > +unsigned int *nr_zones,
> > +BlockZoneDescriptor *zones)
> > +{
> > +int ret;
> > +IO_CODE();
> > +
> > +blk_inc_in_flight(blk); /* increase before waiting */
> > +blk_wait_while_drained(blk);
> > +if (!blk_is_available(blk)) {
> > +blk_dec_in_flight(blk);
> > +return -ENOMEDIUM;
> > +}
> > +ret = bdrv_co_zone_report(blk_bs(blk), offset, nr_zones, zones);
> > +blk_dec_in_flight(blk);
> > +return ret;
> > +}
> > +
> > +/*
> > + * Send a zone_management command.
> > + * op is the zone operation;
> > + * offset is the byte offset from the start of the zoned device;
> > + * len is the maximum number of bytes the command should operate on. It
> > + * should be aligned with the zone sector size.
>
> This should read:
>
> * offset is the byte offset of the start of the first zone to operate on;
> * len is the maximum number of bytes the command should operate on. It
> * should be aligned with the device zone size.
>
> No ?

Right. The zone sector size here is meant for the zone size whose unit
is a 512-byte sector.

>
> > + */
> > +int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, BlockZoneOp op,
> > +int64_t offset, int64_t len)
> > +{
> > +int ret;
> > +IO_CODE();
> > +
> > +
> > +blk_inc_in_flight(blk);
> > +blk_wait_while_drained(blk);
> > +
> > +ret = blk_check_byte_request(blk, offset, len);
> > +if (ret < 0) {
> > +return ret;
> > +}
> > +
> > +ret = bdrv_co_zone_mgmt(blk_bs(blk), op, offset, len);
> > +blk_dec_in_flight(blk);
> > +return ret;
> > +}
> > +
> >  void blk_drain(BlockBackend *blk)
> >  {
> >  BlockDriverState *bs = blk_bs(blk);
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 0a8b4b426e..4edfa25d04 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -67,6 +67,9 @@
> >  #include 
> >  #include 
> >  #include 
> > +#if defined(CONFIG_BLKZONED)
> > +#include 
> > +#endif
> >  #include 
> >  #include 
> >  #include 
> > @@ -216,6 +219,15 @@ typedef struct RawPosixAIOData {
> >  PreallocMode prealloc;
> >  Error **errp;
> >  } truncate;
> > +struct {
> > +unsigned int *nr_zones;
> > +BlockZoneDescriptor *zones;
> > +} zone_report;
> > +struct {
> > +unsigned long zone_op;
> > +const char *zone_op_name;
> > +bool all;
> > +} zone_mgmt;
> >  };
> >  } RawPosixAIOData;
> >
> > @@ -1339,7 +1351,7 @@ static void raw_refresh_limits(BlockDriverState *bs, 
> > Error **errp)
> >  #endif
> >
> >  if (bs->sg || S_ISBLK(st.st_mode)) {
> > -int ret = hdev_get_max_hw_transfer(s->fd, &st);
> > +ret = hdev_get_max_hw_transfer(s->fd, &st);
> >
> >  if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
> >  bs->bl.max_hw_transfer = ret;
> > @@ -1356,6 +1368,27 @@ static void raw_refresh_limits(BlockDriverState *bs, 
> > Error **errp)
> >  zoned = BLK_Z_NONE;
> >  }
> >  bs->bl.zoned = zoned;
> > +if (zoned != BLK_Z_NONE) {
> > +ret = get_sysfs_long_val(&st, "chunk_sectors");
> > +if (ret > 0) {
> > +bs->bl.zone_sectors = ret;
> > +}
>
> It may be good to check that we are getting a valid zone size here. So may be
> change the check to something like this ?
>
> if (ret <= 0) {
> *** print some error message mentioning the invalid zone size ***
> bs->bl.zoned = BLK_Z_NONE;
> return;
> }
> bs->bl.zone_sectors = ret;
>

Ok, thanks!

> > +
> > +ret = get_sysfs_long_val(&st, "zone_append_max_bytes");
> > +if (ret > 0) {
> > +bs->bl.max_append_sectors = ret / 512;
> > +}
> > +
> > +ret = get_sysfs_long_val(&st, "max_open_zones");
> > +if (ret >= 0) {
> > +bs->bl.max_open_zones = ret;
> > +}
> > +
> > +ret = get_sysfs_long_val(&st, "max_active_zones");
> > +if (ret >= 0) {
> > +bs->bl.max_active_zones = ret;
> > +}
> > +}
> >  }
> >
> >  static int check_for_dasd(int fd)
> > @@ -1850,6 +1883,145 @@ static off_t copy_file_range(int in_fd, off_t 
> > *in_off, int out_fd,
> >  }
> >  #endif
> >
> > +/*
> > + * parse_zone - Fill a zone descriptor
> > + */
> > +#if defined(CONFIG_BLKZONED)
> > +static inline void parse_zone(struct BlockZoneDescriptor *zone,
> > +  const struct blk_zone *blkz) {
> > +zone->start = blkz->st

Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

2022-09-10 Thread Damien Le Moal

On 2022/09/11 15:33, Sam Li wrote:
> Damien Le Moal  于2022年9月11日周日 13:31写道：
[...]
>>> +/*
>>> + * zone management operations - Execute an operation on a zone
>>> + */
>>> +static int coroutine_fn raw_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp 
>>> op,
>>> +int64_t offset, int64_t len) {
>>> +#if defined(CONFIG_BLKZONED)
>>> +BDRVRawState *s = bs->opaque;
>>> +RawPosixAIOData acb;
>>> +int64_t zone_sector, zone_sector_mask;
>>> +const char *zone_op_name;
>>> +unsigned long zone_op;
>>> +bool is_all = false;
>>> +
>>> +zone_sector = bs->bl.zone_sectors;
>>> +zone_sector_mask = zone_sector - 1;
>>> +if (offset & zone_sector_mask) {
>>> +error_report("sector offset %" PRId64 " is not aligned to zone 
>>> size "
>>> + "%" PRId64 "", offset, zone_sector);
>>> +return -EINVAL;
>>> +}
>>> +
>>> +if (len & zone_sector_mask) {
>>
>> Linux allows SMR drives to have a smaller last zone. So this needs to be
>> accounted for here. Otherwise, a zone operation that includes the last 
>> smaller
>> zone would always fail. Something like this would work:
>>
>> if (((offset + len) < capacity &&
>> len & zone_sector_mask) ||
>> offset + len > capacity) {
>>
> 
> I see. I think the offset can be removed, like:
> if (((len < capacity && len & zone_sector_mask) || len > capacity) {
> Then if we use the previous zone's len for the last smaller zone, it
> will be greater than its capacity.

Nope, you cannot remove the offset since the zone operation may be for that last
zone only, that is, offset == last zone start and len == last zone smaller size.
In that case, len is alwats smaller than capacity.

> 
> I will also include "opening the last zone" as a test case later.

Note that you can create such smaller last zone on the host with null_blk by
specifying a device capacity that is *not* a multiple of the zone size.

> 
>>> +error_report("number of sectors %" PRId64 " is not aligned to zone 
>>> size"
>>> +  " %" PRId64 "", len, zone_sector);
>>> +return -EINVAL;
>>> +}
>>> +
>>> +switch (op) {
>>> +case BLK_ZO_OPEN:
>>> +zone_op_name = "BLKOPENZONE";
>>> +zone_op = BLKOPENZONE;
>>> +break;
>>> +case BLK_ZO_CLOSE:
>>> +zone_op_name = "BLKCLOSEZONE";
>>> +zone_op = BLKCLOSEZONE;
>>> +break;
>>> +case BLK_ZO_FINISH:
>>> +zone_op_name = "BLKFINISHZONE";
>>> +zone_op = BLKFINISHZONE;
>>> +break;
>>> +case BLK_ZO_RESET:
>>> +zone_op_name = "BLKRESETZONE";
>>> +zone_op = BLKRESETZONE;
>>> +break;
>>> +default:
>>> +g_assert_not_reached();
>>> +}
>>> +
>>> +acb = (RawPosixAIOData) {
>>> +.bs = bs,
>>> +.aio_fildes = s->fd,
>>> +.aio_type   = QEMU_AIO_ZONE_MGMT,
>>> +.aio_offset = offset,
>>> +.aio_nbytes = len,
>>> +.zone_mgmt  = {
>>> +.zone_op = zone_op,
>>> +.zone_op_name = zone_op_name,
>>> +.all = is_all,
>>> +},
>>> +};
>>> +
>>> +return raw_thread_pool_submit(bs, handle_aiocb_zone_mgmt, &acb);
>>> +#else
>>> +return -ENOTSUP;
>>> +#endif
>>> +}

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] block: introduce zone append write for zoned devices

2022-09-10 Thread Damien Le Moal

On 2022/09/11 15:06, Damien Le Moal wrote:
> On 2022/09/10 15:38, Sam Li wrote:
>> A zone append command is a write operation that specifies the first
>> logical block of a zone as the write position. When writing to a zoned
>> block device using zone append, the byte offset of the write is pointing
>> to the write pointer of that zone. Upon completion the device will
>> respond with the position the data has been placed in the zone.
> 
> s/placed/written
> 
> You need to explain more about what this patch does:
> 
> Since Linux does not provide a user API to issue zone append operations to 
> zoned
> devices from user space, the file-posix driver is modified to add zone append
> emulation using regular write operations. To do this, the file-posix driver
> tracks the wp location of all zones of the device Blah.

Thinking more about this, I think you should split this patch in 2:
1) first patch adding the tracking of the zones wp.
2) second patch adding zone append emulation

That will make the review far easier.

> 
>>
>> Signed-off-by: Sam Li 
>> ---
>>  block/block-backend.c  |  65 +++
>>  block/file-posix.c | 169 -
>>  block/io.c |  21 
>>  block/raw-format.c |   7 ++
>>  include/block/block-common.h   |   2 +
>>  include/block/block-io.h   |   3 +
>>  include/block/block_int-common.h   |   9 ++
>>  include/block/raw-aio.h|   4 +-
>>  include/sysemu/block-backend-io.h  |   9 ++
>>  qemu-io-cmds.c |  62 +++
>>  tests/qemu-iotests/tests/zoned.out |   7 ++
>>  tests/qemu-iotests/tests/zoned.sh  |   9 ++
>>  12 files changed, 360 insertions(+), 7 deletions(-)
>>
>> diff --git a/block/block-backend.c b/block/block-backend.c
>> index ebe8d7bdf3..b77a1cb24b 100644
>> --- a/block/block-backend.c
>> +++ b/block/block-backend.c
>> @@ -1439,6 +1439,9 @@ typedef struct BlkRwCo {
>>  struct {
>>  BlockZoneOp op;
>>  } zone_mgmt;
>> +struct {
>> +int64_t *append_sector;
>> +} zone_append;
>>  };
>>  } BlkRwCo;
>>  
>> @@ -1869,6 +1872,47 @@ BlockAIOCB *blk_aio_zone_mgmt(BlockBackend *blk, 
>> BlockZoneOp op,
>>  return &acb->common;
>>  }
>>  
>> +static void blk_aio_zone_append_entry(void *opaque) {
>> +BlkAioEmAIOCB *acb = opaque;
>> +BlkRwCo *rwco = &acb->rwco;
>> +
>> +rwco->ret = blk_co_zone_append(rwco->blk, 
>> rwco->zone_append.append_sector,
>> +   rwco->iobuf, rwco->flags);
>> +blk_aio_complete(acb);
>> +}
>> +
>> +BlockAIOCB *blk_aio_zone_append(BlockBackend *blk, int64_t *offset,
>> +QEMUIOVector *qiov, BdrvRequestFlags flags,
>> +BlockCompletionFunc *cb, void *opaque) {
>> +BlkAioEmAIOCB *acb;
>> +Coroutine *co;
>> +IO_CODE();
>> +
>> +blk_inc_in_flight(blk);
>> +acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
>> +acb->rwco = (BlkRwCo) {
>> +.blk= blk,
>> +.ret= NOT_DONE,
>> +.flags  = flags,
>> +.iobuf  = qiov,
>> +.zone_append = {
>> +.append_sector = offset,
>> +},
>> +};
>> +acb->has_returned = false;
>> +
>> +co = qemu_coroutine_create(blk_aio_zone_append_entry, acb);
>> +bdrv_coroutine_enter(blk_bs(blk), co);
>> +
>> +acb->has_returned = true;
>> +if (acb->rwco.ret != NOT_DONE) {
>> +replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
>> + blk_aio_complete_bh, acb);
>> +}
>> +
>> +return &acb->common;
>> +}
>> +
>>  /*
>>   * Send a zone_report command.
>>   * offset is a byte offset from the start of the device. No alignment
>> @@ -1920,6 +1964,27 @@ int coroutine_fn blk_co_zone_mgmt(BlockBackend *blk, 
>> BlockZoneOp op,
>>  return ret;
>>  }
>>  
>> +/*
>> + * Send a zone_append command.
>> + */
>> +int coroutine_fn blk_co_zone_append(BlockBackend *blk, int64_t *offset,
>> +QEMUIOVector *qiov, BdrvRequestFlags flags)
>> +{
>> +int ret;
>> +IO_CODE();
>> +
>> +blk_inc_in_flight(blk);
>> +blk_wait_while_drained(blk);
>> +if (!blk_is_available(blk)) {
>> +blk_dec_in_flight(blk);
>> +return -ENOMEDIUM;
>> +}
>> +
>> +ret = bdrv_co_zone_append(blk_bs(blk), offset, qiov, flags);
>> +blk_dec_in_flight(blk);
>> +return ret;
>> +}
>> +
>>  void blk_drain(BlockBackend *blk)
>>  {
>>  BlockDriverState *bs = blk_bs(blk);
>> diff --git a/block/file-posix.c b/block/file-posix.c
>> index 354de22860..65500e43f4 100644
>> --- a/block/file-posix.c
>> +++ b/block/file-posix.c
>> @@ -173,6 +173,7 @@ typedef struct BDRVRawState {
>>  } stats;
>>  
>>  PRManager *pr_mgr;
>> +CoRwlock zones_lock;
>>  } BDRVRawState;
>>  
>>  typedef struct BDRVRawReopenState {
>> @@ -206,6 +207,8 @@ type

Re: [PATCH v9 5/7] config: add check to block layer

2022-09-10 Thread Sam Li

Damien Le Moal  于2022年9月11日周日 13:34写道：
>
> On 2022/09/10 14:27, Sam Li wrote:
> > Putting zoned/non-zoned BlockDrivers on top of each other is not
> > allowed.
> >
> > Signed-off-by: Sam Li 
> > Reviewed-by: Stefan Hajnoczi 
> > ---
> >  block.c  | 14 ++
> >  block/file-posix.c   | 14 ++
> >  block/raw-format.c   |  1 +
> >  include/block/block_int-common.h |  5 +
> >  4 files changed, 34 insertions(+)
> >
> > diff --git a/block.c b/block.c
> > index bc85f46eed..dad2ed3959 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -7947,6 +7947,20 @@ void bdrv_add_child(BlockDriverState *parent_bs, 
> > BlockDriverState *child_bs,
> >  return;
> >  }
> >
> > +/*
> > + * Non-zoned block drivers do not follow zoned storage constraints
> > + * (i.e. sequential writes to zones). Refuse mixing zoned and non-zoned
> > + * drivers in a graph.
> > + */
> > +if (!parent_bs->drv->supports_zoned_children &&
> > +child_bs->bl.zoned == BLK_Z_HM) {
>
> Shouldn't this be "child_bs->bl.zoned != BLK_Z_NONE" ?

The host-aware model allows zoned storage constraints(sequentially
write) and random write. Is mixing HA and non-zoned drivers allowed?
What's the difference?

>
> > +error_setg(errp, "Cannot add a %s child to a %s parent",
> > +   child_bs->bl.zoned == BLK_Z_HM ? "zoned" : "non-zoned",
> > +   parent_bs->drv->supports_zoned_children ?
> > +   "support zoned children" : "not support zoned 
> > children");
> > +return;
> > +}
> > +
> >  if (!QLIST_EMPTY(&child_bs->parents)) {
> >  error_setg(errp, "The node %s already has a parent",
> > child_bs->node_name);
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 4edfa25d04..354de22860 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -779,6 +779,20 @@ static int raw_open_common(BlockDriverState *bs, QDict 
> > *options,
> >  goto fail;
> >  }
> >  }
> > +#ifdef CONFIG_BLKZONED
> > +/*
> > + * The kernel page chache does not reliably work for writes to SWR 
> > zones
> > + * of zoned block device because it can not guarantee the order of 
> > writes.
> > + */
> > +if (strcmp(bs->drv->format_name, "zoned_host_device") == 0) {
> > +if (!(s->open_flags & O_DIRECT)) {
> > +error_setg(errp, "driver=zoned_host_device was specified, but 
> > it "
> > + "requires cache.direct=on, which was not 
> > specified.");
> > +ret = -EINVAL;
>
> This line is not needed. Simply "return -EINVAL;".
>
> > +return ret; /* No host kernel page cache */
> > +}
> > +}
> > +#endif
> >
> >  if (S_ISBLK(st.st_mode)) {
> >  #ifdef BLKDISCARDZEROES
> > diff --git a/block/raw-format.c b/block/raw-format.c
> > index 6b20bd22ef..9441536819 100644
> > --- a/block/raw-format.c
> > +++ b/block/raw-format.c
> > @@ -614,6 +614,7 @@ static void raw_child_perm(BlockDriverState *bs, 
> > BdrvChild *c,
> >  BlockDriver bdrv_raw = {
> >  .format_name  = "raw",
> >  .instance_size= sizeof(BDRVRawState),
> > +.supports_zoned_children = true,
> >  .bdrv_probe   = &raw_probe,
> >  .bdrv_reopen_prepare  = &raw_reopen_prepare,
> >  .bdrv_reopen_commit   = &raw_reopen_commit,
> > diff --git a/include/block/block_int-common.h 
> > b/include/block/block_int-common.h
> > index 078ddd7e67..043aa161a0 100644
> > --- a/include/block/block_int-common.h
> > +++ b/include/block/block_int-common.h
> > @@ -127,6 +127,11 @@ struct BlockDriver {
> >   */
> >  bool is_format;
> >
> > +/*
> > + * Set to true if the BlockDriver supports zoned children.
> > + */
> > +bool supports_zoned_children;
> > +
> >  /*
> >   * Drivers not implementing bdrv_parse_filename nor bdrv_open should 
> > have
> >   * this field set to true, except ones that are defined only by their
>
> --
> Damien Le Moal
> Western Digital Research
>

Re: [PATCH] bugfix:migrate with block-dirty-bitmap (disk size is big enough) can't be finished

Re: [PATCH] bugfix:migrate with block-dirty-bitmap (disk size is big enough) can't be finished

Re: [PATCH] bugfix:migrate with block-dirty-bitmap (disk size is big enough) can't be finished

Re:Re: [PATCH] bugfix:migrate with block-dirty-bitmap (disk size is big enough) can't be finished

Re: [PATCH v9 2/7] file-posix: introduce helper functions for sysfs attributes

Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

Re: [PATCH v9 5/7] config: add check to block layer

Re: [PATCH v9 4/7] raw-format: add zone operations to pass through requests

Re: [PATCH v9 7/7] docs/zoned-storage: add zoned device documentation

Re: [PATCH] block: introduce zone append write for zoned devices

Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

Re: [PATCH v9 3/7] block: add block layer APIs resembling Linux ZonedBlockDevice ioctls

Re: [PATCH] block: introduce zone append write for zoned devices

Re: [PATCH v9 5/7] config: add check to block layer

14 matches

Site Navigation

Mail list logo

Footer information