date:20211005

Re: [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-10-05 Thread Michael S. Tsirkin

On Mon, Oct 04, 2021 at 09:38:08PM +0200, Christian Schoenebeck wrote:
> Raise the maximum possible virtio transfer size to 128M
> (more precisely: 32k * PAGE_SIZE). See previous commit for a
> more detailed explanation for the reasons of this change.
> 
> For not breaking any virtio user, all virtio users transition
> to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
> VIRTQUEUE_MAX_SIZE, so they are all still using the old value
> of 1k with this commit.
> 
> On the long-term, each virtio user should subsequently either
> switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
> after checking that they support the new value of 32k, or
> otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
> macro by an appropriate value supported by them.
> 
> Signed-off-by: Christian Schoenebeck 


I don't think we need this. Legacy isn't descriptive either.  Just leave
VIRTQUEUE_MAX_SIZE alone, and come up with a new name for 32k.

> ---
>  hw/9pfs/virtio-9p-device.c |  2 +-
>  hw/block/vhost-user-blk.c  |  6 +++---
>  hw/block/virtio-blk.c  |  6 +++---
>  hw/char/virtio-serial-bus.c|  2 +-
>  hw/input/virtio-input.c|  2 +-
>  hw/net/virtio-net.c| 12 ++--
>  hw/scsi/virtio-scsi.c  |  2 +-
>  hw/virtio/vhost-user-fs.c  |  6 +++---
>  hw/virtio/vhost-user-i2c.c |  2 +-
>  hw/virtio/vhost-vsock-common.c |  2 +-
>  hw/virtio/virtio-balloon.c |  2 +-
>  hw/virtio/virtio-crypto.c  |  2 +-
>  hw/virtio/virtio-iommu.c   |  2 +-
>  hw/virtio/virtio-mem.c |  2 +-
>  hw/virtio/virtio-mmio.c|  4 ++--
>  hw/virtio/virtio-pmem.c|  2 +-
>  hw/virtio/virtio-rng.c |  3 ++-
>  include/hw/virtio/virtio.h | 20 +++-
>  18 files changed, 49 insertions(+), 30 deletions(-)
> 
> diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> index cd5d95dd51..9013e7df6e 100644
> --- a/hw/9pfs/virtio-9p-device.c
> +++ b/hw/9pfs/virtio-9p-device.c
> @@ -217,7 +217,7 @@ static void virtio_9p_device_realize(DeviceState *dev, 
> Error **errp)
>  
>  v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
>  virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> -VIRTQUEUE_MAX_SIZE);
> +VIRTQUEUE_LEGACY_MAX_SIZE);
>  v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
>  }
>  
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index 336f56705c..e5e45262ab 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -480,9 +480,9 @@ static void vhost_user_blk_device_realize(DeviceState 
> *dev, Error **errp)
>  error_setg(errp, "queue size must be non-zero");
>  return;
>  }
> -if (s->queue_size > VIRTQUEUE_MAX_SIZE) {
> +if (s->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
>  error_setg(errp, "queue size must not exceed %d",
> -   VIRTQUEUE_MAX_SIZE);
> +   VIRTQUEUE_LEGACY_MAX_SIZE);
>  return;
>  }
>  
> @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState 
> *dev, Error **errp)
>  }
>  
>  virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> -sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
> +sizeof(struct virtio_blk_config), VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>  s->virtqs = g_new(VirtQueue *, s->num_queues);
>  for (i = 0; i < s->num_queues; i++) {
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index 9c0f46815c..5883e3e7db 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -1171,10 +1171,10 @@ static void virtio_blk_device_realize(DeviceState 
> *dev, Error **errp)
>  return;
>  }
>  if (!is_power_of_2(conf->queue_size) ||
> -conf->queue_size > VIRTQUEUE_MAX_SIZE) {
> +conf->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
>  error_setg(errp, "invalid queue-size property (%" PRIu16 "), "
> "must be a power of 2 (max %d)",
> -   conf->queue_size, VIRTQUEUE_MAX_SIZE);
> +   conf->queue_size, VIRTQUEUE_LEGACY_MAX_SIZE);
>  return;
>  }
>  
> @@ -1214,7 +1214,7 @@ static void virtio_blk_device_realize(DeviceState *dev, 
> Error **errp)
>  virtio_blk_set_config_size(s, s->host_features);
>  
>  virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
> -VIRTQUEUE_MAX_SIZE);
> +VIRTQUEUE_LEGACY_MAX_SIZE);
>  
>  s->blk = conf->conf.blk;
>  s->rq = NULL;
> diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
> index 9ad915..2d4285ab53 100644
> --- a/hw/char/virtio-serial-bus.c
> +++ b/hw/char/virtio-serial-bus.c
> @@ -1045,7 +1045,7 @@ static void virtio_serial_device_realize(DeviceState 
> *dev, Error **errp)
>  config_size = offsetof(struct virtio_console_config, emerg_wr);
>  }
>  virtio_init(vdev, "virtio-serial"

Re: [RFC PATCH 1/1] virtio: write back features before verify

2021-10-05 Thread Halil Pasic

On Mon, 4 Oct 2021 09:11:04 -0400
"Michael S. Tsirkin"  wrote:

> > >> static inline bool virtio_access_is_big_endian(VirtIODevice *vdev)
> > >> {
> > >> #if defined(LEGACY_VIRTIO_IS_BIENDIAN)
> > >> return virtio_is_big_endian(vdev);
> > >> #elif defined(TARGET_WORDS_BIGENDIAN)
> > >> if (virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1)) {
> > >> /* Devices conforming to VIRTIO 1.0 or later are always LE. */
> > >> return false;
> > >> }
> > >> return true;
> > >> #else
> > >> return false;
> > >> #endif
> > >> }
> > >>   
> > >
> > > ok so that's a QEMU bug. Any virtio 1.0 and up
> > > compatible device must use LE.
> > > It can also present a legacy config space where the
> > > endian depends on the guest.  
> > 
> > So, how is the virtio core supposed to determine this? A
> > transport-specific callback?  
> 
> I'd say a field in VirtIODevice is easiest.

Wouldn't a call from transport code into virtio core
be more handy? What I have in mind is stuff like vhost-user and vdpa. My
understanding is, that for vhost setups where the config is outside qemu,
we probably need a new  command that tells the vhost backend what
endiannes to use for config. I don't think we can use
VHOST_USER_SET_VRING_ENDIAN because  that one is on a virtqueue basis
according to the doc. So for vhost-user and similar we would fire that
command and probably also set the filed, while for devices for which
control plane is handled by QEMU we would just set the field.

Does that sound about right?

Re: [PATCH v2 1/2] hw/adc: Add basic Aspeed ADC model

2021-10-05 Thread Cédric Le Goater


On 10/5/21 07:31, Peter Delevoryas wrote:




On Oct 4, 2021, at 12:49 AM, Cédric Le Goater  wrote:

On 10/3/21 21:18, p...@fb.com wrote:

From: Andrew Jeffery 
This model implements enough behaviour to do basic functionality tests
such as device initialisation and read out of dummy sample values. The
sample value generation strategy is similar to the STM ADC already in
the tree.
Signed-off-by: Andrew Jeffery 
[clg : support for multiple engines (AST2600) ]
Signed-off-by: Cédric Le Goater 
[pdel : refactored engine register struct fields to regs[] array field]
[pdel : added guest-error checking for upper-8 channel regs in AST2600]
Signed-off-by: Peter Delevoryas 


Reviewed-by: Cédric Le Goater 

Thanks,

C.


Hey Cedric,

Actually, I have just submitted a v3 of this patch series to support 16-bit
reads of the channel data registers. I don’t think I tested using the driver
to read from the ADC, and that’s what Patrick found crashed with these
changes. Since it’s relatively easy to enable 16-bit reads, I figured
I would just include that.


OK.

A Tested-by: tag would be welcome !

Thanks,

C.

Re: [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-10-05 Thread Greg Kurz

On Tue, 5 Oct 2021 03:16:07 -0400
"Michael S. Tsirkin"  wrote:

> On Mon, Oct 04, 2021 at 09:38:08PM +0200, Christian Schoenebeck wrote:
> > Raise the maximum possible virtio transfer size to 128M
> > (more precisely: 32k * PAGE_SIZE). See previous commit for a
> > more detailed explanation for the reasons of this change.
> > 
> > For not breaking any virtio user, all virtio users transition
> > to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
> > VIRTQUEUE_MAX_SIZE, so they are all still using the old value
> > of 1k with this commit.
> > 
> > On the long-term, each virtio user should subsequently either
> > switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
> > after checking that they support the new value of 32k, or
> > otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
> > macro by an appropriate value supported by them.
> > 
> > Signed-off-by: Christian Schoenebeck 
> 
> 
> I don't think we need this. Legacy isn't descriptive either.  Just leave
> VIRTQUEUE_MAX_SIZE alone, and come up with a new name for 32k.
> 

Yes I agree. Only virtio-9p is going to benefit from the new
size in the short/medium term, so it looks a bit excessive to
patch all devices. Also in the end, you end up reverting the name
change in the last patch for virtio-9p... which is a indication
that this patch does too much.

Introduce the new macro in virtio-9p and use it only there.

> > ---
> >  hw/9pfs/virtio-9p-device.c |  2 +-
> >  hw/block/vhost-user-blk.c  |  6 +++---
> >  hw/block/virtio-blk.c  |  6 +++---
> >  hw/char/virtio-serial-bus.c|  2 +-
> >  hw/input/virtio-input.c|  2 +-
> >  hw/net/virtio-net.c| 12 ++--
> >  hw/scsi/virtio-scsi.c  |  2 +-
> >  hw/virtio/vhost-user-fs.c  |  6 +++---
> >  hw/virtio/vhost-user-i2c.c |  2 +-
> >  hw/virtio/vhost-vsock-common.c |  2 +-
> >  hw/virtio/virtio-balloon.c |  2 +-
> >  hw/virtio/virtio-crypto.c  |  2 +-
> >  hw/virtio/virtio-iommu.c   |  2 +-
> >  hw/virtio/virtio-mem.c |  2 +-
> >  hw/virtio/virtio-mmio.c|  4 ++--
> >  hw/virtio/virtio-pmem.c|  2 +-
> >  hw/virtio/virtio-rng.c |  3 ++-
> >  include/hw/virtio/virtio.h | 20 +++-
> >  18 files changed, 49 insertions(+), 30 deletions(-)
> > 
> > diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> > index cd5d95dd51..9013e7df6e 100644
> > --- a/hw/9pfs/virtio-9p-device.c
> > +++ b/hw/9pfs/virtio-9p-device.c
> > @@ -217,7 +217,7 @@ static void virtio_9p_device_realize(DeviceState *dev, 
> > Error **errp)
> >  
> >  v->config_size = sizeof(struct virtio_9p_config) + 
> > strlen(s->fsconf.tag);
> >  virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > -VIRTQUEUE_MAX_SIZE);
> > +VIRTQUEUE_LEGACY_MAX_SIZE);
> >  v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> >  }
> >  
> > diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> > index 336f56705c..e5e45262ab 100644
> > --- a/hw/block/vhost-user-blk.c
> > +++ b/hw/block/vhost-user-blk.c
> > @@ -480,9 +480,9 @@ static void vhost_user_blk_device_realize(DeviceState 
> > *dev, Error **errp)
> >  error_setg(errp, "queue size must be non-zero");
> >  return;
> >  }
> > -if (s->queue_size > VIRTQUEUE_MAX_SIZE) {
> > +if (s->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> >  error_setg(errp, "queue size must not exceed %d",
> > -   VIRTQUEUE_MAX_SIZE);
> > +   VIRTQUEUE_LEGACY_MAX_SIZE);
> >  return;
> >  }
> >  
> > @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState 
> > *dev, Error **errp)
> >  }
> >  
> >  virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> > -sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
> > +sizeof(struct virtio_blk_config), 
> > VIRTQUEUE_LEGACY_MAX_SIZE);
> >  
> >  s->virtqs = g_new(VirtQueue *, s->num_queues);
> >  for (i = 0; i < s->num_queues; i++) {
> > diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> > index 9c0f46815c..5883e3e7db 100644
> > --- a/hw/block/virtio-blk.c
> > +++ b/hw/block/virtio-blk.c
> > @@ -1171,10 +1171,10 @@ static void virtio_blk_device_realize(DeviceState 
> > *dev, Error **errp)
> >  return;
> >  }
> >  if (!is_power_of_2(conf->queue_size) ||
> > -conf->queue_size > VIRTQUEUE_MAX_SIZE) {
> > +conf->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> >  error_setg(errp, "invalid queue-size property (%" PRIu16 "), "
> > "must be a power of 2 (max %d)",
> > -   conf->queue_size, VIRTQUEUE_MAX_SIZE);
> > +   conf->queue_size, VIRTQUEUE_LEGACY_MAX_SIZE);
> >  return;
> >  }
> >  
> > @@ -1214,7 +1214,7 @@ static void virtio_blk_device_realize(DeviceState 
> > *dev, Error **errp)
> >  virtio_blk_set_config_size(s, s->host_feature

Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable

2021-10-05 Thread Greg Kurz

On Mon, 4 Oct 2021 21:38:04 +0200
Christian Schoenebeck  wrote:

> Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> variable per virtio user.
> 
> Reasons:
> 
> (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> maximum queue size possible. Which is actually the maximum
> queue size allowed by the virtio protocol. The appropriate
> value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> 
> 
> https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006
> 
> Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> more or less arbitrary value of 1024 in the past, which
> limits the maximum transfer size with virtio to 4M
> (more precise: 1024 * PAGE_SIZE, with the latter typically
> being 4k).
> 
> (2) Additionally the current value of 1024 poses a hidden limit,
> invisible to guest, which causes a system hang with the
> following QEMU error if guest tries to exceed it:
> 
> virtio: too many write descriptors in indirect table
> 
> (3) Unfortunately not all virtio users in QEMU would currently
> work correctly with the new value of 32768.
> 
> So let's turn this hard coded global value into a runtime
> variable as a first step in this commit, configurable for each
> virtio user by passing a corresponding value with virtio_init()
> call.
> 
> Signed-off-by: Christian Schoenebeck 
> ---

Reviewed-by: Greg Kurz 

>  hw/9pfs/virtio-9p-device.c |  3 ++-
>  hw/block/vhost-user-blk.c  |  2 +-
>  hw/block/virtio-blk.c  |  3 ++-
>  hw/char/virtio-serial-bus.c|  2 +-
>  hw/display/virtio-gpu-base.c   |  2 +-
>  hw/input/virtio-input.c|  2 +-
>  hw/net/virtio-net.c| 15 ---
>  hw/scsi/virtio-scsi.c  |  2 +-
>  hw/virtio/vhost-user-fs.c  |  2 +-
>  hw/virtio/vhost-user-i2c.c |  3 ++-
>  hw/virtio/vhost-vsock-common.c |  2 +-
>  hw/virtio/virtio-balloon.c |  4 ++--
>  hw/virtio/virtio-crypto.c  |  3 ++-
>  hw/virtio/virtio-iommu.c   |  2 +-
>  hw/virtio/virtio-mem.c |  2 +-
>  hw/virtio/virtio-pmem.c|  2 +-
>  hw/virtio/virtio-rng.c |  2 +-
>  hw/virtio/virtio.c | 35 +++---
>  include/hw/virtio/virtio.h |  5 -
>  19 files changed, 57 insertions(+), 36 deletions(-)
> 
> diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> index 54ee93b71f..cd5d95dd51 100644
> --- a/hw/9pfs/virtio-9p-device.c
> +++ b/hw/9pfs/virtio-9p-device.c
> @@ -216,7 +216,8 @@ static void virtio_9p_device_realize(DeviceState *dev, 
> Error **errp)
>  }
>  
>  v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
> -virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size);
> +virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> +VIRTQUEUE_MAX_SIZE);
>  v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
>  }
>  
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index ba13cb87e5..336f56705c 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState 
> *dev, Error **errp)
>  }
>  
>  virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> -sizeof(struct virtio_blk_config));
> +sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
>  
>  s->virtqs = g_new(VirtQueue *, s->num_queues);
>  for (i = 0; i < s->num_queues; i++) {
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index f139cd7cc9..9c0f46815c 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -1213,7 +1213,8 @@ static void virtio_blk_device_realize(DeviceState *dev, 
> Error **errp)
>  
>  virtio_blk_set_config_size(s, s->host_features);
>  
> -virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size);
> +virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
> +VIRTQUEUE_MAX_SIZE);
>  
>  s->blk = conf->conf.blk;
>  s->rq = NULL;
> diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
> index f01ec2137c..9ad915 100644
> --- a/hw/char/virtio-serial-bus.c
> +++ b/hw/char/virtio-serial-bus.c
> @@ -1045,7 +1045,7 @@ static void virtio_serial_device_realize(DeviceState 
> *dev, Error **errp)
>  config_size = offsetof(struct virtio_console_config, emerg_wr);
>  }
>  virtio_init(vdev, "virtio-serial", VIRTIO_ID_CONSOLE,
> -config_size);
> +config_size, VIRTQUEUE_MAX_SIZE);
>  
>  /* Spawn a new virtio-serial bus on which the ports will ride as devices 
> */
>  qbus_init(&vser->bus, sizeof(vser->bus), TYPE_VIRTIO_SERIAL_BUS,
> diff --git a/hw/display/virtio-gpu-base.c b/hw/display/virtio-gpu-base.c
> index c8da4806e0..20b06a7adf 100644
> --- a/hw/display/virtio-gpu-base.c
> +++ b/hw/display/virtio-gpu-base.c
> @@ -171,7 +171,7 @@ virtio_gpu_ba

Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-10-05 Thread David Hildenbrand


On 04.10.21 21:38, Christian Schoenebeck wrote:

At the moment the maximum transfer size with virtio is limited to 4M
(1024 * PAGE_SIZE). This series raises this limit to its maximum
theoretical possible transfer size of 128M (32k pages) according to the
virtio specs:

https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006



I'm missing the "why do we care". Can you comment on that?


--
Thanks,

David / dhildenb

Re: Strange qemu6 regression cauing disabled usb controller.

2021-10-05 Thread Remy Noel


On Thu, Sep 30, 2021 at 04:05:52PM +0100, Daniel P. Berrangé wrote:

Co-incidentally we've just had another bug report filed today that
suggests 7bed89958bfbf40df9ca681cefbdca63abdde39d as a buggy commit
causing deadlock in QEMU

 https://gitlab.com/qemu-project/qemu/-/issues/650


Is opening a gitlab ticket the prefered way to report issues now ? Should i 
do that ?


Thanks.

Remy.

Re: [RFC PATCH 1/1] virtio: write back features before verify

2021-10-05 Thread Michael S. Tsirkin

On Tue, Oct 05, 2021 at 09:25:39AM +0200, Halil Pasic wrote:
> On Mon, 4 Oct 2021 09:11:04 -0400
> "Michael S. Tsirkin"  wrote:
> 
> > > >> static inline bool virtio_access_is_big_endian(VirtIODevice *vdev)
> > > >> {
> > > >> #if defined(LEGACY_VIRTIO_IS_BIENDIAN)
> > > >> return virtio_is_big_endian(vdev);
> > > >> #elif defined(TARGET_WORDS_BIGENDIAN)
> > > >> if (virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1)) {
> > > >> /* Devices conforming to VIRTIO 1.0 or later are always LE. */
> > > >> return false;
> > > >> }
> > > >> return true;
> > > >> #else
> > > >> return false;
> > > >> #endif
> > > >> }
> > > >>   
> > > >
> > > > ok so that's a QEMU bug. Any virtio 1.0 and up
> > > > compatible device must use LE.
> > > > It can also present a legacy config space where the
> > > > endian depends on the guest.  
> > > 
> > > So, how is the virtio core supposed to determine this? A
> > > transport-specific callback?  
> > 
> > I'd say a field in VirtIODevice is easiest.
> 
> Wouldn't a call from transport code into virtio core
> be more handy? What I have in mind is stuff like vhost-user and vdpa. My
> understanding is, that for vhost setups where the config is outside qemu,
> we probably need a new  command that tells the vhost backend what
> endiannes to use for config. I don't think we can use
> VHOST_USER_SET_VRING_ENDIAN because  that one is on a virtqueue basis
> according to the doc. So for vhost-user and similar we would fire that
> command and probably also set the filed, while for devices for which
> control plane is handled by QEMU we would just set the field.
> 
> Does that sound about right?

I'm fine either way, but when would you invoke this?
With my idea backends can check the field when get_config
is invoked.

As for using this in VHOST, can we maybe re-use SET_FEATURES?

Kind of hacky but nice in that it will actually make existing backends
work...

-- 
MST

Re: [PATCH V3] block/rbd: implement bdrv_co_block_status

2021-10-05 Thread Ilya Dryomov

On Thu, Sep 16, 2021 at 2:21 PM Peter Lieven  wrote:
>
> the qemu rbd driver currently lacks support for bdrv_co_block_status.
> This results mainly in incorrect progress during block operations (e.g.
> qemu-img convert with an rbd image as source).
>
> This patch utilizes the rbd_diff_iterate2 call from librbd to detect
> allocated and unallocated (all zero areas).
>
> To avoid querying the ceph OSDs for the answer this is only done if
> the image has the fast-diff feature which depends on the object-map and
> exclusive-lock features. In this case it is guaranteed that the information
> is present in memory in the librbd client and thus very fast.
>
> If fast-diff is not available all areas are reported to be allocated
> which is the current behaviour if bdrv_co_block_status is not implemented.
>
> Signed-off-by: Peter Lieven 
> ---
> V2->V3:
> - check rbd_flags every time (they can change during runtime) [Ilya]
> - also check for fast-diff invalid flag [Ilya]
> - *map and *file cant be NULL [Ilya]
> - set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an
>   unallocated area [Ilya]
> - typo: catched -> caught [Ilya]
> - changed wording about fast-diff, object-map and exclusive lock in
>   commit msg [Ilya]
>
> V1->V2:
> - add commit comment [Stefano]
> - use failed_post_open [Stefano]
> - remove redundant assert [Stefano]
> - add macro+comment for the magic -9000 value [Stefano]
> - always set *file if its non NULL [Stefano]
>
>  block/rbd.c | 126 
>  1 file changed, 126 insertions(+)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index dcf82b15b8..3cb24f9981 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -1259,6 +1259,131 @@ static ImageInfoSpecific 
> *qemu_rbd_get_specific_info(BlockDriverState *bs,
>  return spec_info;
>  }
>
> +typedef struct rbd_diff_req {
> +uint64_t offs;
> +uint64_t bytes;
> +int exists;

Hi Peter,

Nit: make exists a bool.  The one in the callback has to be an int
because of the callback signature but let's not spread that.

> +} rbd_diff_req;
> +
> +/*
> + * rbd_diff_iterate2 allows to interrupt the exection by returning a negative
> + * value in the callback routine. Choose a value that does not conflict with
> + * an existing exitcode and return it if we want to prematurely stop the
> + * execution because we detected a change in the allocation status.
> + */
> +#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
> +
> +static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
> +   int exists, void *opaque)
> +{
> +struct rbd_diff_req *req = opaque;
> +
> +assert(req->offs + req->bytes <= offs);
> +
> +if (req->exists && offs > req->offs + req->bytes) {
> +/*
> + * we started in an allocated area and jumped over an unallocated 
> area,
> + * req->bytes contains the length of the allocated area before the
> + * unallocated area. stop further processing.
> + */
> +return QEMU_RBD_EXIT_DIFF_ITERATE2;
> +}
> +if (req->exists && !exists) {
> +/*
> + * we started in an allocated area and reached a hole. req->bytes
> + * contains the length of the allocated area before the hole.
> + * stop further processing.
> + */
> +return QEMU_RBD_EXIT_DIFF_ITERATE2;

Do you have a test case for when this branch is taken?

> +}
> +if (!req->exists && exists && offs > req->offs) {
> +/*
> + * we started in an unallocated area and hit the first allocated
> + * block. req->bytes must be set to the length of the unallocated 
> area
> + * before the allocated area. stop further processing.
> + */
> +req->bytes = offs - req->offs;
> +return QEMU_RBD_EXIT_DIFF_ITERATE2;
> +}
> +
> +/*
> + * assert that we caught all cases above and allocation state has not
> + * changed during callbacks.
> + */
> +assert(exists == req->exists || !req->bytes);
> +req->exists = exists;
> +
> +/*
> + * assert that we either return an unallocated block or have got 
> callbacks
> + * for all allocated blocks present.
> + */
> +assert(!req->exists || offs == req->offs + req->bytes);
> +req->bytes = offs + len - req->offs;
> +
> +return 0;
> +}
> +
> +static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
> + bool want_zero, int64_t 
> offset,
> + int64_t bytes, int64_t 
> *pnum,
> + int64_t *map,
> + BlockDriverState **file)
> +{
> +BDRVRBDState *s = bs->opaque;
> +int ret, r;

Nit: I would rename ret to status or something like that to make
it clear(er) that it is an actual value and never an error.  Or,
even better, drop it entirely and return one of the two bitm

[PATCH 0/3] vdpa: Check iova range on memory regions ops

2021-10-05 Thread Eugenio Pérez

At this moment vdpa will not send memory regions bigger than 1<<63.
However, actual iova range could be way more restrictive than that.

Since we can obtain the range through vdpa ioctl call, just save it
from the beginning of the operation and check against it.

Eugenio Pérez (3):
  vdpa: Skip protected ram IOMMU mappings
  vdpa: Add vhost_vdpa_section_end
  vdpa: Check for iova range at mappings changes

 include/hw/virtio/vhost-vdpa.h |  2 +
 hw/virtio/vhost-vdpa.c | 83 +-
 hw/virtio/trace-events |  1 +
 3 files changed, 65 insertions(+), 21 deletions(-)

-- 
2.27.0

[PATCH 3/3] vdpa: Check for iova range at mappings changes

2021-10-05 Thread Eugenio Pérez

Check vdpa device range before updating memory regions so we don't add
any outside of it, and report the invalid change if any.

Signed-off-by: Eugenio Pérez 
---
 include/hw/virtio/vhost-vdpa.h |  2 +
 hw/virtio/vhost-vdpa.c | 68 ++
 hw/virtio/trace-events |  1 +
 3 files changed, 55 insertions(+), 16 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index a8963da2d9..c288cf7ecb 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -13,6 +13,7 @@
 #define HW_VIRTIO_VHOST_VDPA_H
 
 #include "hw/virtio/virtio.h"
+#include "standard-headers/linux/vhost_types.h"
 
 typedef struct VhostVDPAHostNotifier {
 MemoryRegion mr;
@@ -24,6 +25,7 @@ typedef struct vhost_vdpa {
 uint32_t msg_type;
 bool iotlb_batch_begin_sent;
 MemoryListener listener;
+struct vhost_vdpa_iova_range iova_range;
 struct vhost_dev *dev;
 VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
 } VhostVDPA;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index a1de6c7c9c..26d0258723 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -33,20 +33,34 @@ static Int128 vhost_vdpa_section_end(const 
MemoryRegionSection *section)
 return llend;
 }
 
-static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section)
-{
-return (!memory_region_is_ram(section->mr) &&
-!memory_region_is_iommu(section->mr)) ||
-memory_region_is_protected(section->mr) ||
-   /* vhost-vDPA doesn't allow MMIO to be mapped  */
-memory_region_is_ram_device(section->mr) ||
-   /*
-* Sizing an enabled 64-bit BAR can cause spurious mappings to
-* addresses in the upper part of the 64-bit address space.  These
-* are never accessed by the CPU and beyond the address width of
-* some IOMMU hardware.  TODO: VDPA should tell us the IOMMU width.
-*/
-   section->offset_within_address_space & (1ULL << 63);
+static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section,
+uint64_t iova_min,
+uint64_t iova_max)
+{
+Int128 llend;
+bool r = (!memory_region_is_ram(section->mr) &&
+  !memory_region_is_iommu(section->mr)) ||
+  memory_region_is_protected(section->mr) ||
+  /* vhost-vDPA doesn't allow MMIO to be mapped  */
+  memory_region_is_ram_device(section->mr);
+if (r) {
+return true;
+}
+
+if (section->offset_within_address_space < iova_min) {
+error_report("RAM section out of device range (min=%lu, addr=%lu)",
+ iova_min, section->offset_within_address_space);
+return true;
+}
+
+llend = vhost_vdpa_section_end(section);
+if (int128_make64(llend) > iova_max) {
+error_report("RAM section out of device range (max=%lu, end addr=%lu)",
+ iova_max, (uint64_t)int128_make64(llend));
+return true;
+}
+
+return false;
 }
 
 static int vhost_vdpa_dma_map(struct vhost_vdpa *v, hwaddr iova, hwaddr size,
@@ -158,7 +172,8 @@ static void vhost_vdpa_listener_region_add(MemoryListener 
*listener,
 void *vaddr;
 int ret;
 
-if (vhost_vdpa_listener_skipped_section(section)) {
+if (vhost_vdpa_listener_skipped_section(section, v->iova_range.first,
+v->iova_range.last)) {
 return;
 }
 
@@ -216,7 +231,8 @@ static void vhost_vdpa_listener_region_del(MemoryListener 
*listener,
 Int128 llend, llsize;
 int ret;
 
-if (vhost_vdpa_listener_skipped_section(section)) {
+if (vhost_vdpa_listener_skipped_section(section, v->iova_range.first,
+v->iova_range.last)) {
 return;
 }
 
@@ -284,9 +300,24 @@ static void vhost_vdpa_add_status(struct vhost_dev *dev, 
uint8_t status)
 vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &s);
 }
 
+static int vhost_vdpa_get_iova_range(struct vhost_vdpa *v)
+{
+int ret;
+
+ret = vhost_vdpa_call(v->dev, VHOST_VDPA_GET_IOVA_RANGE, &v->iova_range);
+if (ret != 0) {
+return ret;
+}
+
+trace_vhost_vdpa_get_iova_range(v->dev, v->iova_range.first,
+v->iova_range.last);
+return ret;
+}
+
 static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
 {
 struct vhost_vdpa *v;
+int r;
 assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA);
 trace_vhost_vdpa_init(dev, opaque);
 
@@ -296,6 +327,11 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void 
*opaque, Error **errp)
 v->listener = vhost_vdpa_memory_listener;
 v->msg_type = VHOST_IOTLB_MSG_V2;
 
+r = vhost_vdpa_get_iova_range(v);
+if (unlikely(!r)) {
+return r;
+}
+
 vhost_vdpa_add

Re: Deprecate the ppc405 boards in QEMU? (was: [PATCH v3 4/7] MAINTAINERS: Orphan obscure ppc platforms)

2021-10-05 Thread Alexey Kardashevskiy





On 05/10/2021 17:42, Thomas Huth wrote:

On 05/10/2021 08.18, Alexey Kardashevskiy wrote:



On 05/10/2021 15:44, Christophe Leroy wrote:



Le 05/10/2021 à 02:48, David Gibson a écrit :

On Fri, Oct 01, 2021 at 04:18:49PM +0200, Thomas Huth wrote:

On 01/10/2021 15.04, Christophe Leroy wrote:



Le 01/10/2021 à 14:04, Thomas Huth a écrit :

On 01/10/2021 13.12, Peter Maydell wrote:

On Fri, 1 Oct 2021 at 10:43, Thomas Huth  wrote:

Nevertheless, as long as nobody has a hint where to find that
ppc405_rom.bin, I think both boards are pretty useless in QEMU 
(as far as I

can see, they do not work without the bios at all, so it's
also not possible
to use a Linux image with the "-kernel" CLI option directly).


It is at least in theory possible to run bare-metal code on
either board, by passing either a pflash or a bios argument.


True. I did some more research, and seems like there was once
support for those boards in u-boot, but it got removed there a
couple of years ago already:

https://gitlab.com/qemu-project/u-boot/-/commit/98f705c9cefdf

https://gitlab.com/qemu-project/u-boot/-/commit/b147ff2f37d5b

https://gitlab.com/qemu-project/u-boot/-/commit/7514037bcdc37


But I agree that there seem to be no signs of anybody actually
successfully using these boards for anything, so we should
deprecate-and-delete them.


Yes, let's mark them as deprecated now ... if someone still uses
them and speaks up, we can still revert the deprecation again.


I really would like to be able to use them to validate Linux Kernel
changes, hence looking for that missing BIOS.

If we remove ppc405 from QEMU, we won't be able to do any regression
tests of Linux Kernel on those processors.


If you/someone managed to compile an old version of u-boot for one 
of these
two boards, so that we would finally have something for regression 
testing,

we can of course also keep the boards in QEMU...


I can see that it would be usefor for some cases, but unless someone
volunteers to track down the necessary firmware and look after it, I
think we do need to deprecate it - I certainly don't have the capacity
to look into this.



I will look at it, please allow me a few weeks though.


Well, building it was not hard but now I'd like to know what board 
QEMU actually emulates, there are way too many codenames and PVRs.


Here is what I was building:
https://github.com/aik/u-boot/tree/ppc4xx-qemu

CONFIG_SYS_ARCH="powerpc"
CONFIG_SYS_CPU="ppc4xx"
CONFIG_SYS_VENDOR="esd"
CONFIG_SYS_BOARD="pmc405de"
CONFIG_SYS_CONFIG_NAME="PMC405DE"

Is this any use?


If I've got u-boot commit 98f705c9cefdfdba62c069821bbba10273a0a8
right, there used to be SYS_BOARD="405ep" config before that removal, so 
that sounds like a promising match for the ref405ep of QEMU?


Tricky. The board can be 405ep if 
TARGET_IO/TARGET_DLVISION/TARGET_DLVISION_10G selected. Neither compiles 
at 98f705c9cefdfdba62c^ due to missing CONFIG_SYS_PCI_PTM1PCI :-/




The support for "taihu" even got removed earlier, in u-boot commit 
123b6cd7a4f75536734a7bff97db6eebce614bd1 , and the commit message says 
that it did not compile anymore at the end, so you might need to check 
out an even older version for that one.



What is so special about taihu?



--
Alexey

[PATCH 1/3] vdpa: Skip protected ram IOMMU mappings

2021-10-05 Thread Eugenio Pérez

Following the logic of commit 56918a126ae ("memory: Add RAM_PROTECTED
flag to skip IOMMU mappings") with VFIO, skip memory sections
inaccessible via normal mechanisms, including DMA.

Signed-off-by: Eugenio Pérez 
---
 hw/virtio/vhost-vdpa.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 47d7a5a23d..ea1aa71ad8 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -28,6 +28,7 @@ static bool 
vhost_vdpa_listener_skipped_section(MemoryRegionSection *section)
 {
 return (!memory_region_is_ram(section->mr) &&
 !memory_region_is_iommu(section->mr)) ||
+memory_region_is_protected(section->mr) ||
/* vhost-vDPA doesn't allow MMIO to be mapped  */
 memory_region_is_ram_device(section->mr) ||
/*
-- 
2.27.0

Re: [PATCH] MAINTAINERS: Split HPPA TCG vs HPPA machines/hardware

2021-10-05 Thread Helge Deller

On 10/4/21 10:38, Philippe Mathieu-Daudé wrote:
> Hardware emulated models don't belong to the TCG MAINTAINERS
> section. Move them to the 'HP-PARISC Machines' section.
>
> Signed-off-by: Philippe Mathieu-Daudé 

Reviewed-by: Helge Deller 

> ---
>  MAINTAINERS | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 50435b8d2f5..002620c6cad 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -205,10 +205,7 @@ HPPA (PA-RISC) TCG CPUs
>  M: Richard Henderson 
>  S: Maintained
>  F: target/hppa/
> -F: hw/hppa/
>  F: disas/hppa.c
> -F: hw/net/*i82596*
> -F: include/hw/net/lasi_82596.h
>
>  M68K TCG CPUs
>  M: Laurent Vivier 
> @@ -1098,6 +1095,8 @@ R: Helge Deller 
>  S: Odd Fixes
>  F: configs/devices/hppa-softmmu/default.mak
>  F: hw/hppa/
> +F: hw/net/*i82596*
> +F: include/hw/net/lasi_82596.h
>  F: pc-bios/hppa-firmware.img
>
>  M68K Machines
>

Re: Deprecate the ppc405 boards in QEMU? (was: [PATCH v3 4/7] MAINTAINERS: Orphan obscure ppc platforms)

2021-10-05 Thread Thomas Huth


On 05/10/2021 10.05, Alexey Kardashevskiy wrote:



On 05/10/2021 17:42, Thomas Huth wrote:

On 05/10/2021 08.18, Alexey Kardashevskiy wrote:



On 05/10/2021 15:44, Christophe Leroy wrote:



Le 05/10/2021 à 02:48, David Gibson a écrit :

On Fri, Oct 01, 2021 at 04:18:49PM +0200, Thomas Huth wrote:

On 01/10/2021 15.04, Christophe Leroy wrote:



Le 01/10/2021 à 14:04, Thomas Huth a écrit :

On 01/10/2021 13.12, Peter Maydell wrote:

On Fri, 1 Oct 2021 at 10:43, Thomas Huth  wrote:

Nevertheless, as long as nobody has a hint where to find that
ppc405_rom.bin, I think both boards are pretty useless in QEMU (as 
far as I

can see, they do not work without the bios at all, so it's
also not possible
to use a Linux image with the "-kernel" CLI option directly).


It is at least in theory possible to run bare-metal code on
either board, by passing either a pflash or a bios argument.


True. I did some more research, and seems like there was once
support for those boards in u-boot, but it got removed there a
couple of years ago already:

https://gitlab.com/qemu-project/u-boot/-/commit/98f705c9cefdf

https://gitlab.com/qemu-project/u-boot/-/commit/b147ff2f37d5b

https://gitlab.com/qemu-project/u-boot/-/commit/7514037bcdc37


But I agree that there seem to be no signs of anybody actually
successfully using these boards for anything, so we should
deprecate-and-delete them.


Yes, let's mark them as deprecated now ... if someone still uses
them and speaks up, we can still revert the deprecation again.


I really would like to be able to use them to validate Linux Kernel
changes, hence looking for that missing BIOS.

If we remove ppc405 from QEMU, we won't be able to do any regression
tests of Linux Kernel on those processors.


If you/someone managed to compile an old version of u-boot for one of 
these
two boards, so that we would finally have something for regression 
testing,

we can of course also keep the boards in QEMU...


I can see that it would be usefor for some cases, but unless someone
volunteers to track down the necessary firmware and look after it, I
think we do need to deprecate it - I certainly don't have the capacity
to look into this.



I will look at it, please allow me a few weeks though.


Well, building it was not hard but now I'd like to know what board QEMU 
actually emulates, there are way too many codenames and PVRs.


Here is what I was building:
https://github.com/aik/u-boot/tree/ppc4xx-qemu

CONFIG_SYS_ARCH="powerpc"
CONFIG_SYS_CPU="ppc4xx"
CONFIG_SYS_VENDOR="esd"
CONFIG_SYS_BOARD="pmc405de"
CONFIG_SYS_CONFIG_NAME="PMC405DE"

Is this any use?


If I've got u-boot commit 98f705c9cefdfdba62c069821bbba10273a0a8
right, there used to be SYS_BOARD="405ep" config before that removal, so 
that sounds like a promising match for the ref405ep of QEMU?


Tricky. The board can be 405ep if 
TARGET_IO/TARGET_DLVISION/TARGET_DLVISION_10G selected. Neither compiles at 
98f705c9cefdfdba62c^ due to missing CONFIG_SYS_PCI_PTM1PCI :-/




The support for "taihu" even got removed earlier, in u-boot commit 
123b6cd7a4f75536734a7bff97db6eebce614bd1 , and the commit message says 
that it did not compile anymore at the end, so you might need to check out 
an even older version for that one.



What is so special about taihu?


taihu is the other 405 board defined in hw/ppc/ppc405_boards.c (which I 
suggested to deprecate now)


 Thomas

[PATCH 2/3] vdpa: Add vhost_vdpa_section_end

2021-10-05 Thread Eugenio Pérez

Abstract this operation, that will be reused when validating the region
against the iova range that the device supports.

Signed-off-by: Eugenio Pérez 
---
 hw/virtio/vhost-vdpa.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index ea1aa71ad8..a1de6c7c9c 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -24,6 +24,15 @@
 #include "trace.h"
 #include "qemu-common.h"
 
+static Int128 vhost_vdpa_section_end(const MemoryRegionSection *section)
+{
+Int128 llend = int128_make64(section->offset_within_address_space);
+llend = int128_add(llend, section->size);
+llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
+
+return llend;
+}
+
 static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section)
 {
 return (!memory_region_is_ram(section->mr) &&
@@ -160,10 +169,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener 
*listener,
 }
 
 iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
-llend = int128_make64(section->offset_within_address_space);
-llend = int128_add(llend, section->size);
-llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
-
+llend = vhost_vdpa_section_end(section);
 if (int128_ge(int128_make64(iova), llend)) {
 return;
 }
@@ -221,9 +227,7 @@ static void vhost_vdpa_listener_region_del(MemoryListener 
*listener,
 }
 
 iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
-llend = int128_make64(section->offset_within_address_space);
-llend = int128_add(llend, section->size);
-llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
+llend = vhost_vdpa_section_end(section);
 
 trace_vhost_vdpa_listener_region_del(v, iova, int128_get64(llend));
 
-- 
2.27.0

Re: [PATCH 06/13] vhost-user-fs: Use helpers to create/cleanup virtqueue

2021-10-05 Thread Stefan Hajnoczi

On Mon, Oct 04, 2021 at 03:58:09PM -0400, Vivek Goyal wrote:
> On Mon, Oct 04, 2021 at 02:54:17PM +0100, Stefan Hajnoczi wrote:
> > On Thu, Sep 30, 2021 at 11:30:30AM -0400, Vivek Goyal wrote:
> > > Add helpers to create/cleanup virtuqueues and use those helpers. I will
> > 
> > s/virtuqueues/virtqueues/
> > 
> > > need to reconfigure queues in later patches and using helpers will allow
> > > reusing the code.
> > > 
> > > Signed-off-by: Vivek Goyal 
> > > ---
> > >  hw/virtio/vhost-user-fs.c | 87 +++
> > >  1 file changed, 52 insertions(+), 35 deletions(-)
> > > 
> > > diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> > > index c595957983..d1efbc5b18 100644
> > > --- a/hw/virtio/vhost-user-fs.c
> > > +++ b/hw/virtio/vhost-user-fs.c
> > > @@ -139,6 +139,55 @@ static void vuf_set_status(VirtIODevice *vdev, 
> > > uint8_t status)
> > >  }
> > >  }
> > >  
> > > +static void vuf_handle_output(VirtIODevice *vdev, VirtQueue *vq)
> > > +{
> > > +/*
> > > + * Not normally called; it's the daemon that handles the queue;
> > > + * however virtio's cleanup path can call this.
> > > + */
> > > +}
> > > +
> > > +static void vuf_create_vqs(VirtIODevice *vdev)
> > > +{
> > > +VHostUserFS *fs = VHOST_USER_FS(vdev);
> > > +unsigned int i;
> > > +
> > > +/* Hiprio queue */
> > > +fs->hiprio_vq = virtio_add_queue(vdev, fs->conf.queue_size,
> > > + vuf_handle_output);
> > > +
> > > +/* Request queues */
> > > +fs->req_vqs = g_new(VirtQueue *, fs->conf.num_request_queues);
> > > +for (i = 0; i < fs->conf.num_request_queues; i++) {
> > > +fs->req_vqs[i] = virtio_add_queue(vdev, fs->conf.queue_size,
> > > +  vuf_handle_output);
> > > +}
> > > +
> > > +/* 1 high prio queue, plus the number configured */
> > > +fs->vhost_dev.nvqs = 1 + fs->conf.num_request_queues;
> > > +fs->vhost_dev.vqs = g_new0(struct vhost_virtqueue, 
> > > fs->vhost_dev.nvqs);
> > 
> > These two lines prepare for vhost_dev_init(), so moving them here is
> > debatable. If a caller is going to use this function again in the future
> > then they need to be sure to also call vhost_dev_init(). For now it
> > looks safe, so I guess it's okay.
> 
> Hmm..., I do call this function later from vuf_set_features() and
> reconfigure the queues. I see that I don't call vhost_dev_init()
> in that path. I am not even sure if I should be calling
> vhost_dev_init() from inside vuf_set_features().
> 
> So core reuirement is that at the time of first creating device
> I have no idea if driver supports notification queue or not. So
> I do create device with notification queue. But later if driver
> (and possibly vhost device) does not support notifiation queue,
> then we need to reconfigure queues. What's the correct way to
> do that?

Ah, I see. The simplest approach is to always allocate the maximum
number of virtqueues. QEMU's vhost-user-fs device shouldn't need to
worry about which virtqueues are actually in use. Let virtiofsd (the
vhost-user backend) worry about that.

I posted ideas about how to do that in a reply to another patch in this
series. I can't guarantee it will work, but I think it's worth
exploring.

Stefan


signature.asc
Description: PGP signature

[PATCH v1 1/2] migration: block-dirty-bitmap: add missing qemu_mutex_lock_iothread

2021-10-05 Thread Emanuele Giuseppe Esposito

init_dirty_bitmap_migration assumes the iothread lock (BQL)
to be held, but instead it isn't.

Instead of adding the lock to qemu_savevm_state_setup(),
follow the same pattern as the other ->save_setup callbacks
and lock+unlock inside dirty_bitmap_save_setup().

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
---
 migration/block-dirty-bitmap.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
index 35f5ef688d..9aba7d9c22 100644
--- a/migration/block-dirty-bitmap.c
+++ b/migration/block-dirty-bitmap.c
@@ -1215,7 +1215,10 @@ static int dirty_bitmap_save_setup(QEMUFile *f, void 
*opaque)
 {
 DBMSaveState *s = &((DBMState *)opaque)->save;
 SaveBitmapState *dbms = NULL;
+
+qemu_mutex_lock_iothread();
 if (init_dirty_bitmap_migration(s) < 0) {
+qemu_mutex_unlock_iothread();
 return -1;
 }
 
@@ -1223,7 +1226,7 @@ static int dirty_bitmap_save_setup(QEMUFile *f, void 
*opaque)
 send_bitmap_start(f, s, dbms);
 }
 qemu_put_bitmap_flags(f, DIRTY_BITMAP_MIG_FLAG_EOS);
-
+qemu_mutex_unlock_iothread();
 return 0;
 }
 
-- 
2.27.0

Re: [PATCH 2/3] vdpa: Add vhost_vdpa_section_end

2021-10-05 Thread Michael S. Tsirkin

On Tue, Oct 05, 2021 at 10:01:30AM +0200, Eugenio Pérez wrote:
> Abstract this operation, that will be reused when validating the region
> against the iova range that the device supports.
> 
> Signed-off-by: Eugenio Pérez 

Note that as defined end is actually 1 byte beyond end of section.
As such it can e.g. overflow if cast to u64.
So be careful to use int128 ops with it.
Also - document?

> ---
>  hw/virtio/vhost-vdpa.c | 18 +++---
>  1 file changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index ea1aa71ad8..a1de6c7c9c 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -24,6 +24,15 @@
>  #include "trace.h"
>  #include "qemu-common.h"
>  
> +static Int128 vhost_vdpa_section_end(const MemoryRegionSection *section)
> +{
> +Int128 llend = int128_make64(section->offset_within_address_space);
> +llend = int128_add(llend, section->size);
> +llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> +
> +return llend;
> +}
> +
>  static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section)
>  {
>  return (!memory_region_is_ram(section->mr) &&
> @@ -160,10 +169,7 @@ static void 
> vhost_vdpa_listener_region_add(MemoryListener *listener,
>  }
>  
>  iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> -llend = int128_make64(section->offset_within_address_space);
> -llend = int128_add(llend, section->size);
> -llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> -
> +llend = vhost_vdpa_section_end(section);
>  if (int128_ge(int128_make64(iova), llend)) {
>  return;
>  }
> @@ -221,9 +227,7 @@ static void vhost_vdpa_listener_region_del(MemoryListener 
> *listener,
>  }
>  
>  iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> -llend = int128_make64(section->offset_within_address_space);
> -llend = int128_add(llend, section->size);
> -llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> +llend = vhost_vdpa_section_end(section);
>  
>  trace_vhost_vdpa_listener_region_del(v, iova, int128_get64(llend));
>  
> -- 
> 2.27.0

Re: [PATCH 08/13] virtiofsd: Create a notification queue

2021-10-05 Thread Stefan Hajnoczi

On Mon, Oct 04, 2021 at 05:01:07PM -0400, Vivek Goyal wrote:
> On Mon, Oct 04, 2021 at 03:30:38PM +0100, Stefan Hajnoczi wrote:
> > On Thu, Sep 30, 2021 at 11:30:32AM -0400, Vivek Goyal wrote:
> > > Add a notification queue which will be used to send async notifications
> > > for file lock availability.
> > > 
> > > Signed-off-by: Vivek Goyal 
> > > Signed-off-by: Ioannis Angelakopoulos 
> > > ---
> > >  hw/virtio/vhost-user-fs-pci.c |  4 +-
> > >  hw/virtio/vhost-user-fs.c | 62 +--
> > >  include/hw/virtio/vhost-user-fs.h |  2 +
> > >  tools/virtiofsd/fuse_i.h  |  1 +
> > >  tools/virtiofsd/fuse_virtio.c | 70 +++
> > >  5 files changed, 116 insertions(+), 23 deletions(-)
> > > 
> > > diff --git a/hw/virtio/vhost-user-fs-pci.c b/hw/virtio/vhost-user-fs-pci.c
> > > index 2ed8492b3f..cdb9471088 100644
> > > --- a/hw/virtio/vhost-user-fs-pci.c
> > > +++ b/hw/virtio/vhost-user-fs-pci.c
> > > @@ -41,8 +41,8 @@ static void vhost_user_fs_pci_realize(VirtIOPCIProxy 
> > > *vpci_dev, Error **errp)
> > >  DeviceState *vdev = DEVICE(&dev->vdev);
> > >  
> > >  if (vpci_dev->nvectors == DEV_NVECTORS_UNSPECIFIED) {
> > > -/* Also reserve config change and hiprio queue vectors */
> > > -vpci_dev->nvectors = dev->vdev.conf.num_request_queues + 2;
> > > +/* Also reserve config change, hiprio and notification queue 
> > > vectors */
> > > +vpci_dev->nvectors = dev->vdev.conf.num_request_queues + 3;
> > >  }
> > >  
> > >  qdev_realize(vdev, BUS(&vpci_dev->bus), errp);
> > > diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> > > index d1efbc5b18..6bafcf0243 100644
> > > --- a/hw/virtio/vhost-user-fs.c
> > > +++ b/hw/virtio/vhost-user-fs.c
> > > @@ -31,6 +31,7 @@ static const int user_feature_bits[] = {
> > >  VIRTIO_F_NOTIFY_ON_EMPTY,
> > >  VIRTIO_F_RING_PACKED,
> > >  VIRTIO_F_IOMMU_PLATFORM,
> > > +VIRTIO_FS_F_NOTIFICATION,
> > >  
> > >  VHOST_INVALID_FEATURE_BIT
> > >  };
> > > @@ -147,7 +148,7 @@ static void vuf_handle_output(VirtIODevice *vdev, 
> > > VirtQueue *vq)
> > >   */
> > >  }
> > >  
> > > -static void vuf_create_vqs(VirtIODevice *vdev)
> > > +static void vuf_create_vqs(VirtIODevice *vdev, bool notification_vq)
> > >  {
> > >  VHostUserFS *fs = VHOST_USER_FS(vdev);
> > >  unsigned int i;
> > > @@ -155,6 +156,15 @@ static void vuf_create_vqs(VirtIODevice *vdev)
> > >  /* Hiprio queue */
> > >  fs->hiprio_vq = virtio_add_queue(vdev, fs->conf.queue_size,
> > >   vuf_handle_output);
> > > +/*
> > > + * Notification queue. Feature negotiation happens later. So at this
> > > + * point of time we don't know if driver will use notification queue
> > > + * or not.
> > > + */
> > > +if (notification_vq) {
> > > +fs->notification_vq = virtio_add_queue(vdev, fs->conf.queue_size,
> > > +   vuf_handle_output);
> > > +}
> > >  
> > >  /* Request queues */
> > >  fs->req_vqs = g_new(VirtQueue *, fs->conf.num_request_queues);
> > > @@ -163,8 +173,12 @@ static void vuf_create_vqs(VirtIODevice *vdev)
> > >vuf_handle_output);
> > >  }
> > >  
> > > -/* 1 high prio queue, plus the number configured */
> > > -fs->vhost_dev.nvqs = 1 + fs->conf.num_request_queues;
> > > +/* 1 high prio queue, 1 notification queue plus the number 
> > > configured */
> > > +if (notification_vq) {
> > > +fs->vhost_dev.nvqs = 2 + fs->conf.num_request_queues;
> > > +} else {
> > > +fs->vhost_dev.nvqs = 1 + fs->conf.num_request_queues;
> > > +}
> > >  fs->vhost_dev.vqs = g_new0(struct vhost_virtqueue, 
> > > fs->vhost_dev.nvqs);
> > >  }
> > >  
> > > @@ -176,6 +190,11 @@ static void vuf_cleanup_vqs(VirtIODevice *vdev)
> > >  virtio_delete_queue(fs->hiprio_vq);
> > >  fs->hiprio_vq = NULL;
> > >  
> > > +if (fs->notification_vq) {
> > > +virtio_delete_queue(fs->notification_vq);
> > > +}
> > > +fs->notification_vq = NULL;
> > > +
> > >  for (i = 0; i < fs->conf.num_request_queues; i++) {
> > >  virtio_delete_queue(fs->req_vqs[i]);
> > >  }
> > > @@ -194,9 +213,43 @@ static uint64_t vuf_get_features(VirtIODevice *vdev,
> > >  {
> > >  VHostUserFS *fs = VHOST_USER_FS(vdev);
> > >  
> > > +virtio_add_feature(&features, VIRTIO_FS_F_NOTIFICATION);
> > > +
> > >  return vhost_get_features(&fs->vhost_dev, user_feature_bits, 
> > > features);
> > >  }
> > >  
> > > +static void vuf_set_features(VirtIODevice *vdev, uint64_t features)
> > > +{
> > > +VHostUserFS *fs = VHOST_USER_FS(vdev);
> > > +
> > > +if (virtio_has_feature(features, VIRTIO_FS_F_NOTIFICATION)) {
> > > +fs->notify_enabled = true;
> > > +/*
> > > + * If guest first booted with no notification queue support and
> > > +

Re: Deprecate the ppc405 boards in QEMU? (was: [PATCH v3 4/7] MAINTAINERS: Orphan obscure ppc platforms)

2021-10-05 Thread Cédric Le Goater


On 10/5/21 08:18, Alexey Kardashevskiy wrote:



On 05/10/2021 15:44, Christophe Leroy wrote:



Le 05/10/2021 à 02:48, David Gibson a écrit :

On Fri, Oct 01, 2021 at 04:18:49PM +0200, Thomas Huth wrote:

On 01/10/2021 15.04, Christophe Leroy wrote:



Le 01/10/2021 à 14:04, Thomas Huth a écrit :

On 01/10/2021 13.12, Peter Maydell wrote:

On Fri, 1 Oct 2021 at 10:43, Thomas Huth  wrote:

Nevertheless, as long as nobody has a hint where to find that
ppc405_rom.bin, I think both boards are pretty useless in QEMU (as far as I
can see, they do not work without the bios at all, so it's
also not possible
to use a Linux image with the "-kernel" CLI option directly).


It is at least in theory possible to run bare-metal code on
either board, by passing either a pflash or a bios argument.


True. I did some more research, and seems like there was once
support for those boards in u-boot, but it got removed there a
couple of years ago already:

https://gitlab.com/qemu-project/u-boot/-/commit/98f705c9cefdf

https://gitlab.com/qemu-project/u-boot/-/commit/b147ff2f37d5b

https://gitlab.com/qemu-project/u-boot/-/commit/7514037bcdc37


But I agree that there seem to be no signs of anybody actually
successfully using these boards for anything, so we should
deprecate-and-delete them.


Yes, let's mark them as deprecated now ... if someone still uses
them and speaks up, we can still revert the deprecation again.


I really would like to be able to use them to validate Linux Kernel
changes, hence looking for that missing BIOS.

If we remove ppc405 from QEMU, we won't be able to do any regression
tests of Linux Kernel on those processors.


If you/someone managed to compile an old version of u-boot for one of these
two boards, so that we would finally have something for regression testing,
we can of course also keep the boards in QEMU...


I can see that it would be usefor for some cases, but unless someone
volunteers to track down the necessary firmware and look after it, I
think we do need to deprecate it - I certainly don't have the capacity
to look into this.



I will look at it, please allow me a few weeks though.


Well, building it was not hard but now I'd like to know what board QEMU 
actually emulates, there are way too many codenames and PVRs.


yes. We should try to reduce the list below. Deprecating embedded machines
is one way.

C.


$ ./install/bin/qemu-system-ppc -cpu ?
PowerPC 601_v0   PVR 00010001
PowerPC 601_v1   PVR 00010001
PowerPC 601_v2   PVR 00010002
PowerPC 601  (alias for 601_v2)
PowerPC 601v (alias for 601_v2)
PowerPC 603  PVR 00030100
PowerPC mpc8240  (alias for 603)
PowerPC vanilla  (alias for 603)
PowerPC 604  PVR 00040103
PowerPC ppc32(alias for 604)
PowerPC ppc  (alias for 604)
PowerPC default  (alias for 604)
PowerPC 602  PVR 00050100
PowerPC 603e_v1.1PVR 00060101
PowerPC 603e_v1.2PVR 00060102
PowerPC 603e_v1.3PVR 00060103
PowerPC 603e_v1.4PVR 00060104
PowerPC 603e_v2.2PVR 00060202
PowerPC 603e_v3  PVR 00060300
PowerPC 603e_v4  PVR 00060400
PowerPC 603e_v4.1PVR 00060401
PowerPC 603e (alias for 603e_v4.1)
PowerPC stretch  (alias for 603e_v4.1)
PowerPC 603p PVR 0007
PowerPC 603e7v   PVR 00070100
PowerPC vaillant (alias for 603e7v)
PowerPC 603e7v1  PVR 00070101
PowerPC 603e7PVR 00070200
PowerPC 603e7v2  PVR 00070201
PowerPC 603e7t   PVR 00071201
PowerPC 603r (alias for 603e7t)
PowerPC goldeneye(alias for 603e7t)
PowerPC 740_v1.0 PVR 00080100
PowerPC 740e PVR 00080100
PowerPC 750_v1.0 PVR 00080100
PowerPC 750_v2.0 PVR 00080200
PowerPC 740_v2.0 PVR 00080200
PowerPC 750e PVR 00080200
PowerPC 750_v2.1 PVR 00080201
PowerPC 740_v2.1 PVR 00080201
PowerPC 750_v2.2 PVR 00080202
PowerPC 740_v2.2 PVR 00080202
PowerPC 750_v3.0 PVR 00080300
PowerPC 740_v3.0 PVR 00080300
PowerPC 750_v3.1 PVR 00080301
PowerPC 750  (alias for 750_v3.1)
PowerPC typhoon  (alias for 750_v3.1)
PowerPC g3   (alias for 750_v3.1)
PowerPC 740_v3.1 PVR 00080301
PowerPC 740  (alias for 740_v3.1)
PowerPC arthur   (alias for 740_v3.1)
PowerPC 750cx_v1.0   PVR 00082100
PowerPC 750cx_v2.0   PVR 00082200
PowerPC 750cx_v2.1   PVR 00082201
PowerPC 750cx_v2.2   PVR 00082202
PowerPC 750cx(alias for 750cx_v2.2)
PowerPC 750cxe_v2.1  PVR 00082211
PowerPC 750cxe_v2.2  PVR 00082212
PowerPC 750cxe_v2.3  PVR 00082213
PowerPC 750cxe_v2.4  PVR 00082214
PowerPC 750cxe_v3.0  PVR 00082310
PowerPC 750cxe_v3.1  PVR 00082311
PowerPC 745_v1.0 PVR 00083100
PowerPC 755_v1.0 PVR 00083100
PowerPC 755_v1.1 PVR 00083101
Power

[PATCH v1 0/2] Migration: fix missing iothread locking

2021-10-05 Thread Emanuele Giuseppe Esposito

Some functions (in this case qemu_savevm_state_complete_postcopy() and
init_dirty_bitmap_migration()) assume and document that
qemu_mutex_lock_iothread() is hold.

This seems to have been forgotten in some places, and this series
aims to fix that.

Patch 1 was part of my RFC block layer series "block layer: split
block APIs in graph and I/O" but I decided to do a separate series
for these two bugs, as they are independent from the API split.

Signed-off-by: Emanuele Giuseppe Esposito 

Emanuele Giuseppe Esposito (2):
  migration: block-dirty-bitmap: add missing qemu_mutex_lock_iothread
  migration: add missing qemu_mutex_lock_iothread in
migration_completion

 migration/block-dirty-bitmap.c | 5 -
 migration/migration.c  | 3 +++
 2 files changed, 7 insertions(+), 1 deletion(-)

-- 
2.27.0

Re: [PATCH V3] block/rbd: implement bdrv_co_block_status

2021-10-05 Thread Peter Lieven


Am 05.10.21 um 09:54 schrieb Ilya Dryomov:

On Thu, Sep 16, 2021 at 2:21 PM Peter Lieven  wrote:

the qemu rbd driver currently lacks support for bdrv_co_block_status.
This results mainly in incorrect progress during block operations (e.g.
qemu-img convert with an rbd image as source).

This patch utilizes the rbd_diff_iterate2 call from librbd to detect
allocated and unallocated (all zero areas).

To avoid querying the ceph OSDs for the answer this is only done if
the image has the fast-diff feature which depends on the object-map and
exclusive-lock features. In this case it is guaranteed that the information
is present in memory in the librbd client and thus very fast.

If fast-diff is not available all areas are reported to be allocated
which is the current behaviour if bdrv_co_block_status is not implemented.

Signed-off-by: Peter Lieven 
---
V2->V3:
- check rbd_flags every time (they can change during runtime) [Ilya]
- also check for fast-diff invalid flag [Ilya]
- *map and *file cant be NULL [Ilya]
- set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an
   unallocated area [Ilya]
- typo: catched -> caught [Ilya]
- changed wording about fast-diff, object-map and exclusive lock in
   commit msg [Ilya]

V1->V2:
- add commit comment [Stefano]
- use failed_post_open [Stefano]
- remove redundant assert [Stefano]
- add macro+comment for the magic -9000 value [Stefano]
- always set *file if its non NULL [Stefano]

  block/rbd.c | 126 
  1 file changed, 126 insertions(+)

diff --git a/block/rbd.c b/block/rbd.c
index dcf82b15b8..3cb24f9981 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1259,6 +1259,131 @@ static ImageInfoSpecific 
*qemu_rbd_get_specific_info(BlockDriverState *bs,
  return spec_info;
  }

+typedef struct rbd_diff_req {
+uint64_t offs;
+uint64_t bytes;
+int exists;

Hi Peter,

Nit: make exists a bool.  The one in the callback has to be an int
because of the callback signature but let's not spread that.


+} rbd_diff_req;
+
+/*
+ * rbd_diff_iterate2 allows to interrupt the exection by returning a negative
+ * value in the callback routine. Choose a value that does not conflict with
+ * an existing exitcode and return it if we want to prematurely stop the
+ * execution because we detected a change in the allocation status.
+ */
+#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
+
+static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
+   int exists, void *opaque)
+{
+struct rbd_diff_req *req = opaque;
+
+assert(req->offs + req->bytes <= offs);
+
+if (req->exists && offs > req->offs + req->bytes) {
+/*
+ * we started in an allocated area and jumped over an unallocated area,
+ * req->bytes contains the length of the allocated area before the
+ * unallocated area. stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (req->exists && !exists) {
+/*
+ * we started in an allocated area and reached a hole. req->bytes
+ * contains the length of the allocated area before the hole.
+ * stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;

Do you have a test case for when this branch is taken?



That would happen if you diff from a snapshot, the question is if it can also 
happen if the image is a clone from a snapshot?





+}
+if (!req->exists && exists && offs > req->offs) {
+/*
+ * we started in an unallocated area and hit the first allocated
+ * block. req->bytes must be set to the length of the unallocated area
+ * before the allocated area. stop further processing.
+ */
+req->bytes = offs - req->offs;
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+
+/*
+ * assert that we caught all cases above and allocation state has not
+ * changed during callbacks.
+ */
+assert(exists == req->exists || !req->bytes);
+req->exists = exists;
+
+/*
+ * assert that we either return an unallocated block or have got callbacks
+ * for all allocated blocks present.
+ */
+assert(!req->exists || offs == req->offs + req->bytes);
+req->bytes = offs + len - req->offs;
+
+return 0;
+}
+
+static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
+ bool want_zero, int64_t 
offset,
+ int64_t bytes, int64_t *pnum,
+ int64_t *map,
+ BlockDriverState **file)
+{
+BDRVRBDState *s = bs->opaque;
+int ret, r;

Nit: I would rename ret to status or something like that to make
it clear(er) that it is an actual value and never an error.  Or,
even better, drop it entirely and return one of the two bitmasks
directly.


+struct rbd_diff_req req = { .offs =

Re: [PATCH v4 2/3] docs: (further) remove non-reference uses of single backticks

2021-10-05 Thread Damien Hedde





On 10/4/21 23:52, John Snow wrote:

The series rotted already. Here's the new changes.

Signed-off-by: John Snow 


Reviewed-by: Damien Hedde 


---
  docs/system/i386/sgx.rst | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/system/i386/sgx.rst b/docs/system/i386/sgx.rst
index f103ae2a2fd..9aa161af1a1 100644
--- a/docs/system/i386/sgx.rst
+++ b/docs/system/i386/sgx.rst
@@ -77,9 +77,9 @@ CPUID
  
  Due to its myriad dependencies, SGX is currently not listed as supported

  in any of Qemu's built-in CPU configuration. To expose SGX (and SGX Launch
-Control) to a guest, you must either use `-cpu host` to pass-through the
+Control) to a guest, you must either use ``-cpu host`` to pass-through the
  host CPU model, or explicitly enable SGX when using a built-in CPU model,
-e.g. via `-cpu ,+sgx` or `-cpu ,+sgx,+sgxlc`.
+e.g. via ``-cpu ,+sgx`` or ``-cpu ,+sgx,+sgxlc``.
  
  All SGX sub-features enumerated through CPUID, e.g. SGX2, MISCSELECT,

  ATTRIBUTES, etc... can be restricted via CPUID flags. Be aware that enforcing
@@ -126,7 +126,7 @@ creating VM with SGX.
  Feature Control
  ~~~
  
-Qemu SGX updates the `etc/msr_feature_control` fw_cfg entry to set the SGX

+Qemu SGX updates the ``etc/msr_feature_control`` fw_cfg entry to set the SGX
  (bit 18) and SGX LC (bit 17) flags based on their respective CPUID support,
  i.e. existing guest firmware will automatically set SGX and SGX LC 
accordingly,
  assuming said firmware supports fw_cfg.msr_feature_control.

[PATCH v1 2/2] migration: add missing qemu_mutex_lock_iothread in migration_completion

2021-10-05 Thread Emanuele Giuseppe Esposito

qemu_savevm_state_complete_postcopy assumes the iothread lock (BQL)
to be held, but instead it isn't.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 migration/migration.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index 041b8451a6..215d5281f2 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3182,7 +3182,10 @@ static void migration_completion(MigrationState *s)
 } else if (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
 trace_migration_completion_postcopy_end();
 
+qemu_mutex_lock_iothread();
 qemu_savevm_state_complete_postcopy(s->to_dst_file);
+qemu_mutex_unlock_iothread();
+
 trace_migration_completion_postcopy_end_after_complete();
 } else if (s->state == MIGRATION_STATUS_CANCELLING) {
 goto fail;
-- 
2.27.0

Re: [PATCH] qapi: Make some ObjectTypes depend on the build settings

2021-10-05 Thread Markus Armbruster

Thomas Huth  writes:

> Some of the ObjectType entries already depend on CONFIG_* switches.
> Some others also only make sense with certain configurations, but
> are currently always listed in the ObjectType enum. Let's make them
> depend on the correpsonding CONFIG_* switches, too, so that upper
> layers (like libvirt) have a better way to determine which features
> are available in QEMU.
>
> Signed-off-by: Thomas Huth 

All these look good to me.  I didn't look for more.

Reviewed-by: Markus Armbruster

Re: [PATCH 3/3] vdpa: Check for iova range at mappings changes

2021-10-05 Thread Michael S. Tsirkin

On Tue, Oct 05, 2021 at 10:01:31AM +0200, Eugenio Pérez wrote:
> Check vdpa device range before updating memory regions so we don't add
> any outside of it, and report the invalid change if any.
> 
> Signed-off-by: Eugenio Pérez 
> ---
>  include/hw/virtio/vhost-vdpa.h |  2 +
>  hw/virtio/vhost-vdpa.c | 68 ++
>  hw/virtio/trace-events |  1 +
>  3 files changed, 55 insertions(+), 16 deletions(-)
> 
> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> index a8963da2d9..c288cf7ecb 100644
> --- a/include/hw/virtio/vhost-vdpa.h
> +++ b/include/hw/virtio/vhost-vdpa.h
> @@ -13,6 +13,7 @@
>  #define HW_VIRTIO_VHOST_VDPA_H
>  
>  #include "hw/virtio/virtio.h"
> +#include "standard-headers/linux/vhost_types.h"
>  
>  typedef struct VhostVDPAHostNotifier {
>  MemoryRegion mr;
> @@ -24,6 +25,7 @@ typedef struct vhost_vdpa {
>  uint32_t msg_type;
>  bool iotlb_batch_begin_sent;
>  MemoryListener listener;
> +struct vhost_vdpa_iova_range iova_range;
>  struct vhost_dev *dev;
>  VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
>  } VhostVDPA;
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index a1de6c7c9c..26d0258723 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -33,20 +33,34 @@ static Int128 vhost_vdpa_section_end(const 
> MemoryRegionSection *section)
>  return llend;
>  }
>  
> -static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section)
> -{
> -return (!memory_region_is_ram(section->mr) &&
> -!memory_region_is_iommu(section->mr)) ||
> -memory_region_is_protected(section->mr) ||
> -   /* vhost-vDPA doesn't allow MMIO to be mapped  */
> -memory_region_is_ram_device(section->mr) ||
> -   /*
> -* Sizing an enabled 64-bit BAR can cause spurious mappings to
> -* addresses in the upper part of the 64-bit address space.  These
> -* are never accessed by the CPU and beyond the address width of
> -* some IOMMU hardware.  TODO: VDPA should tell us the IOMMU 
> width.
> -*/
> -   section->offset_within_address_space & (1ULL << 63);
> +static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section,
> +uint64_t iova_min,
> +uint64_t iova_max)
> +{
> +Int128 llend;
> +bool r = (!memory_region_is_ram(section->mr) &&
> +  !memory_region_is_iommu(section->mr)) ||
> +  memory_region_is_protected(section->mr) ||
> +  /* vhost-vDPA doesn't allow MMIO to be mapped  */
> +  memory_region_is_ram_device(section->mr);
> +if (r) {
> +return true;
> +}
> +
> +if (section->offset_within_address_space < iova_min) {
> +error_report("RAM section out of device range (min=%lu, addr=%lu)",
> + iova_min, section->offset_within_address_space);
> +return true;
> +}
> +
> +llend = vhost_vdpa_section_end(section);
> +if (int128_make64(llend) > iova_max) {

I am puzzled by this.
You are taking a Int128, converting to u64, converting
back to Int128, and comparing to u64.
Head spins. What is all this back and forth trying to achieve?

> +error_report("RAM section out of device range (max=%lu, end 
> addr=%lu)",
> + iova_max, (uint64_t)int128_make64(llend));
> +return true;
> +}
> +
> +return false;
>  }
>  
>  static int vhost_vdpa_dma_map(struct vhost_vdpa *v, hwaddr iova, hwaddr size,
> @@ -158,7 +172,8 @@ static void vhost_vdpa_listener_region_add(MemoryListener 
> *listener,
>  void *vaddr;
>  int ret;
>  
> -if (vhost_vdpa_listener_skipped_section(section)) {
> +if (vhost_vdpa_listener_skipped_section(section, v->iova_range.first,
> +v->iova_range.last)) {
>  return;
>  }
>  
> @@ -216,7 +231,8 @@ static void vhost_vdpa_listener_region_del(MemoryListener 
> *listener,
>  Int128 llend, llsize;
>  int ret;
>  
> -if (vhost_vdpa_listener_skipped_section(section)) {
> +if (vhost_vdpa_listener_skipped_section(section, v->iova_range.first,
> +v->iova_range.last)) {
>  return;
>  }
>  
> @@ -284,9 +300,24 @@ static void vhost_vdpa_add_status(struct vhost_dev *dev, 
> uint8_t status)
>  vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &s);
>  }
>  
> +static int vhost_vdpa_get_iova_range(struct vhost_vdpa *v)
> +{
> +int ret;
> +
> +ret = vhost_vdpa_call(v->dev, VHOST_VDPA_GET_IOVA_RANGE, &v->iova_range);
> +if (ret != 0) {
> +return ret;
> +}
> +
> +trace_vhost_vdpa_get_iova_range(v->dev, v->iova_range.first,
> +v->iova_range.last);
> +return ret;
> +}
> +
>  static int vhost_vdpa_init(stru

Re: [PATCH V3] block/rbd: implement bdrv_co_block_status

2021-10-05 Thread Ilya Dryomov

On Tue, Oct 5, 2021 at 10:19 AM Peter Lieven  wrote:
>
> Am 05.10.21 um 09:54 schrieb Ilya Dryomov:
> > On Thu, Sep 16, 2021 at 2:21 PM Peter Lieven  wrote:
> >> the qemu rbd driver currently lacks support for bdrv_co_block_status.
> >> This results mainly in incorrect progress during block operations (e.g.
> >> qemu-img convert with an rbd image as source).
> >>
> >> This patch utilizes the rbd_diff_iterate2 call from librbd to detect
> >> allocated and unallocated (all zero areas).
> >>
> >> To avoid querying the ceph OSDs for the answer this is only done if
> >> the image has the fast-diff feature which depends on the object-map and
> >> exclusive-lock features. In this case it is guaranteed that the information
> >> is present in memory in the librbd client and thus very fast.
> >>
> >> If fast-diff is not available all areas are reported to be allocated
> >> which is the current behaviour if bdrv_co_block_status is not implemented.
> >>
> >> Signed-off-by: Peter Lieven 
> >> ---
> >> V2->V3:
> >> - check rbd_flags every time (they can change during runtime) [Ilya]
> >> - also check for fast-diff invalid flag [Ilya]
> >> - *map and *file cant be NULL [Ilya]
> >> - set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an
> >>unallocated area [Ilya]
> >> - typo: catched -> caught [Ilya]
> >> - changed wording about fast-diff, object-map and exclusive lock in
> >>commit msg [Ilya]
> >>
> >> V1->V2:
> >> - add commit comment [Stefano]
> >> - use failed_post_open [Stefano]
> >> - remove redundant assert [Stefano]
> >> - add macro+comment for the magic -9000 value [Stefano]
> >> - always set *file if its non NULL [Stefano]
> >>
> >>   block/rbd.c | 126 
> >>   1 file changed, 126 insertions(+)
> >>
> >> diff --git a/block/rbd.c b/block/rbd.c
> >> index dcf82b15b8..3cb24f9981 100644
> >> --- a/block/rbd.c
> >> +++ b/block/rbd.c
> >> @@ -1259,6 +1259,131 @@ static ImageInfoSpecific 
> >> *qemu_rbd_get_specific_info(BlockDriverState *bs,
> >>   return spec_info;
> >>   }
> >>
> >> +typedef struct rbd_diff_req {
> >> +uint64_t offs;
> >> +uint64_t bytes;
> >> +int exists;
> > Hi Peter,
> >
> > Nit: make exists a bool.  The one in the callback has to be an int
> > because of the callback signature but let's not spread that.
> >
> >> +} rbd_diff_req;
> >> +
> >> +/*
> >> + * rbd_diff_iterate2 allows to interrupt the exection by returning a 
> >> negative
> >> + * value in the callback routine. Choose a value that does not conflict 
> >> with
> >> + * an existing exitcode and return it if we want to prematurely stop the
> >> + * execution because we detected a change in the allocation status.
> >> + */
> >> +#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
> >> +
> >> +static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
> >> +   int exists, void *opaque)
> >> +{
> >> +struct rbd_diff_req *req = opaque;
> >> +
> >> +assert(req->offs + req->bytes <= offs);
> >> +
> >> +if (req->exists && offs > req->offs + req->bytes) {
> >> +/*
> >> + * we started in an allocated area and jumped over an unallocated 
> >> area,
> >> + * req->bytes contains the length of the allocated area before the
> >> + * unallocated area. stop further processing.
> >> + */
> >> +return QEMU_RBD_EXIT_DIFF_ITERATE2;
> >> +}
> >> +if (req->exists && !exists) {
> >> +/*
> >> + * we started in an allocated area and reached a hole. req->bytes
> >> + * contains the length of the allocated area before the hole.
> >> + * stop further processing.
> >> + */
> >> +return QEMU_RBD_EXIT_DIFF_ITERATE2;
> > Do you have a test case for when this branch is taken?
>
>
> That would happen if you diff from a snapshot, the question is if it can also 
> happen if the image is a clone from a snapshot?
>
>
> >
> >> +}
> >> +if (!req->exists && exists && offs > req->offs) {
> >> +/*
> >> + * we started in an unallocated area and hit the first allocated
> >> + * block. req->bytes must be set to the length of the unallocated 
> >> area
> >> + * before the allocated area. stop further processing.
> >> + */
> >> +req->bytes = offs - req->offs;
> >> +return QEMU_RBD_EXIT_DIFF_ITERATE2;
> >> +}
> >> +
> >> +/*
> >> + * assert that we caught all cases above and allocation state has not
> >> + * changed during callbacks.
> >> + */
> >> +assert(exists == req->exists || !req->bytes);
> >> +req->exists = exists;
> >> +
> >> +/*
> >> + * assert that we either return an unallocated block or have got 
> >> callbacks
> >> + * for all allocated blocks present.
> >> + */
> >> +assert(!req->exists || offs == req->offs + req->bytes);
> >> +req->bytes = offs + len - req->offs;
> >> +
> >> +return 0;
> >> +}
> >> +
> >> +static int cor

[PATCH v2 0/3] hw/arm/virt_acpi_build: Upgrate the IORT table up to revision E.b

2021-10-05 Thread Eric Auger

This series upgrades the ACPI IORT table up to the E.b
specification revision. One of the goal of this upgrade
is to allow the addition of RMR nodes along with the SMMUv3.

It applies on top of Igor's
[PATCH v4 00/35] acpi: refactor error prone build_header() and
packed structures usage in ACPI tables

The latest IORT specification (ARM DEN 0049E.b) can be found at
IO Remapping Table - Platform Design Document
https://developer.arm.com/documentation/den0049/latest/

This series and its dependency can be found at
https://github.com/eauger/qemu.git
branch: igor_acpi_refactoring_v4_dbg2_v3_rmr_v2

History:
v1 -> v2:
- fix Revision value in ITS and SMMUv3 nodes (Phil)
- Increment an identifier (Phil)


Eric Auger (3):
  tests/acpi: Get prepared for IORT E.b revision upgrade
  hw/arm/virt-acpi-build: IORT upgrade up to revision E.b
  tests/acpi: Generate reference blob for IORT rev E.b

 hw/arm/virt-acpi-build.c  |  48 ++
 tests/data/acpi/virt/IORT | Bin 124 -> 128 bytes
 tests/data/acpi/virt/IORT.memhp   | Bin 124 -> 128 bytes
 tests/data/acpi/virt/IORT.numamem | Bin 124 -> 128 bytes
 tests/data/acpi/virt/IORT.pxb | Bin 124 -> 128 bytes
 5 files changed, 29 insertions(+), 19 deletions(-)

-- 
2.26.3

[PATCH v2 2/3] hw/arm/virt-acpi-build: IORT upgrade up to revision E.b

2021-10-05 Thread Eric Auger

Upgrade the IORT table from B to E.b specification
revision (ARM DEN 0049E.b).

Signed-off-by: Eric Auger 

---

v1 -> v2:
- Fix Revision value for ITS node and SMMUv3 node
- increment an identifier
---
 hw/arm/virt-acpi-build.c | 48 
 1 file changed, 29 insertions(+), 19 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 257d0fee17..789bac3134 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -241,19 +241,20 @@ static void acpi_dsdt_add_tpm(Aml *scope, 
VirtMachineState *vms)
 #endif
 
 #define ID_MAPPING_ENTRY_SIZE 20
-#define SMMU_V3_ENTRY_SIZE 60
-#define ROOT_COMPLEX_ENTRY_SIZE 32
+#define SMMU_V3_ENTRY_SIZE 68
+#define ROOT_COMPLEX_ENTRY_SIZE 36
 #define IORT_NODE_OFFSET 48
 
 static void build_iort_id_mapping(GArray *table_data, uint32_t input_base,
   uint32_t id_count, uint32_t out_ref)
 {
-/* Identity RID mapping covering the whole input RID range */
+/* Table 4 ID mapping format */
 build_append_int_noprefix(table_data, input_base, 4); /* Input base */
 build_append_int_noprefix(table_data, id_count, 4); /* Number of IDs */
 build_append_int_noprefix(table_data, input_base, 4); /* Output base */
 build_append_int_noprefix(table_data, out_ref, 4); /* Output Reference */
-build_append_int_noprefix(table_data, 0, 4); /* Flags */
+/* Flags */
+build_append_int_noprefix(table_data, 0 /* Single mapping */, 4);
 }
 
 struct AcpiIortIdMapping {
@@ -298,7 +299,7 @@ static int iort_idmap_compare(gconstpointer a, 
gconstpointer b)
 /*
  * Input Output Remapping Table (IORT)
  * Conforms to "IO Remapping Table System Software on ARM Platforms",
- * Document number: ARM DEN 0049B, October 2015
+ * Document number: ARM DEN 0049E, Feb 2021
  */
 static void
 build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
@@ -307,10 +308,11 @@ build_iort(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
 const uint32_t iort_node_offset = IORT_NODE_OFFSET;
 size_t node_size, smmu_offset = 0;
 AcpiIortIdMapping *idmap;
+uint32_t id = 0;
 GArray *smmu_idmaps = g_array_new(false, true, sizeof(AcpiIortIdMapping));
 GArray *its_idmaps = g_array_new(false, true, sizeof(AcpiIortIdMapping));
 
-AcpiTable table = { .sig = "IORT", .rev = 0, .oem_id = vms->oem_id,
+AcpiTable table = { .sig = "IORT", .rev = 3, .oem_id = vms->oem_id,
 .oem_table_id = vms->oem_table_id };
 /* Table 2 The IORT */
 acpi_table_begin(&table, table_data);
@@ -358,12 +360,12 @@ build_iort(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
 build_append_int_noprefix(table_data, IORT_NODE_OFFSET, 4);
 build_append_int_noprefix(table_data, 0, 4); /* Reserved */
 
-/* 3.1.1.3 ITS group node */
+/* Table 12 ITS Group Format */
 build_append_int_noprefix(table_data, 0 /* ITS Group */, 1); /* Type */
 node_size =  20 /* fixed header size */ + 4 /* 1 GIC ITS Identifier */;
 build_append_int_noprefix(table_data, node_size, 2); /* Length */
-build_append_int_noprefix(table_data, 0, 1); /* Revision */
-build_append_int_noprefix(table_data, 0, 4); /* Reserved */
+build_append_int_noprefix(table_data, 1, 1); /* Revision */
+build_append_int_noprefix(table_data, id++, 4); /* Identifier */
 build_append_int_noprefix(table_data, 0, 4); /* Number of ID mappings */
 build_append_int_noprefix(table_data, 0, 4); /* Reference to ID Array */
 build_append_int_noprefix(table_data, 1, 4); /* Number of ITSs */
@@ -374,19 +376,19 @@ build_iort(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
 int irq =  vms->irqmap[VIRT_SMMU] + ARM_SPI_BASE;
 
 smmu_offset = table_data->len - table.table_offset;
-/* 3.1.1.2 SMMUv3 */
+/* Table 9 SMMUv3 Format */
 build_append_int_noprefix(table_data, 4 /* SMMUv3 */, 1); /* Type */
 node_size =  SMMU_V3_ENTRY_SIZE + ID_MAPPING_ENTRY_SIZE;
 build_append_int_noprefix(table_data, node_size, 2); /* Length */
-build_append_int_noprefix(table_data, 0, 1); /* Revision */
-build_append_int_noprefix(table_data, 0, 4); /* Reserved */
+build_append_int_noprefix(table_data, 4, 1); /* Revision */
+build_append_int_noprefix(table_data, id++, 4); /* Identifier */
 build_append_int_noprefix(table_data, 1, 4); /* Number of ID mappings 
*/
 /* Reference to ID Array */
 build_append_int_noprefix(table_data, SMMU_V3_ENTRY_SIZE, 4);
 /* Base address */
 build_append_int_noprefix(table_data, vms->memmap[VIRT_SMMU].base, 8);
 /* Flags */
-build_append_int_noprefix(table_data, 1 /* COHACC OverrideNote */, 4);
+build_append_int_noprefix(table_data, 1 /* COHACC Override */, 4);
 build_append_int_noprefix(table_data, 0, 4); /* Reserved */
 build_append_int_noprefix(table_da

Re: [PATCH V3] block/rbd: implement bdrv_co_block_status

2021-10-05 Thread Peter Lieven


Am 05.10.21 um 10:36 schrieb Ilya Dryomov:

On Tue, Oct 5, 2021 at 10:19 AM Peter Lieven  wrote:

Am 05.10.21 um 09:54 schrieb Ilya Dryomov:

On Thu, Sep 16, 2021 at 2:21 PM Peter Lieven  wrote:

the qemu rbd driver currently lacks support for bdrv_co_block_status.
This results mainly in incorrect progress during block operations (e.g.
qemu-img convert with an rbd image as source).

This patch utilizes the rbd_diff_iterate2 call from librbd to detect
allocated and unallocated (all zero areas).

To avoid querying the ceph OSDs for the answer this is only done if
the image has the fast-diff feature which depends on the object-map and
exclusive-lock features. In this case it is guaranteed that the information
is present in memory in the librbd client and thus very fast.

If fast-diff is not available all areas are reported to be allocated
which is the current behaviour if bdrv_co_block_status is not implemented.

Signed-off-by: Peter Lieven 
---
V2->V3:
- check rbd_flags every time (they can change during runtime) [Ilya]
- also check for fast-diff invalid flag [Ilya]
- *map and *file cant be NULL [Ilya]
- set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an
unallocated area [Ilya]
- typo: catched -> caught [Ilya]
- changed wording about fast-diff, object-map and exclusive lock in
commit msg [Ilya]

V1->V2:
- add commit comment [Stefano]
- use failed_post_open [Stefano]
- remove redundant assert [Stefano]
- add macro+comment for the magic -9000 value [Stefano]
- always set *file if its non NULL [Stefano]

   block/rbd.c | 126 
   1 file changed, 126 insertions(+)

diff --git a/block/rbd.c b/block/rbd.c
index dcf82b15b8..3cb24f9981 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1259,6 +1259,131 @@ static ImageInfoSpecific 
*qemu_rbd_get_specific_info(BlockDriverState *bs,
   return spec_info;
   }

+typedef struct rbd_diff_req {
+uint64_t offs;
+uint64_t bytes;
+int exists;

Hi Peter,

Nit: make exists a bool.  The one in the callback has to be an int
because of the callback signature but let's not spread that.


+} rbd_diff_req;
+
+/*
+ * rbd_diff_iterate2 allows to interrupt the exection by returning a negative
+ * value in the callback routine. Choose a value that does not conflict with
+ * an existing exitcode and return it if we want to prematurely stop the
+ * execution because we detected a change in the allocation status.
+ */
+#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
+
+static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
+   int exists, void *opaque)
+{
+struct rbd_diff_req *req = opaque;
+
+assert(req->offs + req->bytes <= offs);
+
+if (req->exists && offs > req->offs + req->bytes) {
+/*
+ * we started in an allocated area and jumped over an unallocated area,
+ * req->bytes contains the length of the allocated area before the
+ * unallocated area. stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (req->exists && !exists) {
+/*
+ * we started in an allocated area and reached a hole. req->bytes
+ * contains the length of the allocated area before the hole.
+ * stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;

Do you have a test case for when this branch is taken?


That would happen if you diff from a snapshot, the question is if it can also 
happen if the image is a clone from a snapshot?



+}
+if (!req->exists && exists && offs > req->offs) {
+/*
+ * we started in an unallocated area and hit the first allocated
+ * block. req->bytes must be set to the length of the unallocated area
+ * before the allocated area. stop further processing.
+ */
+req->bytes = offs - req->offs;
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+
+/*
+ * assert that we caught all cases above and allocation state has not
+ * changed during callbacks.
+ */
+assert(exists == req->exists || !req->bytes);
+req->exists = exists;
+
+/*
+ * assert that we either return an unallocated block or have got callbacks
+ * for all allocated blocks present.
+ */
+assert(!req->exists || offs == req->offs + req->bytes);
+req->bytes = offs + len - req->offs;
+
+return 0;
+}
+
+static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
+ bool want_zero, int64_t 
offset,
+ int64_t bytes, int64_t *pnum,
+ int64_t *map,
+ BlockDriverState **file)
+{
+BDRVRBDState *s = bs->opaque;
+int ret, r;

Nit: I would rename ret to status or something like that to make
it clear(er) that it is an actual value and never an error.  Or,
even better, dro

Re: [PATCH] gitlab: Escape git-describe match pattern on Windows hosts

2021-10-05 Thread Daniel P . Berrangé

On Tue, Oct 05, 2021 at 10:40:00AM +0200, Cédric Le Goater wrote:
> > I'm curious if you go to
> > 
> >https://gitlab.com/legoater/qemu/-/settings/ci_cd
> > 
> > and expand "General pipelines", what value is set for the
> > 
> >"Git shallow clone"
> > 
> > setting.  In my fork it is 0 which means unlimited depth, but in
> > gitlab docs I see reference to repos getting this set to 50
> > since a particular gitlab release.
> 
> Sorry for the late reply.
> 
> Setting the value to 0 fixed the windows build on gitlab.

Ok, so we've got two options

 - Change the code so it has sane fallback if the tags are all missing

 - Set GIT_DEPTH in the affected jobs to a value that is larger than
   the maximum number of commits we expect in the course of a single
   dev cycle, plus 20% grace on top, so that we're guaranteed enough
   history to describe one tag.


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

[PATCH v2 1/3] tests/acpi: Get prepared for IORT E.b revision upgrade

2021-10-05 Thread Eric Auger

Ignore IORT till reference blob for E.b spec revision gets
added.

Signed-off-by: Eric Auger 
---
 tests/qtest/bios-tables-test-allowed-diff.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tests/qtest/bios-tables-test-allowed-diff.h 
b/tests/qtest/bios-tables-test-allowed-diff.h
index dfb8523c8b..9a5a923d6b 100644
--- a/tests/qtest/bios-tables-test-allowed-diff.h
+++ b/tests/qtest/bios-tables-test-allowed-diff.h
@@ -1 +1,2 @@
 /* List of comma-separated changed AML files to ignore */
+"tests/data/acpi/virt/IORT",
-- 
2.26.3

Re: [PATCH] gitlab: Escape git-describe match pattern on Windows hosts

2021-10-05 Thread Cédric Le Goater


I'm curious if you go to

   https://gitlab.com/legoater/qemu/-/settings/ci_cd

and expand "General pipelines", what value is set for the

   "Git shallow clone"

setting.  In my fork it is 0 which means unlimited depth, but in
gitlab docs I see reference to repos getting this set to 50
since a particular gitlab release.


Sorry for the late reply.

Setting the value to 0 fixed the windows build on gitlab.

Thanks,

C.

[PATCH v2 3/3] tests/acpi: Generate reference blob for IORT rev E.b

2021-10-05 Thread Eric Auger

Re-generate reference blobs with rebuild-expected-aml.sh.

Signed-off-by: Eric Auger 
---
 tests/qtest/bios-tables-test-allowed-diff.h |   1 -
 tests/data/acpi/virt/IORT   | Bin 124 -> 128 bytes
 tests/data/acpi/virt/IORT.memhp | Bin 124 -> 128 bytes
 tests/data/acpi/virt/IORT.numamem   | Bin 124 -> 128 bytes
 tests/data/acpi/virt/IORT.pxb   | Bin 124 -> 128 bytes
 5 files changed, 1 deletion(-)

diff --git a/tests/qtest/bios-tables-test-allowed-diff.h 
b/tests/qtest/bios-tables-test-allowed-diff.h
index 9a5a923d6b..dfb8523c8b 100644
--- a/tests/qtest/bios-tables-test-allowed-diff.h
+++ b/tests/qtest/bios-tables-test-allowed-diff.h
@@ -1,2 +1 @@
 /* List of comma-separated changed AML files to ignore */
-"tests/data/acpi/virt/IORT",
diff --git a/tests/data/acpi/virt/IORT b/tests/data/acpi/virt/IORT
index 
521acefe9ba66706c5607321a82d330586f3f280..7efd0ce8a6b3928efa7e1373f688ab4c5f50543b
 100644
GIT binary patch
literal 128
zcmebD4+?2uU|?Y0?Bwt45v<@85#X!<1dKp25F11@0kHuPgMkDCNC*yK93~3}W)K^M
VRiHGGVg_O`aDdYP|3ers^8jQz3IPBB

literal 124
zcmebD4+^Pa00MR=e`k+i1*eDrX9XZ&1PX!JAesq?4S*O7Bw!2(4Uz`|CKCt^;wu0#
QRGb+i3L*dhhtM#y0PN=p0RR91

diff --git a/tests/data/acpi/virt/IORT.memhp b/tests/data/acpi/virt/IORT.memhp
index 
521acefe9ba66706c5607321a82d330586f3f280..7efd0ce8a6b3928efa7e1373f688ab4c5f50543b
 100644
GIT binary patch
literal 128
zcmebD4+?2uU|?Y0?Bwt45v<@85#X!<1dKp25F11@0kHuPgMkDCNC*yK93~3}W)K^M
VRiHGGVg_O`aDdYP|3ers^8jQz3IPBB

literal 124
zcmebD4+^Pa00MR=e`k+i1*eDrX9XZ&1PX!JAesq?4S*O7Bw!2(4Uz`|CKCt^;wu0#
QRGb+i3L*dhhtM#y0PN=p0RR91

diff --git a/tests/data/acpi/virt/IORT.numamem 
b/tests/data/acpi/virt/IORT.numamem
index 
521acefe9ba66706c5607321a82d330586f3f280..7efd0ce8a6b3928efa7e1373f688ab4c5f50543b
 100644
GIT binary patch
literal 128
zcmebD4+?2uU|?Y0?Bwt45v<@85#X!<1dKp25F11@0kHuPgMkDCNC*yK93~3}W)K^M
VRiHGGVg_O`aDdYP|3ers^8jQz3IPBB

literal 124
zcmebD4+^Pa00MR=e`k+i1*eDrX9XZ&1PX!JAesq?4S*O7Bw!2(4Uz`|CKCt^;wu0#
QRGb+i3L*dhhtM#y0PN=p0RR91

diff --git a/tests/data/acpi/virt/IORT.pxb b/tests/data/acpi/virt/IORT.pxb
index 
521acefe9ba66706c5607321a82d330586f3f280..7efd0ce8a6b3928efa7e1373f688ab4c5f50543b
 100644
GIT binary patch
literal 128
zcmebD4+?2uU|?Y0?Bwt45v<@85#X!<1dKp25F11@0kHuPgMkDCNC*yK93~3}W)K^M
VRiHGGVg_O`aDdYP|3ers^8jQz3IPBB

literal 124
zcmebD4+^Pa00MR=e`k+i1*eDrX9XZ&1PX!JAesq?4S*O7Bw!2(4Uz`|CKCt^;wu0#
QRGb+i3L*dhhtM#y0PN=p0RR91

-- 
2.26.3

Re: Deprecate the ppc405 boards in QEMU?

2021-10-05 Thread Thomas Huth


On 05/10/2021 10.07, Thomas Huth wrote:

On 05/10/2021 10.05, Alexey Kardashevskiy wrote:

[...]

What is so special about taihu?


taihu is the other 405 board defined in hw/ppc/ppc405_boards.c (which I 
suggested to deprecate now)


I've now also played with the u-boot sources a little bit, and with some bit 
of tweaking, it's indeed possible to compile the old taihu board there. 
However, it does not really work with QEMU anymore, it immediately triggers 
an assert():


$ qemu-system-ppc -M taihu -bios u-boot.bin -serial null -serial mon:stdio
**
ERROR:accel/tcg/tcg-accel-ops.c:79:tcg_handle_interrupt: assertion failed: 
(qemu_mutex_iothread_locked())

Aborted (core dumped)

Going back to QEMU v2.3.0, I can see at least a little bit of output, but it 
then also triggers an assert() during DRAM initialization:


$ qemu-system-ppc -M taihu -bios u-boot.bin -serial null -serial mon:stdio

Reset PowerPC core

U-Boot 2014.10-rc2-00123-g461be2f96e-dirty (Oct 05 2021 - 10:02:56)

CPU:   AMCC PowerPC 405EP Rev. B at 770 MHz (PLB=256 OPB=128 EBC=128)
   I2C boot EEPROM disabled
   Internal PCI arbiter enabled
   16 KiB I-Cache 16 KiB D-Cache
Board: Taihu - AMCC PPC405EP Evaluation Board
I2C:   ready
DRAM:  qemu-system-ppc: memory.c:1693: memory_region_del_subregion: 
Assertion `subregion->container == mr' failed.

Aborted (core dumped)

Not sure if this ever worked in QEMU, maybe in the early 0.15 time, but that 
version of QEMU also does not compile easily anymore on modern systems. So 
I'm afraid, getting this into a workable shape again will take a lot of 
time. At least I'll stop my efforts here now.


 Thomas

[RFC v2 1/2] hw/pci-host/gpex: Allow to generate preserve boot config DSM #5

2021-10-05 Thread Eric Auger

Add a 'preserve_config' field in struct GPEXConfig and
if set generate the DSM #5 for preserving PCI boot configurations.
The DSM presence is needed to expose RMRs.

At the moment the DSM generation is not yet enabled.

Signed-off-by: Eric Auger 
---
 include/hw/pci-host/gpex.h |  1 +
 hw/pci-host/gpex-acpi.c| 12 
 2 files changed, 13 insertions(+)

diff --git a/include/hw/pci-host/gpex.h b/include/hw/pci-host/gpex.h
index fcf8b63820..3f8f8ec38d 100644
--- a/include/hw/pci-host/gpex.h
+++ b/include/hw/pci-host/gpex.h
@@ -64,6 +64,7 @@ struct GPEXConfig {
 MemMapEntry pio;
 int irq;
 PCIBus  *bus;
+boolpreserve_config;
 };
 
 int gpex_set_irq_num(GPEXHost *s, int index, int gsi);
diff --git a/hw/pci-host/gpex-acpi.c b/hw/pci-host/gpex-acpi.c
index e7e162a00a..7dab259379 100644
--- a/hw/pci-host/gpex-acpi.c
+++ b/hw/pci-host/gpex-acpi.c
@@ -164,6 +164,12 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg)
 aml_append(dev, aml_name_decl("_PXM", aml_int(numa_node)));
 }
 
+if (cfg->preserve_config) {
+method = aml_method("_DSM", 5, AML_SERIALIZED);
+aml_append(method, aml_return(aml_int(0)));
+aml_append(dev, method);
+}
+
 acpi_dsdt_add_pci_route_table(dev, cfg->irq);
 
 /*
@@ -191,6 +197,12 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg)
 aml_append(dev, aml_name_decl("_STR", aml_unicode("PCIe 0 Device")));
 aml_append(dev, aml_name_decl("_CCA", aml_int(1)));
 
+if (cfg->preserve_config) {
+method = aml_method("_DSM", 5, AML_SERIALIZED);
+aml_append(method, aml_return(aml_int(0)));
+aml_append(dev, method);
+}
+
 acpi_dsdt_add_pci_route_table(dev, cfg->irq);
 
 method = aml_method("_CBA", 0, AML_NOTSERIALIZED);
-- 
2.26.3

Re: [PATCH 2/4] aspeed/smc: Dump address offset in trace events

2021-10-05 Thread Francisco Iglesias

On [2021 Oct 04] Mon 17:46:33, Cédric Le Goater wrote:
> The register index is currently printed and this is confusing.
> 
> Signed-off-by: Cédric Le Goater 

Reviewed-by: Francisco Iglesias 

> ---
>  hw/ssi/aspeed_smc.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/ssi/aspeed_smc.c b/hw/ssi/aspeed_smc.c
> index 7129341c129e..8a988c167604 100644
> --- a/hw/ssi/aspeed_smc.c
> +++ b/hw/ssi/aspeed_smc.c
> @@ -728,7 +728,7 @@ static uint64_t aspeed_smc_read(void *opaque, hwaddr 
> addr, unsigned int size)
>   addr < R_SEG_ADDR0 + asc->max_peripherals) ||
>  (addr >= s->r_ctrl0 && addr < s->r_ctrl0 + asc->max_peripherals)) {
>  
> -trace_aspeed_smc_read(addr, size, s->regs[addr]);
> +trace_aspeed_smc_read(addr << 2, size, s->regs[addr]);
>  
>  return s->regs[addr];
>  } else {
> @@ -1029,10 +1029,10 @@ static void aspeed_smc_write(void *opaque, hwaddr 
> addr, uint64_t data,
>  AspeedSMCClass *asc = ASPEED_SMC_GET_CLASS(s);
>  uint32_t value = data;
>  
> -addr >>= 2;
> -
>  trace_aspeed_smc_write(addr, size, data);
>  
> +addr >>= 2;
> +
>  if (addr == s->r_conf ||
>  (addr >= s->r_timings &&
>   addr < s->r_timings + asc->nregs_timings) ||
> -- 
> 2.31.1
> 
>

Re: Deprecate the ppc405 boards in QEMU? (was: [PATCH v3 4/7] MAINTAINERS: Orphan obscure ppc platforms)

2021-10-05 Thread Daniel P . Berrangé

On Tue, Oct 05, 2021 at 06:44:23AM +0200, Christophe Leroy wrote:
> 
> 
> Le 05/10/2021 à 02:48, David Gibson a écrit :
> > On Fri, Oct 01, 2021 at 04:18:49PM +0200, Thomas Huth wrote:
> > > On 01/10/2021 15.04, Christophe Leroy wrote:
> > > > 
> > > > 
> > > > Le 01/10/2021 à 14:04, Thomas Huth a écrit :
> > > > > On 01/10/2021 13.12, Peter Maydell wrote:
> > > > > > On Fri, 1 Oct 2021 at 10:43, Thomas Huth  wrote:
> > > > > > > Nevertheless, as long as nobody has a hint where to find that
> > > > > > > ppc405_rom.bin, I think both boards are pretty useless in QEMU 
> > > > > > > (as far as I
> > > > > > > can see, they do not work without the bios at all, so it's
> > > > > > > also not possible
> > > > > > > to use a Linux image with the "-kernel" CLI option directly).
> > > > > > 
> > > > > > It is at least in theory possible to run bare-metal code on
> > > > > > either board, by passing either a pflash or a bios argument.
> > > > > 
> > > > > True. I did some more research, and seems like there was once
> > > > > support for those boards in u-boot, but it got removed there a
> > > > > couple of years ago already:
> > > > > 
> > > > > https://gitlab.com/qemu-project/u-boot/-/commit/98f705c9cefdf
> > > > > 
> > > > > https://gitlab.com/qemu-project/u-boot/-/commit/b147ff2f37d5b
> > > > > 
> > > > > https://gitlab.com/qemu-project/u-boot/-/commit/7514037bcdc37
> > > > > 
> > > > > > But I agree that there seem to be no signs of anybody actually
> > > > > > successfully using these boards for anything, so we should
> > > > > > deprecate-and-delete them.
> > > > > 
> > > > > Yes, let's mark them as deprecated now ... if someone still uses
> > > > > them and speaks up, we can still revert the deprecation again.
> > > > 
> > > > I really would like to be able to use them to validate Linux Kernel
> > > > changes, hence looking for that missing BIOS.
> > > > 
> > > > If we remove ppc405 from QEMU, we won't be able to do any regression
> > > > tests of Linux Kernel on those processors.
> > > 
> > > If you/someone managed to compile an old version of u-boot for one of 
> > > these
> > > two boards, so that we would finally have something for regression 
> > > testing,
> > > we can of course also keep the boards in QEMU...
> > 
> > I can see that it would be usefor for some cases, but unless someone
> > volunteers to track down the necessary firmware and look after it, I
> > think we do need to deprecate it - I certainly don't have the capacity
> > to look into this.
> > 
> 
> I will look at it, please allow me a few weeks though.

Once something is deprecated, it remains in QEMU for a minimum of two
release cycles, before being deleted. At any time in that deprecation
period it can be returned to supported status, if someone provides a
good enough justification to keep it.

IOW, we can deprecate this now, and you still have plenty of time to
investigate more.


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

[RFC v2 0/2] hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding

2021-10-05 Thread Eric Auger

To handle SMMUv3 nested stage support it is practical to
expose the guest with reserved memory regions (RMRs)
covering the IOVAs used by the host kernel to map
physical MSI doorbells.

Those IOVAs belong to [0x800, 0x810] matching
MSI_IOVA_BASE and MSI_IOVA_LENGTH definitions in kernel
arm-smmu-v3 driver. This is the window used to allocate
IOVAs matching physical MSI doorbells.

With those RMRs, the guest is forced to use a flat mapping
for this range. Hence the assigned device is programmed
with one IOVA from this range. Stage 1, owned by the guest
has a flat mapping for this IOVA. Stage2, owned by the VMM
then enforces a mapping from this IOVA to the physical
MSI doorbell.

At IORT table level, due to the single mapping flag being
set on the ID mapping, 256 IORT RMR nodes need to be
created per bus. This looks awkward from a specification
and implementation point of view.

This may also produce a warning at execution time:
qemu-system-aarch64: warning: ACPI table size 114709 exceeds
65536 bytes, migration may not work
(here with 5 pcie root ports, ie. 256 * 6 = 1536 RMR nodes!).

The creation of those RMR nodes only is relevant if nested
stage SMMU is in use, along with VFIO. As VFIO devices can be
hotplugged, all RMRs need to be created in advance. Hence
the patch introduces a new arm virt "nested-smmuv3" iommu type.

ARM DEN 0049E.b IORT specification also mandates that when
RMRs are present, the OS must preserve PCIe configuration
performed by the boot FW. So along with the RMR IORT nodes,
a _DSM function #5, as defined by PCI FIRMWARE SPECIFICATION
EVISION 3.3, chapter 4.6.5 is added to PCIe host bridge
and PCIe expander bridge objects.

The series applies on top of Igor's
[1] [PATCH v4 00/35] acpi: refactor error prone build_header() and
packed structures usage in ACPI tables and
[2] [PATCH v2 0/3] hw/arm/virt_acpi_build: Upgrate the IORT table up
  to revision E.b

The guest can use RMRs with Shameer's series:
[3] [PATCH v7 0/9] ACPI/IORT: Support for IORT RMR node

The latest IORT specification (ARM DEN 0049E.b) can be found at
IO Remapping Table - Platform Design Document
https://developer.arm.com/documentation/den0049/latest/

This series and its dependency can be found at
https://github.com/eauger/qemu.git
branch: igor_acpi_refactoring_v4_dbg2_v3_rmr_v2

History:
v1 -> v2:
- add DSM #5

Eric Auger (2):
  hw/pci-host/gpex: Allow to generate preserve boot config DSM #5
  hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested
binding

 include/hw/arm/virt.h  |  7 
 include/hw/pci-host/gpex.h |  1 +
 hw/arm/virt-acpi-build.c   | 84 --
 hw/arm/virt.c  |  7 +++-
 hw/pci-host/gpex-acpi.c| 12 ++
 5 files changed, 98 insertions(+), 13 deletions(-)

-- 
2.26.3

[RFC v2 2/2] hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding

2021-10-05 Thread Eric Auger

To handle SMMUv3 nested stage support it is practical to
expose the guest with reserved memory regions (RMRs)
covering the IOVAs used by the host kernel to map
physical MSI doorbells.

Those IOVAs belong to [0x800, 0x810] matching
MSI_IOVA_BASE and MSI_IOVA_LENGTH definitions in kernel
arm-smmu-v3 driver. This is the window used to allocate
IOVAs matching physical MSI doorbells.

With those RMRs, the guest is forced to use a flat mapping
for this range. Hence the assigned device is programmed
with one IOVA from this range. Stage 1, owned by the guest
has a flat mapping for this IOVA. Stage2, owned by the VMM
then enforces a mapping from this IOVA to the physical
MSI doorbell.

At IORT table level, due to the single mapping flag being
set on the ID mapping, 256 IORT RMR nodes need to be
created per bus. This looks awkward from a specification
and implementation point of view.

This may also produce a warning at execution time:
qemu-system-aarch64: warning: ACPI table size 114709 exceeds
65536 bytes, migration may not work
(here with 5 pcie root ports, ie. 256 * 6 = 1536 RMR nodes!).

The creation of those RMR nodes only is relevant if nested
stage SMMU is in use, along with VFIO. As VFIO devices can be
hotplugged, all RMRs need to be created in advance. Hence
the patch introduces a new arm virt "nested-smmuv3" iommu type.

ARM DEN 0049E.b IORT specification also mandates that when
RMRs are present, the OS must preserve PCIe configuration
performed by the boot FW. So along with the RMR IORT nodes,
a _DSM function #5, as defined by PCI FIRMWARE SPECIFICATION
EVISION 3.3, chapter 4.6.5 is added to PCIe host bridge
and PCIe expander bridge objects.

Signed-off-by: Eric Auger 
Suggested-by: Jean-Philippe Brucker 

---

v1 -> v2:
- add DSM #5
- use identifier increment

Instead of introducing a new IOMMU type, we could introduce
an array of qdev_prop_reserved_region(s).

Guest can parse the IORT RMR nodes with Shammer's series:
[PATCH v7 0/9] ACPI/IORT: Support for IORT RMR node

The patch applies on Igor's v4 series [1]+ IORT E.b upgrade [2]
[1] [PATCH v4 00/35] acpi: refactor error prone build_header()
and packed structures usage in ACPI tables
[2] [PATCH 0/3] hw/arm/virt_acpi_build: Upgrate the IORT table
up to revision E.b
---
 include/hw/arm/virt.h|  7 
 hw/arm/virt-acpi-build.c | 84 ++--
 hw/arm/virt.c|  7 +++-
 3 files changed, 85 insertions(+), 13 deletions(-)

diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index b461b8d261..f2f8aee219 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -99,6 +99,7 @@ enum {
 typedef enum VirtIOMMUType {
 VIRT_IOMMU_NONE,
 VIRT_IOMMU_SMMUV3,
+VIRT_IOMMU_NESTED_SMMUV3,
 VIRT_IOMMU_VIRTIO,
 } VirtIOMMUType;
 
@@ -190,4 +191,10 @@ static inline int 
virt_gicv3_redist_region_count(VirtMachineState *vms)
 return MACHINE(vms)->smp.cpus > redist0_capacity ? 2 : 1;
 }
 
+static inline bool virt_has_smmuv3(const VirtMachineState *vms)
+{
+return vms->iommu == VIRT_IOMMU_SMMUV3 ||
+   vms->iommu == VIRT_IOMMU_NESTED_SMMUV3;
+}
+
 #endif /* QEMU_ARM_VIRT_H */
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 789bac3134..7260e47c83 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -169,6 +169,14 @@ static void acpi_dsdt_add_pci(Aml *scope, const 
MemMapEntry *memmap,
 .bus= vms->bus,
 };
 
+/*
+ * Nested SMMU requires RMRs for MSI 1-1 mapping, which
+ * require _DSM for PreservingPCI Boot Configurations
+ */
+if (vms->iommu == VIRT_IOMMU_NESTED_SMMUV3) {
+cfg.preserve_config = true;
+}
+
 if (use_highmem) {
 cfg.mmio64 = memmap[VIRT_HIGH_PCIE_MMIO];
 }
@@ -245,16 +253,16 @@ static void acpi_dsdt_add_tpm(Aml *scope, 
VirtMachineState *vms)
 #define ROOT_COMPLEX_ENTRY_SIZE 36
 #define IORT_NODE_OFFSET 48
 
-static void build_iort_id_mapping(GArray *table_data, uint32_t input_base,
-  uint32_t id_count, uint32_t out_ref)
+static void
+build_iort_id_mapping(GArray *table_data, uint32_t input_base,
+  uint32_t id_count, uint32_t out_ref, uint32_t flags)
 {
 /* Table 4 ID mapping format */
 build_append_int_noprefix(table_data, input_base, 4); /* Input base */
 build_append_int_noprefix(table_data, id_count, 4); /* Number of IDs */
 build_append_int_noprefix(table_data, input_base, 4); /* Output base */
 build_append_int_noprefix(table_data, out_ref, 4); /* Output Reference */
-/* Flags */
-build_append_int_noprefix(table_data, 0 /* Single mapping */, 4);
+build_append_int_noprefix(table_data, flags, 4); /* Flags */
 }
 
 struct AcpiIortIdMapping {
@@ -296,6 +304,50 @@ static int iort_idmap_compare(gconstpointer a, 
gconstpointer b)
 return idmap_a->input_base - idmap_b->input_base;
 }
 
+static void
+build_iort_rmr_nodes(GArray *table_data, GArray *smmu_idmaps, int smmu

Re: [PATCH 1/4] aspeed/wdt: Add trace events

2021-10-05 Thread Francisco Iglesias

On [2021 Oct 04] Mon 17:46:32, Cédric Le Goater wrote:
> Signed-off-by: Cédric Le Goater 

Reviewed-by: Francisco Iglesias 

> ---
>  hw/watchdog/wdt_aspeed.c | 5 +
>  hw/watchdog/trace-events | 4 
>  2 files changed, 9 insertions(+)
> 
> diff --git a/hw/watchdog/wdt_aspeed.c b/hw/watchdog/wdt_aspeed.c
> index 69c37af9a6e9..146ffcd71301 100644
> --- a/hw/watchdog/wdt_aspeed.c
> +++ b/hw/watchdog/wdt_aspeed.c
> @@ -19,6 +19,7 @@
>  #include "hw/sysbus.h"
>  #include "hw/watchdog/wdt_aspeed.h"
>  #include "migration/vmstate.h"
> +#include "trace.h"
>  
>  #define WDT_STATUS  (0x00 / 4)
>  #define WDT_RELOAD_VALUE(0x04 / 4)
> @@ -60,6 +61,8 @@ static uint64_t aspeed_wdt_read(void *opaque, hwaddr 
> offset, unsigned size)
>  {
>  AspeedWDTState *s = ASPEED_WDT(opaque);
>  
> +trace_aspeed_wdt_read(offset, size);
> +
>  offset >>= 2;
>  
>  switch (offset) {
> @@ -140,6 +143,8 @@ static void aspeed_wdt_write(void *opaque, hwaddr offset, 
> uint64_t data,
>  AspeedWDTClass *awc = ASPEED_WDT_GET_CLASS(s);
>  bool enable;
>  
> +trace_aspeed_wdt_write(offset, size, data);
> +
>  offset >>= 2;
>  
>  switch (offset) {
> diff --git a/hw/watchdog/trace-events b/hw/watchdog/trace-events
> index c3bafbffa911..e7523e22aaf2 100644
> --- a/hw/watchdog/trace-events
> +++ b/hw/watchdog/trace-events
> @@ -5,3 +5,7 @@ cmsdk_apb_watchdog_read(uint64_t offset, uint64_t data, 
> unsigned size) "CMSDK AP
>  cmsdk_apb_watchdog_write(uint64_t offset, uint64_t data, unsigned size) 
> "CMSDK APB watchdog write: offset 0x%" PRIx64 " data 0x%" PRIx64 " size %u"
>  cmsdk_apb_watchdog_reset(void) "CMSDK APB watchdog: reset"
>  cmsdk_apb_watchdog_lock(uint32_t lock) "CMSDK APB watchdog: lock %" PRIu32
> +
> +# wdt-aspeed.c
> +aspeed_wdt_read(uint64_t addr, uint32_t size) "@0x%" PRIx64 " size=%d"
> +aspeed_wdt_write(uint64_t addr, uint32_t size, uint64_t data) "@0x%" PRIx64 
> " size=%d value=0x%"PRIx64
> -- 
> 2.31.1
> 
>

Re: [PATCH v4 05/11] hw/arm/virt: Use object_property_set instead of qdev_prop_set

2021-10-05 Thread Eric Auger




On 10/1/21 7:33 PM, Jean-Philippe Brucker wrote:
> To propagate errors to the caller of the pre_plug callback, use the
> object_poperty_set*() functions directly instead of the qdev_prop_set*()
> helpers.
>
> Suggested-by: Igor Mammedov 
> Signed-off-by: Jean-Philippe Brucker 
Reviewed-by: Eric Auger 

Eric
> ---
>  hw/arm/virt.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 36f0261ef4..ac307b6030 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -2465,8 +2465,9 @@ static void 
> virt_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev,
>  db_start, db_end,
>  VIRTIO_IOMMU_RESV_MEM_T_MSI);
>  
> -qdev_prop_set_uint32(dev, "len-reserved-regions", 1);
> -qdev_prop_set_string(dev, "reserved-regions[0]", resv_prop_str);
> +object_property_set_uint(OBJECT(dev), "len-reserved-regions", 1, 
> errp);
> +object_property_set_str(OBJECT(dev), "reserved-regions[0]",
> +resv_prop_str, errp);
>  g_free(resv_prop_str);
>  }
>  }

[Bug 1884169] Re: There is no option group 'fsdev' for OSX

2021-10-05 Thread Waheed

But actually OS X (macOS) supports 9pfs and it does have its own
AppleVirtIO9PVFS which makes things a bit strange, would not that be a
good workaround, to use the AppleVirtIO9PVFS?

All my best,

Waheed

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1884169

Title:
  There is no option group 'fsdev' for OSX

Status in QEMU:
  Opinion

Bug description:
  When I try to use -fsoption on OSX I receive this error:

  -fsdev local,security_model=mapped,id=fsdev0,path=devel/dmos-example:
  There is no option group 'fsdev'

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1884169/+subscriptions

Re: [PATCH] virtiofsd: xattr mapping add a new type "unsupported"

2021-10-05 Thread Dr. David Alan Gilbert

* Vivek Goyal (vgo...@redhat.com) wrote:
> Right now for xattr remapping, we support types of "prefix", "ok" or "bad".
> Type "bad" returns -EPERM on setxattr and hides xattr in listxattr. For
> getxattr, mapping code returns -EPERM but getxattr code converts it to 
> -ENODATA.
> 
> I need a new semantics where if an xattr is unsupported, then
> getxattr()/setxattr() return -ENOTSUP and listxattr() should hide the xattr.
> This is needed to simulate that security.selinux is not supported by
> virtiofs filesystem and in that case client falls back to some default
> label specified by policy.
> 
> So add a new type "unsupported" which returns -ENOTSUP on getxattr() and
> setxattr() and hides xattrs in listxattr().
> 
> For example, one can use following mapping rule to not support
> security.selinux xattr and allow others.
> 
> "-o xattrmap=/unsupported/all/security.selinux/security.selinux//ok/all///"
> 
> Suggested-by: "Dr. David Alan Gilbert" 
> Signed-off-by: Vivek Goyal 

Yes, that's nice and simple.


Reviewed-by: Dr. David Alan Gilbert 

> ---
>  docs/tools/virtiofsd.rst |6 ++
>  tools/virtiofsd/passthrough_ll.c |   17 ++---
>  2 files changed, 20 insertions(+), 3 deletions(-)
> 
> Index: rhvgoyal-qemu/tools/virtiofsd/passthrough_ll.c
> ===
> --- rhvgoyal-qemu.orig/tools/virtiofsd/passthrough_ll.c   2021-09-22 
> 08:37:16.070377732 -0400
> +++ rhvgoyal-qemu/tools/virtiofsd/passthrough_ll.c2021-09-22 
> 14:17:09.543016250 -0400
> @@ -2465,6 +2465,11 @@ static void lo_flock(fuse_req_t req, fus
>   * Automatically reversed on read
>   */
>  #define XATTR_MAP_FLAG_PREFIX  (1 <<  2)
> +/*
> + * The attribute is unsupported;
> + * ENOTSUP on write, hidden on read.
> + */
> +#define XATTR_MAP_FLAG_UNSUPPORTED (1 <<  3)
>  
>  /* scopes */
>  /* Apply rule to get/set/remove */
> @@ -2636,6 +2641,8 @@ static void parse_xattrmap(struct lo_dat
>  tmp_entry.flags |= XATTR_MAP_FLAG_OK;
>  } else if (strstart(map, "bad", &map)) {
>  tmp_entry.flags |= XATTR_MAP_FLAG_BAD;
> +} else if (strstart(map, "unsupported", &map)) {
> +tmp_entry.flags |= XATTR_MAP_FLAG_UNSUPPORTED;
>  } else if (strstart(map, "map", &map)) {
>  /*
>   * map is sugar that adds a number of rules, and must be
> @@ -2646,8 +2653,8 @@ static void parse_xattrmap(struct lo_dat
>  } else {
>  fuse_log(FUSE_LOG_ERR,
>   "%s: Unexpected type;"
> - "Expecting 'prefix', 'ok', 'bad' or 'map' in rule 
> %zu\n",
> - __func__, lo->xattr_map_nentries);
> + "Expecting 'prefix', 'ok', 'bad', 'unsupported' or 
> 'map'"
> + " in rule %zu\n", __func__, lo->xattr_map_nentries);
>  exit(1);
>  }
>  
> @@ -2749,6 +2756,9 @@ static int xattr_map_client(const struct
>  if (cur_entry->flags & XATTR_MAP_FLAG_BAD) {
>  return -EPERM;
>  }
> +if (cur_entry->flags & XATTR_MAP_FLAG_UNSUPPORTED) {
> +return -ENOTSUP;
> +}
>  if (cur_entry->flags & XATTR_MAP_FLAG_OK) {
>  /* Unmodified name */
>  return 0;
> @@ -2788,7 +2798,8 @@ static int xattr_map_server(const struct
>  
>  if ((cur_entry->flags & XATTR_MAP_FLAG_SERVER) &&
>  (strstart(server_name, cur_entry->prepend, &end))) {
> -if (cur_entry->flags & XATTR_MAP_FLAG_BAD) {
> +if (cur_entry->flags & XATTR_MAP_FLAG_BAD ||
> +cur_entry->flags & XATTR_MAP_FLAG_UNSUPPORTED) {
>  return -ENODATA;
>  }
>  if (cur_entry->flags & XATTR_MAP_FLAG_OK) {
> Index: rhvgoyal-qemu/docs/tools/virtiofsd.rst
> ===
> --- rhvgoyal-qemu.orig/docs/tools/virtiofsd.rst   2021-09-22 
> 08:37:15.938372097 -0400
> +++ rhvgoyal-qemu/docs/tools/virtiofsd.rst2021-09-22 14:44:09.814188712 
> -0400
> @@ -183,6 +183,12 @@ Using ':' as the separator a rule is of
>'ok' as either an explicit terminator or for special handling of certain
>patterns.
>  
> +- 'unsupported' - If a client tries to use a name matching 'key' it's
> +  denied using ENOTSUP; when the server passes an attribute
> +  name matching 'prepend' it's hidden.  In many ways it's use is very like
> +  'ok' as either an explicit terminator or for special handling of certain
> +  patterns.
> +
>  **key** is a string tested as a prefix on an attribute name originating
>  on the client.  It maybe empty in which case a 'client' rule
>  will always match on client names.
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v2 08/12] macfb: add common monitor modes supported by the MacOS toolbox ROM

2021-10-05 Thread Laurent Vivier

Le 04/10/2021 à 23:19, Mark Cave-Ayland a écrit :
> The monitor modes table is found by experimenting with the Monitors Control
> Panel in MacOS and analysing the reads/writes. From this it can be found that
> the mode is controlled by writes to the DAFB_MODE_CTRL1 and DAFB_MODE_CTRL2
> registers.
> 
> Implement the first block of DAFB registers as a register array including the
> existing sense register, the newly discovered control registers above, and 
> also
> the DAFB_MODE_VADDR1 and DAFB_MODE_VADDR2 registers which are used by NetBSD 
> to
> determine the current video mode.
> 
> These experiments also show that the offset of the start of video RAM and the
> stride can change depending upon the monitor mode, so update 
> macfb_draw_graphic()
> and both the BI_MAC_VADDR and BI_MAC_VROW bootinfo for the q800 machine
> accordingly.
> 
> Finally update macfb_common_realize() so that only the resolution and depth
> supported by the display type can be specified on the command line.
> 
> Signed-off-by: Mark Cave-Ayland 
> Reviewed-by: Laurent Vivier 
> ---
>  hw/display/macfb.c | 124 -
>  hw/display/trace-events|   1 +
>  hw/m68k/q800.c |  11 ++--
>  include/hw/display/macfb.h |  16 -
>  4 files changed, 131 insertions(+), 21 deletions(-)
> 
> diff --git a/hw/display/macfb.c b/hw/display/macfb.c
> index f98bcdec2d..357fe18be5 100644
> --- a/hw/display/macfb.c
> +++ b/hw/display/macfb.c
>
...
> +static MacFbMode *macfb_find_mode(MacfbDisplayType display_type,
> +  uint16_t width, uint16_t height,
> +  uint8_t depth)
> +{
> +MacFbMode *macfb_mode;
> +int i;
> +
> +for (i = 0; i < ARRAY_SIZE(macfb_mode_table); i++) {
> +macfb_mode = &macfb_mode_table[i];
> +
> +if (display_type == macfb_mode->type && width == macfb_mode->width &&
> +height == macfb_mode->height && depth == macfb_mode->depth) {
> +return macfb_mode;
> +}
> +}
> +
> +return NULL;
> +}
> +

I misunderstood this part when I reviewed v1...

It means you have to provide the monitor type to QEMU to switch from the 
default mode?

But, as a user, how do we know which modes are allowed with which resolution?

Is possible to try to set internally the type here according to the resolution?

Could you provide an command line example how to start the q800 with the 
1152x870 resolution?

Thanks,
Laurent

Re: [PATCH 2/3] vdpa: Add vhost_vdpa_section_end

2021-10-05 Thread Eugenio Perez Martin

On Tue, Oct 5, 2021 at 10:15 AM Michael S. Tsirkin  wrote:
>
> On Tue, Oct 05, 2021 at 10:01:30AM +0200, Eugenio Pérez wrote:
> > Abstract this operation, that will be reused when validating the region
> > against the iova range that the device supports.
> >
> > Signed-off-by: Eugenio Pérez 
>
> Note that as defined end is actually 1 byte beyond end of section.
> As such it can e.g. overflow if cast to u64.
> So be careful to use int128 ops with it.

You are right, but this is only the result of extracting "llend"
calculation in its own function, since it is going to be used a third
time in the next commit. This next commit contains a mistake because
of this, as you pointed out.

Since "last" would be a very misleading name, do you think we could
give a better name / type to it?

> Also - document?

It will be documented with that ("It returns one byte beyond end of
section" or similar) too.

Thanks!

>
> > ---
> >  hw/virtio/vhost-vdpa.c | 18 +++---
> >  1 file changed, 11 insertions(+), 7 deletions(-)
> >
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index ea1aa71ad8..a1de6c7c9c 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -24,6 +24,15 @@
> >  #include "trace.h"
> >  #include "qemu-common.h"
> >
> > +static Int128 vhost_vdpa_section_end(const MemoryRegionSection *section)
> > +{
> > +Int128 llend = int128_make64(section->offset_within_address_space);
> > +llend = int128_add(llend, section->size);
> > +llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> > +
> > +return llend;
> > +}
> > +
> >  static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection 
> > *section)
> >  {
> >  return (!memory_region_is_ram(section->mr) &&
> > @@ -160,10 +169,7 @@ static void 
> > vhost_vdpa_listener_region_add(MemoryListener *listener,
> >  }
> >
> >  iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> > -llend = int128_make64(section->offset_within_address_space);
> > -llend = int128_add(llend, section->size);
> > -llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> > -
> > +llend = vhost_vdpa_section_end(section);
> >  if (int128_ge(int128_make64(iova), llend)) {
> >  return;
> >  }
> > @@ -221,9 +227,7 @@ static void 
> > vhost_vdpa_listener_region_del(MemoryListener *listener,
> >  }
> >
> >  iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> > -llend = int128_make64(section->offset_within_address_space);
> > -llend = int128_add(llend, section->size);
> > -llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> > +llend = vhost_vdpa_section_end(section);
> >
> >  trace_vhost_vdpa_listener_region_del(v, iova, int128_get64(llend));
> >
> > --
> > 2.27.0
>

Re: [PATCH 3/3] vdpa: Check for iova range at mappings changes

2021-10-05 Thread Eugenio Perez Martin

On Tue, Oct 5, 2021 at 10:14 AM Michael S. Tsirkin  wrote:
>
> On Tue, Oct 05, 2021 at 10:01:31AM +0200, Eugenio Pérez wrote:
> > Check vdpa device range before updating memory regions so we don't add
> > any outside of it, and report the invalid change if any.
> >
> > Signed-off-by: Eugenio Pérez 
> > ---
> >  include/hw/virtio/vhost-vdpa.h |  2 +
> >  hw/virtio/vhost-vdpa.c | 68 ++
> >  hw/virtio/trace-events |  1 +
> >  3 files changed, 55 insertions(+), 16 deletions(-)
> >
> > diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> > index a8963da2d9..c288cf7ecb 100644
> > --- a/include/hw/virtio/vhost-vdpa.h
> > +++ b/include/hw/virtio/vhost-vdpa.h
> > @@ -13,6 +13,7 @@
> >  #define HW_VIRTIO_VHOST_VDPA_H
> >
> >  #include "hw/virtio/virtio.h"
> > +#include "standard-headers/linux/vhost_types.h"
> >
> >  typedef struct VhostVDPAHostNotifier {
> >  MemoryRegion mr;
> > @@ -24,6 +25,7 @@ typedef struct vhost_vdpa {
> >  uint32_t msg_type;
> >  bool iotlb_batch_begin_sent;
> >  MemoryListener listener;
> > +struct vhost_vdpa_iova_range iova_range;
> >  struct vhost_dev *dev;
> >  VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
> >  } VhostVDPA;
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index a1de6c7c9c..26d0258723 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -33,20 +33,34 @@ static Int128 vhost_vdpa_section_end(const 
> > MemoryRegionSection *section)
> >  return llend;
> >  }
> >
> > -static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection 
> > *section)
> > -{
> > -return (!memory_region_is_ram(section->mr) &&
> > -!memory_region_is_iommu(section->mr)) ||
> > -memory_region_is_protected(section->mr) ||
> > -   /* vhost-vDPA doesn't allow MMIO to be mapped  */
> > -memory_region_is_ram_device(section->mr) ||
> > -   /*
> > -* Sizing an enabled 64-bit BAR can cause spurious mappings to
> > -* addresses in the upper part of the 64-bit address space.  
> > These
> > -* are never accessed by the CPU and beyond the address width of
> > -* some IOMMU hardware.  TODO: VDPA should tell us the IOMMU 
> > width.
> > -*/
> > -   section->offset_within_address_space & (1ULL << 63);
> > +static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection 
> > *section,
> > +uint64_t iova_min,
> > +uint64_t iova_max)
> > +{
> > +Int128 llend;
> > +bool r = (!memory_region_is_ram(section->mr) &&
> > +  !memory_region_is_iommu(section->mr)) ||
> > +  memory_region_is_protected(section->mr) ||
> > +  /* vhost-vDPA doesn't allow MMIO to be mapped  */
> > +  memory_region_is_ram_device(section->mr);
> > +if (r) {
> > +return true;
> > +}
> > +
> > +if (section->offset_within_address_space < iova_min) {
> > +error_report("RAM section out of device range (min=%lu, addr=%lu)",
> > + iova_min, section->offset_within_address_space);
> > +return true;
> > +}
> > +
> > +llend = vhost_vdpa_section_end(section);
> > +if (int128_make64(llend) > iova_max) {
>
> I am puzzled by this.
> You are taking a Int128, converting to u64, converting
> back to Int128, and comparing to u64.
> Head spins. What is all this back and forth trying to achieve?
>

You are totally right, this series was extracted from a longer one
where I didn't use vhost_vdpa_section_end, but raw addresses. Then I
applied int128_make64 to the wrong variable, too fast.

To be sure we are on the same page, to do:

if (int128_ge(int128_make64(iova), llend)) {
// error message
return;
}

The same way as vhost_vdpa_listener_region_{add,del} would be ok?

Thanks!

> > +error_report("RAM section out of device range (max=%lu, end 
> > addr=%lu)",
> > + iova_max, (uint64_t)int128_make64(llend));
> > +return true;
> > +}
> > +
> > +return false;
> >  }
> >
> >  static int vhost_vdpa_dma_map(struct vhost_vdpa *v, hwaddr iova, hwaddr 
> > size,
> > @@ -158,7 +172,8 @@ static void 
> > vhost_vdpa_listener_region_add(MemoryListener *listener,
> >  void *vaddr;
> >  int ret;
> >
> > -if (vhost_vdpa_listener_skipped_section(section)) {
> > +if (vhost_vdpa_listener_skipped_section(section, v->iova_range.first,
> > +v->iova_range.last)) {
> >  return;
> >  }
> >
> > @@ -216,7 +231,8 @@ static void 
> > vhost_vdpa_listener_region_del(MemoryListener *listener,
> >  Int128 llend, llsize;
> >  int ret;
> >
> > -if (vhost_vdpa_listener_skipped_section(section)) {
> > +if (vhost_vdpa_listener_skipped_section(section, v->iova_range.first,
> >

Re: [PATCH v4 10/11] tests/acpi: add expected blob for VIOT test on virt machine

2021-10-05 Thread Ani Sinha




On Fri, 1 Oct 2021, Jean-Philippe Brucker wrote:

> The VIOT blob contains the following:
>
> [000h    4]Signature : "VIOT"[Virtual I/O 
> Translation Table]
> [004h 0004   4] Table Length : 0058
> [008h 0008   1] Revision : 00
> [009h 0009   1] Checksum : 66
> [00Ah 0010   6]   Oem ID : "BOCHS "
> [010h 0016   8] Oem Table ID : "BXPC"
> [018h 0024   4] Oem Revision : 0001
> [01Ch 0028   4]  Asl Compiler ID : "BXPC"
> [020h 0032   4]Asl Compiler Revision : 0001
>
> [024h 0036   2]   Node count : 0002
> [026h 0038   2]  Node offset : 0030
> [028h 0040   8] Reserved : 
>
> [030h 0048   1] Type : 03 [VirtIO-PCI IOMMU]
> [031h 0049   1] Reserved : 00
> [032h 0050   2]   Length : 0010
>
> [034h 0052   2]  PCI Segment : 
> [036h 0054   2]   PCI BDF number : 0008
> [038h 0056   8] Reserved : 
>
> [040h 0064   1] Type : 01 [PCI Range]
> [041h 0065   1] Reserved : 00
> [042h 0066   2]   Length : 0018
>
> [044h 0068   4]   Endpoint start : 
> [048h 0072   2]PCI Segment start : 
> [04Ah 0074   2]  PCI Segment end : 
> [04Ch 0076   2]PCI BDF start : 
> [04Eh 0078   2]  PCI BDF end : 00FF
> [050h 0080   2]  Output node : 0030
> [052h 0082   6] Reserved : 
>
> Signed-off-by: Jean-Philippe Brucker 

Acked-by: Ani Sinha 

Without looking at the other patches, the disassembly looks good (with
latest iasl from upstream git).
One suggestion : maybe also add the raw table data as well of length 88.


> ---
>  tests/qtest/bios-tables-test-allowed-diff.h |   1 -
>  tests/data/acpi/virt/VIOT   | Bin 0 -> 88 bytes
>  2 files changed, 1 deletion(-)
>
> diff --git a/tests/qtest/bios-tables-test-allowed-diff.h 
> b/tests/qtest/bios-tables-test-allowed-diff.h
> index 29b5b1eabc..fa213e4738 100644
> --- a/tests/qtest/bios-tables-test-allowed-diff.h
> +++ b/tests/qtest/bios-tables-test-allowed-diff.h
> @@ -1,4 +1,3 @@
>  /* List of comma-separated changed AML files to ignore */
> -"tests/data/acpi/virt/VIOT",
>  "tests/data/acpi/q35/DSDT.viot",
>  "tests/data/acpi/q35/VIOT.viot",
> diff --git a/tests/data/acpi/virt/VIOT b/tests/data/acpi/virt/VIOT
> index 
> e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..921f40d88c28ba2171a4d664e119914335309e7d
>  100644
> GIT binary patch
> literal 88
> zcmWIZ^bd((0D?3pe`k+i1*eDrX9XZ&1PX!JAexE60Hgv8m>C3sGzXN&z`)2L0cSHX
> I{D-Rq0Q5fy0RR91
>
> literal 0
> HcmV?d1
>
> --
> 2.33.0
>
>

Re: [RFC PATCH 1/1] virtio: write back features before verify

2021-10-05 Thread Cornelia Huck

On Mon, Oct 04 2021, "Michael S. Tsirkin"  wrote:

> On Mon, Oct 04, 2021 at 04:23:23AM +0200, Halil Pasic wrote:
>> --8<-
>> 
>> From: Halil Pasic 
>> Date: Thu, 30 Sep 2021 02:38:47 +0200
>> Subject: [PATCH] virtio: write back feature VERSION_1 before verify
>> 
>> This patch fixes a regression introduced by commit 82e89ea077b9
>> ("virtio-blk: Add validation for block size in config space") and
>> enables similar checks in verify() on big endian platforms.
>> 
>> The problem with checking multi-byte config fields in the verify
>> callback, on big endian platforms, and with a possibly transitional
>> device is the following. The verify() callback is called between
>> config->get_features() and virtio_finalize_features(). That we have a
>> device that offered F_VERSION_1 then we have the following options
>> either the device is transitional, and then it has to present the legacy
>> interface, i.e. a big endian config space until F_VERSION_1 is
>> negotiated, or we have a non-transitional device, which makes
>> F_VERSION_1 mandatory, and only implements the non-legacy interface and
>> thus presents a little endian config space. Because at this point we
>> can't know if the device is transitional or non-transitional, we can't
>> know do we need to byte swap or not.
>
> Well we established that we can know. Here's an alternative explanation:
>
>   The virtio specification virtio-v1.1-cs01 states:
>
>   Transitional devices MUST detect Legacy drivers by detecting that
>   VIRTIO_F_VERSION_1 has not been acknowledged by the driver.
>   This is exactly what QEMU as of 6.1 has done relying solely
>   on VIRTIO_F_VERSION_1 for detecting that.
>
>   However, the specification also says:
>   driver MAY read (but MUST NOT write) the device-specific
>   configuration fields to check that it can support the device before
>   accepting it.
>
>   In that case, any device relying solely on VIRTIO_F_VERSION_1
>   for detecting legacy drivers will return data in legacy format.
>   In particular, this implies that it is in big endian format
>   for big endian guests. This naturally confuses the driver
>   which expects little endian in the modern mode.
>
>   It is probably a good idea to amend the spec to clarify that
>   VIRTIO_F_VERSION_1 can only be relied on after the feature negotiation
>   is complete. However, we already have regression so let's
>   try to address it.

I prefer that explanation.

>
>
>> 
>> The virtio spec explicitly states that the driver MAY read config
>> between reading and writing the features so saying that first accessing
>> the config before feature negotiation is done is not an option. The
>> specification ain't clear about setting the features multiple times
>> before FEATURES_OK, so I guess that should be fine to set F_VERSION_1
>> since at this point we already know that we are about to negotiate
>> F_VERSION_1.
>> 
>> I don't consider this patch super clean, but frankly I don't think we
>> have a ton of options. Another option that may or man not be cleaner,
>> but is also IMHO much uglier is to figure out whether the device is
>> transitional by rejecting _F_VERSION_1, then resetting it and proceeding
>> according tho what we have figured out, hoping that the characteristics
>> of the device didn't change.
>
> An empty line before tags.
>
>> Signed-off-by: Halil Pasic 
>> Fixes: 82e89ea077b9 ("virtio-blk: Add validation for block size in config 
>> space")
>> Reported-by: mark...@us.ibm.com
>
> Let's add more commits that are affected. E.g. virtio-net with MTU
> feature bit set is affected too.
>
> So let's add Fixes tag for:
> commit 14de9d114a82a564b94388c95af79a701dc93134
> Author: Aaron Conole 
> Date:   Fri Jun 3 16:57:12 2016 -0400
>
> virtio-net: Add initial MTU advice feature
> 
> I think that's all, but pls double check me.

I could not find anything else after a quick check.

>
>
>> ---
>>  drivers/virtio/virtio.c | 6 ++
>>  1 file changed, 6 insertions(+)
>> 
>> diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
>> index 0a5b54034d4b..2b9358f2e22a 100644
>> --- a/drivers/virtio/virtio.c
>> +++ b/drivers/virtio/virtio.c
>> @@ -239,6 +239,12 @@ static int virtio_dev_probe(struct device *_d)
>>  driver_features_legacy = driver_features;
>>  }
>>  
>> +/* Write F_VERSION_1 feature to pin down endianness */
>> +if (device_features & (1ULL << VIRTIO_F_VERSION_1) & driver_features) {
>> +dev->features = (1ULL << VIRTIO_F_VERSION_1);
>> +dev->config->finalize_features(dev);
>> +}
>> +
>>  if (device_features & (1ULL << VIRTIO_F_VERSION_1))
>>  dev->features = driver_features & device_features;
>>  else
>> -- 
>> 2.31.1

I think we should go with this just to fix the nasty regression for now.

Re: [PATCH v4 11/11] tests/acpi: add expected blobs for VIOT test on q35 machine

2021-10-05 Thread Ani Sinha




On Fri, 1 Oct 2021, Jean-Philippe Brucker wrote:

> Add expected blobs of the VIOT and DSDT table for the VIOT test on the
> q35 machine.
>
> Since the test instantiates a virtio device and two PCIe expander
> bridges, DSDT.viot has more blocks than the base DSDT (long diff not
> shown here).

For documentation and bisection of issues in future, I think its better to
provide the DSDT table ASL diff here as well.

>The VIOT table generated for the q35 test is:
>
> [000h    4]Signature : "VIOT"[Virtual I/O 
> Translation Table]
> [004h 0004   4] Table Length : 0070
> [008h 0008   1] Revision : 00
> [009h 0009   1] Checksum : 3D
> [00Ah 0010   6]   Oem ID : "BOCHS "
> [010h 0016   8] Oem Table ID : "BXPC"
> [018h 0024   4] Oem Revision : 0001
> [01Ch 0028   4]  Asl Compiler ID : "BXPC"
> [020h 0032   4]Asl Compiler Revision : 0001
>
> [024h 0036   2]   Node count : 0003
> [026h 0038   2]  Node offset : 0030
> [028h 0040   8] Reserved : 
>
> [030h 0048   1] Type : 03 [VirtIO-PCI IOMMU]
> [031h 0049   1] Reserved : 00
> [032h 0050   2]   Length : 0010
>
> [034h 0052   2]  PCI Segment : 
> [036h 0054   2]   PCI BDF number : 0010
> [038h 0056   8] Reserved : 
>
> [040h 0064   1] Type : 01 [PCI Range]
> [041h 0065   1] Reserved : 00
> [042h 0066   2]   Length : 0018
>
> [044h 0068   4]   Endpoint start : 3000
> [048h 0072   2]PCI Segment start : 
> [04Ah 0074   2]  PCI Segment end : 
> [04Ch 0076   2]PCI BDF start : 3000
> [04Eh 0078   2]  PCI BDF end : 30FF
> [050h 0080   2]  Output node : 0030
> [052h 0082   6] Reserved : 
>
> [058h 0088   1] Type : 01 [PCI Range]
> [059h 0089   1] Reserved : 00
> [05Ah 0090   2]   Length : 0018
>
> [05Ch 0092   4]   Endpoint start : 1000
> [060h 0096   2]PCI Segment start : 
> [062h 0098   2]  PCI Segment end : 
> [064h 0100   2]PCI BDF start : 1000
> [066h 0102   2]  PCI BDF end : 10FF
> [068h 0104   2]  Output node : 0030
> [06Ah 0106   6] Reserved : 
>
> Signed-off-by: Jean-Philippe Brucker 
> ---
>  tests/qtest/bios-tables-test-allowed-diff.h |   2 --
>  tests/data/acpi/q35/DSDT.viot   | Bin 0 -> 9398 bytes
>  tests/data/acpi/q35/VIOT.viot   | Bin 0 -> 112 bytes
>  3 files changed, 2 deletions(-)
>
> diff --git a/tests/qtest/bios-tables-test-allowed-diff.h 
> b/tests/qtest/bios-tables-test-allowed-diff.h
> index fa213e4738..dfb8523c8b 100644
> --- a/tests/qtest/bios-tables-test-allowed-diff.h
> +++ b/tests/qtest/bios-tables-test-allowed-diff.h
> @@ -1,3 +1 @@
>  /* List of comma-separated changed AML files to ignore */
> -"tests/data/acpi/q35/DSDT.viot",
> -"tests/data/acpi/q35/VIOT.viot",
> diff --git a/tests/data/acpi/q35/DSDT.viot b/tests/data/acpi/q35/DSDT.viot
> index 
> e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..b41270ff6d63493c2ae379ddd1d3e28f190a6c01
>  100644
> GIT binary patch
> literal 9398
> zcmeHNO>7&-8J*>iv|O&FB}G~OOGG$M|57BBoWHhc5OS9yDTx$CQgH$r;8Idr*-4Q_
> z5(9Az1F`}niVsB-#zBvCpa8wKr(7GLxfJNZhXOUwQxCo5S`_gq>icGPq#2R|qEj!C
> zfZhFO-E)yj*v)_%j$|bWD4v61&3MJ6@sGF_Mv(
> z(Y~GJ$Ji9i%ul_-ddc|xw*RT`zx{!4bOW~WnR9oe8@#vYZ!iK~-v}&=4xHj-r&;K<
> zcU`OQR&r*iT=DGueakdEt~iRCoxImzW@o+PvCPVNXSM0Z?!3la@A7=V7VmARrY)yk
> z{pY1`=FY$P>E*ZcU;gqRzq<396$4-adlUOh0d4%7zBT9fosWB0jax+L=jQv zQRdK@z^9UXwkV>i=J#J~?>_G}@-A=VM7>texw(0?%WX7MbJqC}W*M`obLj6+2L}g#
> z7KhBa!JMioR2I#0z1Wf}4QL}(?VWPHRb@6~_rFcDSo^j^@$^f@nwPCNyiPXrY^T}E
> zvw%wcfQq{B`j+GO?T>ms>-oupgMHSY{HWJupLA{Zum8sP*}gR;+Lp2=-%n6m?tjZ-
> zjG;9@c#>K}{oUR@TWRJyyo-^34o#_78fy{Dw`^y5>Zzy%5~{uX^m4%iSX`qhT8~!A
> zG^eeZlHoI-8Ai$2Vq4f>h#*^g_hNN*{g5>^t+7liet~+Zy}PhdZ_UfPW8!)n8rHEU
> zO2#|UccP|wVTaee;I38=IdP!Tn zmOARAox0m>8Og6~%fzLjz(wD!XR-0J?VV zfm^7pSF`ns_j0yv6jt12mU+DH7MCLJ$0#~D2(}3k+%T>(s-yiwD&A+AC-UHoLQ!1-
> zZTt}HXS}hx*Q`$VSHhuj|GB^ZyZOw!)sJSsuAcdeTMekL*MH;pAM0IX{WHC*Rs z7Qc^d+_nd7KNU4@(}vxf?a%bCS>r)E9$^!#8~A%&#`e2rz2YvijNQTB2(~G5e*20+
> zH;dzb%?EP5(W z08WZ?oCl~3iHZ6-Ho}>}h7mC(G{QI&P|ie1Otgk$qns&Q5M{)a(5PSn%9#j>DYIZ)
> z2`sNC#+ect6HM87gsRTCrZdi&5*imw*?5Gi&M{5r7-vf8n649{s&ib^Ij-p(*L5OP
> zb()$^Q`2ecIuWWm@dQ$OI-%)I=sFRqIxS77rRlVEod{K(Nlj-`)0xzDB2;zaS*To3
> zThnRlIuWWmCp4WCn$8JbCqh-{q^5IH(>bZ@M5yYV(sWK~I;V7<2vwbqrqj`MI=W7T
> zs?LqMyPoYr(sYdWWOod{K(8BJ$K)0xqAB2;zGXgX&!
> zoin;lgsRR{n$A<2&QrQhgsM)=Byji1=g_RCb5_@hP}O-_(|KCcd0N+rP}O;cGxOn-
> z@C;`b!iU`%!E}#8VtOI=tj0X6G0*Bugevo##

[PATCH] ui/gtk: Update the refresh rate for gl-area too

2021-10-05 Thread Nikola Pavlica

This is a bugfix that stretches all the way back to January 2020,
where I initially introduced this problem and potential solutions.

A quick recap of the issue: QEMU did not sync up with the monitors
refresh rate causing the VM to render frames that were NOT displayed
to the user. That "fix" allowed QEMU to obtain the screen refreshrate
information from the system using GDK API's and was for GTK only.

Well, I'm back with the same issue again. But this time on Wayland.

And I did NOT realize there was YET another screen refresh rate
function, this time for Wayland specifically. Thankfully the fix was
simple and without much hassle.

Thanks,
Nikola

Signed-off-by: Nikola Pavlica 
---
 ui/gtk-gl-area.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/ui/gtk-gl-area.c b/ui/gtk-gl-area.c
index b23523748e..afcb29f658 100644
--- a/ui/gtk-gl-area.c
+++ b/ui/gtk-gl-area.c
@@ -112,6 +112,9 @@ void gd_gl_area_refresh(DisplayChangeListener *dcl)
 {
 VirtualConsole *vc = container_of(dcl, VirtualConsole, gfx.dcl);
 
+vc->gfx.dcl.update_interval = gd_monitor_update_interval(
+vc->window ? vc->window : vc->gfx.drawing_area);
+
 if (!vc->gfx.gls) {
 if (!gtk_widget_get_realized(vc->gfx.drawing_area)) {
 return;
-- 
2.33.0

Re: [PATCH v1 2/2] migration: add missing qemu_mutex_lock_iothread in migration_completion

2021-10-05 Thread Dr. David Alan Gilbert

* Emanuele Giuseppe Esposito (eespo...@redhat.com) wrote:
> qemu_savevm_state_complete_postcopy assumes the iothread lock (BQL)
> to be held, but instead it isn't.
> 
> Signed-off-by: Emanuele Giuseppe Esposito 

Interesting, I think you're right - and I think it's been missing it
from the start.

Reviewed-by: Dr. David Alan Gilbert 

> ---
>  migration/migration.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 041b8451a6..215d5281f2 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3182,7 +3182,10 @@ static void migration_completion(MigrationState *s)
>  } else if (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
>  trace_migration_completion_postcopy_end();
>  
> +qemu_mutex_lock_iothread();
>  qemu_savevm_state_complete_postcopy(s->to_dst_file);
> +qemu_mutex_unlock_iothread();
> +
>  trace_migration_completion_postcopy_end_after_complete();
>  } else if (s->state == MIGRATION_STATUS_CANCELLING) {
>  goto fail;
> -- 
> 2.27.0
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [PATCH v4 09/11] tests/acpi: add test cases for VIOT

2021-10-05 Thread Ani Sinha




On Fri, 1 Oct 2021, Jean-Philippe Brucker wrote:

> Add two test cases for VIOT, one on the q35 machine and the other on
> virt. To test complex topologies the q35 test has two PCIe buses that
> bypass the IOMMU (and are therefore not described by VIOT), and two
> buses that are translated by virtio-iommu.
>
> Signed-off-by: Jean-Philippe Brucker 

This might be a stupid question but what about virtio-mmio and single mmio
cases? I see none of your tables has nodes for those and here too you do
not add test cases for it.

> ---
>  tests/qtest/bios-tables-test.c | 38 ++
>  1 file changed, 38 insertions(+)
>
> diff --git a/tests/qtest/bios-tables-test.c b/tests/qtest/bios-tables-test.c
> index 4f11d03055..b6cb383bd9 100644
> --- a/tests/qtest/bios-tables-test.c
> +++ b/tests/qtest/bios-tables-test.c
> @@ -1403,6 +1403,42 @@ static void test_acpi_virt_tcg(void)
>  free_test_data(&data);
>  }
>
> +static void test_acpi_q35_viot(void)
> +{
> +test_data data = {
> +.machine = MACHINE_Q35,
> +.variant = ".viot",
> +};
> +
> +/*
> + * To keep things interesting, two buses bypass the IOMMU.
> + * VIOT should only describes the other two buses.
> + */
> +test_acpi_one("-machine default_bus_bypass_iommu=on "
> +  "-device virtio-iommu "
> +  "-device pxb-pcie,bus_nr=0x10,id=pcie.100,bus=pcie.0 "
> +  "-device 
> pxb-pcie,bus_nr=0x20,id=pcie.200,bus=pcie.0,bypass_iommu=on "
> +  "-device pxb-pcie,bus_nr=0x30,id=pcie.300,bus=pcie.0",
> +  &data);
> +free_test_data(&data);
> +}
> +
> +static void test_acpi_virt_viot(void)
> +{
> +test_data data = {
> +.machine = "virt",
> +.uefi_fl1 = "pc-bios/edk2-aarch64-code.fd",
> +.uefi_fl2 = "pc-bios/edk2-arm-vars.fd",
> +.cd = 
> "tests/data/uefi-boot-images/bios-tables-test.aarch64.iso.qcow2",
> +.ram_start = 0x4000ULL,
> +.scan_len = 128ULL * 1024 * 1024,
> +};
> +
> +test_acpi_one("-cpu cortex-a57 "
> +  "-device virtio-iommu", &data);
> +free_test_data(&data);
> +}
> +
>  static void test_oem_fields(test_data *data)
>  {
>  int i;
> @@ -1567,12 +1603,14 @@ int main(int argc, char *argv[])
>  if (strcmp(arch, "x86_64") == 0) {
>  qtest_add_func("acpi/microvm/pcie", test_acpi_microvm_pcie_tcg);
>  }
> +qtest_add_func("acpi/q35/viot", test_acpi_q35_viot);
>  } else if (strcmp(arch, "aarch64") == 0) {
>  qtest_add_func("acpi/virt", test_acpi_virt_tcg);
>  qtest_add_func("acpi/virt/numamem", test_acpi_virt_tcg_numamem);
>  qtest_add_func("acpi/virt/memhp", test_acpi_virt_tcg_memhp);
>  qtest_add_func("acpi/virt/pxb", test_acpi_virt_tcg_pxb);
>  qtest_add_func("acpi/virt/oem-fields", test_acpi_oem_fields_virt);
> +qtest_add_func("acpi/virt/viot", test_acpi_virt_viot);
>  }
>  ret = g_test_run();
>  boot_sector_cleanup(disk);
> --
> 2.33.0
>
>

Re: [RFC PATCH 1/1] virtio: write back features before verify

2021-10-05 Thread Halil Pasic

On Mon, 4 Oct 2021 05:07:13 -0400
"Michael S. Tsirkin"  wrote:

> On Mon, Oct 04, 2021 at 04:23:23AM +0200, Halil Pasic wrote:
> > On Sat, 2 Oct 2021 14:13:37 -0400
> > "Michael S. Tsirkin"  wrote:
> >   
> > > > Anyone else have an idea? This is a nasty regression; we could revert 
> > > > the
> > > > patch, which would remove the symptoms and give us some time, but that
> > > > doesn't really feel right, I'd do that only as a last resort.
> > > 
> > > Well we have Halil's hack (except I would limit it
> > > to only apply to BE, only do devices with validate,
> > > and only in modern mode), and we will fix QEMU to be spec compliant.
> > > Between these why do we need any conditional compiles?  
> > 
> > We don't. As I stated before, this hack is flawed because it
> > effectively breaks fencing features by the driver with QEMU. Some
> > features can not be unset after once set, because we tend to try to
> > enable the corresponding functionality whenever we see a write
> > features operation with the feature bit set, and we don't disable, if a
> > subsequent features write operation stores the feature bit as not set.  
> 
> Something to fix in QEMU too, I think.

Possibly. But it is the same situation: it probably has a long
history. And it may even make some sense. The obvious trigger for
doing the conditional initialization for modern is the setting of
FEATURES_OK. The problem is, legacy doesn't do FEATURES_OK. So we would
need a different trigger.

> 
> > But it looks like VIRTIO_1 is fine to get cleared afterwards.  
> 
> We'd never clear it though - why would we?
> 

Right.

> > So my hack
> > should actually look like posted below, modulo conditions.  
> 
> 
> Looking at it some more, I see that vhost-user actually
> does not send features to the backend until FEATURES_OK.

I.e. the hack does not work for transitional vhost-user devices,
but it doesn't break them either.

Furthermore, I believe there is not much we can do to support
transitional devices with vhost-user and similar, without extending
the protocol. The transport specific detection idea would need a new
vhost-user thingy to tell the device what has been figured
out, right?

In theory modern only could work, if the backends were paying extra
attention to endianness, instead of just assuming that the code is
running little-endian.

> However, the code in contrib for vhost-user-blk at least seems
> broken wrt endian-ness ATM.

Agree. For example config is native endian ATM AFAICT. 

> What about other backends though?

I think whenever the config is owned and managed by the vhost-backend
we have a problem with transitional. And we don't have everything in
the protocol to deal with this problem.

I didn't check modern for the different vhost-user backends. I don't
think we recommend our users on s390 to use those. My understanding
of the use-cases is far form complete.

> Hard to be sure right?

I agree.

> Cc Raphael and Stefan so they can take a look.
> And I guess it's time we CC'd qemu-devel too.
> 
> For now I am beginning to think we should either revert or just limit
> validation to LE and think about all this some more. And I am inclining
> to do a revert.

I'm fine with either of these as a quick fix, but we will eventually have
to find a solution. AFAICT this solution works for the s390 setups we
care about the most, but so would a revert.

> These are all hypervisors that shipped for a long time.
> Do we need a flag for early config space access then?

You mean a feature bit? I think it is a good idea even if
it weren't strictly necessary. We will have a behavior change
for some devices, and I think the ability to detect those
is valuable.

Your spec change proposal, makes it IMHO pretty clear, that
we are changing our understanding of how transitional should work.
Strictly, transitional is not a normative part of the spec AFAIU,
but still...

> 
> 
> 
> > 
> > Regarding the conditions I guess checking that driver_features has
> > F_VERSION_1 already satisfies "only modern mode", or?  
> 
> Right.
> 
> > For now
> > I've deliberately omitted the has verify and the is big endian
> > conditions so we have a better chance to see if something breaks
> > (i.e. the approach does not work). I can add in those extra conditions
> > later.  
> 
> Or maybe if we will go down that road just the verify check (for
> performance). I'm a bit unhappy we have the extra exit but consistency
> seems more important.
> 

I'm fine either way. The extra exit is only for the initialization and
one per 1 device, I have no feeling if this has a measurable performance
impact.

> > 
> > --8<-
> > 
> > From: Halil Pasic 
> > Date: Thu, 30 Sep 2021 02:38:47 +0200
> > Subject: [PATCH] virtio: write back feature VERSION_1 before verify
> > 
> > This patch fixes a regression introduced by commit 82e89ea077b9
> > ("virtio-blk: Add validation for block size in config space") and
> > enables similar checks in verif

Re: [PATCH 3/3] vdpa: Check for iova range at mappings changes

2021-10-05 Thread Michael S. Tsirkin

On Tue, Oct 05, 2021 at 11:58:12AM +0200, Eugenio Perez Martin wrote:
> On Tue, Oct 5, 2021 at 10:14 AM Michael S. Tsirkin  wrote:
> >
> > On Tue, Oct 05, 2021 at 10:01:31AM +0200, Eugenio Pérez wrote:
> > > Check vdpa device range before updating memory regions so we don't add
> > > any outside of it, and report the invalid change if any.
> > >
> > > Signed-off-by: Eugenio Pérez 
> > > ---
> > >  include/hw/virtio/vhost-vdpa.h |  2 +
> > >  hw/virtio/vhost-vdpa.c | 68 ++
> > >  hw/virtio/trace-events |  1 +
> > >  3 files changed, 55 insertions(+), 16 deletions(-)
> > >
> > > diff --git a/include/hw/virtio/vhost-vdpa.h 
> > > b/include/hw/virtio/vhost-vdpa.h
> > > index a8963da2d9..c288cf7ecb 100644
> > > --- a/include/hw/virtio/vhost-vdpa.h
> > > +++ b/include/hw/virtio/vhost-vdpa.h
> > > @@ -13,6 +13,7 @@
> > >  #define HW_VIRTIO_VHOST_VDPA_H
> > >
> > >  #include "hw/virtio/virtio.h"
> > > +#include "standard-headers/linux/vhost_types.h"
> > >
> > >  typedef struct VhostVDPAHostNotifier {
> > >  MemoryRegion mr;
> > > @@ -24,6 +25,7 @@ typedef struct vhost_vdpa {
> > >  uint32_t msg_type;
> > >  bool iotlb_batch_begin_sent;
> > >  MemoryListener listener;
> > > +struct vhost_vdpa_iova_range iova_range;
> > >  struct vhost_dev *dev;
> > >  VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
> > >  } VhostVDPA;
> > > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > > index a1de6c7c9c..26d0258723 100644
> > > --- a/hw/virtio/vhost-vdpa.c
> > > +++ b/hw/virtio/vhost-vdpa.c
> > > @@ -33,20 +33,34 @@ static Int128 vhost_vdpa_section_end(const 
> > > MemoryRegionSection *section)
> > >  return llend;
> > >  }
> > >
> > > -static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection 
> > > *section)
> > > -{
> > > -return (!memory_region_is_ram(section->mr) &&
> > > -!memory_region_is_iommu(section->mr)) ||
> > > -memory_region_is_protected(section->mr) ||
> > > -   /* vhost-vDPA doesn't allow MMIO to be mapped  */
> > > -memory_region_is_ram_device(section->mr) ||
> > > -   /*
> > > -* Sizing an enabled 64-bit BAR can cause spurious mappings to
> > > -* addresses in the upper part of the 64-bit address space.  
> > > These
> > > -* are never accessed by the CPU and beyond the address width 
> > > of
> > > -* some IOMMU hardware.  TODO: VDPA should tell us the IOMMU 
> > > width.
> > > -*/
> > > -   section->offset_within_address_space & (1ULL << 63);
> > > +static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection 
> > > *section,
> > > +uint64_t iova_min,
> > > +uint64_t iova_max)
> > > +{
> > > +Int128 llend;
> > > +bool r = (!memory_region_is_ram(section->mr) &&
> > > +  !memory_region_is_iommu(section->mr)) ||
> > > +  memory_region_is_protected(section->mr) ||
> > > +  /* vhost-vDPA doesn't allow MMIO to be mapped  */
> > > +  memory_region_is_ram_device(section->mr);
> > > +if (r) {
> > > +return true;
> > > +}
> > > +
> > > +if (section->offset_within_address_space < iova_min) {
> > > +error_report("RAM section out of device range (min=%lu, 
> > > addr=%lu)",
> > > + iova_min, section->offset_within_address_space);
> > > +return true;
> > > +}
> > > +
> > > +llend = vhost_vdpa_section_end(section);
> > > +if (int128_make64(llend) > iova_max) {
> >
> > I am puzzled by this.
> > You are taking a Int128, converting to u64, converting
> > back to Int128, and comparing to u64.
> > Head spins. What is all this back and forth trying to achieve?
> >
> 
> You are totally right, this series was extracted from a longer one
> where I didn't use vhost_vdpa_section_end, but raw addresses. Then I
> applied int128_make64 to the wrong variable, too fast.
> 
> To be sure we are on the same page, to do:
> 
> if (int128_ge(int128_make64(iova), llend)) {
> // error message
> return;
> }
> 
> The same way as vhost_vdpa_listener_region_{add,del} would be ok?
> 
> Thanks!


should be ok, yea

> > > +error_report("RAM section out of device range (max=%lu, end 
> > > addr=%lu)",
> > > + iova_max, (uint64_t)int128_make64(llend));
> > > +return true;
> > > +}
> > > +
> > > +return false;
> > >  }
> > >
> > >  static int vhost_vdpa_dma_map(struct vhost_vdpa *v, hwaddr iova, hwaddr 
> > > size,
> > > @@ -158,7 +172,8 @@ static void 
> > > vhost_vdpa_listener_region_add(MemoryListener *listener,
> > >  void *vaddr;
> > >  int ret;
> > >
> > > -if (vhost_vdpa_listener_skipped_section(section)) {
> > > +if (vhost_vdpa_listener_skipped_section(section, v->iova_range.first,
> > > +

Re: [RFC PATCH 1/1] virtio: write back features before verify

2021-10-05 Thread Halil Pasic

On Tue, 5 Oct 2021 03:53:17 -0400
"Michael S. Tsirkin"  wrote:

> > Wouldn't a call from transport code into virtio core
> > be more handy? What I have in mind is stuff like vhost-user and vdpa. My
> > understanding is, that for vhost setups where the config is outside qemu,
> > we probably need a new  command that tells the vhost backend what
> > endiannes to use for config. I don't think we can use
> > VHOST_USER_SET_VRING_ENDIAN because  that one is on a virtqueue basis
> > according to the doc. So for vhost-user and similar we would fire that
> > command and probably also set the filed, while for devices for which
> > control plane is handled by QEMU we would just set the field.
> > 
> > Does that sound about right?  
> 
> I'm fine either way, but when would you invoke this?
> With my idea backends can check the field when get_config
> is invoked.
> 
> As for using this in VHOST, can we maybe re-use SET_FEATURES?
> 
> Kind of hacky but nice in that it will actually make existing backends
> work...

Basically the equivalent of this patch, just on the vhost interface,
right? Could work I have to look into it :)

Re: [PATCH 2/3] vdpa: Add vhost_vdpa_section_end

2021-10-05 Thread Michael S. Tsirkin

On Tue, Oct 05, 2021 at 11:52:37AM +0200, Eugenio Perez Martin wrote:
> On Tue, Oct 5, 2021 at 10:15 AM Michael S. Tsirkin  wrote:
> >
> > On Tue, Oct 05, 2021 at 10:01:30AM +0200, Eugenio Pérez wrote:
> > > Abstract this operation, that will be reused when validating the region
> > > against the iova range that the device supports.
> > >
> > > Signed-off-by: Eugenio Pérez 
> >
> > Note that as defined end is actually 1 byte beyond end of section.
> > As such it can e.g. overflow if cast to u64.
> > So be careful to use int128 ops with it.
> 
> You are right, but this is only the result of extracting "llend"
> calculation in its own function, since it is going to be used a third
> time in the next commit. This next commit contains a mistake because
> of this, as you pointed out.
> 
> Since "last" would be a very misleading name, do you think we could
> give a better name / type to it?
> 
> > Also - document?
> 
> It will be documented with that ("It returns one byte beyond end of
> section" or similar) too.
> 
> Thanks!

that's how c++ containers work so maybe it's not too bad as long
as we document this carefully.

> >
> > > ---
> > >  hw/virtio/vhost-vdpa.c | 18 +++---
> > >  1 file changed, 11 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > > index ea1aa71ad8..a1de6c7c9c 100644
> > > --- a/hw/virtio/vhost-vdpa.c
> > > +++ b/hw/virtio/vhost-vdpa.c
> > > @@ -24,6 +24,15 @@
> > >  #include "trace.h"
> > >  #include "qemu-common.h"
> > >
> > > +static Int128 vhost_vdpa_section_end(const MemoryRegionSection *section)
> > > +{
> > > +Int128 llend = int128_make64(section->offset_within_address_space);
> > > +llend = int128_add(llend, section->size);
> > > +llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> > > +
> > > +return llend;
> > > +}
> > > +
> > >  static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection 
> > > *section)
> > >  {
> > >  return (!memory_region_is_ram(section->mr) &&
> > > @@ -160,10 +169,7 @@ static void 
> > > vhost_vdpa_listener_region_add(MemoryListener *listener,
> > >  }
> > >
> > >  iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> > > -llend = int128_make64(section->offset_within_address_space);
> > > -llend = int128_add(llend, section->size);
> > > -llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> > > -
> > > +llend = vhost_vdpa_section_end(section);
> > >  if (int128_ge(int128_make64(iova), llend)) {
> > >  return;
> > >  }
> > > @@ -221,9 +227,7 @@ static void 
> > > vhost_vdpa_listener_region_del(MemoryListener *listener,
> > >  }
> > >
> > >  iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> > > -llend = int128_make64(section->offset_within_address_space);
> > > -llend = int128_add(llend, section->size);
> > > -llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> > > +llend = vhost_vdpa_section_end(section);
> > >
> > >  trace_vhost_vdpa_listener_region_del(v, iova, int128_get64(llend));
> > >
> > > --
> > > 2.27.0
> >

Re: [PATCH 06/11] qdev: Add Error parameter to qdev_set_id()

2021-10-05 Thread Kevin Wolf

Am 27.09.2021 um 12:33 hat Damien Hedde geschrieben:
> Hi Kevin,
> 
> I proposed a very similar patch in our rfc series because we needed some of
> the cleaning you do here.
> https://lists.gnu.org/archive/html/qemu-devel/2021-09/msg05679.html
> I've added a bit of doc for the function, feel free to take it if you want.

Thanks, I'm replacing my patch with yours for v2.

Kevin

Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-10-05 Thread Christian Schoenebeck

On Dienstag, 5. Oktober 2021 09:38:53 CEST David Hildenbrand wrote:
> On 04.10.21 21:38, Christian Schoenebeck wrote:
> > At the moment the maximum transfer size with virtio is limited to 4M
> > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > theoretical possible transfer size of 128M (32k pages) according to the
> > virtio specs:
> > 
> > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > x1-240006
> I'm missing the "why do we care". Can you comment on that?

Primary motivation is the possibility of improved performance, e.g. in case of 
9pfs, people can raise the maximum transfer size with the Linux 9p client's 
'msize' option on guest side (and only on guest side actually). If guest 
performs large chunk I/O, e.g. consider something "useful" like this one on 
guest side:

  time cat large_file_on_9pfs.dat > /dev/null

Then there is a noticable performance increase with higher transfer size 
values. That performance gain is continuous with rising transfer size values, 
but the performance increase obviously shrinks with rising transfer sizes as 
well, as with similar concepts in general like cache sizes, etc.

Then a secondary motivation is described in reason (2) of patch 2: if the 
transfer size is configurable on guest side (like it is the case with the 9pfs 
'msize' option), then there is the unpleasant side effect that the current 
virtio limit of 4M is invisible to guest; as this value of 4M is simply an 
arbitrarily limit set on QEMU side in the past (probably just implementation 
motivated on QEMU side at that point), i.e. it is not a limit specified by the 
virtio protocol, nor is this limit be made aware to guest via virtio protocol 
at all. The consequence with 9pfs would be if user tries to go higher than 4M, 
then the system would simply hang with this QEMU error:

  virtio: too many write descriptors in indirect table

Now whether this is an issue or not for individual virtio users, depends on 
whether the individual virtio user already had its own limitation <= 4M 
enforced on its side.

Best regards,
Christian Schoenebeck

Re: [RFC PATCH 1/1] virtio: write back features before verify

2021-10-05 Thread Cornelia Huck

On Tue, Oct 05 2021, Halil Pasic  wrote:

> On Mon, 4 Oct 2021 05:07:13 -0400
> "Michael S. Tsirkin"  wrote:
>> Well we established that we can know. Here's an alternative explanation:
>
>
> I thin we established how this should be in the future, where a transport
> specific mechanism is used to decide are we operating in legacy mode or
> in modern mode. But with the current QEMU reality, I don't think so.
> Namely currently the switch native-endian config -> little endian config
> happens when the VERSION_1 is negotiated, which may happen whenever
> the VERSION_1 bit is changed, or only when FEATURES_OK is set
> (vhost-user).
>
> This is consistent with device should detect a legacy driver by checking
> for VERSION_1, which is what the spec currently says.
>
> So for transitional we start out with native-endian config. For modern
> only the config is always LE.
>
> The guest can distinguish between a legacy only device and a modern
> capable device after the revision negotiation. A legacy device would
> reject the CCW.
>
> But both a transitional device and a modern only device would accept
> a revision > 0. So the guest does not know for ccw.

Well, for pci I think the driver knows that it is using either legacy or
modern, no?

And for ccw, the driver knows at that point in time which revision it
negotiated, so it should know that a revision > 0 will use LE (and the
device will obviously know that as well.)

Or am I misunderstanding what you're getting at?

Re: [RFC PATCH 1/1] virtio: write back features before verify

2021-10-05 Thread Michael S. Tsirkin

On Tue, Oct 05, 2021 at 12:43:03PM +0200, Halil Pasic wrote:
> On Mon, 4 Oct 2021 05:07:13 -0400
> "Michael S. Tsirkin"  wrote:
> 
> > On Mon, Oct 04, 2021 at 04:23:23AM +0200, Halil Pasic wrote:
> > > On Sat, 2 Oct 2021 14:13:37 -0400
> > > "Michael S. Tsirkin"  wrote:
> > >   
> > > > > Anyone else have an idea? This is a nasty regression; we could revert 
> > > > > the
> > > > > patch, which would remove the symptoms and give us some time, but that
> > > > > doesn't really feel right, I'd do that only as a last resort.
> > > > 
> > > > Well we have Halil's hack (except I would limit it
> > > > to only apply to BE, only do devices with validate,
> > > > and only in modern mode), and we will fix QEMU to be spec compliant.
> > > > Between these why do we need any conditional compiles?  
> > > 
> > > We don't. As I stated before, this hack is flawed because it
> > > effectively breaks fencing features by the driver with QEMU. Some
> > > features can not be unset after once set, because we tend to try to
> > > enable the corresponding functionality whenever we see a write
> > > features operation with the feature bit set, and we don't disable, if a
> > > subsequent features write operation stores the feature bit as not set.  
> > 
> > Something to fix in QEMU too, I think.
> 
> Possibly. But it is the same situation: it probably has a long
> history. And it may even make some sense. The obvious trigger for
> doing the conditional initialization for modern is the setting of
> FEATURES_OK. The problem is, legacy doesn't do FEATURES_OK. So we would
> need a different trigger.
> 
> > 
> > > But it looks like VIRTIO_1 is fine to get cleared afterwards.  
> > 
> > We'd never clear it though - why would we?
> > 
> 
> Right.
> 
> > > So my hack
> > > should actually look like posted below, modulo conditions.  
> > 
> > 
> > Looking at it some more, I see that vhost-user actually
> > does not send features to the backend until FEATURES_OK.
> 
> I.e. the hack does not work for transitional vhost-user devices,
> but it doesn't break them either.
> 
> Furthermore, I believe there is not much we can do to support
> transitional devices with vhost-user and similar, without extending
> the protocol. The transport specific detection idea would need a new
> vhost-user thingy to tell the device what has been figured
> out, right?
> 
> In theory modern only could work, if the backends were paying extra
> attention to endianness, instead of just assuming that the code is
> running little-endian.

I think a reasonable thing is to send SET_FEATURES before each
GET_CONFIG, to tell backend which format is expected.

> > However, the code in contrib for vhost-user-blk at least seems
> > broken wrt endian-ness ATM.
> 
> Agree. For example config is native endian ATM AFAICT. 
> 
> > What about other backends though?
> 
> I think whenever the config is owned and managed by the vhost-backend
> we have a problem with transitional. And we don't have everything in
> the protocol to deal with this problem.
> 
> I didn't check modern for the different vhost-user backends. I don't
> think we recommend our users on s390 to use those. My understanding
> of the use-cases is far form complete.
> 
> > Hard to be sure right?
> 
> I agree.
> 
> > Cc Raphael and Stefan so they can take a look.
> > And I guess it's time we CC'd qemu-devel too.
> > 
> > For now I am beginning to think we should either revert or just limit
> > validation to LE and think about all this some more. And I am inclining
> > to do a revert.
> 
> I'm fine with either of these as a quick fix, but we will eventually have
> to find a solution. AFAICT this solution works for the s390 setups we
> care about the most, but so would a revert.

The reason I like this one is that it also fixes MTU for virtio net,
and that one we can't really revert.


> 
> 
> > These are all hypervisors that shipped for a long time.
> > Do we need a flag for early config space access then?
> 
> You mean a feature bit? I think it is a good idea even if
> it weren't strictly necessary. We will have a behavior change
> for some devices, and I think the ability to detect those
> is valuable.
> 
> Your spec change proposal, makes it IMHO pretty clear, that
> we are changing our understanding of how transitional should work.
> Strictly, transitional is not a normative part of the spec AFAIU,
> but still...
> 
> 
> > 
> > 
> > 
> > > 
> > > Regarding the conditions I guess checking that driver_features has
> > > F_VERSION_1 already satisfies "only modern mode", or?  
> > 
> > Right.
> > 
> > > For now
> > > I've deliberately omitted the has verify and the is big endian
> > > conditions so we have a better chance to see if something breaks
> > > (i.e. the approach does not work). I can add in those extra conditions
> > > later.  
> > 
> > Or maybe if we will go down that road just the verify check (for
> > performance). I'm a bit unhappy we have the extra exit but consistency
> > seems more important.
> > 
>

Re: [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-10-05 Thread Christian Schoenebeck

On Dienstag, 5. Oktober 2021 09:16:07 CEST Michael S. Tsirkin wrote:
> On Mon, Oct 04, 2021 at 09:38:08PM +0200, Christian Schoenebeck wrote:
> > Raise the maximum possible virtio transfer size to 128M
> > (more precisely: 32k * PAGE_SIZE). See previous commit for a
> > more detailed explanation for the reasons of this change.
> > 
> > For not breaking any virtio user, all virtio users transition
> > to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
> > VIRTQUEUE_MAX_SIZE, so they are all still using the old value
> > of 1k with this commit.
> > 
> > On the long-term, each virtio user should subsequently either
> > switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
> > after checking that they support the new value of 32k, or
> > otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
> > macro by an appropriate value supported by them.
> > 
> > Signed-off-by: Christian Schoenebeck 
> 
> I don't think we need this. Legacy isn't descriptive either.  Just leave
> VIRTQUEUE_MAX_SIZE alone, and come up with a new name for 32k.

Does this mean you disagree that on the long-term all virtio users should 
transition either to the new upper limit of 32k max queue size or introduce 
their own limit at their end?

Independent of the name, and I would appreciate for suggestions for an 
adequate macro name here, I still think this new limit should be placed in the 
shared virtio.h file. Because this value is not something invented on virtio 
user side. It rather reflects the theoretical upper limited possible with the 
virtio protocol, which is and will be common for all virtio users.

> > ---
> > 
> >  hw/9pfs/virtio-9p-device.c |  2 +-
> >  hw/block/vhost-user-blk.c  |  6 +++---
> >  hw/block/virtio-blk.c  |  6 +++---
> >  hw/char/virtio-serial-bus.c|  2 +-
> >  hw/input/virtio-input.c|  2 +-
> >  hw/net/virtio-net.c| 12 ++--
> >  hw/scsi/virtio-scsi.c  |  2 +-
> >  hw/virtio/vhost-user-fs.c  |  6 +++---
> >  hw/virtio/vhost-user-i2c.c |  2 +-
> >  hw/virtio/vhost-vsock-common.c |  2 +-
> >  hw/virtio/virtio-balloon.c |  2 +-
> >  hw/virtio/virtio-crypto.c  |  2 +-
> >  hw/virtio/virtio-iommu.c   |  2 +-
> >  hw/virtio/virtio-mem.c |  2 +-
> >  hw/virtio/virtio-mmio.c|  4 ++--
> >  hw/virtio/virtio-pmem.c|  2 +-
> >  hw/virtio/virtio-rng.c |  3 ++-
> >  include/hw/virtio/virtio.h | 20 +++-
> >  18 files changed, 49 insertions(+), 30 deletions(-)
> > 
> > diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> > index cd5d95dd51..9013e7df6e 100644
> > --- a/hw/9pfs/virtio-9p-device.c
> > +++ b/hw/9pfs/virtio-9p-device.c
> > @@ -217,7 +217,7 @@ static void virtio_9p_device_realize(DeviceState *dev,
> > Error **errp)> 
> >  v->config_size = sizeof(struct virtio_9p_config) +
> >  strlen(s->fsconf.tag);
> >  virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > 
> > -VIRTQUEUE_MAX_SIZE);
> > +VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >  v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> >  
> >  }
> > 
> > diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> > index 336f56705c..e5e45262ab 100644
> > --- a/hw/block/vhost-user-blk.c
> > +++ b/hw/block/vhost-user-blk.c
> > @@ -480,9 +480,9 @@ static void vhost_user_blk_device_realize(DeviceState
> > *dev, Error **errp)> 
> >  error_setg(errp, "queue size must be non-zero");
> >  return;
> >  
> >  }
> > 
> > -if (s->queue_size > VIRTQUEUE_MAX_SIZE) {
> > +if (s->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> > 
> >  error_setg(errp, "queue size must not exceed %d",
> > 
> > -   VIRTQUEUE_MAX_SIZE);
> > +   VIRTQUEUE_LEGACY_MAX_SIZE);
> > 
> >  return;
> >  
> >  }
> > 
> > @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState
> > *dev, Error **errp)> 
> >  }
> >  
> >  virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> > 
> > -sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
> > +sizeof(struct virtio_blk_config),
> > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> >  s->virtqs = g_new(VirtQueue *, s->num_queues);
> >  for (i = 0; i < s->num_queues; i++) {
> > 
> > diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> > index 9c0f46815c..5883e3e7db 100644
> > --- a/hw/block/virtio-blk.c
> > +++ b/hw/block/virtio-blk.c
> > @@ -1171,10 +1171,10 @@ static void virtio_blk_device_realize(DeviceState
> > *dev, Error **errp)> 
> >  return;
> >  
> >  }
> >  if (!is_power_of_2(conf->queue_size) ||
> > 
> > -conf->queue_size > VIRTQUEUE_MAX_SIZE) {
> > +conf->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> > 
> >  error_setg(errp, "invalid queue-size property (%" PRIu16 "), "
> >  
> > "must be a power of 2 (max %d)",
> > 
> >

Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-10-05 Thread Michael S. Tsirkin

On Tue, Oct 05, 2021 at 01:10:56PM +0200, Christian Schoenebeck wrote:
> On Dienstag, 5. Oktober 2021 09:38:53 CEST David Hildenbrand wrote:
> > On 04.10.21 21:38, Christian Schoenebeck wrote:
> > > At the moment the maximum transfer size with virtio is limited to 4M
> > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > theoretical possible transfer size of 128M (32k pages) according to the
> > > virtio specs:
> > > 
> > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#
> > > x1-240006
> > I'm missing the "why do we care". Can you comment on that?
> 
> Primary motivation is the possibility of improved performance, e.g. in case 
> of 
> 9pfs, people can raise the maximum transfer size with the Linux 9p client's 
> 'msize' option on guest side (and only on guest side actually). If guest 
> performs large chunk I/O, e.g. consider something "useful" like this one on 
> guest side:
> 
>   time cat large_file_on_9pfs.dat > /dev/null
> 
> Then there is a noticable performance increase with higher transfer size 
> values. That performance gain is continuous with rising transfer size values, 
> but the performance increase obviously shrinks with rising transfer sizes as 
> well, as with similar concepts in general like cache sizes, etc.
> 
> Then a secondary motivation is described in reason (2) of patch 2: if the 
> transfer size is configurable on guest side (like it is the case with the 
> 9pfs 
> 'msize' option), then there is the unpleasant side effect that the current 
> virtio limit of 4M is invisible to guest; as this value of 4M is simply an 
> arbitrarily limit set on QEMU side in the past (probably just implementation 
> motivated on QEMU side at that point), i.e. it is not a limit specified by 
> the 
> virtio protocol,

According to the spec it's specified, sure enough: vq size limits the
size of indirect descriptors too.
However, ever since commit 44ed8089e991a60d614abe0ee4b9057a28b364e4 we
do not enforce it in the driver ...

> nor is this limit be made aware to guest via virtio protocol 
> at all. The consequence with 9pfs would be if user tries to go higher than 
> 4M, 
> then the system would simply hang with this QEMU error:
> 
>   virtio: too many write descriptors in indirect table
> 
> Now whether this is an issue or not for individual virtio users, depends on 
> whether the individual virtio user already had its own limitation <= 4M 
> enforced on its side.
> 
> Best regards,
> Christian Schoenebeck
>

Re: [RFC PATCH 1/1] virtio: write back features before verify

2021-10-05 Thread Michael S. Tsirkin

On Tue, Oct 05, 2021 at 12:46:34PM +0200, Halil Pasic wrote:
> On Tue, 5 Oct 2021 03:53:17 -0400
> "Michael S. Tsirkin"  wrote:
> 
> > > Wouldn't a call from transport code into virtio core
> > > be more handy? What I have in mind is stuff like vhost-user and vdpa. My
> > > understanding is, that for vhost setups where the config is outside qemu,
> > > we probably need a new  command that tells the vhost backend what
> > > endiannes to use for config. I don't think we can use
> > > VHOST_USER_SET_VRING_ENDIAN because  that one is on a virtqueue basis
> > > according to the doc. So for vhost-user and similar we would fire that
> > > command and probably also set the filed, while for devices for which
> > > control plane is handled by QEMU we would just set the field.
> > > 
> > > Does that sound about right?  
> > 
> > I'm fine either way, but when would you invoke this?
> > With my idea backends can check the field when get_config
> > is invoked.
> > 
> > As for using this in VHOST, can we maybe re-use SET_FEATURES?
> > 
> > Kind of hacky but nice in that it will actually make existing backends
> > work...
> 
> Basically the equivalent of this patch, just on the vhost interface,
> right? Could work I have to look into it :)

yep

Re: [RFC PATCH 1/1] virtio: write back features before verify

2021-10-05 Thread Michael S. Tsirkin

On Tue, Oct 05, 2021 at 01:13:31PM +0200, Cornelia Huck wrote:
> On Tue, Oct 05 2021, Halil Pasic  wrote:
> 
> > On Mon, 4 Oct 2021 05:07:13 -0400
> > "Michael S. Tsirkin"  wrote:
> >> Well we established that we can know. Here's an alternative explanation:
> >
> >
> > I thin we established how this should be in the future, where a transport
> > specific mechanism is used to decide are we operating in legacy mode or
> > in modern mode. But with the current QEMU reality, I don't think so.
> > Namely currently the switch native-endian config -> little endian config
> > happens when the VERSION_1 is negotiated, which may happen whenever
> > the VERSION_1 bit is changed, or only when FEATURES_OK is set
> > (vhost-user).
> >
> > This is consistent with device should detect a legacy driver by checking
> > for VERSION_1, which is what the spec currently says.
> >
> > So for transitional we start out with native-endian config. For modern
> > only the config is always LE.
> >
> > The guest can distinguish between a legacy only device and a modern
> > capable device after the revision negotiation. A legacy device would
> > reject the CCW.
> >
> > But both a transitional device and a modern only device would accept
> > a revision > 0. So the guest does not know for ccw.
> 
> Well, for pci I think the driver knows that it is using either legacy or
> modern, no?
> 
> And for ccw, the driver knows at that point in time which revision it
> negotiated, so it should know that a revision > 0 will use LE (and the
> device will obviously know that as well.)
> 
> Or am I misunderstanding what you're getting at?

Exactly what I'm saying.

Re: [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-10-05 Thread Michael S. Tsirkin

On Tue, Oct 05, 2021 at 01:17:59PM +0200, Christian Schoenebeck wrote:
> On Dienstag, 5. Oktober 2021 09:16:07 CEST Michael S. Tsirkin wrote:
> > On Mon, Oct 04, 2021 at 09:38:08PM +0200, Christian Schoenebeck wrote:
> > > Raise the maximum possible virtio transfer size to 128M
> > > (more precisely: 32k * PAGE_SIZE). See previous commit for a
> > > more detailed explanation for the reasons of this change.
> > > 
> > > For not breaking any virtio user, all virtio users transition
> > > to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
> > > VIRTQUEUE_MAX_SIZE, so they are all still using the old value
> > > of 1k with this commit.
> > > 
> > > On the long-term, each virtio user should subsequently either
> > > switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
> > > after checking that they support the new value of 32k, or
> > > otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
> > > macro by an appropriate value supported by them.
> > > 
> > > Signed-off-by: Christian Schoenebeck 
> > 
> > I don't think we need this. Legacy isn't descriptive either.  Just leave
> > VIRTQUEUE_MAX_SIZE alone, and come up with a new name for 32k.
> 
> Does this mean you disagree that on the long-term all virtio users should 
> transition either to the new upper limit of 32k max queue size or introduce 
> their own limit at their end?


depends. if 9pfs is the only one unhappy, we can keep 4k as
the default. it's sure a safe one.

> Independent of the name, and I would appreciate for suggestions for an 
> adequate macro name here, I still think this new limit should be placed in 
> the 
> shared virtio.h file. Because this value is not something invented on virtio 
> user side. It rather reflects the theoretical upper limited possible with the 
> virtio protocol, which is and will be common for all virtio users.


We can add this to the linux uapi headers, sure.

> > > ---
> > > 
> > >  hw/9pfs/virtio-9p-device.c |  2 +-
> > >  hw/block/vhost-user-blk.c  |  6 +++---
> > >  hw/block/virtio-blk.c  |  6 +++---
> > >  hw/char/virtio-serial-bus.c|  2 +-
> > >  hw/input/virtio-input.c|  2 +-
> > >  hw/net/virtio-net.c| 12 ++--
> > >  hw/scsi/virtio-scsi.c  |  2 +-
> > >  hw/virtio/vhost-user-fs.c  |  6 +++---
> > >  hw/virtio/vhost-user-i2c.c |  2 +-
> > >  hw/virtio/vhost-vsock-common.c |  2 +-
> > >  hw/virtio/virtio-balloon.c |  2 +-
> > >  hw/virtio/virtio-crypto.c  |  2 +-
> > >  hw/virtio/virtio-iommu.c   |  2 +-
> > >  hw/virtio/virtio-mem.c |  2 +-
> > >  hw/virtio/virtio-mmio.c|  4 ++--
> > >  hw/virtio/virtio-pmem.c|  2 +-
> > >  hw/virtio/virtio-rng.c |  3 ++-
> > >  include/hw/virtio/virtio.h | 20 +++-
> > >  18 files changed, 49 insertions(+), 30 deletions(-)
> > > 
> > > diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> > > index cd5d95dd51..9013e7df6e 100644
> > > --- a/hw/9pfs/virtio-9p-device.c
> > > +++ b/hw/9pfs/virtio-9p-device.c
> > > @@ -217,7 +217,7 @@ static void virtio_9p_device_realize(DeviceState *dev,
> > > Error **errp)> 
> > >  v->config_size = sizeof(struct virtio_9p_config) +
> > >  strlen(s->fsconf.tag);
> > >  virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> > > 
> > > -VIRTQUEUE_MAX_SIZE);
> > > +VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >  v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> > >  
> > >  }
> > > 
> > > diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> > > index 336f56705c..e5e45262ab 100644
> > > --- a/hw/block/vhost-user-blk.c
> > > +++ b/hw/block/vhost-user-blk.c
> > > @@ -480,9 +480,9 @@ static void vhost_user_blk_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >  error_setg(errp, "queue size must be non-zero");
> > >  return;
> > >  
> > >  }
> > > 
> > > -if (s->queue_size > VIRTQUEUE_MAX_SIZE) {
> > > +if (s->queue_size > VIRTQUEUE_LEGACY_MAX_SIZE) {
> > > 
> > >  error_setg(errp, "queue size must not exceed %d",
> > > 
> > > -   VIRTQUEUE_MAX_SIZE);
> > > +   VIRTQUEUE_LEGACY_MAX_SIZE);
> > > 
> > >  return;
> > >  
> > >  }
> > > 
> > > @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState
> > > *dev, Error **errp)> 
> > >  }
> > >  
> > >  virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> > > 
> > > -sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
> > > +sizeof(struct virtio_blk_config),
> > > VIRTQUEUE_LEGACY_MAX_SIZE);> 
> > >  s->virtqs = g_new(VirtQueue *, s->num_queues);
> > >  for (i = 0; i < s->num_queues; i++) {
> > > 
> > > diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> > > index 9c0f46815c..5883e3e7db 100644
> > > --- a/hw/block/virtio-blk.c
> > > +++ b/hw/block/virtio-blk.c
> > > @@ -1171,10 +1171,10 @@ static v

Re: [PATCH v2 08/12] macfb: add common monitor modes supported by the MacOS toolbox ROM

2021-10-05 Thread Mark Cave-Ayland


On 05/10/2021 10:50, Laurent Vivier wrote:


Le 04/10/2021 à 23:19, Mark Cave-Ayland a écrit :

The monitor modes table is found by experimenting with the Monitors Control
Panel in MacOS and analysing the reads/writes. From this it can be found that
the mode is controlled by writes to the DAFB_MODE_CTRL1 and DAFB_MODE_CTRL2
registers.

Implement the first block of DAFB registers as a register array including the
existing sense register, the newly discovered control registers above, and also
the DAFB_MODE_VADDR1 and DAFB_MODE_VADDR2 registers which are used by NetBSD to
determine the current video mode.

These experiments also show that the offset of the start of video RAM and the
stride can change depending upon the monitor mode, so update 
macfb_draw_graphic()
and both the BI_MAC_VADDR and BI_MAC_VROW bootinfo for the q800 machine
accordingly.

Finally update macfb_common_realize() so that only the resolution and depth
supported by the display type can be specified on the command line.

Signed-off-by: Mark Cave-Ayland 
Reviewed-by: Laurent Vivier 
---
  hw/display/macfb.c | 124 -
  hw/display/trace-events|   1 +
  hw/m68k/q800.c |  11 ++--
  include/hw/display/macfb.h |  16 -
  4 files changed, 131 insertions(+), 21 deletions(-)

diff --git a/hw/display/macfb.c b/hw/display/macfb.c
index f98bcdec2d..357fe18be5 100644
--- a/hw/display/macfb.c
+++ b/hw/display/macfb.c


...

+static MacFbMode *macfb_find_mode(MacfbDisplayType display_type,
+  uint16_t width, uint16_t height,
+  uint8_t depth)
+{
+MacFbMode *macfb_mode;
+int i;
+
+for (i = 0; i < ARRAY_SIZE(macfb_mode_table); i++) {
+macfb_mode = &macfb_mode_table[i];
+
+if (display_type == macfb_mode->type && width == macfb_mode->width &&
+height == macfb_mode->height && depth == macfb_mode->depth) {
+return macfb_mode;
+}
+}
+
+return NULL;
+}
+


I misunderstood this part when I reviewed v1...

It means you have to provide the monitor type to QEMU to switch from the 
default mode?


Not as such: both the MacOS toolbox ROM and MacOS itself offer a fixed set of 
resolutions and depths based upon the display type. What I've done for now is default 
the display type to VGA since it offers both 640x480 and 800x600 in 1, 2, 4, 8, 16 
and 24-bit colour which should cover the most common use of cases of people wanting 
to boot using the MacOS toolbox ROM.


Even if you specify a default on the command line, MacOS still only cares about the 
display type and will allow you to change the resolution and depth dynamically, 
remembering the last resolution and depth across reboots.


During testing I found that having access to the 1152x870 resolution offered by the 
Apple 21" monitor display type was useful to allow larger screen sizes, although only 
up to 8-bit depth so I added a bit of code that will switch from a VGA display type 
to a 21" display type if the graphics resolution is set to 1152x870x8.


Finally if you boot a Linux kernel directly using -kernel then the provided XxYxD is 
placed directly into the relevant bootinfo fields with a VGA display type, unless a 
resolution of 1152x870x8 is specified in which case the 21" display type is used as 
above.



But, as a user, how do we know which modes are allowed with which resolution?

Is possible to try to set internally the type here according to the resolution?

Could you provide an command line example how to start the q800 with the 
1152x870 resolution?


Sure - simply add "-g 1152x870x8" to your command line. If the -g parameter is 
omitted then the display type will default to VGA.



ATB,

Mark.

Re: [PATCH v6 05/10] ACPI ERST: support for ACPI ERST feature

2021-10-05 Thread Igor Mammedov

On Mon, 4 Oct 2021 16:13:09 -0500
Eric DeVolder  wrote:

> Igor, thanks for the close examination. Inline responses below.
> eric
> 
> On 9/21/21 10:30 AM, Igor Mammedov wrote:
> > On Thu,  5 Aug 2021 18:30:34 -0400
> > Eric DeVolder  wrote:
> >   
> >> This implements a PCI device for ACPI ERST. This implements the
> >> non-NVRAM "mode" of operation for ERST as it is supported by
> >> Linux and Windows.
> >>
> >> Signed-off-by: Eric DeVolder 
> >> ---
> >>   hw/acpi/erst.c   | 750 
> >> +++
> >>   hw/acpi/meson.build  |   1 +
> >>   hw/acpi/trace-events |  15 ++
> >>   3 files changed, 766 insertions(+)
> >>   create mode 100644 hw/acpi/erst.c
> >>
> >> diff --git a/hw/acpi/erst.c b/hw/acpi/erst.c
> >> new file mode 100644
> >> index 000..eb4ab34
> >> --- /dev/null
> >> +++ b/hw/acpi/erst.c
> >> @@ -0,0 +1,750 @@
> >> +/*
> >> + * ACPI Error Record Serialization Table, ERST, Implementation
> >> + *
> >> + * ACPI ERST introduced in ACPI 4.0, June 16, 2009.
> >> + * ACPI Platform Error Interfaces : Error Serialization
> >> + *
> >> + * Copyright (c) 2021 Oracle and/or its affiliates.
> >> + *
> >> + * SPDX-License-Identifier: GPL-2.0-or-later
> >> + */
> >> +
> >> +#include 
> >> +#include 
> >> +#include 
> >> +
> >> +#include "qemu/osdep.h"
> >> +#include "qapi/error.h"
> >> +#include "hw/qdev-core.h"
> >> +#include "exec/memory.h"
> >> +#include "qom/object.h"
> >> +#include "hw/pci/pci.h"
> >> +#include "qom/object_interfaces.h"
> >> +#include "qemu/error-report.h"
> >> +#include "migration/vmstate.h"
> >> +#include "hw/qdev-properties.h"
> >> +#include "hw/acpi/acpi.h"
> >> +#include "hw/acpi/acpi-defs.h"
> >> +#include "hw/acpi/aml-build.h"
> >> +#include "hw/acpi/bios-linker-loader.h"
> >> +#include "exec/address-spaces.h"
> >> +#include "sysemu/hostmem.h"
> >> +#include "hw/acpi/erst.h"
> >> +#include "trace.h"
> >> +
> >> +/* ACPI 4.0: Table 17-16 Serialization Actions */
> >> +#define ACTION_BEGIN_WRITE_OPERATION 0x0
> >> +#define ACTION_BEGIN_READ_OPERATION  0x1
> >> +#define ACTION_BEGIN_CLEAR_OPERATION 0x2
> >> +#define ACTION_END_OPERATION 0x3
> >> +#define ACTION_SET_RECORD_OFFSET 0x4
> >> +#define ACTION_EXECUTE_OPERATION 0x5
> >> +#define ACTION_CHECK_BUSY_STATUS 0x6
> >> +#define ACTION_GET_COMMAND_STATUS0x7
> >> +#define ACTION_GET_RECORD_IDENTIFIER 0x8
> >> +#define ACTION_SET_RECORD_IDENTIFIER 0x9
> >> +#define ACTION_GET_RECORD_COUNT  0xA
> >> +#define ACTION_BEGIN_DUMMY_WRITE_OPERATION   0xB
> >> +#define ACTION_RESERVED  0xC
> >> +#define ACTION_GET_ERROR_LOG_ADDRESS_RANGE   0xD
> >> +#define ACTION_GET_ERROR_LOG_ADDRESS_LENGTH  0xE
> >> +#define ACTION_GET_ERROR_LOG_ADDRESS_RANGE_ATTRIBUTES 0xF
> >> +#define ACTION_GET_EXECUTE_OPERATION_TIMINGS 0x10
> >> +
> >> +/* ACPI 4.0: Table 17-17 Command Status Definitions */
> >> +#define STATUS_SUCCESS0x00
> >> +#define STATUS_NOT_ENOUGH_SPACE   0x01
> >> +#define STATUS_HARDWARE_NOT_AVAILABLE 0x02
> >> +#define STATUS_FAILED 0x03
> >> +#define STATUS_RECORD_STORE_EMPTY 0x04
> >> +#define STATUS_RECORD_NOT_FOUND   0x05
> >> +
> >> +
> >> +/* UEFI 2.1: Appendix N Common Platform Error Record */
> >> +#define UEFI_CPER_RECORD_MIN_SIZE 128U
> >> +#define UEFI_CPER_RECORD_LENGTH_OFFSET 20U
> >> +#define UEFI_CPER_RECORD_ID_OFFSET 96U
> >> +#define IS_UEFI_CPER_RECORD(ptr) \
> >> +(((ptr)[0] == 'C') && \
> >> + ((ptr)[1] == 'P') && \
> >> + ((ptr)[2] == 'E') && \
> >> + ((ptr)[3] == 'R'))
> >> +#define THE_UEFI_CPER_RECORD_ID(ptr) \
> >> +(*(uint64_t *)(&(ptr)[UEFI_CPER_RECORD_ID_OFFSET]))
> >> +
> >> +/*
> >> + * This implementation is an ACTION (cmd) and VALUE (data)
> >> + * interface consisting of just two 64-bit registers.
> >> + */
> >> +#define ERST_REG_SIZE (16UL)
> >> +#define ERST_ACTION_OFFSET (0UL) /* action (cmd) */
> >> +#define ERST_VALUE_OFFSET  (8UL) /* argument/value (data) */
> >> +
> >> +/*
> >> + * ERST_RECORD_SIZE is the buffer size for exchanging ERST
> >> + * record contents. Thus, it defines the maximum record size.
> >> + * As this is mapped through a PCI BAR, it must be a power of
> >> + * two and larger than UEFI_CPER_RECORD_MIN_SIZE.
> >> + * The backing storage is divided into fixed size "slots",
> >> + * each ERST_RECORD_SIZE in length, and each "slot"
> >> + * storing a single record. No attempt at optimizing storage
> >> + * through compression, compaction, etc is attempted.
> >> + * NOTE that slot 0 is reserved for the backing storage header.
> >> + * Depending upon the size of the backing storage, additional
> >> + * slots will be part of the slot 0 header in order to account
> >> + * for a record_id for each available remaining slot.
> >> + */
> >> +/* 8KiB records, not too small, not too big */
> >> +#define ERST_RECORD_SIZE (8192UL)
> >> +
> >> +#define ACP

Re: [PATCH v2 0/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-10-05 Thread Christian Schoenebeck

On Dienstag, 5. Oktober 2021 13:19:43 CEST Michael S. Tsirkin wrote:
> On Tue, Oct 05, 2021 at 01:10:56PM +0200, Christian Schoenebeck wrote:
> > On Dienstag, 5. Oktober 2021 09:38:53 CEST David Hildenbrand wrote:
> > > On 04.10.21 21:38, Christian Schoenebeck wrote:
> > > > At the moment the maximum transfer size with virtio is limited to 4M
> > > > (1024 * PAGE_SIZE). This series raises this limit to its maximum
> > > > theoretical possible transfer size of 128M (32k pages) according to
> > > > the
> > > > virtio specs:
> > > > 
> > > > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.h
> > > > tml#
> > > > x1-240006
> > > 
> > > I'm missing the "why do we care". Can you comment on that?
> > 
> > Primary motivation is the possibility of improved performance, e.g. in
> > case of 9pfs, people can raise the maximum transfer size with the Linux
> > 9p client's 'msize' option on guest side (and only on guest side
> > actually). If guest performs large chunk I/O, e.g. consider something
> > "useful" like this one on> 
> > guest side:
> >   time cat large_file_on_9pfs.dat > /dev/null
> > 
> > Then there is a noticable performance increase with higher transfer size
> > values. That performance gain is continuous with rising transfer size
> > values, but the performance increase obviously shrinks with rising
> > transfer sizes as well, as with similar concepts in general like cache
> > sizes, etc.
> > 
> > Then a secondary motivation is described in reason (2) of patch 2: if the
> > transfer size is configurable on guest side (like it is the case with the
> > 9pfs 'msize' option), then there is the unpleasant side effect that the
> > current virtio limit of 4M is invisible to guest; as this value of 4M is
> > simply an arbitrarily limit set on QEMU side in the past (probably just
> > implementation motivated on QEMU side at that point), i.e. it is not a
> > limit specified by the virtio protocol,
> 
> According to the spec it's specified, sure enough: vq size limits the
> size of indirect descriptors too.

In the virtio specs the only hard limit that I see is the aforementioned 32k:

"Queue Size corresponds to the maximum number of buffers in the virtqueue. 
Queue Size value is always a power of 2. The maximum Queue Size value is 
32768. This value is specified in a bus-specific way."

> However, ever since commit 44ed8089e991a60d614abe0ee4b9057a28b364e4 we
> do not enforce it in the driver ...

Then there is the current queue size (that you probably mean) which is 
transmitted to guest with whatever virtio was initialized with.

In case of 9p client however the virtio queue size is first initialized with 
some initial hard coded value when the 9p driver is loaded on Linux kernel 
guest side, then when some 9pfs is mounted later on by guest, it may include 
the 'msize' mount option to raise the transfer size, and that's the problem. I 
don't see any way for guest to see that it cannot go above that 4M transfer 
size now.

> > nor is this limit be made aware to guest via virtio protocol
> > at all. The consequence with 9pfs would be if user tries to go higher than
> > 4M,> 
> > then the system would simply hang with this QEMU error:
> >   virtio: too many write descriptors in indirect table
> > 
> > Now whether this is an issue or not for individual virtio users, depends
> > on
> > whether the individual virtio user already had its own limitation <= 4M
> > enforced on its side.
> > 
> > Best regards,
> > Christian Schoenebeck

Re: [PATCH v3 2/3] hw/virtio: Acquire RCU read lock in virtqueue_packed_drop_all()

2021-10-05 Thread Stefano Garzarella


On Mon, Oct 04, 2021 at 11:27:12AM +0200, Philippe Mathieu-Daudé wrote:

On 10/4/21 11:23, Stefan Hajnoczi wrote:

On Mon, Sep 06, 2021 at 12:43:17PM +0200, Philippe Mathieu-Daudé wrote:

vring_get_region_caches() must be called with the RCU read lock
acquired. virtqueue_packed_drop_all() does not, and uses the
'caches' pointer. Fix that by using the RCU_READ_LOCK_GUARD()
macro.


Is this a bug that has been encountered, is it a latent bug, a code
cleanup, etc? The impact of this isn't clear but it sounds a little
scary so I wanted to check.


I'll defer to Stefano, but IIUC it is a latent bug discovered
during code audit.


Yep, I confirm this. We discovered it by discussing the documentation in 
a previous series.


Thanks,
Stefano

Re: [PATCH v4 06/11] hw/i386: Move vIOMMU uniqueness check into pc.c

2021-10-05 Thread Eric Auger

Hi jean,

On 10/1/21 7:33 PM, Jean-Philippe Brucker wrote:
> We're about to need this check for a third vIOMMU, virtio-iommu, which
> doesn't inherit X86IOMMUState as it doesn't support IRQ remapping and is
> a virtio device. Move the check into the pre_plug callback to be shared
> by all three vIOMMUs.
>
> Signed-off-by: Jean-Philippe Brucker 
Reviewed-by: Eric Auger 
Tested-by: Eric Auger 

Eric
> ---
>  hw/i386/pc.c| 10 +-
>  hw/i386/x86-iommu.c |  6 --
>  2 files changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 557d49c9f8..789ccb6ef4 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -1367,6 +1367,13 @@ static void pc_virtio_md_pci_unplug(HotplugHandler 
> *hotplug_dev,
>  static void pc_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev,
>DeviceState *dev, Error **errp)
>  {
> +if (object_dynamic_cast(OBJECT(dev), TYPE_X86_IOMMU_DEVICE) &&
> +x86_iommu_get_default()) {
> +error_setg(errp, "QEMU does not support multiple vIOMMUs "
> +   "for x86 yet.");
> +return;
> +}
> +
>  if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
>  pc_memory_pre_plug(hotplug_dev, dev, errp);
>  } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
> @@ -1428,7 +1435,8 @@ static HotplugHandler 
> *pc_get_hotplug_handler(MachineState *machine,
>  if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM) ||
>  object_dynamic_cast(OBJECT(dev), TYPE_CPU) ||
>  object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
> -object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
> +object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI) ||
> +object_dynamic_cast(OBJECT(dev), TYPE_X86_IOMMU_DEVICE)) {
>  return HOTPLUG_HANDLER(machine);
>  }
>  
> diff --git a/hw/i386/x86-iommu.c b/hw/i386/x86-iommu.c
> index 86ad03972e..550e551993 100644
> --- a/hw/i386/x86-iommu.c
> +++ b/hw/i386/x86-iommu.c
> @@ -84,12 +84,6 @@ static void x86_iommu_set_default(X86IOMMUState *x86_iommu)
>  {
>  assert(x86_iommu);
>  
> -if (x86_iommu_default) {
> -error_report("QEMU does not support multiple vIOMMUs "
> - "for x86 yet.");
> -exit(1);
> -}
> -
>  x86_iommu_default = x86_iommu;
>  }
>

Re: [PATCH v4 03/11] hw/arm/virt: Remove device tree restriction for virtio-iommu

2021-10-05 Thread Eric Auger

Hi Jean,

On 10/1/21 7:33 PM, Jean-Philippe Brucker wrote:
> virtio-iommu is now supported with ACPI VIOT as well as device tree.
> Remove the restriction that prevents from instantiating a virtio-iommu
> device under ACPI.
>
> Reviewed-by: Eric Auger 
> Signed-off-by: Jean-Philippe Brucker 
> ---
>  hw/arm/virt.c| 10 ++
>  hw/virtio/virtio-iommu-pci.c |  7 ---
>  2 files changed, 2 insertions(+), 15 deletions(-)
>
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 1d59f0e59f..56e8fc7059 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -2561,16 +2561,10 @@ static HotplugHandler 
> *virt_machine_get_hotplug_handler(MachineState *machine,
>  MachineClass *mc = MACHINE_GET_CLASS(machine);
>  
>  if (device_is_dynamic_sysbus(mc, dev) ||
> -   (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM))) {
> +object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM) ||
> +object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_IOMMU_PCI)) {
>  return HOTPLUG_HANDLER(machine);
>  }
> -if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_IOMMU_PCI)) {
> -VirtMachineState *vms = VIRT_MACHINE(machine);
> -
> -if (!vms->bootinfo.firmware_loaded || !virt_is_acpi_enabled(vms)) {
> -return HOTPLUG_HANDLER(machine);
> -}
> -}
>  return NULL;
>  }
>  
> diff --git a/hw/virtio/virtio-iommu-pci.c b/hw/virtio/virtio-iommu-pci.c
> index 770c286be7..f30eb16cbf 100644
> --- a/hw/virtio/virtio-iommu-pci.c
> +++ b/hw/virtio/virtio-iommu-pci.c
> @@ -48,16 +48,9 @@ static void virtio_iommu_pci_realize(VirtIOPCIProxy 
> *vpci_dev, Error **errp)
>  VirtIOIOMMU *s = VIRTIO_IOMMU(vdev);
>  
>  if (!qdev_get_machine_hotplug_handler(DEVICE(vpci_dev))) {
> -MachineClass *mc = MACHINE_GET_CLASS(qdev_get_machine());
> -
> -error_setg(errp,
> -   "%s machine fails to create iommu-map device tree 
> bindings",
> -   mc->name);
actually this does not work. To add a hint you need the *errp to be set.
Otherwise when running through this path you will get

emu-system-x86_64: ../util/error.c:158: error_append_hint: Assertion
`err && errp != &error_abort && errp != &error_fatal' failed.

replace the error_append_hint with an error_setg (without the \n)

Thanks

Eric
>  error_append_hint(errp,
>"Check your machine implements a hotplug handler "
>"for the virtio-iommu-pci device\n");
> -error_append_hint(errp, "Check the guest is booted without FW or 
> with "
> -  "-no-acpi\n");
>  return;
>  }
>  for (int i = 0; i < s->nb_reserved_regions; i++) {

Re: [RFC PATCH 1/1] virtio: write back features before verify

2021-10-05 Thread Halil Pasic

On Tue, 05 Oct 2021 13:13:31 +0200
Cornelia Huck  wrote:

> On Tue, Oct 05 2021, Halil Pasic  wrote:
> 
> > On Mon, 4 Oct 2021 05:07:13 -0400
> > "Michael S. Tsirkin"  wrote:  
> >> Well we established that we can know. Here's an alternative explanation:  
> >
> >
> > I thin we established how this should be in the future, where a transport
> > specific mechanism is used to decide are we operating in legacy mode or
> > in modern mode. But with the current QEMU reality, I don't think so.
> > Namely currently the switch native-endian config -> little endian config
> > happens when the VERSION_1 is negotiated, which may happen whenever
> > the VERSION_1 bit is changed, or only when FEATURES_OK is set
> > (vhost-user).
> >
> > This is consistent with device should detect a legacy driver by checking
> > for VERSION_1, which is what the spec currently says.
> >
> > So for transitional we start out with native-endian config. For modern
> > only the config is always LE.
> >
> > The guest can distinguish between a legacy only device and a modern
> > capable device after the revision negotiation. A legacy device would
> > reject the CCW.
> >
> > But both a transitional device and a modern only device would accept
> > a revision > 0. So the guest does not know for ccw.  
> 
> Well, for pci I think the driver knows that it is using either legacy or
> modern, no?

It is mighty complicated. virtio-blk-pci-non-transitional and 
virtio-net-pci-non-transitional will give you BE, but 
virtio-crypto-pci, which is also non-transitional will get you LE,
before VERSION_1 is set (becausevirtio-crypto uses stl_le_p()). That is
fact.

The deal is that virtio-blk and virtion-net was written with
transitional in mind, and config code is the same for transitional and
non-transitional.

That is how things are now. With the QEMU changes things will be simpler.

> 
> And for ccw, the driver knows at that point in time which revision it
> negotiated, so it should know that a revision > 0 will use LE (and the
> device will obviously know that as well.)

With the future changes in QEMU, yes. Without these changes no. Without
these changes we get BE when the guest code things it is going to get
LE. That is what causes the regression.

The commit message for this patch is written from the perspective of
right now, and not from the perspective of future changes.

Or can you hack up a guest patch that looks at the revision, figures out
what endiannes is the early config access in, and does the right thing?

I don't think so. I tried to explain why that is impossible. Because
that would be preferable to messing with the the device and introducing
another exit. 

> 
> Or am I misunderstanding what you're getting at?
> 

Probably. I'm talking about pre- "do transport specific legacy detection
in the device instead of looking at VERSION_1" you are probably talking
about the post-state. If we had this new behavior for all relevant
hypervisors then we wouldn't need to do a thing in the guest. The current
code would work like charm.

Does that answer your question?

Regards,
Halil

Re: [PATCH v2 2/3] virtio: increase VIRTQUEUE_MAX_SIZE to 32k

2021-10-05 Thread Christian Schoenebeck

On Dienstag, 5. Oktober 2021 13:24:36 CEST Michael S. Tsirkin wrote:
> On Tue, Oct 05, 2021 at 01:17:59PM +0200, Christian Schoenebeck wrote:
> > On Dienstag, 5. Oktober 2021 09:16:07 CEST Michael S. Tsirkin wrote:
> > > On Mon, Oct 04, 2021 at 09:38:08PM +0200, Christian Schoenebeck wrote:
> > > > Raise the maximum possible virtio transfer size to 128M
> > > > (more precisely: 32k * PAGE_SIZE). See previous commit for a
> > > > more detailed explanation for the reasons of this change.
> > > > 
> > > > For not breaking any virtio user, all virtio users transition
> > > > to using the new macro VIRTQUEUE_LEGACY_MAX_SIZE instead of
> > > > VIRTQUEUE_MAX_SIZE, so they are all still using the old value
> > > > of 1k with this commit.
> > > > 
> > > > On the long-term, each virtio user should subsequently either
> > > > switch from VIRTQUEUE_LEGACY_MAX_SIZE to VIRTQUEUE_MAX_SIZE
> > > > after checking that they support the new value of 32k, or
> > > > otherwise they should replace the VIRTQUEUE_LEGACY_MAX_SIZE
> > > > macro by an appropriate value supported by them.
> > > > 
> > > > Signed-off-by: Christian Schoenebeck 
> > > 
> > > I don't think we need this. Legacy isn't descriptive either.  Just leave
> > > VIRTQUEUE_MAX_SIZE alone, and come up with a new name for 32k.
> > 
> > Does this mean you disagree that on the long-term all virtio users should
> > transition either to the new upper limit of 32k max queue size or
> > introduce
> > their own limit at their end?
> 
> depends. if 9pfs is the only one unhappy, we can keep 4k as
> the default. it's sure a safe one.
> 
> > Independent of the name, and I would appreciate for suggestions for an
> > adequate macro name here, I still think this new limit should be placed in
> > the shared virtio.h file. Because this value is not something invented on
> > virtio user side. It rather reflects the theoretical upper limited
> > possible with the virtio protocol, which is and will be common for all
> > virtio users.
> We can add this to the linux uapi headers, sure.

Well, then I wait for few days, and if nobody else cares about this issue, 
then I just hard code 32k on 9pfs side exclusively in v3 for now and that's 
it.

Best regards,
Christian Schoenebeck

Re: Deprecate the ppc405 boards in QEMU?

2021-10-05 Thread BALATON Zoltan


On Tue, 5 Oct 2021, Thomas Huth wrote:

On 05/10/2021 10.07, Thomas Huth wrote:

On 05/10/2021 10.05, Alexey Kardashevskiy wrote:

[...]

What is so special about taihu?


taihu is the other 405 board defined in hw/ppc/ppc405_boards.c (which I 
suggested to deprecate now)


I've now also played with the u-boot sources a little bit, and with some bit 
of tweaking, it's indeed possible to compile the old taihu board there. 
However, it does not really work with QEMU anymore, it immediately triggers 
an assert():


$ qemu-system-ppc -M taihu -bios u-boot.bin -serial null -serial mon:stdio
**
ERROR:accel/tcg/tcg-accel-ops.c:79:tcg_handle_interrupt: assertion failed: 
(qemu_mutex_iothread_locked())

Aborted (core dumped)


Maybe it's similar to this: 2025fc6766ab25501e0041c564c44bb0f7389774 The 
helper_load_dcr() and helper_store_dcr() in target/ppc/timebase_helper.c 
seem to lock/unlock the iothread but I'm not sure if that's necessary. 
Also not sure why this does not happen with 460ex but that maybe uses 
different code.


Going back to QEMU v2.3.0, I can see at least a little bit of output, but it 
then also triggers an assert() during DRAM initialization:


$ qemu-system-ppc -M taihu -bios u-boot.bin -serial null -serial mon:stdio

Reset PowerPC core

U-Boot 2014.10-rc2-00123-g461be2f96e-dirty (Oct 05 2021 - 10:02:56)

CPU:   AMCC PowerPC 405EP Rev. B at 770 MHz (PLB=256 OPB=128 EBC=128)
  I2C boot EEPROM disabled
  Internal PCI arbiter enabled
  16 KiB I-Cache 16 KiB D-Cache
Board: Taihu - AMCC PPC405EP Evaluation Board
I2C:   ready
DRAM:  qemu-system-ppc: memory.c:1693: memory_region_del_subregion: Assertion 
`subregion->container == mr' failed.

Aborted (core dumped)

Not sure if this ever worked in QEMU, maybe in the early 0.15 time, but that 
version of QEMU also does not compile easily anymore on modern systems. So 
I'm afraid, getting this into a workable shape again will take a lot of time. 
At least I'll stop my efforts here now.


Do you have this u-boot binary somewhere just for others who want to try it?

Regards,
BALATON Zoltan

Re: Deprecate the ppc405 boards in QEMU? (was: [PATCH v3 4/7] MAINTAINERS: Orphan obscure ppc platforms)

2021-10-05 Thread BALATON Zoltan


On Tue, 5 Oct 2021, Cédric Le Goater wrote:

On 10/5/21 08:18, Alexey Kardashevskiy wrote:

On 05/10/2021 15:44, Christophe Leroy wrote:

Le 05/10/2021 à 02:48, David Gibson a écrit :

On Fri, Oct 01, 2021 at 04:18:49PM +0200, Thomas Huth wrote:

On 01/10/2021 15.04, Christophe Leroy wrote:

Le 01/10/2021 à 14:04, Thomas Huth a écrit :

On 01/10/2021 13.12, Peter Maydell wrote:

On Fri, 1 Oct 2021 at 10:43, Thomas Huth  wrote:

Nevertheless, as long as nobody has a hint where to find that
ppc405_rom.bin, I think both boards are pretty useless in QEMU (as 
far as I

can see, they do not work without the bios at all, so it's
also not possible
to use a Linux image with the "-kernel" CLI option directly).


It is at least in theory possible to run bare-metal code on
either board, by passing either a pflash or a bios argument.


True. I did some more research, and seems like there was once
support for those boards in u-boot, but it got removed there a
couple of years ago already:

https://gitlab.com/qemu-project/u-boot/-/commit/98f705c9cefdf

https://gitlab.com/qemu-project/u-boot/-/commit/b147ff2f37d5b

https://gitlab.com/qemu-project/u-boot/-/commit/7514037bcdc37


But I agree that there seem to be no signs of anybody actually
successfully using these boards for anything, so we should
deprecate-and-delete them.


Yes, let's mark them as deprecated now ... if someone still uses
them and speaks up, we can still revert the deprecation again.


I really would like to be able to use them to validate Linux Kernel
changes, hence looking for that missing BIOS.

If we remove ppc405 from QEMU, we won't be able to do any regression
tests of Linux Kernel on those processors.


If you/someone managed to compile an old version of u-boot for one of 
these
two boards, so that we would finally have something for regression 
testing,

we can of course also keep the boards in QEMU...


I can see that it would be usefor for some cases, but unless someone
volunteers to track down the necessary firmware and look after it, I
think we do need to deprecate it - I certainly don't have the capacity
to look into this.



I will look at it, please allow me a few weeks though.


Well, building it was not hard but now I'd like to know what board QEMU 
actually emulates, there are way too many codenames and PVRs.


yes. We should try to reduce the list below. Deprecating embedded machines
is one way.


Why should we reduce that list? It's good to have different cpu options 
when one wants to test code for different PPC versions (maybe also in user 
mode) or just to have a quick list of these at one place.


Regards,
BALATON Zoltan

Re: [PATCH 13/13] virtiofsd, seccomp: Add clock_nanosleep() to allow list

2021-10-05 Thread Stefan Hajnoczi

On Thu, Sep 30, 2021 at 11:30:37AM -0400, Vivek Goyal wrote:
> g_usleep() calls nanosleep() and that now seems to call clock_nanosleep()
> syscall. Now these patches are making use of g_usleep(). So add
> clock_nanosleep() to list of allowed syscalls.
> 
> Signed-off-by: Vivek Goyal 
> ---
>  tools/virtiofsd/passthrough_seccomp.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/tools/virtiofsd/passthrough_seccomp.c 
> b/tools/virtiofsd/passthrough_seccomp.c
> index cd24b40b78..03080806c0 100644
> --- a/tools/virtiofsd/passthrough_seccomp.c
> +++ b/tools/virtiofsd/passthrough_seccomp.c
> @@ -117,6 +117,7 @@ static const int syscall_allowlist[] = {
>  SCMP_SYS(writev),
>  SCMP_SYS(umask),
>  SCMP_SYS(nanosleep),
> +SCMP_SYS(clock_nanosleep),

This patch can be dropped once sleep has been replaced by a condvar.


signature.asc
Description: PGP signature

Re: [PATCH 12/13] virtiofsd: Implement blocking posix locks

2021-10-05 Thread Stefan Hajnoczi

On Thu, Sep 30, 2021 at 11:30:36AM -0400, Vivek Goyal wrote:
> As of now we don't support fcntl(F_SETLKW) and if we see one, we return
> -EOPNOTSUPP.
> 
> Change that by accepting these requests and returning a reply
> immediately asking caller to wait. Once lock is available, send a
> notification to the waiter indicating lock is available.
> 
> In response to lock request, we are returning error value as "1", which
> signals to client to queue the lock request internally and later client
> will get a notification which will signal lock is taken (or error). And
> then fuse client should wake up the guest process.
> 
> Signed-off-by: Vivek Goyal 
> Signed-off-by: Ioannis Angelakopoulos 
> ---
>  tools/virtiofsd/fuse_lowlevel.c  | 37 -
>  tools/virtiofsd/fuse_lowlevel.h  | 26 
>  tools/virtiofsd/fuse_virtio.c| 50 ---
>  tools/virtiofsd/passthrough_ll.c | 70 
>  4 files changed, 167 insertions(+), 16 deletions(-)
> 
> diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
> index e4679c73ab..2e7f4b786d 100644
> --- a/tools/virtiofsd/fuse_lowlevel.c
> +++ b/tools/virtiofsd/fuse_lowlevel.c
> @@ -179,8 +179,8 @@ int fuse_send_reply_iov_nofree(fuse_req_t req, int error, 
> struct iovec *iov,
>  .unique = req->unique,
>  .error = error,
>  };
> -
> -if (error <= -1000 || error > 0) {
> +/* error = 1 has been used to signal client to wait for notificaiton */
> +if (error <= -1000 || error > 1) {
>  fuse_log(FUSE_LOG_ERR, "fuse: bad error value: %i\n", error);
>  out.error = -ERANGE;
>  }
> @@ -290,6 +290,11 @@ int fuse_reply_err(fuse_req_t req, int err)
>  return send_reply(req, -err, NULL, 0);
>  }
>  
> +int fuse_reply_wait(fuse_req_t req)
> +{
> +return send_reply(req, 1, NULL, 0);
> +}
> +
>  void fuse_reply_none(fuse_req_t req)
>  {
>  fuse_free_req(req);
> @@ -2165,6 +2170,34 @@ static void do_destroy(fuse_req_t req, fuse_ino_t 
> nodeid,
>  send_reply_ok(req, NULL, 0);
>  }
>  
> +static int send_notify_iov(struct fuse_session *se, int notify_code,
> +   struct iovec *iov, int count)
> +{
> +struct fuse_out_header out;
> +if (!se->got_init) {
> +return -ENOTCONN;
> +}
> +out.unique = 0;
> +out.error = notify_code;
> +iov[0].iov_base = &out;
> +iov[0].iov_len = sizeof(struct fuse_out_header);
> +return fuse_send_msg(se, NULL, iov, count);
> +}
> +
> +int fuse_lowlevel_notify_lock(struct fuse_session *se, uint64_t unique,
> +  int32_t error)
> +{
> +struct fuse_notify_lock_out outarg = {0};
> +struct iovec iov[2];
> +
> +outarg.unique = unique;
> +outarg.error = -error;
> +
> +iov[1].iov_base = &outarg;
> +iov[1].iov_len = sizeof(outarg);
> +return send_notify_iov(se, FUSE_NOTIFY_LOCK, iov, 2);
> +}
> +
>  int fuse_lowlevel_notify_store(struct fuse_session *se, fuse_ino_t ino,
> off_t offset, struct fuse_bufvec *bufv)
>  {
> diff --git a/tools/virtiofsd/fuse_lowlevel.h b/tools/virtiofsd/fuse_lowlevel.h
> index c55c0ca2fc..64624b48dc 100644
> --- a/tools/virtiofsd/fuse_lowlevel.h
> +++ b/tools/virtiofsd/fuse_lowlevel.h
> @@ -1251,6 +1251,22 @@ struct fuse_lowlevel_ops {
>   */
>  int fuse_reply_err(fuse_req_t req, int err);
>  
> +/**
> + * Ask caller to wait for lock.
> + *
> + * Possible requests:
> + *   setlkw
> + *
> + * If caller sends a blocking lock request (setlkw), then reply to caller
> + * that wait for lock to be available. Once lock is available caller will

I can't parse the first sentence.

s/that wait for lock to be available/that waiting for the lock is
necessary/?

> + * receive a notification with request's unique id. Notification will
> + * carry info whether lock was successfully obtained or not.
> + *
> + * @param req request handle
> + * @return zero for success, -errno for failure to send reply
> + */
> +int fuse_reply_wait(fuse_req_t req);
> +
>  /**
>   * Don't send reply
>   *
> @@ -1685,6 +1701,16 @@ int fuse_lowlevel_notify_delete(struct fuse_session 
> *se, fuse_ino_t parent,
>  int fuse_lowlevel_notify_store(struct fuse_session *se, fuse_ino_t ino,
> off_t offset, struct fuse_bufvec *bufv);
>  
> +/**
> + * Notify event related to previous lock request
> + *
> + * @param se the session object
> + * @param unique the unique id of the request which requested setlkw
> + * @param error zero for success, -errno for the failure
> + */
> +int fuse_lowlevel_notify_lock(struct fuse_session *se, uint64_t unique,
> +  int32_t error);
> +
>  /*
>   * Utility functions
>   */
> diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
> index a87e88e286..bb2d4456fc 100644
> --- a/tools/virtiofsd/fuse_virtio.c
> +++ b/tools/virtiofsd/fuse_virtio.c
> @@ -273,6 +273,23 @@ static void vq_send_element(struct fv

Re: Deprecate the ppc405 boards in QEMU? (was: [PATCH v3 4/7] MAINTAINERS: Orphan obscure ppc platforms)

2021-10-05 Thread Thomas Huth


On 05/10/2021 14.20, BALATON Zoltan wrote:

On Tue, 5 Oct 2021, Cédric Le Goater wrote:

On 10/5/21 08:18, Alexey Kardashevskiy wrote:

On 05/10/2021 15:44, Christophe Leroy wrote:

Le 05/10/2021 à 02:48, David Gibson a écrit :

On Fri, Oct 01, 2021 at 04:18:49PM +0200, Thomas Huth wrote:

On 01/10/2021 15.04, Christophe Leroy wrote:

Le 01/10/2021 à 14:04, Thomas Huth a écrit :

On 01/10/2021 13.12, Peter Maydell wrote:

On Fri, 1 Oct 2021 at 10:43, Thomas Huth  wrote:

Nevertheless, as long as nobody has a hint where to find that
ppc405_rom.bin, I think both boards are pretty useless in QEMU (as 
far as I

can see, they do not work without the bios at all, so it's
also not possible
to use a Linux image with the "-kernel" CLI option directly).


It is at least in theory possible to run bare-metal code on
either board, by passing either a pflash or a bios argument.


True. I did some more research, and seems like there was once
support for those boards in u-boot, but it got removed there a
couple of years ago already:

https://gitlab.com/qemu-project/u-boot/-/commit/98f705c9cefdf

https://gitlab.com/qemu-project/u-boot/-/commit/b147ff2f37d5b

https://gitlab.com/qemu-project/u-boot/-/commit/7514037bcdc37


But I agree that there seem to be no signs of anybody actually
successfully using these boards for anything, so we should
deprecate-and-delete them.


Yes, let's mark them as deprecated now ... if someone still uses
them and speaks up, we can still revert the deprecation again.


I really would like to be able to use them to validate Linux Kernel
changes, hence looking for that missing BIOS.

If we remove ppc405 from QEMU, we won't be able to do any regression
tests of Linux Kernel on those processors.


If you/someone managed to compile an old version of u-boot for one of 
these
two boards, so that we would finally have something for regression 
testing,

we can of course also keep the boards in QEMU...


I can see that it would be usefor for some cases, but unless someone
volunteers to track down the necessary firmware and look after it, I
think we do need to deprecate it - I certainly don't have the capacity
to look into this.



I will look at it, please allow me a few weeks though.


Well, building it was not hard but now I'd like to know what board QEMU 
actually emulates, there are way too many codenames and PVRs.


yes. We should try to reduce the list below. Deprecating embedded machines
is one way.


Why should we reduce that list? It's good to have different cpu options when 
one wants to test code for different PPC versions (maybe also in user mode) 
or just to have a quick list of these at one place.


I think there are many CPUs in that list which cannot be used with any 
board, some of them might be also in a very incomplete state. So presenting 
such a big list to the users is confusing and might create wrong 
expectations. It would be good to remove at least the CPUs which are really 
completely useless.


 Thomas

Re: Deprecate the ppc405 boards in QEMU?

2021-10-05 Thread Thomas Huth


On 05/10/2021 14.17, BALATON Zoltan wrote:

On Tue, 5 Oct 2021, Thomas Huth wrote:

On 05/10/2021 10.07, Thomas Huth wrote:

On 05/10/2021 10.05, Alexey Kardashevskiy wrote:

[...]

What is so special about taihu?


taihu is the other 405 board defined in hw/ppc/ppc405_boards.c (which I 
suggested to deprecate now)


I've now also played with the u-boot sources a little bit, and with some 
bit of tweaking, it's indeed possible to compile the old taihu board 
there. However, it does not really work with QEMU anymore, it immediately 
triggers an assert():


$ qemu-system-ppc -M taihu -bios u-boot.bin -serial null -serial mon:stdio
**
ERROR:accel/tcg/tcg-accel-ops.c:79:tcg_handle_interrupt: assertion failed: 
(qemu_mutex_iothread_locked())

Aborted (core dumped)


Maybe it's similar to this: 2025fc6766ab25501e0041c564c44bb0f7389774 The 
helper_load_dcr() and helper_store_dcr() in target/ppc/timebase_helper.c 
seem to lock/unlock the iothread but I'm not sure if that's necessary. Also 
not sure why this does not happen with 460ex but that maybe uses different 
code.


It's rather the other way round, the locking is missing here instead. I can 
get the serial output with the current QEMU when I add the following patch 
(not sure whether that's the right spot, though):


diff --git a/hw/ppc/ppc.c b/hw/ppc/ppc.c
index f5d012f860..bb57f1c9ed 100644
--- a/hw/ppc/ppc.c
+++ b/hw/ppc/ppc.c
@@ -336,6 +336,8 @@ void store_40x_dbcr0(CPUPPCState *env, uint32_t val)
 {
 PowerPCCPU *cpu = env_archcpu(env);

+qemu_mutex_lock_iothread();
+
 switch ((val >> 28) & 0x3) {
 case 0x0:
 /* No action */
@@ -353,6 +355,8 @@ void store_40x_dbcr0(CPUPPCState *env, uint32_t val)
 ppc40x_system_reset(cpu);
 break;
 }
+
+qemu_mutex_unlock_iothread();
 }

 /* PowerPC 40x internal IRQ controller */


Going back to QEMU v2.3.0, I can see at least a little bit of output, but 
it then also triggers an assert() during DRAM initialization:


$ qemu-system-ppc -M taihu -bios u-boot.bin -serial null -serial mon:stdio

Reset PowerPC core

U-Boot 2014.10-rc2-00123-g461be2f96e-dirty (Oct 05 2021 - 10:02:56)

CPU:   AMCC PowerPC 405EP Rev. B at 770 MHz (PLB=256 OPB=128 EBC=128)
  I2C boot EEPROM disabled
  Internal PCI arbiter enabled
  16 KiB I-Cache 16 KiB D-Cache
Board: Taihu - AMCC PPC405EP Evaluation Board
I2C:   ready
DRAM:  qemu-system-ppc: memory.c:1693: memory_region_del_subregion: 
Assertion `subregion->container == mr' failed.

Aborted (core dumped)

Not sure if this ever worked in QEMU, maybe in the early 0.15 time, but 
that version of QEMU also does not compile easily anymore on modern 
systems. So I'm afraid, getting this into a workable shape again will take 
a lot of time. At least I'll stop my efforts here now.


Do you have this u-boot binary somewhere just for others who want to try it?


FWIW:
http://people.redhat.com/~thuth/data/u-boot-taihu.bin

 Thomas

Re: [PATCH 08/13] virtiofsd: Create a notification queue

2021-10-05 Thread Vivek Goyal

On Tue, Oct 05, 2021 at 09:14:14AM +0100, Stefan Hajnoczi wrote:
> On Mon, Oct 04, 2021 at 05:01:07PM -0400, Vivek Goyal wrote:
> > On Mon, Oct 04, 2021 at 03:30:38PM +0100, Stefan Hajnoczi wrote:
> > > On Thu, Sep 30, 2021 at 11:30:32AM -0400, Vivek Goyal wrote:
> > > > Add a notification queue which will be used to send async notifications
> > > > for file lock availability.
> > > > 
> > > > Signed-off-by: Vivek Goyal 
> > > > Signed-off-by: Ioannis Angelakopoulos 
> > > > ---
> > > >  hw/virtio/vhost-user-fs-pci.c |  4 +-
> > > >  hw/virtio/vhost-user-fs.c | 62 +--
> > > >  include/hw/virtio/vhost-user-fs.h |  2 +
> > > >  tools/virtiofsd/fuse_i.h  |  1 +
> > > >  tools/virtiofsd/fuse_virtio.c | 70 +++
> > > >  5 files changed, 116 insertions(+), 23 deletions(-)
> > > > 
> > > > diff --git a/hw/virtio/vhost-user-fs-pci.c 
> > > > b/hw/virtio/vhost-user-fs-pci.c
> > > > index 2ed8492b3f..cdb9471088 100644
> > > > --- a/hw/virtio/vhost-user-fs-pci.c
> > > > +++ b/hw/virtio/vhost-user-fs-pci.c
> > > > @@ -41,8 +41,8 @@ static void vhost_user_fs_pci_realize(VirtIOPCIProxy 
> > > > *vpci_dev, Error **errp)
> > > >  DeviceState *vdev = DEVICE(&dev->vdev);
> > > >  
> > > >  if (vpci_dev->nvectors == DEV_NVECTORS_UNSPECIFIED) {
> > > > -/* Also reserve config change and hiprio queue vectors */
> > > > -vpci_dev->nvectors = dev->vdev.conf.num_request_queues + 2;
> > > > +/* Also reserve config change, hiprio and notification queue 
> > > > vectors */
> > > > +vpci_dev->nvectors = dev->vdev.conf.num_request_queues + 3;
> > > >  }
> > > >  
> > > >  qdev_realize(vdev, BUS(&vpci_dev->bus), errp);
> > > > diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> > > > index d1efbc5b18..6bafcf0243 100644
> > > > --- a/hw/virtio/vhost-user-fs.c
> > > > +++ b/hw/virtio/vhost-user-fs.c
> > > > @@ -31,6 +31,7 @@ static const int user_feature_bits[] = {
> > > >  VIRTIO_F_NOTIFY_ON_EMPTY,
> > > >  VIRTIO_F_RING_PACKED,
> > > >  VIRTIO_F_IOMMU_PLATFORM,
> > > > +VIRTIO_FS_F_NOTIFICATION,
> > > >  
> > > >  VHOST_INVALID_FEATURE_BIT
> > > >  };
> > > > @@ -147,7 +148,7 @@ static void vuf_handle_output(VirtIODevice *vdev, 
> > > > VirtQueue *vq)
> > > >   */
> > > >  }
> > > >  
> > > > -static void vuf_create_vqs(VirtIODevice *vdev)
> > > > +static void vuf_create_vqs(VirtIODevice *vdev, bool notification_vq)
> > > >  {
> > > >  VHostUserFS *fs = VHOST_USER_FS(vdev);
> > > >  unsigned int i;
> > > > @@ -155,6 +156,15 @@ static void vuf_create_vqs(VirtIODevice *vdev)
> > > >  /* Hiprio queue */
> > > >  fs->hiprio_vq = virtio_add_queue(vdev, fs->conf.queue_size,
> > > >   vuf_handle_output);
> > > > +/*
> > > > + * Notification queue. Feature negotiation happens later. So at 
> > > > this
> > > > + * point of time we don't know if driver will use notification 
> > > > queue
> > > > + * or not.
> > > > + */
> > > > +if (notification_vq) {
> > > > +fs->notification_vq = virtio_add_queue(vdev, 
> > > > fs->conf.queue_size,
> > > > +   vuf_handle_output);
> > > > +}
> > > >  
> > > >  /* Request queues */
> > > >  fs->req_vqs = g_new(VirtQueue *, fs->conf.num_request_queues);
> > > > @@ -163,8 +173,12 @@ static void vuf_create_vqs(VirtIODevice *vdev)
> > > >vuf_handle_output);
> > > >  }
> > > >  
> > > > -/* 1 high prio queue, plus the number configured */
> > > > -fs->vhost_dev.nvqs = 1 + fs->conf.num_request_queues;
> > > > +/* 1 high prio queue, 1 notification queue plus the number 
> > > > configured */
> > > > +if (notification_vq) {
> > > > +fs->vhost_dev.nvqs = 2 + fs->conf.num_request_queues;
> > > > +} else {
> > > > +fs->vhost_dev.nvqs = 1 + fs->conf.num_request_queues;
> > > > +}
> > > >  fs->vhost_dev.vqs = g_new0(struct vhost_virtqueue, 
> > > > fs->vhost_dev.nvqs);
> > > >  }
> > > >  
> > > > @@ -176,6 +190,11 @@ static void vuf_cleanup_vqs(VirtIODevice *vdev)
> > > >  virtio_delete_queue(fs->hiprio_vq);
> > > >  fs->hiprio_vq = NULL;
> > > >  
> > > > +if (fs->notification_vq) {
> > > > +virtio_delete_queue(fs->notification_vq);
> > > > +}
> > > > +fs->notification_vq = NULL;
> > > > +
> > > >  for (i = 0; i < fs->conf.num_request_queues; i++) {
> > > >  virtio_delete_queue(fs->req_vqs[i]);
> > > >  }
> > > > @@ -194,9 +213,43 @@ static uint64_t vuf_get_features(VirtIODevice 
> > > > *vdev,
> > > >  {
> > > >  VHostUserFS *fs = VHOST_USER_FS(vdev);
> > > >  
> > > > +virtio_add_feature(&features, VIRTIO_FS_F_NOTIFICATION);
> > > > +
> > > >  return vhost_get_features(&fs->vhost_dev, user_feature_bits, 
> > > > features);
> > > >  }
> > > >  
> > > > +static void vuf_set_fea

Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable

2021-10-05 Thread Stefan Hajnoczi

On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck wrote:
> Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> variable per virtio user.

virtio user == virtio device model?

> 
> Reasons:
> 
> (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> maximum queue size possible. Which is actually the maximum
> queue size allowed by the virtio protocol. The appropriate
> value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> 
> 
> https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-240006
> 
> Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> more or less arbitrary value of 1024 in the past, which
> limits the maximum transfer size with virtio to 4M
> (more precise: 1024 * PAGE_SIZE, with the latter typically
> being 4k).

Being equal to IOV_MAX is a likely reason. Buffers with more iovecs than
that cannot be passed to host system calls (sendmsg(2), pwritev(2),
etc).

> (2) Additionally the current value of 1024 poses a hidden limit,
> invisible to guest, which causes a system hang with the
> following QEMU error if guest tries to exceed it:
> 
> virtio: too many write descriptors in indirect table

I don't understand this point. 2.6.5 The Virtqueue Descriptor Table says:

  The number of descriptors in the table is defined by the queue size for this 
virtqueue: this is the maximum possible descriptor chain length.

and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:

  A driver MUST NOT create a descriptor chain longer than the Queue Size of the 
device.

Do you mean a broken/malicious guest driver that is violating the spec?
That's not a hidden limit, it's defined by the spec.

> (3) Unfortunately not all virtio users in QEMU would currently
> work correctly with the new value of 32768.
> 
> So let's turn this hard coded global value into a runtime
> variable as a first step in this commit, configurable for each
> virtio user by passing a corresponding value with virtio_init()
> call.

virtio_add_queue() already has an int queue_size argument, why isn't
that enough to deal with the maximum queue size? There's probably a good
reason for it, but please include it in the commit description.

> 
> Signed-off-by: Christian Schoenebeck 
> ---
>  hw/9pfs/virtio-9p-device.c |  3 ++-
>  hw/block/vhost-user-blk.c  |  2 +-
>  hw/block/virtio-blk.c  |  3 ++-
>  hw/char/virtio-serial-bus.c|  2 +-
>  hw/display/virtio-gpu-base.c   |  2 +-
>  hw/input/virtio-input.c|  2 +-
>  hw/net/virtio-net.c| 15 ---
>  hw/scsi/virtio-scsi.c  |  2 +-
>  hw/virtio/vhost-user-fs.c  |  2 +-
>  hw/virtio/vhost-user-i2c.c |  3 ++-
>  hw/virtio/vhost-vsock-common.c |  2 +-
>  hw/virtio/virtio-balloon.c |  4 ++--
>  hw/virtio/virtio-crypto.c  |  3 ++-
>  hw/virtio/virtio-iommu.c   |  2 +-
>  hw/virtio/virtio-mem.c |  2 +-
>  hw/virtio/virtio-pmem.c|  2 +-
>  hw/virtio/virtio-rng.c |  2 +-
>  hw/virtio/virtio.c | 35 +++---
>  include/hw/virtio/virtio.h |  5 -
>  19 files changed, 57 insertions(+), 36 deletions(-)
> 
> diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> index 54ee93b71f..cd5d95dd51 100644
> --- a/hw/9pfs/virtio-9p-device.c
> +++ b/hw/9pfs/virtio-9p-device.c
> @@ -216,7 +216,8 @@ static void virtio_9p_device_realize(DeviceState *dev, 
> Error **errp)
>  }
>  
>  v->config_size = sizeof(struct virtio_9p_config) + strlen(s->fsconf.tag);
> -virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size);
> +virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P, v->config_size,
> +VIRTQUEUE_MAX_SIZE);
>  v->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
>  }
>  
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index ba13cb87e5..336f56705c 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -491,7 +491,7 @@ static void vhost_user_blk_device_realize(DeviceState 
> *dev, Error **errp)
>  }
>  
>  virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK,
> -sizeof(struct virtio_blk_config));
> +sizeof(struct virtio_blk_config), VIRTQUEUE_MAX_SIZE);
>  
>  s->virtqs = g_new(VirtQueue *, s->num_queues);
>  for (i = 0; i < s->num_queues; i++) {
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index f139cd7cc9..9c0f46815c 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -1213,7 +1213,8 @@ static void virtio_blk_device_realize(DeviceState *dev, 
> Error **errp)
>  
>  virtio_blk_set_config_size(s, s->host_features);
>  
> -virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size);
> +virtio_init(vdev, "virtio-blk", VIRTIO_ID_BLOCK, s->config_size,
> +VIRTQUEUE_MAX_SIZE);
>  
>  s->blk = conf->conf.blk;
>  s->rq = NULL;
> diff --git a/hw/char/virti

Re: [PATCH 10/13] virtiofsd: Custom threadpool for remote blocking posix locks requests

2021-10-05 Thread Vivek Goyal

On Mon, Oct 04, 2021 at 03:54:31PM +0100, Stefan Hajnoczi wrote:
> On Thu, Sep 30, 2021 at 11:30:34AM -0400, Vivek Goyal wrote:
> > Add a new custom threadpool using posix threads that specifically
> > service locking requests.
> > 
> > In the case of a fcntl(SETLKW) request, if the guest is waiting
> > for a lock or locks and issues a hard-reboot through SYSRQ then virtiofsd
> > unblocks the blocked threads by sending a signal to them and waking
> > them up.
> > 
> > The current threadpool (GThreadPool) is not adequate to service the
> > locking requests that result in a thread blocking. That is because
> > GLib does not provide an API to cancel the request while it is
> > serviced by a thread. In addition, a user might be running virtiofsd
> > without a threadpool (--thread-pool-size=0), thus a locking request
> > that blocks, will block the main virtqueue thread that services requests
> > from servicing any other requests.
> > 
> > The only exception occurs when the lock is of type F_UNLCK. In this case
> > the request is serviced by the main virtqueue thread or a GThreadPool
> > thread to avoid a deadlock, when all the threads in the custom threadpool
> > are blocked.
> > 
> > Then virtiofsd proceeds to cleanup the state of the threads, release
> > them back to the system and re-initialize.
> 
> Is there another way to cancel SETLKW without resorting to a new thread
> pool? Since this only matters when shutting down or restarting, can we
> close all plock->fd file descriptors to kick the GThreadPool workers out
> of fnctl()?

I don't think that closing plock->fd will unblock fcntl().  

SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
{
struct fd f = fdget_raw(fd);
}

IIUC, fdget_raw() will take a reference on associated "struct file" and
after that rest of the code will work with that "struct file".

static int do_lock_file_wait(struct file *filp, unsigned int cmd,
 struct file_lock *fl)
{
..
..
error = wait_event_interruptible(fl->fl_wait,
list_empty(&fl->fl_blocked_member));

..
..
}

And this shoudl break upon receiving signal. And man page says the
same thing.

   F_OFD_SETLKW (struct flock *)
  As for F_OFD_SETLK, but if a conflicting lock  is  held  on  the
  file,  then  wait  for that lock to be released.  If a signal is
  caught while waiting, then the call is  interrupted  and  (after
  the  signal  handler has returned) returns immediately (with re‐
  turn value -1 and errno set to EINTR; see signal(7)).

It would be nice if we don't have to implement our own custom threadpool
just for locking. Would have been better if glib thread pool provided
some facility for this.

[..]
> > diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
> > index 3b720c5d4a..c67c2e0e7a 100644
> > --- a/tools/virtiofsd/fuse_virtio.c
> > +++ b/tools/virtiofsd/fuse_virtio.c
> > @@ -20,6 +20,7 @@
> >  #include "fuse_misc.h"
> >  #include "fuse_opt.h"
> >  #include "fuse_virtio.h"
> > +#include "tpool.h"
> >  
> >  #include 
> >  #include 
> > @@ -612,6 +613,60 @@ out:
> >  free(req);
> >  }
> >  
> > +/*
> > + * If the request is a locking request, use a custom locking thread pool.
> > + */
> > +static bool use_lock_tpool(gpointer data, gpointer user_data)
> > +{
> > +struct fv_QueueInfo *qi = user_data;
> > +struct fuse_session *se = qi->virtio_dev->se;
> > +FVRequest *req = data;
> > +VuVirtqElement *elem = &req->elem;
> > +struct fuse_buf fbuf = {};
> > +struct fuse_in_header *inhp;
> > +struct fuse_lk_in *lkinp;
> > +size_t lk_req_len;
> > +/* The 'out' part of the elem is from qemu */
> > +unsigned int out_num = elem->out_num;
> > +struct iovec *out_sg = elem->out_sg;
> > +size_t out_len = iov_size(out_sg, out_num);
> > +bool use_custom_tpool = false;
> > +
> > +/*
> > + * If notifications are not enabled, no point in using cusotm lock
> > + * thread pool.
> > + */
> > +if (!se->notify_enabled) {
> > +return false;
> > +}
> > +
> > +assert(se->bufsize > sizeof(struct fuse_in_header));
> > +lk_req_len = sizeof(struct fuse_in_header) + sizeof(struct fuse_lk_in);
> > +
> > +if (out_len < lk_req_len) {
> > +return false;
> > +}
> > +
> > +fbuf.mem = g_malloc(se->bufsize);
> > +copy_from_iov(&fbuf, out_num, out_sg, lk_req_len);
> 
> This looks inefficient: for every FUSE request we now malloc se->bufsize
> and then copy lk_req_len bytes, only to free the memory again.
> 
> Is it possible to keep lk_req_len bytes on the stack instead?

I guess it should be possible. se->bufsize is variable but lk_req_len
is known at compile time.

lk_req_len = sizeof(struct fuse_in_header) + sizeof(struct fuse_lk_in);

So we should be able to allocate this much space on stack and point
fbuf.mem to it.

char buf[

RE: [PATCH v3 9/9] vfio: defer to commit kvm irq routing when enable msi/msix

2021-10-05 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)




> -Original Message-
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Saturday, October 2, 2021 7:05 AM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> 
> Cc: phi...@redhat.com; pbonz...@redhat.com; marcel.apfelb...@gmail.com;
> m...@redhat.com; qemu-devel@nongnu.org; Gonglei (Arei)
> ; chenjiashang 
> Subject: Re: [PATCH v3 9/9] vfio: defer to commit kvm irq routing when enable
> msi/msix
> 
> On Tue, 21 Sep 2021 07:02:02 +0800
> "Longpeng(Mike)"  wrote:
> 
> > In migration resume phase, all unmasked msix vectors need to be
> > setup when load the VF state. However, the setup operation would
> 
> s/load/loading/
> 
> > take longer if the VM has more VFs and each VF has more unmasked
> > vectors.
> >
> > The hot spot is kvm_irqchip_commit_routes, it'll scan and update
> > all irqfds that already assigned each invocation, so more vectors
> 
> s/that/that are/
> 
> > means need more time to process them.
> >
> > vfio_pci_load_config
> >   vfio_msix_enable
> > msix_set_vector_notifiers
> >   for (vector = 0; vector < dev->msix_entries_nr; vector++) {
> > vfio_msix_vector_do_use
> >   vfio_add_kvm_msi_virq
> > kvm_irqchip_commit_routes <-- expensive
> >   }
> >
> > We can reduce the cost by only commit once outside the loop. The
> 
> s/commit/committing/
> 

OK, will fix in the next version, thanks.

> > routes is cached in kvm_state, we commit them first and then bind
> 
> s/is/are/
> 

OK.

> > irqfd for each vector.
> >
> > The test VM has 128 vcpus and 8 VF (each one has 65 vectors),
> > we measure the cost of the vfio_msix_enable for each VF, and
> > we can see 90+% costs can be reduce.
> >
> > VF  Count of irqfds[*]  OriginalWith this patch
> >
> > 1st   658   2
> > 2nd   130   15  2
> > 3rd   195   22  2
> > 4th   260   24  3
> > 5th   325   36  2
> > 6th   390   44  3
> > 7th   455   51  3
> > 8th   520   58  4
> > Total   258ms   21ms
> >
> > [*] Count of irqfds
> > How many irqfds that already assigned and need to process in this
> > round.
> >
> > The optimition can be applied to msi type too.
> 
> s/optimition/optimization/
> 

OK, thanks.

> >
> > Signed-off-by: Longpeng(Mike) 
> > ---
> >  hw/vfio/pci.c | 36 
> >  1 file changed, 28 insertions(+), 8 deletions(-)
> >
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index 2de1cc5425..b26129bddf 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -513,11 +513,13 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev,
> unsigned int nr,
> >   * increase them as needed.
> >   */
> >  if (vdev->nr_vectors < nr + 1) {
> > -vfio_disable_irqindex(&vdev->vbasedev, VFIO_PCI_MSIX_IRQ_INDEX);
> >  vdev->nr_vectors = nr + 1;
> > -ret = vfio_enable_vectors(vdev, true);
> > -if (ret) {
> > -error_report("vfio: failed to enable vectors, %d", ret);
> > +if (!vdev->defer_kvm_irq_routing) {
> > +vfio_disable_irqindex(&vdev->vbasedev,
> VFIO_PCI_MSIX_IRQ_INDEX);
> > +ret = vfio_enable_vectors(vdev, true);
> > +if (ret) {
> > +error_report("vfio: failed to enable vectors, %d", ret);
> > +}
> >  }
> >  } else {
> >  Error *err = NULL;
> > @@ -579,8 +581,7 @@ static void vfio_msix_vector_release(PCIDevice *pdev,
> unsigned int nr)
> >  }
> >  }
> >
> > -/* TODO: invoked when enclabe msi/msix vectors */
> > -static __attribute__((unused)) void vfio_commit_kvm_msi_virq(VFIOPCIDevice
> *vdev)
> > +static void vfio_commit_kvm_msi_virq(VFIOPCIDevice *vdev)
> >  {
> >  int i;
> >  VFIOMSIVector *vector;
> > @@ -610,6 +611,9 @@ static __attribute__((unused)) void
> vfio_commit_kvm_msi_virq(VFIOPCIDevice *vdev
> >
> >  static void vfio_msix_enable(VFIOPCIDevice *vdev)
> >  {
> > +PCIDevice *pdev = &vdev->pdev;
> > +int ret;
> > +
> >  vfio_disable_interrupts(vdev);
> >
> >  vdev->msi_vectors = g_new0(VFIOMSIVector, vdev->msix->entries);
> > @@ -632,11 +636,22 @@ static void vfio_msix_enable(VFIOPCIDevice *vdev)
> >  vfio_msix_vector_do_use(&vdev->pdev, 0, NULL, NULL);
> >  vfio_msix_vector_release(&vdev->pdev, 0);
> >
> > -if (msix_set_vector_notifiers(&vdev->pdev, vfio_msix_vector_use,
> > -  vfio_msix_vector_release, NULL)) {
> 
> A comment would be useful here, maybe something like:
> 
> /*
>  * Setting vector notifiers triggers synchronous vector-use
>  * callbacks for each active vector.  Deferring to commit the KVM
>  * routes once rather than per vector provides a substantial
>  * performance improvement.
>  */
> 

Will add in the nex

Re: [PATCH 2/3] vdpa: Add vhost_vdpa_section_end

2021-10-05 Thread Eugenio Perez Martin

On Tue, Oct 5, 2021 at 12:47 PM Michael S. Tsirkin  wrote:
>
> On Tue, Oct 05, 2021 at 11:52:37AM +0200, Eugenio Perez Martin wrote:
> > On Tue, Oct 5, 2021 at 10:15 AM Michael S. Tsirkin  wrote:
> > >
> > > On Tue, Oct 05, 2021 at 10:01:30AM +0200, Eugenio Pérez wrote:
> > > > Abstract this operation, that will be reused when validating the region
> > > > against the iova range that the device supports.
> > > >
> > > > Signed-off-by: Eugenio Pérez 
> > >
> > > Note that as defined end is actually 1 byte beyond end of section.
> > > As such it can e.g. overflow if cast to u64.
> > > So be careful to use int128 ops with it.
> >
> > You are right, but this is only the result of extracting "llend"
> > calculation in its own function, since it is going to be used a third
> > time in the next commit. This next commit contains a mistake because
> > of this, as you pointed out.
> >
> > Since "last" would be a very misleading name, do you think we could
> > give a better name / type to it?
> >
> > > Also - document?
> >
> > It will be documented with that ("It returns one byte beyond end of
> > section" or similar) too.
> >
> > Thanks!
>
> that's how c++ containers work so maybe it's not too bad as long
> as we document this carefully.
>

I tend to see it that way except when the name is "last", that I read
as "last one addressable/valid", as discussed in the
VHOST_VDPA_GET_IOVA_RANGE call mail thread. So end = past range, last
= last one valid.

It would be great to have something like void * / hwaddr, or c++
chrono time_point vs time_point, that moves to
type system the verification of not mixing different range types. But
this may be overthinking at this moment.

Thanks!

> > >
> > > > ---
> > > >  hw/virtio/vhost-vdpa.c | 18 +++---
> > > >  1 file changed, 11 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > > > index ea1aa71ad8..a1de6c7c9c 100644
> > > > --- a/hw/virtio/vhost-vdpa.c
> > > > +++ b/hw/virtio/vhost-vdpa.c
> > > > @@ -24,6 +24,15 @@
> > > >  #include "trace.h"
> > > >  #include "qemu-common.h"
> > > >
> > > > +static Int128 vhost_vdpa_section_end(const MemoryRegionSection 
> > > > *section)
> > > > +{
> > > > +Int128 llend = int128_make64(section->offset_within_address_space);
> > > > +llend = int128_add(llend, section->size);
> > > > +llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> > > > +
> > > > +return llend;
> > > > +}
> > > > +
> > > >  static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection 
> > > > *section)
> > > >  {
> > > >  return (!memory_region_is_ram(section->mr) &&
> > > > @@ -160,10 +169,7 @@ static void 
> > > > vhost_vdpa_listener_region_add(MemoryListener *listener,
> > > >  }
> > > >
> > > >  iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> > > > -llend = int128_make64(section->offset_within_address_space);
> > > > -llend = int128_add(llend, section->size);
> > > > -llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> > > > -
> > > > +llend = vhost_vdpa_section_end(section);
> > > >  if (int128_ge(int128_make64(iova), llend)) {
> > > >  return;
> > > >  }
> > > > @@ -221,9 +227,7 @@ static void 
> > > > vhost_vdpa_listener_region_del(MemoryListener *listener,
> > > >  }
> > > >
> > > >  iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> > > > -llend = int128_make64(section->offset_within_address_space);
> > > > -llend = int128_add(llend, section->size);
> > > > -llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> > > > +llend = vhost_vdpa_section_end(section);
> > > >
> > > >  trace_vhost_vdpa_listener_region_del(v, iova, int128_get64(llend));
> > > >
> > > > --
> > > > 2.27.0
> > >
>

Re: [PATCH v2 1/3] virtio: turn VIRTQUEUE_MAX_SIZE into a variable

2021-10-05 Thread Christian Schoenebeck

On Dienstag, 5. Oktober 2021 14:45:56 CEST Stefan Hajnoczi wrote:
> On Mon, Oct 04, 2021 at 09:38:04PM +0200, Christian Schoenebeck wrote:
> > Refactor VIRTQUEUE_MAX_SIZE to effectively become a runtime
> > variable per virtio user.
> 
> virtio user == virtio device model?

Yes

> > Reasons:
> > 
> > (1) VIRTQUEUE_MAX_SIZE should reflect the absolute theoretical
> > 
> > maximum queue size possible. Which is actually the maximum
> > queue size allowed by the virtio protocol. The appropriate
> > value for VIRTQUEUE_MAX_SIZE would therefore be 32768:
> > 
> > https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.h
> > tml#x1-240006
> > 
> > Apparently VIRTQUEUE_MAX_SIZE was instead defined with a
> > more or less arbitrary value of 1024 in the past, which
> > limits the maximum transfer size with virtio to 4M
> > (more precise: 1024 * PAGE_SIZE, with the latter typically
> > being 4k).
> 
> Being equal to IOV_MAX is a likely reason. Buffers with more iovecs than
> that cannot be passed to host system calls (sendmsg(2), pwritev(2),
> etc).

Yes, that's use case dependent. Hence the solution to opt-in if it is desired 
and feasible.

> > (2) Additionally the current value of 1024 poses a hidden limit,
> > 
> > invisible to guest, which causes a system hang with the
> > following QEMU error if guest tries to exceed it:
> > 
> > virtio: too many write descriptors in indirect table
> 
> I don't understand this point. 2.6.5 The Virtqueue Descriptor Table says:
> 
>   The number of descriptors in the table is defined by the queue size for
> this virtqueue: this is the maximum possible descriptor chain length.
> 
> and 2.6.5.3.1 Driver Requirements: Indirect Descriptors says:
> 
>   A driver MUST NOT create a descriptor chain longer than the Queue Size of
> the device.
> 
> Do you mean a broken/malicious guest driver that is violating the spec?
> That's not a hidden limit, it's defined by the spec.

https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00781.html
https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg00788.html

You can already go beyond that queue size at runtime with the indirection 
table. The only actual limit is the currently hard coded value of 1k pages. 
Hence the suggestion to turn that into a variable.

> > (3) Unfortunately not all virtio users in QEMU would currently
> > 
> > work correctly with the new value of 32768.
> > 
> > So let's turn this hard coded global value into a runtime
> > variable as a first step in this commit, configurable for each
> > virtio user by passing a corresponding value with virtio_init()
> > call.
> 
> virtio_add_queue() already has an int queue_size argument, why isn't
> that enough to deal with the maximum queue size? There's probably a good
> reason for it, but please include it in the commit description.
[...]
> Can you make this value per-vq instead of per-vdev since virtqueues can
> have different queue sizes?
> 
> The same applies to the rest of this patch. Anything using
> vdev->queue_max_size should probably use vq->vring.num instead.

I would like to avoid that and keep it per device. The maximum size stored 
there is the maximum size supported by virtio user (or vortio device model, 
however you want to call it). So that's really a limit per device, not per 
queue, as no queue of the device would ever exceed that limit.

Plus a lot more code would need to be refactored, which I think is 
unnecessary.

Best regards,
Christian Schoenebeck

[PATCH] docs: Add spec of OVMF GUIDed table for SEV guests

2021-10-05 Thread Dov Murik

Add docs/specs/sev-guest-firmware.rst which describes the GUIDed table
in the end of OVMF's image which is parsed by QEMU, and currently used
to describe some values for SEV and SEV-ES guests.

Signed-off-by: Dov Murik 
---
 docs/specs/index.rst  |   1 +
 docs/specs/sev-guest-firmware.rst | 125 ++
 2 files changed, 126 insertions(+)
 create mode 100644 docs/specs/sev-guest-firmware.rst

diff --git a/docs/specs/index.rst b/docs/specs/index.rst
index ecc43896bb..2a35700fb3 100644
--- a/docs/specs/index.rst
+++ b/docs/specs/index.rst
@@ -18,3 +18,4 @@ guest hardware that is specific to QEMU.
acpi_mem_hotplug
acpi_pci_hotplug
acpi_nvdimm
+   sev-guest-firmware
diff --git a/docs/specs/sev-guest-firmware.rst 
b/docs/specs/sev-guest-firmware.rst
new file mode 100644
index 00..3f7f082df5
--- /dev/null
+++ b/docs/specs/sev-guest-firmware.rst
@@ -0,0 +1,125 @@
+
+QEMU/Guest Firmware Interface for AMD SEV and SEV-ES
+
+
+Overview
+
+
+The guest firmware image (OVMF) may contain some configuration entries
+which are used by QEMU before the guest launches.  These are listed in a
+GUIDed table at a known location in the firmware image.  QEMU parses
+this table when it loads the firmware image into memory, and then QEMU
+reads individual entries when their values are needed.
+
+Though nothing in the table structure is SEV-specific, currently all the
+entries in the table are related to SEV and SEV-ES features.
+
+
+Table parsing in QEMU
+-
+
+The table is parsed from the footer: first the presence of the table
+footer GUID (96b582de-1fb2-45f7-baea-a366c55a082d) at 0xffd0 is
+verified.  If that is found, two bytes at 0xffce are the entire
+table length.
+
+Then the table is scanned backwards looking for the specific entry GUID.
+
+QEMU files related to parsing and scanning the OVMF table:
+ - ``hw/i386/pc_sysfw_ovmf.c``
+
+The edk2 firmware code that constructs this structure is in the
+`OVMF Reset Vector file`_.
+
+
+Table memory layout
+---
+
++++-+
+|GPA | Length |   Description   |
++++=+
+| 0xff80 | 4  | Zero padding|
++++-+
+| 0xff84 | 4  | SEV hashes table base address   |
++++-+
+| 0xff88 | 4  | SEV hashes table size (=0x400)  |
++++-+
+| 0xff8c | 2  | SEV hashes table entry length (=0x1a)   |
++++-+
+| 0xff8e | 16 | SEV hashes table GUID:  |
+||| 7255371f-3a3b-4b04-927b-1da6efa8d454|
++++-+
+| 0xff9e | 4  | SEV secret block base address   |
++++-+
+| 0xffa2 | 4  | SEV secret block size (=0xc00)  |
++++-+
+| 0xffa6 | 2  | SEV secret block entry length (=0x1a)   |
++++-+
+| 0xffa8 | 16 | SEV secret block GUID:  |
+||| 4c2eb361-7d9b-4cc3-8081-127c90d3d294|
++++-+
+| 0xffb8 | 4  | SEV-ES AP reset RIP |
++++-+
+| 0xffbc | 2  | SEV-ES reset block entry length (=0x16) |
++++-+
+| 0xffbe | 16 | SEV-ES reset block entry GUID:  |
+||| 00f771de-1a7e-4fcb-890e-68c77e2fb44e|
++++-+
+| 0xffce | 2  | Length of entire table including table  |
+||| footer GUID and length (=0x72)  |
++++-+
+| 0xffd0 | 16 | OVMF GUIDed table footer GUID:  |
+||| 96b582de-1fb2-45f7-baea-a366c55a082d|
++++-+
+| 0xffe0 | 8  | Application processor entry point code  |
++++-+
+| 0xffe8 | 8  | "\0\0\0\0VTF\0" |
++++-+
+| 0xfff0 | 16 | Reset vector code   |
++++-+
+
+
+T

Re: [PATCH] monitor: Tidy up find_device_state()

2021-10-05 Thread Damien Hedde





On 9/16/21 13:17, Markus Armbruster wrote:

Commit 6287d827d4 "monitor: allow device_del to accept QOM paths"
extended find_device_state() to accept QOM paths in addition to qdev
IDs.  This added a checked conversion to TYPE_DEVICE at the end, which
duplicates the check done for the qdev ID case earlier, except it sets
a *different* error: GenericError "ID is not a hotpluggable device"
when passed a QOM path, and DeviceNotFound "Device 'ID' not found"
when passed a qdev ID.  Fortunately, the latter won't happen as long
as we add only devices to /machine/peripheral/.

Earlier, commit b6cc36abb2 "qdev: device_del: Search for to be
unplugged device in 'peripheral' container" rewrote the lookup by qdev
ID to use QOM instead of qdev_find_recursive(), so it can handle
buss-less devices.  It does so by constructing an absolute QOM path.
Works, but object_resolve_path_component() is easier.  Switching to it
also gets rid of the unclean duplication described above.

While there, avoid converting to TYPE_DEVICE twice, first to check
whether it's possible, and then for real.

Signed-off-by: Markus Armbruster 


Reviewed-by: Damien Hedde 

---
  softmmu/qdev-monitor.c | 13 +
  1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
index a304754ab9..d1ab3c25fb 100644
--- a/softmmu/qdev-monitor.c
+++ b/softmmu/qdev-monitor.c
@@ -831,16 +831,12 @@ void qmp_device_add(QDict *qdict, QObject **ret_data, 
Error **errp)
  static DeviceState *find_device_state(const char *id, Error **errp)
  {
  Object *obj;
+DeviceState *dev;
  
  if (id[0] == '/') {

  obj = object_resolve_path(id, NULL);
  } else {
-char *root_path = object_get_canonical_path(qdev_get_peripheral());
-char *path = g_strdup_printf("%s/%s", root_path, id);
-
-g_free(root_path);
-obj = object_resolve_path_type(path, TYPE_DEVICE, NULL);
-g_free(path);
+obj = object_resolve_path_component(qdev_get_peripheral(), id);
  }
  
  if (!obj) {

@@ -849,12 +845,13 @@ static DeviceState *find_device_state(const char *id, 
Error **errp)
  return NULL;
  }
  
-if (!object_dynamic_cast(obj, TYPE_DEVICE)) {

+dev = (DeviceState *)object_dynamic_cast(obj, TYPE_DEVICE);
+if (!dev) {
  error_setg(errp, "%s is not a hotpluggable device", id);
  return NULL;
  }
  
-return DEVICE(obj);

+return dev;
  }
  
  void qdev_unplug(DeviceState *dev, Error **errp)

Re: [PATCH 12/13] virtiofsd: Implement blocking posix locks

2021-10-05 Thread Vivek Goyal

On Mon, Oct 04, 2021 at 04:07:04PM +0100, Stefan Hajnoczi wrote:
> On Thu, Sep 30, 2021 at 11:30:36AM -0400, Vivek Goyal wrote:
> > As of now we don't support fcntl(F_SETLKW) and if we see one, we return
> > -EOPNOTSUPP.
> > 
> > Change that by accepting these requests and returning a reply
> > immediately asking caller to wait. Once lock is available, send a
> > notification to the waiter indicating lock is available.
> > 
> > In response to lock request, we are returning error value as "1", which
> > signals to client to queue the lock request internally and later client
> > will get a notification which will signal lock is taken (or error). And
> > then fuse client should wake up the guest process.
> > 
> > Signed-off-by: Vivek Goyal 
> > Signed-off-by: Ioannis Angelakopoulos 
> > ---
> >  tools/virtiofsd/fuse_lowlevel.c  | 37 -
> >  tools/virtiofsd/fuse_lowlevel.h  | 26 
> >  tools/virtiofsd/fuse_virtio.c| 50 ---
> >  tools/virtiofsd/passthrough_ll.c | 70 
> >  4 files changed, 167 insertions(+), 16 deletions(-)
> > 
> > diff --git a/tools/virtiofsd/fuse_lowlevel.c 
> > b/tools/virtiofsd/fuse_lowlevel.c
> > index e4679c73ab..2e7f4b786d 100644
> > --- a/tools/virtiofsd/fuse_lowlevel.c
> > +++ b/tools/virtiofsd/fuse_lowlevel.c
> > @@ -179,8 +179,8 @@ int fuse_send_reply_iov_nofree(fuse_req_t req, int 
> > error, struct iovec *iov,
> >  .unique = req->unique,
> >  .error = error,
> >  };
> > -
> > -if (error <= -1000 || error > 0) {
> > +/* error = 1 has been used to signal client to wait for notificaiton */
> 
> s/notificaiton/notification/

Will fix. I have made too many spelling mistakes. :-(

> 
> > +if (error <= -1000 || error > 1) {
> >  fuse_log(FUSE_LOG_ERR, "fuse: bad error value: %i\n", error);
> >  out.error = -ERANGE;
> >  }
> > @@ -290,6 +290,11 @@ int fuse_reply_err(fuse_req_t req, int err)
> >  return send_reply(req, -err, NULL, 0);
> >  }
> >  
> > +int fuse_reply_wait(fuse_req_t req)
> > +{
> > +return send_reply(req, 1, NULL, 0);
> > +}
> > +
> >  void fuse_reply_none(fuse_req_t req)
> >  {
> >  fuse_free_req(req);
> > @@ -2165,6 +2170,34 @@ static void do_destroy(fuse_req_t req, fuse_ino_t 
> > nodeid,
> >  send_reply_ok(req, NULL, 0);
> >  }
> >  
> > +static int send_notify_iov(struct fuse_session *se, int notify_code,
> > +   struct iovec *iov, int count)
> > +{
> > +struct fuse_out_header out;
> > +if (!se->got_init) {
> > +return -ENOTCONN;
> > +}
> > +out.unique = 0;
> > +out.error = notify_code;
> 
> Please fully initialize all fuse_out_header fields so it's obvious that
> there is no accidental information leak from virtiofsd to the guest:
> 
>   struct fuse_out_header out = {
>   .error = notify_code,
>   };
> 
> The host must not expose uninitialized memory to the guest (just like
> the kernel vs userspace). fuse_send_msg() initializes out.len later, but
> to be on the safe side I think we should be explicit here.

Agreed. Its better to be explicit here and initialize fuse_out_header
fully. Will do.

Vivek

Re: Moving QEMU downloads to GitLab Releases?

2021-10-05 Thread Stefan Hajnoczi

On Mon, Oct 04, 2021 at 02:34:49PM -0500, Michael Roth wrote:
> Quoting Stefan Hajnoczi (2021-10-04 04:01:22)
> > On Fri, Oct 01, 2021 at 09:39:13AM +0200, Philippe Mathieu-Daudé wrote:
> > > On 9/30/21 15:40, Stefan Hajnoczi wrote:
> > > > Hi Mike,
> > > > QEMU downloads are currently hosted on qemu.org's Apache web server.
> > > > Paolo and I were discussing ways to reduce qemu.org network traffic to
> > > > save money and eventually turn off the qemu.org server since there is no
> > > > full-time sysadmin for it. I'd like to discuss moving QEMU downloads to
> > > > GitLab Releases.
> > > > 
> > > > Since you create and sign QEMU releases I wanted to see what you think
> > > > about the idea. GitLab Releases has two ways of creating release assets:
> > > > archiving a git tree and attaching arbitrary binaries. The
> > > > scripts/make-release script fetches submodules and generates version
> > > > files, so it may be necessary to treat QEMU tarballs as arbitrary
> > > > binaries instead of simply letting GitLab create git tree archives:
> > > > https://docs.gitlab.com/ee/user/project/releases/#use-a-generic-package-for-attaching-binaries
> > > > 
> > > > Releases can be uploaded via the GitLab API from your local machine or
> > > > deployed as a GitLab CI job. Uploading from your local machine would be
> > > > the closest to the current workflow.
> > > > 
> > > > In the long term we could have a CI job that automatically publishes
> > > > QEMU releases when a new qemu.git tag is pushed. The release process
> > > > could be fully automated so that manual steps are no longer necessary,
> > > > although we'd have to trust GitLab with QEMU GPG signing keys.
> > > 
> > > Before having to trust a SaaS for GPG signing, could this work?
> > > 
> > > - make-release script should produce a reproducible tarball in a
> > >   gitlab job, along with a file containing the tarball hash.
> > > 
> > > - Mike (or whoever is responsible of releases) keeps doing a local
> > >   manual build
> > > 
> > > - Mike checks the local hash matches the Gitlab artifact hash
> > > 
> > > - Mike signs its local build/hash and uses the GitLab API to upload
> > >   that .sig to job artifacts
> > > 
> > > - we can have an extra manual job that checks the tarball.sig
> > >   (hash and pubkey) and on success deploys updating the download
> > >   page, adding the new release
> > 
> > I wonder what Mike sees as the way forward.
> 
> Hi Stefan, Philippe,
> 
> In general I like the idea, since we could also have the CI do the full
> gamut of testing against the binaries built from said tarball, so the
> Release Person can just regenerate the tarball and provide a sig to
> attest that it came from the proper sources. Currently I do make check
> and make check-acceptance and a few other sanity checks, which I guess
> would no longer be needed then.
> 
> But I think the more immediate issue is where/how to host those
> tarballs. Even moving all the ROMs/capstone out of the source tree still
> results in an xz-compressed tarball size ~25MB, which is well above the
> 10MB limit mentioned earlier. We could break it out into target-specific
> tarballs, maybe further into softmmu/user variants, but that sounds
> painful for both users and maintainers who need to deal with the
> resulting source tree differences.
> 
> What I'm wondering is whether we could just use the archive files
> generated by gitlab when we tag our releases? E.g.:
> 
>   https://gitlab.com/qemu-project/qemu/-/archive/v6.1.0/qemu-v6.1.0.tar.bz2
> 
> If we paired that with an in-tree script similar to make-release for
> users to download individual ROM sources/subprojects used for a particular
> tagged release, would that be sufficient for GPL compliance and verifying
> what sources the binaries were built from? Are there any other
> considerations WRT ROMs/etc.?
> 
> With something like that in place, Release Person could just do a git
> checkout, verify the Maintainer's sig/tag (in case we don't necessarily
> trust the git host), generate the tarball, verify the hash matches what
> gitlab published (or verify/diff individual files if the bz/gz hashes
> require a specific environment), then sign the gitlab tarball and add
> the sig to qemu.org download page along with a link the gitlab-generated
> tarball.
> 
> We could also publish the Maintainer and Release Person public keys on
> qemu.org download page so users can verify this as well using the same
> process.
> 
> Users that want the additional sources can then do it locally via
> above-mentioned script, which would be part of the now-signed tarball
> and so could be 'trusted' assuming the individual project hosts weren't
> compromised (which is still an assumption that's needed with the current
> process anyway).
> 
> I guess the main question is who is using the ROM/BIOS sources in the
> tarballs, and would this 2-step process work for those users? If there
> are distros relying on them then maybe there are some more logistics

Re: [PATCH 11/13] virtiofsd: Shutdown notification queue in the end

2021-10-05 Thread Vivek Goyal

On Mon, Oct 04, 2021 at 04:01:02PM +0100, Stefan Hajnoczi wrote:
> On Thu, Sep 30, 2021 at 11:30:35AM -0400, Vivek Goyal wrote:
> > So far we did not have the notion of cross queue traffic. That is, we
> > get request on a queue and send back response on same queue. So if a
> > request be being processed and at the same time a stop queue request
> > comes in, we wait for all pending requests to finish and then queue
> > is stopped and associated data structure cleaned.
> > 
> > But with notification queue, now it is possible that we get a locking
> > request on request queue and send the notification back on a different
> > queue (notificaiton queue). This means, we need to make sure that
> 
> s/notificaiton/notification/
> 
> > notifiation queue has not already been shutdown or is not being
> 
> s/notifiation/notification/

Will fix both.

[..]
> >  /* Callback from libvhost-user on start or stop of a queue */
> > @@ -934,7 +950,16 @@ static void fv_queue_set_started(VuDev *dev, int qidx, 
> > bool started)
> >   * the queue thread doesn't block in virtio_send_msg().
> >   */
> >  vu_dispatch_unlock(vud);
> > -fv_queue_cleanup_thread(vud, qidx);
> > +
> > +/*
> > + * If queue 0 is being shutdown, treat it as if device is being
> > + * shutdown and stop all queues.
> > + */
> 
> Please expand this comment so it's clear why we do this.

Ok, will do. I put the justification in commit message but it is a good
idea to put it here as well.

Vivek

[PATCH v2] hw/usb/vt82c686-uhci-pci: Use ISA instead of PCI interrupts

2021-10-05 Thread BALATON Zoltan

This device is part of a superio/ISA bridge chip and IRQs from it are
routed to an ISA interrupt set by the Interrupt Line PCI config
register. Change uhci_update_irq() to allow this and use it from
vt82c686-uhci-pci.

Signed-off-by: BALATON Zoltan 
Reviewed-by: Jiaxun Yang 
---
v2: Do it differently to confine isa reference to vt82c686-uhci-pci as
hcd-uhci is also used on machines that don't have isa. Left Jiaxun's
R-b there as he checked it's the same for VT82C686B and gave R-b for
the approach which still holds but speak up if you tink otherwise.

 hw/usb/hcd-uhci.c  | 15 +--
 hw/usb/hcd-uhci.h  |  8 +---
 hw/usb/vt82c686-uhci-pci.c | 10 ++
 3 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/hw/usb/hcd-uhci.c b/hw/usb/hcd-uhci.c
index 0cb02a6432..7924cfffdb 100644
--- a/hw/usb/hcd-uhci.c
+++ b/hw/usb/hcd-uhci.c
@@ -288,9 +288,14 @@ static UHCIAsync *uhci_async_find_td(UHCIState *s, 
uint32_t td_addr)
 return NULL;
 }
 
+static void uhci_pci_set_irq(UHCIState *s, int level)
+{
+pci_set_irq(&s->dev, level);
+}
+
 static void uhci_update_irq(UHCIState *s)
 {
-int level;
+int level = 0;
 if (((s->status2 & 1) && (s->intr & (1 << 2))) ||
 ((s->status2 & 2) && (s->intr & (1 << 3))) ||
 ((s->status & UHCI_STS_USBERR) && (s->intr & (1 << 0))) ||
@@ -298,10 +303,8 @@ static void uhci_update_irq(UHCIState *s)
 (s->status & UHCI_STS_HSERR) ||
 (s->status & UHCI_STS_HCPERR)) {
 level = 1;
-} else {
-level = 0;
 }
-pci_set_irq(&s->dev, level);
+s->set_irq(s, level);
 }
 
 static void uhci_reset(DeviceState *dev)
@@ -1170,9 +1173,9 @@ void usb_uhci_common_realize(PCIDevice *dev, Error **errp)
 
 pci_conf[PCI_CLASS_PROG] = 0x00;
 /* TODO: reset value should be 0. */
-pci_conf[USB_SBRN] = USB_RELEASE_1; // release number
-
+pci_conf[USB_SBRN] = USB_RELEASE_1; /* release number */
 pci_config_set_interrupt_pin(pci_conf, u->info.irq_pin + 1);
+s->set_irq = uhci_pci_set_irq;
 
 if (s->masterbus) {
 USBPort *ports[NB_PORTS];
diff --git a/hw/usb/hcd-uhci.h b/hw/usb/hcd-uhci.h
index e61d8fcb19..ecd19762d6 100644
--- a/hw/usb/hcd-uhci.h
+++ b/hw/usb/hcd-uhci.h
@@ -42,7 +42,9 @@ typedef struct UHCIPort {
 uint16_t ctrl;
 } UHCIPort;
 
-typedef struct UHCIState {
+typedef struct UHCIState UHCIState;
+
+struct UHCIState {
 PCIDevice dev;
 MemoryRegion io_bar;
 USBBus bus; /* Note unused when we're a companion controller */
@@ -60,7 +62,7 @@ typedef struct UHCIState {
 uint32_t frame_bandwidth;
 bool completions_only;
 UHCIPort ports[NB_PORTS];
-
+void (*set_irq)(UHCIState *s, int level);
 /* Interrupts that should be raised at the end of the current frame.  */
 uint32_t pending_int_mask;
 
@@ -72,7 +74,7 @@ typedef struct UHCIState {
 char *masterbus;
 uint32_t firstport;
 uint32_t maxframes;
-} UHCIState;
+};
 
 #define TYPE_UHCI "pci-uhci-usb"
 DECLARE_INSTANCE_CHECKER(UHCIState, UHCI, TYPE_UHCI)
diff --git a/hw/usb/vt82c686-uhci-pci.c b/hw/usb/vt82c686-uhci-pci.c
index b109c21603..f6bae704be 100644
--- a/hw/usb/vt82c686-uhci-pci.c
+++ b/hw/usb/vt82c686-uhci-pci.c
@@ -1,6 +1,15 @@
 #include "qemu/osdep.h"
+#include "hw/irq.h"
 #include "hcd-uhci.h"
 
+static void uhci_isa_set_irq(UHCIState *s, int level)
+{
+uint8_t irq = pci_get_byte(s->dev.config + PCI_INTERRUPT_LINE);
+if (irq > 0 && irq < 15) {
+qemu_set_irq(isa_get_irq(NULL, irq), level);
+}
+}
+
 static void usb_uhci_vt82c686b_realize(PCIDevice *dev, Error **errp)
 {
 UHCIState *s = UHCI(dev);
@@ -14,6 +23,7 @@ static void usb_uhci_vt82c686b_realize(PCIDevice *dev, Error 
**errp)
 pci_set_long(pci_conf + 0xc0, 0x2000);
 
 usb_uhci_common_realize(dev, errp);
+s->set_irq = uhci_isa_set_irq;
 }
 
 static UHCIInfo uhci_info[] = {
-- 
2.21.4

Re: [Virtio-fs] [PATCH 07/13] virtiofsd: Release file locks using F_UNLCK

2021-10-05 Thread Christophe de Dinechin



On 2021-09-30 at 11:30 -04, Vivek Goyal  wrote...
> We are emulating posix locks for guest using open file description locks
> in virtiofsd. When any of the fd is closed in guest, we find associated
> OFD lock fd (if there is one) and close it to release all the locks.
>
> Assumption here is that there is no other thread using lo_inode_plock
> structure or plock->fd, hence it is safe to do so.
>
> But now we are about to introduce blocking variant of locks (SETLKW),
> and that means we might be waiting to a lock to be available and
> using plock->fd. And that means there are still users of plock
> structure.
>
> So release locks using fcntl(SETLK, F_UNLCK) instead of closing fd
> and plock will be freed later when lo_inode is being freed.
>
> Signed-off-by: Vivek Goyal 
> Signed-off-by: Ioannis Angelakopoulos 
> ---
>  tools/virtiofsd/passthrough_ll.c | 21 +
>  1 file changed, 17 insertions(+), 4 deletions(-)
>
> diff --git a/tools/virtiofsd/passthrough_ll.c 
> b/tools/virtiofsd/passthrough_ll.c
> index 38b2af8599..6928662e22 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -1557,9 +1557,6 @@ static void unref_inode(struct lo_data *lo, struct 
> lo_inode *inode, uint64_t n)
>  lo_map_remove(&lo->ino_map, inode->fuse_ino);
>  g_hash_table_remove(lo->inodes, &inode->key);
>  if (lo->posix_lock) {
> -if (g_hash_table_size(inode->posix_locks)) {
> -fuse_log(FUSE_LOG_WARNING, "Hash table is not empty\n");
> -}
>  g_hash_table_destroy(inode->posix_locks);
>  pthread_mutex_destroy(&inode->plock_mutex);
>  }
> @@ -2266,6 +2263,8 @@ static void lo_flush(fuse_req_t req, fuse_ino_t ino, 
> struct fuse_file_info *fi)
>  (void)ino;
>  struct lo_inode *inode;
>  struct lo_data *lo = lo_data(req);
> +struct lo_inode_plock *plock;
> +struct flock flock;
>
>  inode = lo_inode(req, ino);
>  if (!inode) {
> @@ -2282,8 +2281,22 @@ static void lo_flush(fuse_req_t req, fuse_ino_t ino, 
> struct fuse_file_info *fi)
>  /* An fd is going away. Cleanup associated posix locks */
>  if (lo->posix_lock) {
>  pthread_mutex_lock(&inode->plock_mutex);
> -g_hash_table_remove(inode->posix_locks,

I'm curious why the g_hash_table_remove above is not in the 'if' below?

> +plock = g_hash_table_lookup(inode->posix_locks,
>  GUINT_TO_POINTER(fi->lock_owner));
> +
> +if (plock) {
> +/*
> + * An fd is being closed. For posix locks, this means
> + * drop all the associated locks.
> + */
> +memset(&flock, 0, sizeof(struct flock));
> +flock.l_type = F_UNLCK;
> +flock.l_whence = SEEK_SET;
> +/* Unlock whole file */
> +flock.l_start = flock.l_len = 0;
> +fcntl(plock->fd, F_OFD_SETLK, &flock);
> +}
> +
>  pthread_mutex_unlock(&inode->plock_mutex);
>  }
>  res = close(dup(lo_fi_fd(req, fi)));


--
Cheers,
Christophe de Dinechin (IRC c3d)

[PATCH v2 1/3] vdpa: Skip protected ram IOMMU mappings

2021-10-05 Thread Eugenio Pérez

Following the logic of commit 56918a126ae ("memory: Add RAM_PROTECTED
flag to skip IOMMU mappings") with VFIO, skip memory sections
inaccessible via normal mechanisms, including DMA.

Signed-off-by: Eugenio Pérez 
---
 hw/virtio/vhost-vdpa.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 47d7a5a23d..ea1aa71ad8 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -28,6 +28,7 @@ static bool 
vhost_vdpa_listener_skipped_section(MemoryRegionSection *section)
 {
 return (!memory_region_is_ram(section->mr) &&
 !memory_region_is_iommu(section->mr)) ||
+memory_region_is_protected(section->mr) ||
/* vhost-vDPA doesn't allow MMIO to be mapped  */
 memory_region_is_ram_device(section->mr) ||
/*
-- 
2.27.0

Re: [PATCH v3 6/6] tests/qapi-schema: Test cases for aliases

2021-10-05 Thread Markus Armbruster

Kevin Wolf  writes:

> Am 02.10.2021 um 15:33 hat Markus Armbruster geschrieben:
>> I apologize for this wall of text.  It's a desparate attempt to cut
>> through the complexity and my confusion, and make sense of the actual
>> problems we're trying to solve.
>> 
>> So, what problems exactly are we trying to solve?
>
> I'll start with replying to your final question because I think it's
> more helpful to start with the big picture than with details.
>
> So tools like libvirt want to have a single consistent interface to
> configure things on startup and at runtime. We also intend to support
> configuration files that should this time support all of the options and
> not just a few chosen ones.

Yes.

> The hypothesis is that QAPIfying the command line is the correct
> solution for both of these problems, i.e. all available command line
> options must be present in the QAPI schema and will be processed by a
> single parser shared with QMP to make sure they are consistent.

Yes.

This leads us to JSON option arguments and configuration files.
Well-suited for management applications that already use QMP.

> Adding QAPIfied versions of individual command line options are steps
> towards this goal. As soon as they exist for every option, the final
> conversion from an open coded getopt() loop (or in fact a hand crafted
> parser in the case of vl.c) to something generated from the QAPI schema
> should be reasonably easy.

Yes.

> You're right that adding a second JSON-based command line interface for
> every option can already achieve the goal of providing a unified
> external interface, at the cost of (interface and code) duplication. Is
> this duplication desirable? Certainly no. Is it acceptable? You might
> get different opinions on this one.

We'd certainly prefer CLI options to match corresponding QMP commands
exactly.

Unfortunately, existing CLI options deviate from corresponding QMP
commands, and existing CLI options without corresponding QMP commands
may violate QMP design rules.

Note: these issues pertain to human-friendly option syntax.  The
machine-friendly option syntax is still limited to just a few options,
and it does match QMP there.

> In my opinion, we should try to get rid of hand crafted parsers where
> it's reasonably possible, and take advantage of the single unified
> option structure that QAPI provides. -chardev specifically has a hand
> crafted parser that essentially duplicates the automatically generated
> QAPI visitor code, except for the nesting and some differences in option
> names.

We should definitely parse JSON option arguments with the QAPI
machinery, and nothing more.

Parsing human-friendly arguments with it is desirable, but the need for
backward compatibility can make it difficult.  Even where compatibility
is of no concern, simply swapping concrete JSON syntax for dotted keys
may result in human interfaces that are less than friendly.

Are we in agreement that this is the problem at hand?

> Aliases are one tool that can help bridge these differences in a generic
> way with minimal effort in the individual case. They are not _necessary_
> to solve the problem; we could instead just use manually written code to
> manipulate input QDicts so that QAPI visitors accept them. Even with
> aliases, there are a few things left in the chardev code that are
> converted this way. Aliases just greatly reduce the amount of this code
> and make the conversion declarative instead.

Understood.

> Now a key point in the previous paragraph is that aliases add a generic
> way to do this. So even if they are immediately motivated by -chardev,
> it might be worth looking at other cases they could enable if you think
> that -chardev alone isn't sufficient justification to have this tool.
> I guess this is the point where things become a bit less clear because
> people are just hand waving with vague ideas for additional uses.
>
> Do we need to invest more thought on these other cases? We probably do
> if it makes a difference for the answer to the question whether we want
> to add aliases to our toolbox. Does it?

I hope we can make a case for aliases without looking beyond CLI
QAPIfication.  That's a wide field already, with enough opportunity to
get lost in details.

If we later put aliases to other uses, we might have to adapt them some.
That's okay.  Designing for one problem we have and understand has a
much better chance of success than trying to design for all problems we
might have.

There are many CLI options to be QAPIfied.  -chardev is one of the more
thornier ones, which makes it a useful example.

>> But what about the dotted keys argument?
>> 
>> One point of view is the difference between the JSON and the dotted keys
>> argument should be concrete syntax only.  Fine print: there may be
>> arguments dotted keys can't express, but let's ignore that here.
>> 
>> Another point of view is that dotted keys arguments are to JSON
>> arguments what HMP is to QMP: a (hopefully) human-f

[PATCH v2 0/3] vdpa: Check iova range on memory regions ops

2021-10-05 Thread Eugenio Pérez

At this moment vdpa will not send memory regions bigger than 1<<63.
However, actual iova range could be way more restrictive than that.

Since we can obtain the range through vdpa ioctl call, just save it
from the beginning of the operation and check against it.

Changes from v1:
* Use of int128_gt instead of plain uint64_t < comparison on memory
  range end.
* Document vhost_vdpa_section_end's return value so it's clear that
  it returns "one past end".

Eugenio Pérez (3):
  vdpa: Skip protected ram IOMMU mappings
  vdpa: Add vhost_vdpa_section_end
  vdpa: Check for iova range at mappings changes

 include/hw/virtio/vhost-vdpa.h |  2 +
 hw/virtio/vhost-vdpa.c | 87 ++
 hw/virtio/trace-events |  1 +
 3 files changed, 69 insertions(+), 21 deletions(-)

-- 
2.27.0

[PATCH v2 3/3] vdpa: Check for iova range at mappings changes

2021-10-05 Thread Eugenio Pérez

Check vdpa device range before updating memory regions so we don't add
any outside of it, and report the invalid change if any.

Signed-off-by: Eugenio Pérez 
---
 include/hw/virtio/vhost-vdpa.h |  2 +
 hw/virtio/vhost-vdpa.c | 68 ++
 hw/virtio/trace-events |  1 +
 3 files changed, 55 insertions(+), 16 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index a8963da2d9..c288cf7ecb 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -13,6 +13,7 @@
 #define HW_VIRTIO_VHOST_VDPA_H
 
 #include "hw/virtio/virtio.h"
+#include "standard-headers/linux/vhost_types.h"
 
 typedef struct VhostVDPAHostNotifier {
 MemoryRegion mr;
@@ -24,6 +25,7 @@ typedef struct vhost_vdpa {
 uint32_t msg_type;
 bool iotlb_batch_begin_sent;
 MemoryListener listener;
+struct vhost_vdpa_iova_range iova_range;
 struct vhost_dev *dev;
 VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
 } VhostVDPA;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index be7c63b4ba..6654287050 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -37,20 +37,34 @@ static Int128 vhost_vdpa_section_end(const 
MemoryRegionSection *section)
 return llend;
 }
 
-static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section)
-{
-return (!memory_region_is_ram(section->mr) &&
-!memory_region_is_iommu(section->mr)) ||
-memory_region_is_protected(section->mr) ||
-   /* vhost-vDPA doesn't allow MMIO to be mapped  */
-memory_region_is_ram_device(section->mr) ||
-   /*
-* Sizing an enabled 64-bit BAR can cause spurious mappings to
-* addresses in the upper part of the 64-bit address space.  These
-* are never accessed by the CPU and beyond the address width of
-* some IOMMU hardware.  TODO: VDPA should tell us the IOMMU width.
-*/
-   section->offset_within_address_space & (1ULL << 63);
+static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section,
+uint64_t iova_min,
+uint64_t iova_max)
+{
+Int128 llend;
+
+if ((!memory_region_is_ram(section->mr) &&
+ !memory_region_is_iommu(section->mr)) ||
+memory_region_is_protected(section->mr) ||
+/* vhost-vDPA doesn't allow MMIO to be mapped  */
+memory_region_is_ram_device(section->mr)) {
+return true;
+}
+
+if (section->offset_within_address_space < iova_min) {
+error_report("RAM section out of device range (min=%lu, addr=%lu)",
+ iova_min, section->offset_within_address_space);
+return true;
+}
+
+llend = vhost_vdpa_section_end(section);
+if (int128_gt(llend, int128_make64(iova_max))) {
+error_report("RAM section out of device range (max=%lu, end addr=%lu)",
+ iova_max, int128_get64(llend));
+return true;
+}
+
+return false;
 }
 
 static int vhost_vdpa_dma_map(struct vhost_vdpa *v, hwaddr iova, hwaddr size,
@@ -162,7 +176,8 @@ static void vhost_vdpa_listener_region_add(MemoryListener 
*listener,
 void *vaddr;
 int ret;
 
-if (vhost_vdpa_listener_skipped_section(section)) {
+if (vhost_vdpa_listener_skipped_section(section, v->iova_range.first,
+v->iova_range.last)) {
 return;
 }
 
@@ -220,7 +235,8 @@ static void vhost_vdpa_listener_region_del(MemoryListener 
*listener,
 Int128 llend, llsize;
 int ret;
 
-if (vhost_vdpa_listener_skipped_section(section)) {
+if (vhost_vdpa_listener_skipped_section(section, v->iova_range.first,
+v->iova_range.last)) {
 return;
 }
 
@@ -288,9 +304,24 @@ static void vhost_vdpa_add_status(struct vhost_dev *dev, 
uint8_t status)
 vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &s);
 }
 
+static int vhost_vdpa_get_iova_range(struct vhost_vdpa *v)
+{
+int ret;
+
+ret = vhost_vdpa_call(v->dev, VHOST_VDPA_GET_IOVA_RANGE, &v->iova_range);
+if (ret != 0) {
+return ret;
+}
+
+trace_vhost_vdpa_get_iova_range(v->dev, v->iova_range.first,
+v->iova_range.last);
+return ret;
+}
+
 static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
 {
 struct vhost_vdpa *v;
+int r;
 assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA);
 trace_vhost_vdpa_init(dev, opaque);
 
@@ -300,6 +331,11 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void 
*opaque, Error **errp)
 v->listener = vhost_vdpa_memory_listener;
 v->msg_type = VHOST_IOTLB_MSG_V2;
 
+r = vhost_vdpa_get_iova_range(v);
+if (unlikely(!r)) {
+return r;
+}
+
 vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDG

[PATCH v2 2/3] vdpa: Add vhost_vdpa_section_end

2021-10-05 Thread Eugenio Pérez

Abstract this operation, that will be reused when validating the region
against the iova range that the device supports.

Signed-off-by: Eugenio Pérez 
---
 hw/virtio/vhost-vdpa.c | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index ea1aa71ad8..be7c63b4ba 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -24,6 +24,19 @@
 #include "trace.h"
 #include "qemu-common.h"
 
+/*
+ * Return one past the end of the end of section. Be careful with uint64_t
+ * conversions!
+ */
+static Int128 vhost_vdpa_section_end(const MemoryRegionSection *section)
+{
+Int128 llend = int128_make64(section->offset_within_address_space);
+llend = int128_add(llend, section->size);
+llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
+
+return llend;
+}
+
 static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section)
 {
 return (!memory_region_is_ram(section->mr) &&
@@ -160,10 +173,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener 
*listener,
 }
 
 iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
-llend = int128_make64(section->offset_within_address_space);
-llend = int128_add(llend, section->size);
-llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
-
+llend = vhost_vdpa_section_end(section);
 if (int128_ge(int128_make64(iova), llend)) {
 return;
 }
@@ -221,9 +231,7 @@ static void vhost_vdpa_listener_region_del(MemoryListener 
*listener,
 }
 
 iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
-llend = int128_make64(section->offset_within_address_space);
-llend = int128_add(llend, section->size);
-llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
+llend = vhost_vdpa_section_end(section);
 
 trace_vhost_vdpa_listener_region_del(v, iova, int128_get64(llend));
 
-- 
2.27.0

Re: [PATCH v0 0/2] virtio-blk and vhost-user-blk cross-device migration

2021-10-05 Thread Dr. David Alan Gilbert

* Michael S. Tsirkin (m...@redhat.com) wrote:
> On Tue, Oct 05, 2021 at 02:18:40AM +0300, Roman Kagan wrote:
> > On Mon, Oct 04, 2021 at 11:11:00AM -0400, Michael S. Tsirkin wrote:
> > > On Mon, Oct 04, 2021 at 06:07:29PM +0300, Denis Plotnikov wrote:
> > > > It might be useful for the cases when a slow block layer should be 
> > > > replaced
> > > > with a more performant one on running VM without stopping, i.e. with 
> > > > very low
> > > > downtime comparable with the one on migration.
> > > > 
> > > > It's possible to achive that for two reasons:
> > > > 
> > > > 1.The VMStates of "virtio-blk" and "vhost-user-blk" are almost the same.
> > > >   They consist of the identical VMSTATE_VIRTIO_DEVICE and differs from
> > > >   each other in the values of migration service fields only.
> > > > 2.The device driver used in the guest is the same: virtio-blk
> > > > 
> > > > In the series cross-migration is achieved by adding a new type.
> > > > The new type uses virtio-blk VMState instead of vhost-user-blk specific
> > > > VMstate, also it implements migration save/load callbacks to be 
> > > > compatible
> > > > with migration stream produced by "virtio-blk" device.
> > > > 
> > > > Adding the new type instead of modifying the existing one is convenent.
> > > > It ease to differ the new virtio-blk-compatible vhost-user-blk
> > > > device from the existing non-compatible one using qemu machinery 
> > > > without any
> > > > other modifiactions. That gives all the variety of qemu device related
> > > > constraints out of box.
> > > 
> > > Hmm I'm not sure I understand. What is the advantage for the user?
> > > What if vhost-user-blk became an alias for vhost-user-virtio-blk?
> > > We could add some hacks to make it compatible for old machine types.
> > 
> > The point is that virtio-blk and vhost-user-blk are not
> > migration-compatible ATM.  OTOH they are the same device from the guest
> > POV so there's nothing fundamentally preventing the migration between
> > the two.  In particular, we see it as a means to switch between the
> > storage backend transports via live migration without disrupting the
> > guest.
> > 
> > Migration-wise virtio-blk and vhost-user-blk have in common
> > 
> > - the content of the VMState -- VMSTATE_VIRTIO_DEVICE
> > 
> > The two differ in
> > 
> > - the name and the version of the VMStateDescription
> > 
> > - virtio-blk has an extra migration section (via .save/.load callbacks
> >   on VirtioDeviceClass) containing requests in flight
> > 
> > It looks like to become migration-compatible with virtio-blk,
> > vhost-user-blk has to start using VMStateDescription of virtio-blk and
> > provide compatible .save/.load callbacks.  It isn't entirely obvious how
> > to make this machine-type-dependent, so we came up with a simpler idea
> > of defining a new device that shares most of the implementation with the
> > original vhost-user-blk except for the migration stuff.  We're certainly
> > open to suggestions on how to reconcile this under a single
> > vhost-user-blk device, as this would be more user-friendly indeed.
> > 
> > We considered using a class property for this and defining the
> > respective compat clause, but IIUC the class constructors (where .vmsd
> > and .save/.load are defined) are not supposed to depend on class
> > properties.
> > 
> > Thanks,
> > Roman.
> 
> So the question is how to make vmsd depend on machine type.
> CC Eduardo who poked at this kind of compat stuff recently,
> paolo who looked at qom things most recently and dgilbert
> for advice on migration.

I don't think I've seen anyone change vmsd name dependent on machine
type; making fields appear/disappear is easy - that just ends up as a
property on the device that's checked;  I guess if that property is
global (rather than per instance) then you can check it in
vhost_user_blk_class_init and swing the dc->vmsd pointer?

Dave


> -- 
> MST
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

1 2 3 4 >

1 - 100 of 300 matches

Mail list logo