Re: [RFC PATCH] tests/qtest: properly initialise the vring used idx

2022-04-07 Thread Stefan Hajnoczi
On Wed, Apr 06, 2022 at 06:33:56PM +0100, Alex Bennée wrote:
> Eric noticed while attempting to enable the vhost-user-blk-test for
> Aarch64 that that things didn't work unless he put in a dummy
> guest_malloc() at the start of the test. Without it
> qvirtio_wait_used_elem() would assert when it reads a junk value for
> idx resulting in:
> 
>   qvirtqueue_get_buf: idx:2401 last_idx:0
>   qvirtqueue_get_buf: 0x7ffcb6d3fe74, (nil)
>   qvirtio_wait_used_elem: 300/0
>   ERROR:../../tests/qtest/libqos/virtio.c:226:qvirtio_wait_used_elem: 
> assertion failed (got_desc_idx == desc_idx): (50331648 == 0)
>   Bail out! 
> ERROR:../../tests/qtest/libqos/virtio.c:226:qvirtio_wait_used_elem: assertion 
> failed (got_desc_idx == desc_idx): (50331648 == 0)
> 
> What was actually happening is the guest_malloc() effectively pushed
> the allocation of the vring into the next page which just happened to
> have clear memory. After much tedious tracing of the code I could see
> that qvring_init() does attempt initialise a bunch of the vring
> structures but skips the vring->used.idx value. It is probably not
> wise to assume guest memory is zeroed anyway. Once the ring is
> properly initialised the hack is no longer needed to get things
> working.
> 
> Thanks-to: John Snow  for helping debug
> Cc: Eric Auger 
> Cc: Stefan Hajnoczi 
> Cc: Michael S. Tsirkin 
> Cc: Raphael Norwitz 
> Signed-off-by: Alex Bennée 
> ---
>  tests/qtest/libqos/virtio.c | 2 ++
>  1 file changed, 2 insertions(+)

Nice work!

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH 4/7] virtio: don't read pending event on host notifier if disabled

2022-04-07 Thread Jason Wang



在 2022/4/6 上午3:18, Si-Wei Liu 写道:



On 4/1/2022 7:00 PM, Jason Wang wrote:

On Sat, Apr 2, 2022 at 4:37 AM Si-Wei Liu  wrote:



On 3/31/2022 1:36 AM, Jason Wang wrote:
On Thu, Mar 31, 2022 at 12:41 AM Si-Wei Liu  
wrote:


On 3/30/2022 2:14 AM, Jason Wang wrote:
On Wed, Mar 30, 2022 at 2:33 PM Si-Wei Liu 
 wrote:

Previous commit prevents vhost-user and vhost-vdpa from using
userland vq handler via disable_ioeventfd_handler. The same
needs to be done for host notifier cleanup too, as the
virtio_queue_host_notifier_read handler still tends to read
pending event left behind on ioeventfd and attempts to handle
outstanding kicks from QEMU userland vq.

If vq handler is not disabled on cleanup, it may lead to sigsegv
with recursive virtio_net_set_status call on the control vq:

0  0x7f8ce3ff3387 in raise () at /lib64/libc.so.6
1  0x7f8ce3ff4a78 in abort () at /lib64/libc.so.6
2  0x7f8ce3fec1a6 in __assert_fail_base () at /lib64/libc.so.6
3  0x7f8ce3fec252 in  () at /lib64/libc.so.6
4  0x558f52d79421 in vhost_vdpa_get_vq_index (dev=out>, idx=) at ../hw/virtio/vhost-vdpa.c:563
5  0x558f52d79421 in vhost_vdpa_get_vq_index (dev=out>, idx=) at ../hw/virtio/vhost-vdpa.c:558
6  0x558f52d7329a in vhost_virtqueue_mask 
(hdev=0x558f55c01800, vdev=0x558f568f91f0, n=2, mask=out>) at ../hw/virtio/vhost.c:1557

I feel it's probably a bug elsewhere e.g when we fail to start
vhost-vDPA, it's the charge of the Qemu to poll host notifier and we
will fallback to the userspace vq handler.
Apologies, an incorrect stack trace was pasted which actually came 
from

patch #1. I will post a v2 with the corresponding one as below:

0  0x55f800df1780 in qdev_get_parent_bus (dev=0x0) at
../hw/core/qdev.c:376
1  0x55f800c68ad8 in virtio_bus_device_iommu_enabled
(vdev=vdev@entry=0x0) at ../hw/virtio/virtio-bus.c:331
2  0x55f800d70d7f in vhost_memory_unmap (dev=) at
../hw/virtio/vhost.c:318
3  0x55f800d70d7f in vhost_memory_unmap (dev=,
buffer=0x7fc19bec5240, len=2052, is_write=1, access_len=2052) at
../hw/virtio/vhost.c:336
4  0x55f800d71867 in vhost_virtqueue_stop
(dev=dev@entry=0x55f8037ccc30, vdev=vdev@entry=0x55f8044ec590,
vq=0x55f8037cceb0, idx=0) at ../hw/virtio/vhost.c:1241
5  0x55f800d7406c in vhost_dev_stop 
(hdev=hdev@entry=0x55f8037ccc30,

vdev=vdev@entry=0x55f8044ec590) at ../hw/virtio/vhost.c:1839
6  0x55f800bf00a7 in vhost_net_stop_one (net=0x55f8037ccc30,
dev=0x55f8044ec590) at ../hw/net/vhost_net.c:315
7  0x55f800bf0678 in vhost_net_stop 
(dev=dev@entry=0x55f8044ec590,

ncs=0x55f80452bae0, data_queue_pairs=data_queue_pairs@entry=7,
cvq=cvq@entry=1)
  at ../hw/net/vhost_net.c:423
8  0x55f800d4e628 in virtio_net_set_status (status=out>,

n=0x55f8044ec590) at ../hw/net/virtio-net.c:296
9  0x55f800d4e628 in virtio_net_set_status
(vdev=vdev@entry=0x55f8044ec590, status=15 '\017') at
../hw/net/virtio-net.c:370
I don't understand why virtio_net_handle_ctrl() call 
virtio_net_set_stauts()...

The pending request left over on the ctrl vq was a VIRTIO_NET_CTRL_MQ
command, i.e. in virtio_net_handle_mq():

Completely forget that the code was actually written by me :\


1413 n->curr_queue_pairs = queue_pairs;
1414 /* stop the backend before changing the number of queue_pairs
to avoid handling a
1415  * disabled queue */
1416 virtio_net_set_status(vdev, vdev->status);
1417 virtio_net_set_queue_pairs(n);

Noted before the vdpa multiqueue support, there was never a vhost_dev
for ctrl_vq exposed, i.e. there's no host notifier set up for the
ctrl_vq on vhost_kernel as it is emulated in QEMU software.


10 0x55f800d534d8 in virtio_net_handle_ctrl (iov_cnt=, iov=, cmd=0 '\000', n=0x55f8044ec590) at
../hw/net/virtio-net.c:1408
11 0x55f800d534d8 in virtio_net_handle_ctrl (vdev=0x55f8044ec590,
vq=0x7fc1a7e888d0) at ../hw/net/virtio-net.c:1452
12 0x55f800d69f37 in virtio_queue_host_notifier_read
(vq=0x7fc1a7e888d0) at ../hw/virtio/virtio.c:2331
13 0x55f800d69f37 in virtio_queue_host_notifier_read
(n=n@entry=0x7fc1a7e8894c) at ../hw/virtio/virtio.c:3575
14 0x55f800c688e6 in virtio_bus_cleanup_host_notifier
(bus=, n=n@entry=14) at ../hw/virtio/virtio-bus.c:312
15 0x55f800d73106 in vhost_dev_disable_notifiers
(hdev=hdev@entry=0x55f8035b51b0, vdev=vdev@entry=0x55f8044ec590)
  at ../../../include/hw/virtio/virtio-bus.h:35
16 0x55f800bf00b2 in vhost_net_stop_one (net=0x55f8035b51b0,
dev=0x55f8044ec590) at ../hw/net/vhost_net.c:316
17 0x55f800bf0678 in vhost_net_stop 
(dev=dev@entry=0x55f8044ec590,

ncs=0x55f80452bae0, data_queue_pairs=data_queue_pairs@entry=7,
cvq=cvq@entry=1)
  at ../hw/net/vhost_net.c:423
18 0x55f800d4e628 in virtio_net_set_status (status=out>,

n=0x55f8044ec590) at ../hw/net/virtio-net.c:296
19 0x55f800d4e628 in virtio_net_set_status (vdev=0x55f8044ec590,
status=15 '\017') at ../hw/net/virtio-net.c:370
20 0x55f800d6c4b2 in virtio_set_status (vdev=0x55f8044ec590,
val=) at ../hw/virtio/virti

Re: [PATCH v3 1/7] block/copy-before-write: refactor option parsing

2022-04-07 Thread Hanna Reitz

On 06.04.22 20:07, Vladimir Sementsov-Ogievskiy wrote:

We are going to add one more option of enum type. Let's refactor option
parsing so that we can simply work with BlockdevOptionsCbw object.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  block/copy-before-write.c | 55 ---
  1 file changed, 28 insertions(+), 27 deletions(-)

diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index a8a06fdc09..6877ff893a 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c


[...]


@@ -376,6 +365,14 @@ static int cbw_open(BlockDriverState *bs, QDict *options, 
int flags,
  BDRVCopyBeforeWriteState *s = bs->opaque;
  BdrvDirtyBitmap *bitmap = NULL;
  int64_t cluster_size;
+g_autoptr(BlockdevOptions) full_opts = NULL;
+BlockdevOptionsCbw *opts;
+
+full_opts = cbw_parse_options(options, errp);
+if (!full_opts) {
+return -EINVAL;
+}
+opts = &full_opts->u.copy_before_write;


I would prefer an `assert(full_opts->driver == 
BLOCKDEV_DRIVER_COPY_BEFORE_WRITE);` here, but, either way:


Reviewed-by: Hanna Reitz 




Re: [libvirt RFC] virFile: new VIR_FILE_WRAPPER_BIG_PIPE to improve performance

2022-04-07 Thread Claudio Fontana
On 4/5/22 10:35 AM, Dr. David Alan Gilbert wrote:
> * Claudio Fontana (cfont...@suse.de) wrote:
>> On 3/28/22 10:31 AM, Daniel P. Berrangé wrote:
>>> On Sat, Mar 26, 2022 at 04:49:46PM +0100, Claudio Fontana wrote:
 On 3/25/22 12:29 PM, Daniel P. Berrangé wrote:
> On Fri, Mar 18, 2022 at 02:34:29PM +0100, Claudio Fontana wrote:
>> On 3/17/22 4:03 PM, Dr. David Alan Gilbert wrote:
>>> * Claudio Fontana (cfont...@suse.de) wrote:
 On 3/17/22 2:41 PM, Claudio Fontana wrote:
> On 3/17/22 11:25 AM, Daniel P. Berrangé wrote:
>> On Thu, Mar 17, 2022 at 11:12:11AM +0100, Claudio Fontana wrote:
>>> On 3/16/22 1:17 PM, Claudio Fontana wrote:
 On 3/14/22 6:48 PM, Daniel P. Berrangé wrote:
> On Mon, Mar 14, 2022 at 06:38:31PM +0100, Claudio Fontana wrote:
>> On 3/14/22 6:17 PM, Daniel P. Berrangé wrote:
>>> On Sat, Mar 12, 2022 at 05:30:01PM +0100, Claudio Fontana wrote:
 the first user is the qemu driver,

 virsh save/resume would slow to a crawl with a default pipe 
 size (64k).

 This improves the situation by 400%.

 Going through io_helper still seems to incur in some penalty 
 (~15%-ish)
 compared with direct qemu migration to a nc socket to a file.

 Signed-off-by: Claudio Fontana 
 ---
  src/qemu/qemu_driver.c|  6 +++---
  src/qemu/qemu_saveimage.c | 11 ++-
  src/util/virfile.c| 12 
  src/util/virfile.h|  1 +
  4 files changed, 22 insertions(+), 8 deletions(-)

 Hello, I initially thought this to be a qemu performance issue,
 so you can find the discussion about this in qemu-devel:

 "Re: bad virsh save /dev/null performance (600 MiB/s max)"

 https://lists.gnu.org/archive/html/qemu-devel/2022-03/msg03142.html
>>
>>
>>> Current results show these experimental averages maximum throughput
>>> migrating to /dev/null per each FdWrapper Pipe Size (as per QEMU QMP
>>> "query-migrate", tests repeated 5 times for each).
>>> VM Size is 60G, most of the memory effectively touched before 
>>> migration,
>>> through user application allocating and touching all memory with
>>> pseudorandom data.
>>>
>>> 64K: 5200 Mbps (current situation)
>>> 128K:5800 Mbps
>>> 256K:   20900 Mbps
>>> 512K:   21600 Mbps
>>> 1M: 22800 Mbps
>>> 2M: 22800 Mbps
>>> 4M: 22400 Mbps
>>> 8M: 22500 Mbps
>>> 16M:22800 Mbps
>>> 32M:22900 Mbps
>>> 64M:22900 Mbps
>>> 128M:   22800 Mbps
>>>
>>> This above is the throughput out of patched libvirt with multiple 
>>> Pipe Sizes for the FDWrapper.
>>
>> Ok, its bouncing around with noise after 1 MB. So I'd suggest that
>> libvirt attempt to raise the pipe limit to 1 MB by default, but
>> not try to go higher.
>>
>>> As for the theoretical limit for the libvirt architecture,
>>> I ran a qemu migration directly issuing the appropriate QMP
>>> commands, setting the same migration parameters as per libvirt,
>>> and then migrating to a socket netcatted to /dev/null via
>>> {"execute": "migrate", "arguments": { "uri", 
>>> "unix:///tmp/netcat.sock" } } :
>>>
>>> QMP:37000 Mbps
>>
>>> So although the Pipe size improves things (in particular the
>>> large jump is for the 256K size, although 1M seems a very good 
>>> value),
>>> there is still a second bottleneck in there somewhere that
>>> accounts for a loss of ~14200 Mbps in throughput.


 Interesting addition: I tested quickly on a system with faster cpus 
 and larger VM sizes, up to 200GB,
 and the difference in throughput libvirt vs qemu is basically the same 
 ~14500 Mbps.

 ~5 mbps qemu to netcat socket to /dev/null
 ~35500 mbps virsh save to /dev/null

 Seems it is not proportional to cpu speed by the looks of it (not a 
 totally fair comparison because the VM sizes are different).
>>>
>>> It might be closer to RAM or cache bandwidth limited though; for an 
>>> extra copy.
>>
>> I was thinking about sendfile(2) in iohelper, but that probably
>> can't work as the input fd is a socket, I am getting EINVAL.
>
> Yep, sendfile() requires the input to be a mmapable FD,
> and the output to be a socket.
>
> Try splic

Re: [RFC v2 1/8] blkio: add io_uring block driver using libblkio

2022-04-07 Thread Stefan Hajnoczi
On Wed, Apr 06, 2022 at 07:32:04PM +0200, Kevin Wolf wrote:
> Am 05.04.2022 um 17:33 hat Stefan Hajnoczi geschrieben:
> > libblkio (https://gitlab.com/libblkio/libblkio/) is a library for
> > high-performance disk I/O. It currently supports io_uring with
> > additional drivers planned.
> > 
> > One of the reasons for developing libblkio is that other applications
> > besides QEMU can use it. This will be particularly useful for
> > vhost-user-blk which applications may wish to use for connecting to
> > qemu-storage-daemon.
> > 
> > libblkio also gives us an opportunity to develop in Rust behind a C API
> > that is easy to consume from QEMU.
> > 
> > This commit adds an io_uring BlockDriver to QEMU using libblkio. For now
> > I/O buffers are copied through bounce buffers if the libblkio driver
> > requires it. Later commits add an optimization for pre-registering guest
> > RAM to avoid bounce buffers. It will be easy to add other libblkio
> > drivers since they will share the majority of code.
> > 
> > Signed-off-by: Stefan Hajnoczi 
> 
> > +static BlockDriver bdrv_io_uring = {
> > +.format_name= "io_uring",
> > +.protocol_name  = "io_uring",
> > +.instance_size  = sizeof(BDRVBlkioState),
> > +.bdrv_needs_filename= true,
> > +.bdrv_parse_filename= blkio_parse_filename_io_uring,
> > +.bdrv_file_open = blkio_file_open,
> > +.bdrv_close = blkio_close,
> > +.bdrv_getlength = blkio_getlength,
> > +.has_variable_length= true,
> 
> This one is a bad idea. It means that every request will call
> blkio_getlength() first, which looks up the "capacity" property in
> libblkio and then calls lseek() for the io_uring backend.

Thanks for pointing this out. I didn't think this through. More below on
what I was trying to do.

> For other backends like the vhost_user one (where I just copied your
> definition and then noticed this behaviour), it involve a message over
> the vhost socket, which is even worse.

(A vhost-user-blk driver could cache the capacity field and update it
when a Configuration Change Notification is received. There is no need
to send a vhost-user protocol message every time.)

> .has_variable_length was only meant for the host_floppy/cdrom drivers
> that have to deal with media change. Everything else just requires an
> explicit block_resize monitor command to be resized.

I was trying to support devices that can be resized below QEMU (e.g.
vhost-user-blk, vhost-vdpa-blk, and virtio-blk-pci). That was
unnecessary since QEMU doesn't support that model. If an LVM volume is
resized, for example, you still need to execute a monitor command to let
QEMU know.

I'll drop .has_variable_length.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH 1/3] vhost: Refactor vhost_reset_device() in VhostOps

2022-04-07 Thread Jason Wang



在 2022/4/2 下午1:14, Michael Qiu 写道:



On 2022/4/2 10:38, Jason Wang wrote:


在 2022/4/1 下午7:06, Michael Qiu 写道:

Currently in vhost framwork, vhost_reset_device() is misnamed.
Actually, it should be vhost_reset_owner().

In vhost user, it make compatible with reset device ops, but
vhost kernel does not compatible with it, for vhost vdpa, it
only implement reset device action.

So we need seperate the function into vhost_reset_owner() and
vhost_reset_device(). So that different backend could use the
correct function.



I see no reason when RESET_OWNER needs to be done for kernel backend.



In kernel vhost, RESET_OWNER  indeed do vhost device level reset: 
vhost_net_reset_owner()


static long vhost_net_reset_owner(struct vhost_net *n)
{
[...]
    err = vhost_dev_check_owner(&n->dev);
    if (err)
    goto done;
    umem = vhost_dev_reset_owner_prepare();
    if (!umem) {
    err = -ENOMEM;
    goto done;
    }
    vhost_net_stop(n, &tx_sock, &rx_sock);
    vhost_net_flush(n);
    vhost_dev_stop(&n->dev);
    vhost_dev_reset_owner(&n->dev, umem);
    vhost_net_vq_reset(n);
[...]

}

In the history of QEMU, There is a commit:
commit d1f8b30ec8dde0318fd1b98d24a64926feae9625
Author: Yuanhan Liu 
Date:   Wed Sep 23 12:19:57 2015 +0800

    vhost: rename VHOST_RESET_OWNER to VHOST_RESET_DEVICE

    Quote from Michael:

    We really should rename VHOST_RESET_OWNER to VHOST_RESET_DEVICE.

but finally, it has been reverted by the author:
commit 60915dc4691768c4dc62458bb3e16c843fab091d
Author: Yuanhan Liu 
Date:   Wed Nov 11 21:24:37 2015 +0800

    vhost: rename RESET_DEVICE backto RESET_OWNER

    This patch basically reverts commit d1f8b30e.

    It turned out that it breaks stuff, so revert it:

http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg00949.html

Seems kernel take RESET_OWNER for reset,but QEMU never call to this 
function to do a reset.



The question is, we manage to survive by not using RESET_OWNER for past 
10 years. Any reason that we want to use that now?


Note that the RESET_OWNER is only useful the process want to drop the 
its mm refcnt from vhost, it doesn't reset the device (e.g it does not 
even call vhost_vq_reset()).


(Especially, it was deprecated in by the vhost-user protocol since its 
semantics is ambiguous)





And if I understand the code correctly, vhost-user "abuse" 
RESET_OWNER for reset. So the current code looks fine?





Signde-off-by: Michael Qiu 
---
  hw/scsi/vhost-user-scsi.c |  6 +-
  hw/virtio/vhost-backend.c |  4 ++--
  hw/virtio/vhost-user.c    | 22 ++
  include/hw/virtio/vhost-backend.h |  2 ++
  4 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/hw/scsi/vhost-user-scsi.c b/hw/scsi/vhost-user-scsi.c
index 1b2f7ee..f179626 100644
--- a/hw/scsi/vhost-user-scsi.c
+++ b/hw/scsi/vhost-user-scsi.c
@@ -80,8 +80,12 @@ static void vhost_user_scsi_reset(VirtIODevice 
*vdev)

  return;
  }
-    if (dev->vhost_ops->vhost_reset_device) {
+    if (virtio_has_feature(dev->protocol_features,
+ VHOST_USER_PROTOCOL_F_RESET_DEVICE) &&
+ dev->vhost_ops->vhost_reset_device) {
  dev->vhost_ops->vhost_reset_device(dev);
+    } else if (dev->vhost_ops->vhost_reset_owner) {
+    dev->vhost_ops->vhost_reset_owner(dev);



Actually, I fail to understand why we need an indirection via 
vhost_ops. It's guaranteed to be vhost_user_ops.




  }
  }
diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
index e409a86..abbaa8b 100644
--- a/hw/virtio/vhost-backend.c
+++ b/hw/virtio/vhost-backend.c
@@ -191,7 +191,7 @@ static int vhost_kernel_set_owner(struct 
vhost_dev *dev)

  return vhost_kernel_call(dev, VHOST_SET_OWNER, NULL);
  }
-static int vhost_kernel_reset_device(struct vhost_dev *dev)
+static int vhost_kernel_reset_owner(struct vhost_dev *dev)
  {
  return vhost_kernel_call(dev, VHOST_RESET_OWNER, NULL);
  }
@@ -317,7 +317,7 @@ const VhostOps kernel_ops = {
  .vhost_get_features = vhost_kernel_get_features,
  .vhost_set_backend_cap = vhost_kernel_set_backend_cap,
  .vhost_set_owner = vhost_kernel_set_owner,
-    .vhost_reset_device = vhost_kernel_reset_device,
+    .vhost_reset_owner = vhost_kernel_reset_owner,



I think we can delete the current vhost_reset_device() since it not 
used in any code path.




I planned to use it for vDPA reset, 



For vhost-vDPA it can call vhost_vdpa_reset_device() directly.

As I mentioned before, the only user of vhost_reset_device config ops is 
vhost-user-scsi but it should directly call the vhost_user_reset_device().


Thanks



and vhost-user-scsi also use device reset.

Thanks,
Michael


Thanks



  .vhost_get_vq_index = vhost_kernel_get_vq_index,
  #ifdef CONFIG_VHOST_VSOCK
  .vhost_vsock_set_guest_cid = 
vhost_kernel_vsock_set_guest_cid,

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c

Re: [PATCH v4] vdpa: reset the backend device in the end of vhost_net_stop()

2022-04-07 Thread Jason Wang



在 2022/4/6 上午8:56, Si-Wei Liu 写道:



On 4/1/2022 7:20 PM, Jason Wang wrote:

Adding Michael.

On Sat, Apr 2, 2022 at 7:08 AM Si-Wei Liu  wrote:



On 3/31/2022 7:53 PM, Jason Wang wrote:
On Fri, Apr 1, 2022 at 9:31 AM Michael Qiu  
wrote:

Currently, when VM poweroff, it will trigger vdpa
device(such as mlx bluefield2 VF) reset many times(with 1 datapath
queue pair and one control queue, triggered 3 times), this
leads to below issue:

vhost VQ 2 ring restore failed: -22: Invalid argument (22)

This because in vhost_net_stop(), it will stop all vhost device 
bind to
this virtio device, and in vhost_dev_stop(), qemu tries to stop 
the device

, then stop the queue: vhost_virtqueue_stop().

In vhost_dev_stop(), it resets the device, which clear some flags
in low level driver, and in next loop(stop other vhost backends),
qemu try to stop the queue corresponding to the vhost backend,
   the driver finds that the VQ is invalied, this is the root cause.

To solve the issue, vdpa should set vring unready, and
remove reset ops in device stop: vhost_dev_start(hdev, false).

and implement a new function vhost_dev_reset, only reset backend
device after all vhost(per-queue) stoped.

Typo.


Signed-off-by: Michael Qiu
Acked-by: Jason Wang 

Rethink this patch, consider there're devices that don't support
set_vq_ready(). I wonder if we need

1) uAPI to tell the user space whether or not it supports 
set_vq_ready()

I guess what's more relevant here is to define the uAPI semantics for
unready i.e. set_vq_ready(0) for resuming/stopping virtqueue 
processing,

as starting vq is comparatively less ambiguous.

Yes.


Considering the
likelihood that this interface may be used for live migration, it would
be nice to come up with variants such as 1) discard inflight request
v.s. 2) waiting for inflight processing to be done,

Or inflight descriptor reporting (which seems to be tricky). But we
can start from net that a discarding may just work.


and 3) timeout in
waiting.

Actually, that's the plan and Eugenio is proposing something like this
via virtio spec:

https://urldefense.com/v3/__https://lists.oasis-open.org/archives/virtio-dev/202111/msg00020.html__;!!ACWV5N9M2RV99hQ!bcX6i6_atR-6Gcl-4q5Tekab_xDuXr7lDAMw2E1hilZ_1cZIX1c5mztQtvsnjiiy$ 

Thanks for the pointer, I seem to recall I saw it some time back 
though I wonder if there's follow-up for the v3? My impression was 
that this is still a work-in-progress spec proposal, while the 
semantics of various F_STOP scenario is unclear yet and not all of the 
requirements (ex: STOP_FAILED, rewind & !IN_ORDER) for live migration 
do seem to get accommodated?



My understanding is that, the reason for STOP_FAILED and IN_ORDER is 
because we don't have a way to report inflight descriptors. We will try 
to overcome those by allow the device to report inflight descriptors in 
the next version.








2) userspace will call SET_VRING_ENABLE() when the device supports
otherwise it will use RESET.

Are you looking to making virtqueue resume-able through the new
SET_VRING_ENABLE() uAPI?

I think RESET is inevitable in some case, i.e. when guest initiates
device reset by writing 0 to the status register.

Yes, that's all my plan.


For suspend/resume and
live migration use cases, indeed RESET can be substituted with
SET_VRING_ENABLE. Again, it'd need quite some code refactoring to
accommodate this change. Although I'm all for it, it'd be the best to
lay out the plan for multiple phases rather than overload this single
patch too much. You can count my time on this endeavor if you don't 
mind. :)

You're welcome, I agree we should choose a way to go first:

1) manage to use SET_VRING_ENABLE (more like a workaround anyway)
For networking device and the vq suspend/resume and live migration use 
cases to support, I thought it might suffice?



Without config space change it would be sufficient. And anyhow the vDPA 
parent can prevent the config change if all the virtqueue is disabled.




We may drop inflight or unused ones for Ethernet...



Yes.


What other part do you think may limit its extension to become a 
general uAPI or add new uAPI to address similar VQ stop requirement if 
need be? 



For networking, we don't need other.


Or we might well define subsystem specific uAPI to stop the virtqueue, 
for vdpa device specifically?



Anyhow we need a uAPI consider we have some parent that doesn't support 
virtqueue stop. So this could be another way to go.


But if we decide to bother with new uAPI, I would rather go with a new 
uAPI for stop the device. It can help for the config space change as well.



I think the point here is given that we would like to avoid guest side 
modification to support live migration, we can define specific uAPI 
for specific live migration requirement without having to involve 
guest driver change. It'd be easy to get started this way and 
generalize them all to a full blown _S_STOP when things are eventually 
settled.



Yes, note that the status seen b

Re: [PATCH v4] vdpa: reset the backend device in the end of vhost_net_stop()

2022-04-07 Thread Jason Wang



在 2022/4/2 上午11:53, Michael Qiu 写道:



On 2022/4/2 10:20, Jason Wang wrote:

Adding Michael.

On Sat, Apr 2, 2022 at 7:08 AM Si-Wei Liu  wrote:




On 3/31/2022 7:53 PM, Jason Wang wrote:
On Fri, Apr 1, 2022 at 9:31 AM Michael Qiu  
wrote:

Currently, when VM poweroff, it will trigger vdpa
device(such as mlx bluefield2 VF) reset many times(with 1 datapath
queue pair and one control queue, triggered 3 times), this
leads to below issue:

vhost VQ 2 ring restore failed: -22: Invalid argument (22)

This because in vhost_net_stop(), it will stop all vhost device 
bind to
this virtio device, and in vhost_dev_stop(), qemu tries to stop 
the device

, then stop the queue: vhost_virtqueue_stop().

In vhost_dev_stop(), it resets the device, which clear some flags
in low level driver, and in next loop(stop other vhost backends),
qemu try to stop the queue corresponding to the vhost backend,
   the driver finds that the VQ is invalied, this is the root cause.

To solve the issue, vdpa should set vring unready, and
remove reset ops in device stop: vhost_dev_start(hdev, false).

and implement a new function vhost_dev_reset, only reset backend
device after all vhost(per-queue) stoped.

Typo.


Signed-off-by: Michael Qiu
Acked-by: Jason Wang 

Rethink this patch, consider there're devices that don't support
set_vq_ready(). I wonder if we need

1) uAPI to tell the user space whether or not it supports 
set_vq_ready()

I guess what's more relevant here is to define the uAPI semantics for
unready i.e. set_vq_ready(0) for resuming/stopping virtqueue 
processing,

as starting vq is comparatively less ambiguous.


Yes.


Considering the
likelihood that this interface may be used for live migration, it would
be nice to come up with variants such as 1) discard inflight request
v.s. 2) waiting for inflight processing to be done,


Or inflight descriptor reporting (which seems to be tricky). But we
can start from net that a discarding may just work.


and 3) timeout in
waiting.


Actually, that's the plan and Eugenio is proposing something like this
via virtio spec:

https://lists.oasis-open.org/archives/virtio-dev/202111/msg00020.html




2) userspace will call SET_VRING_ENABLE() when the device supports
otherwise it will use RESET.

Are you looking to making virtqueue resume-able through the new
SET_VRING_ENABLE() uAPI?

I think RESET is inevitable in some case, i.e. when guest initiates
device reset by writing 0 to the status register.


Yes, that's all my plan.


For suspend/resume and
live migration use cases, indeed RESET can be substituted with
SET_VRING_ENABLE. Again, it'd need quite some code refactoring to
accommodate this change. Although I'm all for it, it'd be the best to
lay out the plan for multiple phases rather than overload this single
patch too much. You can count my time on this endeavor if you don't 
mind. :)


You're welcome, I agree we should choose a way to go first:

1) manage to use SET_VRING_ENABLE (more like a workaround anyway)
2) go with virtio-spec (may take a while)
3) don't wait for the spec, have a vDPA specific uAPI first. Note that
I've chatted with most of the vendors and they seem to be fine with
the _S_STOP. If we go this way, we can still provide the forward
compatibility of _S_STOP
4) or do them all (in parallel)

Any thoughts?



virtio-spec should be long-term, not only because the spec goes very 
slowly, but also the hardware upgrade should be a problem.


For short-term, better to take the first one?



Consider we need a new uAPI anyhow, I prefer for 2) but you can try 1) 
and see what people think.


Thanks




Thanks,
Michael

Thanks





And for safety, I suggest tagging this as 7.1.

+1

Regards,
-Siwei




---
v4 --> v3
  Nothing changed, becasue of issue with mimecast,
  when the From: tag is different from the sender,
  the some mail client will take the patch as an
  attachment, RESEND v3 does not work, So resend
  the patch as v4

v3 --> v2:
  Call vhost_dev_reset() at the end of vhost_net_stop().

  Since the vDPA device need re-add the status bit
  VIRTIO_CONFIG_S_ACKNOWLEDGE and VIRTIO_CONFIG_S_DRIVER,
  simply, add them inside vhost_vdpa_reset_device, and
  the only way calling vhost_vdpa_reset_device is in
  vhost_net_stop(), so it keeps the same behavior as before.

v2 --> v1:
 Implement a new function vhost_dev_reset,
 reset the backend kernel device at last.
---
   hw/net/vhost_net.c    | 24 +---
   hw/virtio/vhost-vdpa.c    | 15 +--
   hw/virtio/vhost.c | 15 ++-
   include/hw/virtio/vhost.h |  1 +
   4 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index 30379d2..422c9bf 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -325,7 +325,7 @@ int vhost_net_start(VirtIODevice *dev, 
NetClientState *ncs,

   int total_notifiers = data_queue_pairs * 2 + cvq;
   VirtIONet *n = VIRTIO_NET(dev);
   

[PATCH] hw/arm/smmuv3: Pass the real perm to returned IOMMUTLBEntry in smmuv3_translate()

2022-04-07 Thread chenxiang via
From: Xiang Chen 

In function memory_region_iommu_replay(), it decides to notify() or not
according to the perm of returned IOMMUTLBEntry. But for smmuv3, the
returned perm is always IOMMU_NONE even if the translation success.
Pass the real perm to returned IOMMUTLBEntry to avoid the issue.

Signed-off-by: Xiang Chen 
---
 hw/arm/smmuv3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 674623aabe..707eb430c2 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -760,7 +760,7 @@ epilogue:
 qemu_mutex_unlock(&s->mutex);
 switch (status) {
 case SMMU_TRANS_SUCCESS:
-entry.perm = flag;
+entry.perm = cached_entry->entry.perm;
 entry.translated_addr = cached_entry->entry.translated_addr +
 (addr & cached_entry->entry.addr_mask);
 entry.addr_mask = cached_entry->entry.addr_mask;
-- 
2.33.0




[PATCH v2] display/qxl-render: fix race condition in qxl_cursor (CVE-2021-4207)

2022-04-07 Thread Mauro Matteo Cascella
Avoid fetching 'width' and 'height' a second time to prevent possible
race condition. Refer to security advisory
https://starlabs.sg/advisories/22-4207/ for more information.

Fixes: CVE-2021-4207
Signed-off-by: Mauro Matteo Cascella 
---
v2:
- fix CVE id (CVE-2021-4207 instead of CVE-2022-4207)

 hw/display/qxl-render.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/display/qxl-render.c b/hw/display/qxl-render.c
index d28849b121..237ed293ba 100644
--- a/hw/display/qxl-render.c
+++ b/hw/display/qxl-render.c
@@ -266,7 +266,7 @@ static QEMUCursor *qxl_cursor(PCIQXLDevice *qxl, QXLCursor 
*cursor,
 }
 break;
 case SPICE_CURSOR_TYPE_ALPHA:
-size = sizeof(uint32_t) * cursor->header.width * cursor->header.height;
+size = sizeof(uint32_t) * c->width * c->height;
 qxl_unpack_chunks(c->data, size, qxl, &cursor->chunk, group_id);
 if (qxl->debug > 2) {
 cursor_print_ascii_art(c, "qxl/alpha");
-- 
2.35.1




[PATCH v3] ui/cursor: fix integer overflow in cursor_alloc (CVE-2021-4206)

2022-04-07 Thread Mauro Matteo Cascella
Prevent potential integer overflow by limiting 'width' and 'height' to
512x512. Also change 'datasize' type to size_t. Refer to security
advisory https://starlabs.sg/advisories/22-4206/ for more information.

Fixes: CVE-2021-4206
Signed-off-by: Mauro Matteo Cascella 
---
v3:
- fix CVE id (CVE-2021-4206 instead of CVE-2022-4206)

 hw/display/qxl-render.c | 7 +++
 hw/display/vmware_vga.c | 2 ++
 ui/cursor.c | 8 +++-
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/hw/display/qxl-render.c b/hw/display/qxl-render.c
index d28849b121..dc3c4edd05 100644
--- a/hw/display/qxl-render.c
+++ b/hw/display/qxl-render.c
@@ -247,6 +247,13 @@ static QEMUCursor *qxl_cursor(PCIQXLDevice *qxl, QXLCursor 
*cursor,
 size_t size;
 
 c = cursor_alloc(cursor->header.width, cursor->header.height);
+
+if (!c) {
+qxl_set_guest_bug(qxl, "%s: cursor %ux%u alloc error", __func__,
+cursor->header.width, cursor->header.height);
+goto fail;
+}
+
 c->hot_x = cursor->header.hot_spot_x;
 c->hot_y = cursor->header.hot_spot_y;
 switch (cursor->header.type) {
diff --git a/hw/display/vmware_vga.c b/hw/display/vmware_vga.c
index 98c83474ad..45d06cbe25 100644
--- a/hw/display/vmware_vga.c
+++ b/hw/display/vmware_vga.c
@@ -515,6 +515,8 @@ static inline void vmsvga_cursor_define(struct 
vmsvga_state_s *s,
 int i, pixels;
 
 qc = cursor_alloc(c->width, c->height);
+assert(qc != NULL);
+
 qc->hot_x = c->hot_x;
 qc->hot_y = c->hot_y;
 switch (c->bpp) {
diff --git a/ui/cursor.c b/ui/cursor.c
index 1d62ddd4d0..835f0802f9 100644
--- a/ui/cursor.c
+++ b/ui/cursor.c
@@ -46,6 +46,8 @@ static QEMUCursor *cursor_parse_xpm(const char *xpm[])
 
 /* parse pixel data */
 c = cursor_alloc(width, height);
+assert(c != NULL);
+
 for (pixel = 0, y = 0; y < height; y++, line++) {
 for (x = 0; x < height; x++, pixel++) {
 idx = xpm[line][x];
@@ -91,7 +93,11 @@ QEMUCursor *cursor_builtin_left_ptr(void)
 QEMUCursor *cursor_alloc(int width, int height)
 {
 QEMUCursor *c;
-int datasize = width * height * sizeof(uint32_t);
+size_t datasize = width * height * sizeof(uint32_t);
+
+if (width > 512 || height > 512) {
+return NULL;
+}
 
 c = g_malloc0(sizeof(QEMUCursor) + datasize);
 c->width  = width;
-- 
2.35.1




Re: [RFC PATCH] tests/qtest: properly initialise the vring used idx

2022-04-07 Thread Alex Bennée


Peter Maydell  writes:

> On Wed, 6 Apr 2022 at 21:07, Alex Bennée  wrote:
>>
>>
>> Peter Maydell  writes:
>> > Guest memory is generally zero at startup. Do we manage to
>> > hit the bit of memory at the start of the virt machine's RAM
>> > where we store the DTB ? (As you say, initializing the data
>> > structures is the right thing anyway.)
>>
>> I don't know - where is the DTB loaded?
>
> Start of RAM (that's physaddr 0x4000_). The thing I'm not sure
> about is whether these qtests go through the code in hw/arm/boot.c
> that loads the DTB into guest RAM or not.

Yes because it's linked to the machine creation:

Thread 1 hit Breakpoint 1, arm_load_dtb (addr=1073741824, 
binfo=binfo@entry=0x55bc4ce26970, addr_limit=0, as=as@entry=0x55bc4d119c50, 
ms=ms@entry=0x55bc4ce26800) at ../../hw/arm/boot.c:534
534 {
(rr) bt
#0  arm_load_dtb (addr=1073741824, binfo=binfo@entry=0x55bc4ce26970, 
addr_limit=0, as=as@entry=0x55bc4d119c50, ms=ms@entry=0x55bc4ce26800) at 
../../hw/arm/boot.c:534
#1  0x55bc4a9f7ded in virt_machine_done (notifier=0x55bc4ce26910, 
data=) at ../../hw/arm/virt.c:1637
#2  0x55bc4aebc807 in notifier_list_notify (list=list@entry=0x55bc4b5f8b20 
, data=data@entry=0x0) at ../../util/notify.c:39
#3  0x55bc4a7f82db in qdev_machine_creation_done () at 
../../hw/core/machine.c:1235
#4  0x55bc4a744b19 in qemu_machine_creation_done () at 
../../softmmu/vl.c:2725
#5  qmp_x_exit_preconfig (errp=) at ../../softmmu/vl.c:2748
#6  0x55bc4a748a14 in qmp_x_exit_preconfig (errp=) at 
../../softmmu/vl.c:2741
#7  qemu_init (argc=, argv=, envp=) at ../../softmmu/vl.c:3776
#8  0x55bc4a6de639 in main (argc=, argv=, 
envp=) at ../../softmmu/main.c:49

(ION: yay, I can capture qtest runs in rr now ;-)

>
>> Currently we are using the first
>> couple of pages in qtest because that where the qtest allocater is
>> initialised:
>>
>>   static void *qos_create_machine_arm_virt(QTestState *qts)
>>   {
>>   QVirtMachine *machine = g_new0(QVirtMachine, 1);
>>
>>   alloc_init(&machine->alloc, 0,
>>  ARM_VIRT_RAM_ADDR,
>>  ARM_VIRT_RAM_ADDR + ARM_VIRT_RAM_SIZE,
>>  ARM_PAGE_SIZE);
>>   qvirtio_mmio_init_device(&machine->virtio_mmio, qts, 
>> VIRTIO_MMIO_BASE_ADDR,
>>VIRTIO_MMIO_SIZE);
>>
>>   qos_create_generic_pcihost(&machine->bridge, qts, &machine->alloc);
>>
>>   machine->obj.get_device = virt_get_device;
>>   machine->obj.get_driver = virt_get_driver;
>>   machine->obj.destructor = virt_destructor;
>>   return machine;
>>   }
>>
>> I don't know if there is a more sane piece of memory we should be using.
>
> The first part of RAM is fine, it's just you can't assume it's
> all zeroes :-)
>
> -- PMM


-- 
Alex Bennée



Re: [PATCH v3 2/7] block/copy-before-write: add on-cbw-error open parameter

2022-04-07 Thread Hanna Reitz

On 06.04.22 20:07, Vladimir Sementsov-Ogievskiy wrote:

Currently, behavior on copy-before-write operation failure is simple:
report error to the guest.

Let's implement alternative behavior: break the whole copy-before-write
process (and corresponding backup job or NBD client) but keep guest
working. It's needed if we consider guest stability as more important.

The realisation is simple: on copy-before-write failure we set
s->snapshot_ret and continue guest operations. s->snapshot_ret being
set will lead to all further snapshot API requests. Note that all
in-flight snapshot-API requests may still success: we do wait for them
on BREAK_SNAPSHOT-failure path in cbw_do_copy_before_write().

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  qapi/block-core.json  | 25 -
  block/copy-before-write.c | 32 ++--
  2 files changed, 54 insertions(+), 3 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index beeb91952a..085f1666af 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json


[...]


@@ -4184,11 +4203,15 @@
  #  modifications (or removing) of specified bitmap doesn't
  #  influence the filter. (Since 7.0)
  #
+# @on-cbw-error: Behavior on failure of copy-before-write operation.
+#Default is @break-guest-write. (Since 7.0)


With *7.1:

Reviewed-by: Hanna Reitz 


+#
  # Since: 6.2
  ##
  { 'struct': 'BlockdevOptionsCbw',
'base': 'BlockdevOptionsGenericFormat',
-  'data': { 'target': 'BlockdevRef', '*bitmap': 'BlockDirtyBitmap' } }
+  'data': { 'target': 'BlockdevRef', '*bitmap': 'BlockDirtyBitmap',
+'*on-cbw-error': 'OnCbwError' } }
  
  ##

  # @BlockdevOptions:





Re: [RFC v2 1/8] blkio: add io_uring block driver using libblkio

2022-04-07 Thread Kevin Wolf
Am 07.04.2022 um 09:22 hat Stefan Hajnoczi geschrieben:
> On Wed, Apr 06, 2022 at 07:32:04PM +0200, Kevin Wolf wrote:
> > Am 05.04.2022 um 17:33 hat Stefan Hajnoczi geschrieben:
> > > libblkio (https://gitlab.com/libblkio/libblkio/) is a library for
> > > high-performance disk I/O. It currently supports io_uring with
> > > additional drivers planned.
> > > 
> > > One of the reasons for developing libblkio is that other applications
> > > besides QEMU can use it. This will be particularly useful for
> > > vhost-user-blk which applications may wish to use for connecting to
> > > qemu-storage-daemon.
> > > 
> > > libblkio also gives us an opportunity to develop in Rust behind a C API
> > > that is easy to consume from QEMU.
> > > 
> > > This commit adds an io_uring BlockDriver to QEMU using libblkio. For now
> > > I/O buffers are copied through bounce buffers if the libblkio driver
> > > requires it. Later commits add an optimization for pre-registering guest
> > > RAM to avoid bounce buffers. It will be easy to add other libblkio
> > > drivers since they will share the majority of code.
> > > 
> > > Signed-off-by: Stefan Hajnoczi 
> > 
> > > +static BlockDriver bdrv_io_uring = {
> > > +.format_name= "io_uring",
> > > +.protocol_name  = "io_uring",
> > > +.instance_size  = sizeof(BDRVBlkioState),
> > > +.bdrv_needs_filename= true,
> > > +.bdrv_parse_filename= blkio_parse_filename_io_uring,
> > > +.bdrv_file_open = blkio_file_open,
> > > +.bdrv_close = blkio_close,
> > > +.bdrv_getlength = blkio_getlength,
> > > +.has_variable_length= true,
> > 
> > This one is a bad idea. It means that every request will call
> > blkio_getlength() first, which looks up the "capacity" property in
> > libblkio and then calls lseek() for the io_uring backend.
> 
> Thanks for pointing this out. I didn't think this through. More below on
> what I was trying to do.
> 
> > For other backends like the vhost_user one (where I just copied your
> > definition and then noticed this behaviour), it involve a message over
> > the vhost socket, which is even worse.
> 
> (A vhost-user-blk driver could cache the capacity field and update it
> when a Configuration Change Notification is received. There is no need
> to send a vhost-user protocol message every time.)

In theory we could cache in libblkio, but then we would need a mechanism
to invalidate the cache so we can support resizing an image (similar to
what block_resize does in QEMU, except that it wouldn't set the new
size from a parameter, but just get the new value from the backend).

I think it's simpler to leave caching to the application - and QEMU
already does this automatically if we don't set .has_variable_length =
true.

Kevin


signature.asc
Description: PGP signature


Re: [RFC v2 1/8] blkio: add io_uring block driver using libblkio

2022-04-07 Thread Kevin Wolf
Am 07.04.2022 um 10:25 hat Kevin Wolf geschrieben:
> Am 07.04.2022 um 09:22 hat Stefan Hajnoczi geschrieben:
> > On Wed, Apr 06, 2022 at 07:32:04PM +0200, Kevin Wolf wrote:
> > > Am 05.04.2022 um 17:33 hat Stefan Hajnoczi geschrieben:
> > > > libblkio (https://gitlab.com/libblkio/libblkio/) is a library for
> > > > high-performance disk I/O. It currently supports io_uring with
> > > > additional drivers planned.
> > > > 
> > > > One of the reasons for developing libblkio is that other applications
> > > > besides QEMU can use it. This will be particularly useful for
> > > > vhost-user-blk which applications may wish to use for connecting to
> > > > qemu-storage-daemon.
> > > > 
> > > > libblkio also gives us an opportunity to develop in Rust behind a C API
> > > > that is easy to consume from QEMU.
> > > > 
> > > > This commit adds an io_uring BlockDriver to QEMU using libblkio. For now
> > > > I/O buffers are copied through bounce buffers if the libblkio driver
> > > > requires it. Later commits add an optimization for pre-registering guest
> > > > RAM to avoid bounce buffers. It will be easy to add other libblkio
> > > > drivers since they will share the majority of code.
> > > > 
> > > > Signed-off-by: Stefan Hajnoczi 
> > > 
> > > > +static BlockDriver bdrv_io_uring = {
> > > > +.format_name= "io_uring",
> > > > +.protocol_name  = "io_uring",
> > > > +.instance_size  = sizeof(BDRVBlkioState),
> > > > +.bdrv_needs_filename= true,
> > > > +.bdrv_parse_filename= blkio_parse_filename_io_uring,
> > > > +.bdrv_file_open = blkio_file_open,
> > > > +.bdrv_close = blkio_close,
> > > > +.bdrv_getlength = blkio_getlength,
> > > > +.has_variable_length= true,
> > > 
> > > This one is a bad idea. It means that every request will call
> > > blkio_getlength() first, which looks up the "capacity" property in
> > > libblkio and then calls lseek() for the io_uring backend.
> > 
> > Thanks for pointing this out. I didn't think this through. More below on
> > what I was trying to do.
> > 
> > > For other backends like the vhost_user one (where I just copied your
> > > definition and then noticed this behaviour), it involve a message over
> > > the vhost socket, which is even worse.
> > 
> > (A vhost-user-blk driver could cache the capacity field and update it
> > when a Configuration Change Notification is received. There is no need
> > to send a vhost-user protocol message every time.)
> 
> In theory we could cache in libblkio, but then we would need a mechanism
> to invalidate the cache so we can support resizing an image (similar to
> what block_resize does in QEMU, except that it wouldn't set the new
> size from a parameter, but just get the new value from the backend).

Oh, sorry, I misread. VHOST_USER_SLAVE_CONFIG_CHANGE_MSG is probably
what you mean, so that the backend triggers the update. It exists in the
spec, but neither libvhost-user nor rust-vmm seem to support it
currently. We also don't set up the backchannel yet where this message
could even be passed.

So it's an option, but probably only for later because it involves
extending several places.

> I think it's simpler to leave caching to the application - and QEMU
> already does this automatically if we don't set .has_variable_length =
> true.

I still think the application shouldn't query the capacity more often
than necessary, so optimising it is probably not very important.

Kevin


signature.asc
Description: PGP signature


Re: [PATCH v3 4/5] tests/qtest/vhost-user-blk-test: Temporary hack to get tests passing on aarch64

2022-04-07 Thread Eric Auger
Hi Alex,

On 4/6/22 7:34 PM, Alex Bennée wrote:
> Eric Auger  writes:
>
>> When run on ARM, basic and indirect tests currently fail with the
>> following error:
>>
>> ERROR:../tests/qtest/libqos/virtio.c:224:qvirtio_wait_used_elem:
>> assertion failed (got_desc_idx == desc_idx): (50331648 == 0)
>> Bail out! ERROR:../tests/qtest/libqos/virtio.c:224: qvirtio_wait_used_elem:
>> assertion failed (got_desc_idx == desc_idx): (50331648 == 0)
>>
>> I noticed it worked when I set up MSI and I further reduced the
>> code to a simple guest_alloc() that removes the error. At the moment
>> I am not able to identify where ths issue is and this blocks the
>> whole pci/aarch64 enablement.
>>
>> Signed-off-by: Eric Auger 
> Hi Eric,
>
> I sent a RFC patch which I think avoids the need for this hack:
>
>   Subject: [RFC PATCH] tests/qtest: properly initialise the vring used idx
>   Date: Wed,  6 Apr 2022 18:33:56 +0100
>   Message-Id: <20220406173356.1891500-1-alex.ben...@linaro.org>
>
Indeed this fixes my issue! Many thanks for the debug & fix.

I will respin with your R-b's.

Eric




Re: [RFC PATCH] tests/qtest: properly initialise the vring used idx

2022-04-07 Thread Eric Auger
Hi Alex,

On 4/6/22 7:33 PM, Alex Bennée wrote:
> Eric noticed while attempting to enable the vhost-user-blk-test for
> Aarch64 that that things didn't work unless he put in a dummy
> guest_malloc() at the start of the test. Without it
> qvirtio_wait_used_elem() would assert when it reads a junk value for
> idx resulting in:
>
>   qvirtqueue_get_buf: idx:2401 last_idx:0
>   qvirtqueue_get_buf: 0x7ffcb6d3fe74, (nil)
>   qvirtio_wait_used_elem: 300/0
>   ERROR:../../tests/qtest/libqos/virtio.c:226:qvirtio_wait_used_elem: 
> assertion failed (got_desc_idx == desc_idx): (50331648 == 0)
>   Bail out! 
> ERROR:../../tests/qtest/libqos/virtio.c:226:qvirtio_wait_used_elem: assertion 
> failed (got_desc_idx == desc_idx): (50331648 == 0)
>
> What was actually happening is the guest_malloc() effectively pushed
> the allocation of the vring into the next page which just happened to
> have clear memory. After much tedious tracing of the code I could see
Many thanks for the tedious investigation!
> that qvring_init() does attempt initialise a bunch of the vring
> structures but skips the vring->used.idx value. It is probably not
> wise to assume guest memory is zeroed anyway. Once the ring is
> properly initialised the hack is no longer needed to get things
> working.
>
> Thanks-to: John Snow  for helping debug
> Cc: Eric Auger 
> Cc: Stefan Hajnoczi 
> Cc: Michael S. Tsirkin 
> Cc: Raphael Norwitz 
> Signed-off-by: Alex Bennée 
> ---
>  tests/qtest/libqos/virtio.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/tests/qtest/libqos/virtio.c b/tests/qtest/libqos/virtio.c
> index 6fe7bf9555..fba9186659 100644
> --- a/tests/qtest/libqos/virtio.c
> +++ b/tests/qtest/libqos/virtio.c
> @@ -260,6 +260,8 @@ void qvring_init(QTestState *qts, const QGuestAllocator 
> *alloc, QVirtQueue *vq,
>  
>  /* vq->used->flags */
>  qvirtio_writew(vq->vdev, qts, vq->used, 0);
> +/* vq->used->idx */
> +qvirtio_writew(vq->vdev, qts, vq->used + 2, 0);
>  /* vq->used->avail_event */
>  qvirtio_writew(vq->vdev, qts, vq->used + 2 +
> sizeof(struct vring_used_elem) * vq->size, 0);
Reviewed-by: Eric Auger 
Tested-by: Eric Auger 

Eric




[PATCH 1/1] qemu-img: properly list formats which have consistency check implemented

2022-04-07 Thread Denis V. Lunev
Simple grep for the .bdrv_co_check callback presence gives the following
list of block drivers
* QED
* VDI
* VHDX
* VMDK
* Parallels
which have this callback. The presense of the callback means that
consistency check is supported.

The patch updates documentation accordingly.

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Hanna Reitz 
---
 docs/tools/qemu-img.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
index 8885ea11cf..85a6e05b35 100644
--- a/docs/tools/qemu-img.rst
+++ b/docs/tools/qemu-img.rst
@@ -332,8 +332,8 @@ by the used format or see the format descriptions below for 
details.
   ``-r all`` fixes all kinds of errors, with a higher risk of choosing the
   wrong fix or hiding corruption that has already occurred.
 
-  Only the formats ``qcow2``, ``qed`` and ``vdi`` support
-  consistency checks.
+  Only the formats ``qcow2``, ``qed``, ``parallels``, ``vhdx``, ``vmdk`` and
+  ``vdi`` support consistency checks.
 
   In case the image does not have any inconsistencies, check exits with ``0``.
   Other exit codes indicate the kind of inconsistency found or if another error
-- 
2.32.0




Re: [PATCH v2] display/qxl-render: fix race condition in qxl_cursor (CVE-2021-4207)

2022-04-07 Thread Marc-André Lureau
On Thu, Apr 7, 2022 at 12:11 PM Mauro Matteo Cascella 
wrote:

> Avoid fetching 'width' and 'height' a second time to prevent possible
> race condition. Refer to security advisory
> https://starlabs.sg/advisories/22-4207/ for more information.
>
> Fixes: CVE-2021-4207
> Signed-off-by: Mauro Matteo Cascella 
>

Reviewed-by: Marc-André Lureau 


> ---
> v2:
> - fix CVE id (CVE-2021-4207 instead of CVE-2022-4207)
>
>  hw/display/qxl-render.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/display/qxl-render.c b/hw/display/qxl-render.c
> index d28849b121..237ed293ba 100644
> --- a/hw/display/qxl-render.c
> +++ b/hw/display/qxl-render.c
> @@ -266,7 +266,7 @@ static QEMUCursor *qxl_cursor(PCIQXLDevice *qxl,
> QXLCursor *cursor,
>  }
>  break;
>  case SPICE_CURSOR_TYPE_ALPHA:
> -size = sizeof(uint32_t) * cursor->header.width *
> cursor->header.height;
> +size = sizeof(uint32_t) * c->width * c->height;
>  qxl_unpack_chunks(c->data, size, qxl, &cursor->chunk, group_id);
>  if (qxl->debug > 2) {
>  cursor_print_ascii_art(c, "qxl/alpha");
> --
> 2.35.1
>
>
>

-- 
Marc-André Lureau


Re: [PATCH v3] ui/cursor: fix integer overflow in cursor_alloc (CVE-2021-4206)

2022-04-07 Thread Marc-André Lureau
On Thu, Apr 7, 2022 at 12:23 PM Mauro Matteo Cascella 
wrote:

> Prevent potential integer overflow by limiting 'width' and 'height' to
> 512x512. Also change 'datasize' type to size_t. Refer to security
> advisory https://starlabs.sg/advisories/22-4206/ for more information.
>
> Fixes: CVE-2021-4206
>

(the Starlabs advisory has 2022, I guess it's wrong then)

Signed-off-by: Mauro Matteo Cascella 
>

Reviewed-by: Marc-André Lureau 



> ---
> v3:
> - fix CVE id (CVE-2021-4206 instead of CVE-2022-4206)
>
>  hw/display/qxl-render.c | 7 +++
>  hw/display/vmware_vga.c | 2 ++
>  ui/cursor.c | 8 +++-
>  3 files changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/hw/display/qxl-render.c b/hw/display/qxl-render.c
> index d28849b121..dc3c4edd05 100644
> --- a/hw/display/qxl-render.c
> +++ b/hw/display/qxl-render.c
> @@ -247,6 +247,13 @@ static QEMUCursor *qxl_cursor(PCIQXLDevice *qxl,
> QXLCursor *cursor,
>  size_t size;
>
>  c = cursor_alloc(cursor->header.width, cursor->header.height);
> +
> +if (!c) {
> +qxl_set_guest_bug(qxl, "%s: cursor %ux%u alloc error", __func__,
> +cursor->header.width, cursor->header.height);
> +goto fail;
> +}
> +
>  c->hot_x = cursor->header.hot_spot_x;
>  c->hot_y = cursor->header.hot_spot_y;
>  switch (cursor->header.type) {
> diff --git a/hw/display/vmware_vga.c b/hw/display/vmware_vga.c
> index 98c83474ad..45d06cbe25 100644
> --- a/hw/display/vmware_vga.c
> +++ b/hw/display/vmware_vga.c
> @@ -515,6 +515,8 @@ static inline void vmsvga_cursor_define(struct
> vmsvga_state_s *s,
>  int i, pixels;
>
>  qc = cursor_alloc(c->width, c->height);
> +assert(qc != NULL);
> +
>  qc->hot_x = c->hot_x;
>  qc->hot_y = c->hot_y;
>  switch (c->bpp) {
> diff --git a/ui/cursor.c b/ui/cursor.c
> index 1d62ddd4d0..835f0802f9 100644
> --- a/ui/cursor.c
> +++ b/ui/cursor.c
> @@ -46,6 +46,8 @@ static QEMUCursor *cursor_parse_xpm(const char *xpm[])
>
>  /* parse pixel data */
>  c = cursor_alloc(width, height);
> +assert(c != NULL);
> +
>  for (pixel = 0, y = 0; y < height; y++, line++) {
>  for (x = 0; x < height; x++, pixel++) {
>  idx = xpm[line][x];
> @@ -91,7 +93,11 @@ QEMUCursor *cursor_builtin_left_ptr(void)
>  QEMUCursor *cursor_alloc(int width, int height)
>  {
>  QEMUCursor *c;
> -int datasize = width * height * sizeof(uint32_t);
> +size_t datasize = width * height * sizeof(uint32_t);
> +
> +if (width > 512 || height > 512) {
> +return NULL;
> +}
>
>  c = g_malloc0(sizeof(QEMUCursor) + datasize);
>  c->width  = width;
> --
> 2.35.1
>
>
>

-- 
Marc-André Lureau


Re: [PATCH v3 7/7] iotests: copy-before-write: add cases for cbw-timeout option

2022-04-07 Thread Hanna Reitz

On 06.04.22 20:08, Vladimir Sementsov-Ogievskiy wrote:

Add two simple test-cases: timeout failure with
break-snapshot-on-cbw-error behavior and similar with
break-guest-write-on-cbw-error behavior.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/tests/copy-before-write| 78 +++
  .../qemu-iotests/tests/copy-before-write.out  |  4 +-
  2 files changed, 80 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/tests/copy-before-write 
b/tests/qemu-iotests/tests/copy-before-write
index a32608f597..5c90b8cd50 100755
--- a/tests/qemu-iotests/tests/copy-before-write
+++ b/tests/qemu-iotests/tests/copy-before-write
@@ -122,6 +122,84 @@ read 1048576/1048576 bytes at offset 0
  1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  """)
  
+def do_cbw_timeout(self, on_cbw_error):

+result = self.vm.qmp('object-add', {
+'qom-type': 'throttle-group',
+'id': 'group0',
+'limits': {'bps-write': 300 * 1024}


Hm, yes, I can’t find a way to make this work without your other 
series.  For some reason, not even -accel tcg helps; and using qtest to 
advance the virtual clock doesn’t really help because the qemu-io 
commands block while the request is throttled.


One thing that should work would be to run everything in a 
qemu-storage-daemon instance, and then having qemu-io access an NBD 
export...





Re: [PULL 0/3] virtio,pc: bugfixes

2022-04-07 Thread Peter Maydell
On Wed, 6 Apr 2022 at 22:11, Michael S. Tsirkin  wrote:
>
> The following changes since commit 128e050d41794e61e5849c6c507160da5556ea61:
>
>   hw/acpi/microvm: turn on 8042 bit in FADT boot architecture flags if 
> present (2022-03-07 17:43:14 -0500)
>
> are available in the Git repository at:
>
>   git://git.kernel.org/pub/scm/virt/kvm/mst/qemu.git tags/for_upstream
>
> for you to fetch changes up to f556b9a0cd20d41493afd403fb7f016c8fb01eb3:
>
>   virtio-iommu: use-after-free fix (2022-04-06 17:11:03 -0400)
>
> 
> virtio,pc: bugfixes
>
> Several fixes all over the place
>
> Signed-off-by: Michael S. Tsirkin 

RC3 has gone. We're not taking any more changes unless they
are absolutely release-critical, which means that you need
to be clearly describing in the pull request cover letter
what the changes are and why they are release critical.

thanks
-- PMM



Re: [PATCH v3 4/7] util: add qemu-co-timeout

2022-04-07 Thread Hanna Reitz

On 06.04.22 20:07, Vladimir Sementsov-Ogievskiy wrote:

Add new API, to make a time limited call of the coroutine.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  include/qemu/coroutine.h | 13 ++
  util/qemu-co-timeout.c   | 89 
  util/meson.build |  1 +
  3 files changed, 103 insertions(+)
  create mode 100644 util/qemu-co-timeout.c


Reviewed-by: Hanna Reitz 




Re: [PATCH v3 5/7] block/block-copy: block_copy(): add timeout_ns parameter

2022-04-07 Thread Hanna Reitz

On 06.04.22 20:07, Vladimir Sementsov-Ogievskiy wrote:

Add possibility to limit block_copy() call in time. To be used in the
next commit.

As timed-out block_copy() call will continue in background anyway (we
can't immediately cancel IO operation), it's important also give user a
possibility to pass a callback, to do some additional actions on
block-copy call finish.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  include/block/block-copy.h |  4 +++-
  block/block-copy.c | 33 ++---
  block/copy-before-write.c  |  2 +-
  3 files changed, 30 insertions(+), 9 deletions(-)


Reviewed-by: Hanna Reitz 




Re: [PATCH v1 1/4] hw/arm: versal: Create an APU CPU Cluster

2022-04-07 Thread Francisco Iglesias
On Wed, Apr 06, 2022 at 06:43:00PM +0100, Edgar E. Iglesias wrote:
> From: "Edgar E. Iglesias" 
> 
> Create an APU CPU Cluster. This is in preparation to add the RPU.
> 
> Signed-off-by: Edgar E. Iglesias 

Reviewed-by: Francisco Iglesias 

> ---
>  hw/arm/xlnx-versal.c | 9 -
>  include/hw/arm/xlnx-versal.h | 2 ++
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/arm/xlnx-versal.c b/hw/arm/xlnx-versal.c
> index 2551dfc22d..4415ee413f 100644
> --- a/hw/arm/xlnx-versal.c
> +++ b/hw/arm/xlnx-versal.c
> @@ -34,10 +34,15 @@ static void versal_create_apu_cpus(Versal *s)
>  {
>  int i;
>  
> +object_initialize_child(OBJECT(s), "apu-cluster", &s->fpd.apu.cluster,
> +TYPE_CPU_CLUSTER);
> +qdev_prop_set_uint32(DEVICE(&s->fpd.apu.cluster), "cluster-id", 0);
> +
>  for (i = 0; i < ARRAY_SIZE(s->fpd.apu.cpu); i++) {
>  Object *obj;
>  
> -object_initialize_child(OBJECT(s), "apu-cpu[*]", &s->fpd.apu.cpu[i],
> +object_initialize_child(OBJECT(&s->fpd.apu.cluster),
> +"apu-cpu[*]", &s->fpd.apu.cpu[i],
>  XLNX_VERSAL_ACPU_TYPE);
>  obj = OBJECT(&s->fpd.apu.cpu[i]);
>  if (i) {
> @@ -52,6 +57,8 @@ static void versal_create_apu_cpus(Versal *s)
>   &error_abort);
>  qdev_realize(DEVICE(obj), NULL, &error_fatal);
>  }
> +
> +qdev_realize(DEVICE(&s->fpd.apu.cluster), NULL, &error_fatal);
>  }
>  
>  static void versal_create_apu_gic(Versal *s, qemu_irq *pic)
> diff --git a/include/hw/arm/xlnx-versal.h b/include/hw/arm/xlnx-versal.h
> index 0728316ec7..d2d3028e18 100644
> --- a/include/hw/arm/xlnx-versal.h
> +++ b/include/hw/arm/xlnx-versal.h
> @@ -14,6 +14,7 @@
>  
>  #include "hw/sysbus.h"
>  #include "hw/arm/boot.h"
> +#include "hw/cpu/cluster.h"
>  #include "hw/or-irq.h"
>  #include "hw/sd/sdhci.h"
>  #include "hw/intc/arm_gicv3.h"
> @@ -49,6 +50,7 @@ struct Versal {
>  struct {
>  struct {
>  MemoryRegion mr;
> +CPUClusterState cluster;
>  ARMCPU cpu[XLNX_VERSAL_NR_ACPUS];
>  GICv3State gic;
>  } apu;
> -- 
> 2.25.1
> 



Re: [PATCH v3 6/7] block/copy-before-write: implement cbw-timeout option

2022-04-07 Thread Hanna Reitz

On 06.04.22 20:08, Vladimir Sementsov-Ogievskiy wrote:

In some scenarios, when copy-before-write operations lasts too long
time, it's better to cancel it.

Most useful would be to use the new option together with
on-cbw-error=break-snapshot: this way if cbw operation takes too long
time we'll just cancel backup process but do not disturb the guest too
much.

Note the tricky point of realization: we keep additional point in
bs->in_fligth during block_copy operation even if it's timed-out.


*flight


Background "cancelled" block_copy operations will finish at some point
and will want to access state. We should care to not free the state in
.bdrv_close() earlier.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  qapi/block-core.json  |  8 +++-
  block/copy-before-write.c | 23 ++-
  2 files changed, 29 insertions(+), 2 deletions(-)


Reviewed-by: Hanna Reitz 




Re: [PULL 0/3] virtio,pc: bugfixes

2022-04-07 Thread Michael S. Tsirkin
On Thu, Apr 07, 2022 at 10:18:24AM +0100, Peter Maydell wrote:
> On Wed, 6 Apr 2022 at 22:11, Michael S. Tsirkin  wrote:
> >
> > The following changes since commit 128e050d41794e61e5849c6c507160da5556ea61:
> >
> >   hw/acpi/microvm: turn on 8042 bit in FADT boot architecture flags if 
> > present (2022-03-07 17:43:14 -0500)
> >
> > are available in the Git repository at:
> >
> >   git://git.kernel.org/pub/scm/virt/kvm/mst/qemu.git tags/for_upstream
> >
> > for you to fetch changes up to f556b9a0cd20d41493afd403fb7f016c8fb01eb3:
> >
> >   virtio-iommu: use-after-free fix (2022-04-06 17:11:03 -0400)
> >
> > 
> > virtio,pc: bugfixes
> >
> > Several fixes all over the place
> >
> > Signed-off-by: Michael S. Tsirkin 
> 
> RC3 has gone. We're not taking any more changes unless they
> are absolutely release-critical, which means that you need
> to be clearly describing in the pull request cover letter
> what the changes are and why they are release critical.
> 
> thanks
> -- PMM

Will do, thanks!

-- 
MST




Re: [PATCH v1 2/4] hw/arm: versal: Add the Cortex-R5Fs

2022-04-07 Thread Francisco Iglesias
On Wed, Apr 06, 2022 at 06:43:01PM +0100, Edgar E. Iglesias wrote:
> From: "Edgar E. Iglesias" 
> 
> Add the Cortex-R5Fs of the Versal RPU (Real-time Processing Unit)
> subsystem.
> 
> Signed-off-by: Edgar E. Iglesias 

Reviewed-by: Francisco Iglesias 


> ---
>  hw/arm/xlnx-versal-virt.c|  6 +++---
>  hw/arm/xlnx-versal.c | 36 
>  include/hw/arm/xlnx-versal.h | 10 ++
>  3 files changed, 49 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/arm/xlnx-versal-virt.c b/hw/arm/xlnx-versal-virt.c
> index 7c7baff8b7..66a2de7e13 100644
> --- a/hw/arm/xlnx-versal-virt.c
> +++ b/hw/arm/xlnx-versal-virt.c
> @@ -721,9 +721,9 @@ static void versal_virt_machine_class_init(ObjectClass 
> *oc, void *data)
>  
>  mc->desc = "Xilinx Versal Virtual development board";
>  mc->init = versal_virt_init;
> -mc->min_cpus = XLNX_VERSAL_NR_ACPUS;
> -mc->max_cpus = XLNX_VERSAL_NR_ACPUS;
> -mc->default_cpus = XLNX_VERSAL_NR_ACPUS;
> +mc->min_cpus = XLNX_VERSAL_NR_ACPUS + XLNX_VERSAL_NR_RCPUS;
> +mc->max_cpus = XLNX_VERSAL_NR_ACPUS + XLNX_VERSAL_NR_RCPUS;
> +mc->default_cpus = XLNX_VERSAL_NR_ACPUS + XLNX_VERSAL_NR_RCPUS;
>  mc->no_cdrom = true;
>  mc->default_ram_id = "ddr";
>  }
> diff --git a/hw/arm/xlnx-versal.c b/hw/arm/xlnx-versal.c
> index 4415ee413f..ebad8dbb6d 100644
> --- a/hw/arm/xlnx-versal.c
> +++ b/hw/arm/xlnx-versal.c
> @@ -25,6 +25,7 @@
>  #include "hw/sysbus.h"
>  
>  #define XLNX_VERSAL_ACPU_TYPE ARM_CPU_TYPE_NAME("cortex-a72")
> +#define XLNX_VERSAL_RCPU_TYPE ARM_CPU_TYPE_NAME("cortex-r5f")
>  #define GEM_REVISION0x40070106
>  
>  #define VERSAL_NUM_PMC_APB_IRQS 3
> @@ -130,6 +131,35 @@ static void versal_create_apu_gic(Versal *s, qemu_irq 
> *pic)
>  }
>  }
>  
> +static void versal_create_rpu_cpus(Versal *s)
> +{
> +int i;
> +
> +object_initialize_child(OBJECT(s), "rpu-cluster", &s->lpd.rpu.cluster,
> +TYPE_CPU_CLUSTER);
> +qdev_prop_set_uint32(DEVICE(&s->lpd.rpu.cluster), "cluster-id", 1);
> +
> +for (i = 0; i < ARRAY_SIZE(s->lpd.rpu.cpu); i++) {
> +Object *obj;
> +
> +object_initialize_child(OBJECT(&s->lpd.rpu.cluster),
> +"rpu-cpu[*]", &s->lpd.rpu.cpu[i],
> +XLNX_VERSAL_RCPU_TYPE);
> +obj = OBJECT(&s->lpd.rpu.cpu[i]);
> +object_property_set_bool(obj, "start-powered-off", true,
> + &error_abort);
> +
> +object_property_set_int(obj, "mp-affinity", 0x100 | i, &error_abort);
> +object_property_set_int(obj, "core-count", 
> ARRAY_SIZE(s->lpd.rpu.cpu),
> +&error_abort);
> +object_property_set_link(obj, "memory", OBJECT(&s->lpd.rpu.mr),
> + &error_abort);
> +qdev_realize(DEVICE(obj), NULL, &error_fatal);
> +}
> +
> +qdev_realize(DEVICE(&s->lpd.rpu.cluster), NULL, &error_fatal);
> +}
> +
>  static void versal_create_uarts(Versal *s, qemu_irq *pic)
>  {
>  int i;
> @@ -638,6 +668,7 @@ static void versal_realize(DeviceState *dev, Error **errp)
>  
>  versal_create_apu_cpus(s);
>  versal_create_apu_gic(s, pic);
> +versal_create_rpu_cpus(s);
>  versal_create_uarts(s, pic);
>  versal_create_usbs(s, pic);
>  versal_create_gems(s, pic);
> @@ -659,6 +690,8 @@ static void versal_realize(DeviceState *dev, Error **errp)
>  
>  memory_region_add_subregion_overlap(&s->mr_ps, MM_OCM, &s->lpd.mr_ocm, 
> 0);
>  memory_region_add_subregion_overlap(&s->fpd.apu.mr, 0, &s->mr_ps, 0);
> +memory_region_add_subregion_overlap(&s->lpd.rpu.mr, 0,
> +&s->lpd.rpu.mr_ps_alias, 0);
>  }
>  
>  static void versal_init(Object *obj)
> @@ -666,7 +699,10 @@ static void versal_init(Object *obj)
>  Versal *s = XLNX_VERSAL(obj);
>  
>  memory_region_init(&s->fpd.apu.mr, obj, "mr-apu", UINT64_MAX);
> +memory_region_init(&s->lpd.rpu.mr, obj, "mr-rpu", UINT64_MAX);
>  memory_region_init(&s->mr_ps, obj, "mr-ps-switch", UINT64_MAX);
> +memory_region_init_alias(&s->lpd.rpu.mr_ps_alias, OBJECT(s),
> + "mr-rpu-ps-alias", &s->mr_ps, 0, UINT64_MAX);
>  }
>  
>  static Property versal_properties[] = {
> diff --git a/include/hw/arm/xlnx-versal.h b/include/hw/arm/xlnx-versal.h
> index d2d3028e18..155e8c4b8c 100644
> --- a/include/hw/arm/xlnx-versal.h
> +++ b/include/hw/arm/xlnx-versal.h
> @@ -35,6 +35,7 @@
>  OBJECT_DECLARE_SIMPLE_TYPE(Versal, XLNX_VERSAL)
>  
>  #define XLNX_VERSAL_NR_ACPUS   2
> +#define XLNX_VERSAL_NR_RCPUS   2
>  #define XLNX_VERSAL_NR_UARTS   2
>  #define XLNX_VERSAL_NR_GEMS2
>  #define XLNX_VERSAL_NR_ADMAS   8
> @@ -73,6 +74,15 @@ struct Versal {
>  VersalUsb2 usb;
>  } iou;
>  
> +/* Real-time Processing Unit.  */
> +struct {
> +MemoryRegion mr;
> +MemoryRegion mr_ps_alias;
> +
> + 

Re: [PATCH v3] ui/cursor: fix integer overflow in cursor_alloc (CVE-2021-4206)

2022-04-07 Thread Mauro Matteo Cascella
On Thu, Apr 7, 2022 at 11:17 AM Marc-André Lureau
 wrote:
>
>
>
> On Thu, Apr 7, 2022 at 12:23 PM Mauro Matteo Cascella  
> wrote:
>>
>> Prevent potential integer overflow by limiting 'width' and 'height' to
>> 512x512. Also change 'datasize' type to size_t. Refer to security
>> advisory https://starlabs.sg/advisories/22-4206/ for more information.
>>
>> Fixes: CVE-2021-4206
>
>
> (the Starlabs advisory has 2022, I guess it's wrong then)

Yep, that is wrong. I asked them to update the page.

Thanks.

>> Signed-off-by: Mauro Matteo Cascella 
>
>
> Reviewed-by: Marc-André Lureau 
>
>
>>
>> ---
>> v3:
>> - fix CVE id (CVE-2021-4206 instead of CVE-2022-4206)
>>
>>  hw/display/qxl-render.c | 7 +++
>>  hw/display/vmware_vga.c | 2 ++
>>  ui/cursor.c | 8 +++-
>>  3 files changed, 16 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/display/qxl-render.c b/hw/display/qxl-render.c
>> index d28849b121..dc3c4edd05 100644
>> --- a/hw/display/qxl-render.c
>> +++ b/hw/display/qxl-render.c
>> @@ -247,6 +247,13 @@ static QEMUCursor *qxl_cursor(PCIQXLDevice *qxl, 
>> QXLCursor *cursor,
>>  size_t size;
>>
>>  c = cursor_alloc(cursor->header.width, cursor->header.height);
>> +
>> +if (!c) {
>> +qxl_set_guest_bug(qxl, "%s: cursor %ux%u alloc error", __func__,
>> +cursor->header.width, cursor->header.height);
>> +goto fail;
>> +}
>> +
>>  c->hot_x = cursor->header.hot_spot_x;
>>  c->hot_y = cursor->header.hot_spot_y;
>>  switch (cursor->header.type) {
>> diff --git a/hw/display/vmware_vga.c b/hw/display/vmware_vga.c
>> index 98c83474ad..45d06cbe25 100644
>> --- a/hw/display/vmware_vga.c
>> +++ b/hw/display/vmware_vga.c
>> @@ -515,6 +515,8 @@ static inline void vmsvga_cursor_define(struct 
>> vmsvga_state_s *s,
>>  int i, pixels;
>>
>>  qc = cursor_alloc(c->width, c->height);
>> +assert(qc != NULL);
>> +
>>  qc->hot_x = c->hot_x;
>>  qc->hot_y = c->hot_y;
>>  switch (c->bpp) {
>> diff --git a/ui/cursor.c b/ui/cursor.c
>> index 1d62ddd4d0..835f0802f9 100644
>> --- a/ui/cursor.c
>> +++ b/ui/cursor.c
>> @@ -46,6 +46,8 @@ static QEMUCursor *cursor_parse_xpm(const char *xpm[])
>>
>>  /* parse pixel data */
>>  c = cursor_alloc(width, height);
>> +assert(c != NULL);
>> +
>>  for (pixel = 0, y = 0; y < height; y++, line++) {
>>  for (x = 0; x < height; x++, pixel++) {
>>  idx = xpm[line][x];
>> @@ -91,7 +93,11 @@ QEMUCursor *cursor_builtin_left_ptr(void)
>>  QEMUCursor *cursor_alloc(int width, int height)
>>  {
>>  QEMUCursor *c;
>> -int datasize = width * height * sizeof(uint32_t);
>> +size_t datasize = width * height * sizeof(uint32_t);
>> +
>> +if (width > 512 || height > 512) {
>> +return NULL;
>> +}
>>
>>  c = g_malloc0(sizeof(QEMUCursor) + datasize);
>>  c->width  = width;
>> --
>> 2.35.1
>>
>>
>
>
> --
> Marc-André Lureau



-- 
Mauro Matteo Cascella
Red Hat Product Security
PGP-Key ID: BB3410B0




Re: [PATCH v1 3/4] hw/misc: Add a model of the Xilinx Versal CRL

2022-04-07 Thread Francisco Iglesias
On Wed, Apr 06, 2022 at 06:43:02PM +0100, Edgar E. Iglesias wrote:
> From: "Edgar E. Iglesias" 
> 
> Add a model of the Xilinx Versal CRL.
> 
> Signed-off-by: Edgar E. Iglesias 

Reviewed-by: Francisco Iglesias 

> ---
>  hw/misc/meson.build   |   1 +
>  hw/misc/xlnx-versal-crl.c | 421 ++
>  include/hw/misc/xlnx-versal-crl.h | 235 +
>  3 files changed, 657 insertions(+)
>  create mode 100644 hw/misc/xlnx-versal-crl.c
>  create mode 100644 include/hw/misc/xlnx-versal-crl.h
> 
> diff --git a/hw/misc/meson.build b/hw/misc/meson.build
> index 6fb69612e0..2ff05c7afa 100644
> --- a/hw/misc/meson.build
> +++ b/hw/misc/meson.build
> @@ -86,6 +86,7 @@ softmmu_ss.add(when: 'CONFIG_SLAVIO', if_true: 
> files('slavio_misc.c'))
>  softmmu_ss.add(when: 'CONFIG_ZYNQ', if_true: files('zynq_slcr.c'))
>  specific_ss.add(when: 'CONFIG_XLNX_ZYNQMP_ARM', if_true: 
> files('xlnx-zynqmp-crf.c'))
>  specific_ss.add(when: 'CONFIG_XLNX_ZYNQMP_ARM', if_true: 
> files('xlnx-zynqmp-apu-ctrl.c'))
> +specific_ss.add(when: 'CONFIG_XLNX_VERSAL', if_true: 
> files('xlnx-versal-crl.c'))
>  softmmu_ss.add(when: 'CONFIG_XLNX_VERSAL', if_true: files(
>'xlnx-versal-xramc.c',
>'xlnx-versal-pmc-iou-slcr.c',
> diff --git a/hw/misc/xlnx-versal-crl.c b/hw/misc/xlnx-versal-crl.c
> new file mode 100644
> index 00..767106b7a3
> --- /dev/null
> +++ b/hw/misc/xlnx-versal-crl.c
> @@ -0,0 +1,421 @@
> +/*
> + * QEMU model of the Clock-Reset-LPD (CRL).
> + *
> + * Copyright (c) 2022 Advanced Micro Devices, Inc.
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * Written by Edgar E. Iglesias 
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "qemu/log.h"
> +#include "qemu/bitops.h"
> +#include "migration/vmstate.h"
> +#include "hw/qdev-properties.h"
> +#include "hw/sysbus.h"
> +#include "hw/irq.h"
> +#include "hw/register.h"
> +#include "hw/resettable.h"
> +
> +#include "target/arm/arm-powerctl.h"
> +#include "hw/misc/xlnx-versal-crl.h"
> +
> +#ifndef XLNX_VERSAL_CRL_ERR_DEBUG
> +#define XLNX_VERSAL_CRL_ERR_DEBUG 0
> +#endif
> +
> +static void crl_update_irq(XlnxVersalCRL *s)
> +{
> +bool pending = s->regs[R_IR_STATUS] & ~s->regs[R_IR_MASK];
> +qemu_set_irq(s->irq, pending);
> +}
> +
> +static void crl_status_postw(RegisterInfo *reg, uint64_t val64)
> +{
> +XlnxVersalCRL *s = XLNX_VERSAL_CRL(reg->opaque);
> +crl_update_irq(s);
> +}
> +
> +static uint64_t crl_enable_prew(RegisterInfo *reg, uint64_t val64)
> +{
> +XlnxVersalCRL *s = XLNX_VERSAL_CRL(reg->opaque);
> +uint32_t val = val64;
> +
> +s->regs[R_IR_MASK] &= ~val;
> +crl_update_irq(s);
> +return 0;
> +}
> +
> +static uint64_t crl_disable_prew(RegisterInfo *reg, uint64_t val64)
> +{
> +XlnxVersalCRL *s = XLNX_VERSAL_CRL(reg->opaque);
> +uint32_t val = val64;
> +
> +s->regs[R_IR_MASK] |= val;
> +crl_update_irq(s);
> +return 0;
> +}
> +
> +static void crl_reset_dev(XlnxVersalCRL *s, DeviceState *dev,
> +  bool rst_old, bool rst_new)
> +{
> +device_cold_reset(dev);
> +}
> +
> +static void crl_reset_cpu(XlnxVersalCRL *s, ARMCPU *armcpu,
> +  bool rst_old, bool rst_new)
> +{
> +if (rst_new) {
> +arm_set_cpu_off(armcpu->mp_affinity);
> +} else {
> +arm_set_cpu_on_and_reset(armcpu->mp_affinity);
> +}
> +}
> +
> +#define REGFIELD_RESET(type, s, reg, f, new_val, dev) { \
> +bool old_f = ARRAY_FIELD_EX32((s)->regs, reg, f);   \
> +bool new_f = FIELD_EX32(new_val, reg, f);   \
> +\
> +/* Detect edges.  */\
> +if (dev && old_f != new_f) {\
> +crl_reset_ ## type(s, dev, old_f, new_f);   \
> +}   \
> +}
> +
> +static uint64_t crl_rst_r5_prew(RegisterInfo *reg, uint64_t val64)
> +{
> +XlnxVersalCRL *s = XLNX_VERSAL_CRL(reg->opaque);
> +
> +REGFIELD_RESET(cpu, s, RST_CPU_R5, RESET_CPU0, val64, s->cfg.cpu_r5[0]);
> +REGFIELD_RESET(cpu, s, RST_CPU_R5, RESET_CPU1, val64, s->cfg.cpu_r5[1]);
> +return val64;
> +}
> +
> +static uint64_t crl_rst_adma_prew(RegisterInfo *reg, uint64_t val64)
> +{
> +XlnxVersalCRL *s = XLNX_VERSAL_CRL(reg->opaque);
> +int i;
> +
> +/* A single register fans out to all ADMA reset inputs.  */
> +for (i = 0; i < ARRAY_SIZE(s->cfg.adma); i++) {
> +REGFIELD_RESET(dev, s, RST_ADMA, RESET, val64, s->cfg.adma[i]);
> +}
> +return val64;
> +}
> +
> +static uint64_t crl_rst_uart0_prew(RegisterInfo *reg, uint64_t val64)
> +{
> +XlnxVersalCRL *s = XLNX_VERSAL_CRL(reg->opaque);
> +
> +REGFIELD_RESET(dev, s, RST_UART0, RESET, val64, s->cfg.uart[0]);
> +return val64;
> +}
> +
> +static uint64_t crl_rst_uart1_prew(RegisterInfo *reg, uint64_t val64)
> +{
> +XlnxVersalCRL *s = XL

Re: [PATCH v1 4/4] hw/arm: versal: Connect the CRL

2022-04-07 Thread Francisco Iglesias
On Wed, Apr 06, 2022 at 06:43:03PM +0100, Edgar E. Iglesias wrote:
> From: "Edgar E. Iglesias" 
> 
> Connect the CRL (Clock Reset LPD) to the Versal SoC.
> 
> Signed-off-by: Edgar E. Iglesias 

Reviewed-by: Francisco Iglesias 

> ---
>  hw/arm/xlnx-versal.c | 54 ++--
>  include/hw/arm/xlnx-versal.h |  4 +++
>  2 files changed, 56 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/arm/xlnx-versal.c b/hw/arm/xlnx-versal.c
> index ebad8dbb6d..57276e1506 100644
> --- a/hw/arm/xlnx-versal.c
> +++ b/hw/arm/xlnx-versal.c
> @@ -539,6 +539,57 @@ static void versal_create_ospi(Versal *s, qemu_irq *pic)
>  qdev_connect_gpio_out(orgate, 0, pic[VERSAL_OSPI_IRQ]);
>  }
>  
> +static void versal_create_crl(Versal *s, qemu_irq *pic)
> +{
> +SysBusDevice *sbd;
> +int i;
> +
> +object_initialize_child(OBJECT(s), "crl", &s->lpd.crl,
> +TYPE_XLNX_VERSAL_CRL);
> +sbd = SYS_BUS_DEVICE(&s->lpd.crl);
> +
> +for (i = 0; i < ARRAY_SIZE(s->lpd.rpu.cpu); i++) {
> +g_autofree gchar *name = g_strdup_printf("cpu_r5[%d]", i);
> +
> +object_property_set_link(OBJECT(&s->lpd.crl),
> + name, OBJECT(&s->lpd.rpu.cpu[i]),
> + &error_abort);
> +}
> +
> +for (i = 0; i < ARRAY_SIZE(s->lpd.iou.gem); i++) {
> +g_autofree gchar *name = g_strdup_printf("gem[%d]", i);
> +
> +object_property_set_link(OBJECT(&s->lpd.crl),
> + name, OBJECT(&s->lpd.iou.gem[i]),
> + &error_abort);
> +}
> +
> +for (i = 0; i < ARRAY_SIZE(s->lpd.iou.adma); i++) {
> +g_autofree gchar *name = g_strdup_printf("adma[%d]", i);
> +
> +object_property_set_link(OBJECT(&s->lpd.crl),
> + name, OBJECT(&s->lpd.iou.adma[i]),
> + &error_abort);
> +}
> +
> +for (i = 0; i < ARRAY_SIZE(s->lpd.iou.uart); i++) {
> +g_autofree gchar *name = g_strdup_printf("uart[%d]", i);
> +
> +object_property_set_link(OBJECT(&s->lpd.crl),
> + name, OBJECT(&s->lpd.iou.uart[i]),
> + &error_abort);
> +}
> +
> +object_property_set_link(OBJECT(&s->lpd.crl),
> + "usb", OBJECT(&s->lpd.iou.usb),
> + &error_abort);
> +
> +sysbus_realize(sbd, &error_fatal);
> +memory_region_add_subregion(&s->mr_ps, MM_CRL,
> +sysbus_mmio_get_region(sbd, 0));
> +sysbus_connect_irq(sbd, 0, pic[VERSAL_CRL_IRQ]);
> +}
> +
>  /* This takes the board allocated linear DDR memory and creates aliases
>   * for each split DDR range/aperture on the Versal address map.
>   */
> @@ -622,8 +673,6 @@ static void versal_unimp(Versal *s)
>  
>  versal_unimp_area(s, "psm", &s->mr_ps,
>  MM_PSM_START, MM_PSM_END - MM_PSM_START);
> -versal_unimp_area(s, "crl", &s->mr_ps,
> -MM_CRL, MM_CRL_SIZE);
>  versal_unimp_area(s, "crf", &s->mr_ps,
>  MM_FPD_CRF, MM_FPD_CRF_SIZE);
>  versal_unimp_area(s, "apu", &s->mr_ps,
> @@ -681,6 +730,7 @@ static void versal_realize(DeviceState *dev, Error **errp)
>  versal_create_efuse(s, pic);
>  versal_create_pmc_iou_slcr(s, pic);
>  versal_create_ospi(s, pic);
> +versal_create_crl(s, pic);
>  versal_map_ddr(s);
>  versal_unimp(s);
>  
> diff --git a/include/hw/arm/xlnx-versal.h b/include/hw/arm/xlnx-versal.h
> index 155e8c4b8c..cbe8a19c10 100644
> --- a/include/hw/arm/xlnx-versal.h
> +++ b/include/hw/arm/xlnx-versal.h
> @@ -29,6 +29,7 @@
>  #include "hw/nvram/xlnx-versal-efuse.h"
>  #include "hw/ssi/xlnx-versal-ospi.h"
>  #include "hw/dma/xlnx_csu_dma.h"
> +#include "hw/misc/xlnx-versal-crl.h"
>  #include "hw/misc/xlnx-versal-pmc-iou-slcr.h"
>  
>  #define TYPE_XLNX_VERSAL "xlnx-versal"
> @@ -87,6 +88,8 @@ struct Versal {
>  qemu_or_irq irq_orgate;
>  XlnxXramCtrl ctrl[XLNX_VERSAL_NR_XRAM];
>  } xram;
> +
> +XlnxVersalCRL crl;
>  } lpd;
>  
>  /* The Platform Management Controller subsystem.  */
> @@ -127,6 +130,7 @@ struct Versal {
>  #define VERSAL_TIMER_NS_EL1_IRQ 14
>  #define VERSAL_TIMER_NS_EL2_IRQ 10
>  
> +#define VERSAL_CRL_IRQ 10
>  #define VERSAL_UART0_IRQ_0 18
>  #define VERSAL_UART1_IRQ_0 19
>  #define VERSAL_USB0_IRQ_0  22
> -- 
> 2.25.1
> 



[PATCH v4] dump: Remove the sh_info variable

2022-04-07 Thread Janosch Frank
There's no need to have phdr_num and sh_info at the same time. We can
make phdr_num 32 bit and set PN_XNUM when we write the header if
phdr_num >= PN_XNUM.

Signed-off-by: Janosch Frank 
Reviewed-by: Richard Henderson 
---

A question out of general curiosity:
Is PN_XNUM a real concern anyway?
Are architectures using >65k segments in real life?

* Uses MIN()
* Added explanation for the PN_XNUM usage

---
 dump/dump.c   | 44 +++
 include/sysemu/dump.h |  3 +--
 2 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/dump/dump.c b/dump/dump.c
index 58c4923fce..56cd1b2bb8 100644
--- a/dump/dump.c
+++ b/dump/dump.c
@@ -124,6 +124,12 @@ static int fd_write_vmcore(const void *buf, size_t size, 
void *opaque)
 
 static void write_elf64_header(DumpState *s, Error **errp)
 {
+/*
+ * phnum in the elf header is 16 bit, if we have more segments we
+ * set phnum to PN_XNUM and write the real number of segments to a
+ * special section.
+ */
+uint16_t phnum = MIN(s->phdr_num, PN_XNUM);
 Elf64_Ehdr elf_header;
 int ret;
 
@@ -138,9 +144,9 @@ static void write_elf64_header(DumpState *s, Error **errp)
 elf_header.e_ehsize = cpu_to_dump16(s, sizeof(elf_header));
 elf_header.e_phoff = cpu_to_dump64(s, sizeof(Elf64_Ehdr));
 elf_header.e_phentsize = cpu_to_dump16(s, sizeof(Elf64_Phdr));
-elf_header.e_phnum = cpu_to_dump16(s, s->phdr_num);
+elf_header.e_phnum = cpu_to_dump16(s, phnum);
 if (s->have_section) {
-uint64_t shoff = sizeof(Elf64_Ehdr) + sizeof(Elf64_Phdr) * s->sh_info;
+uint64_t shoff = sizeof(Elf64_Ehdr) + sizeof(Elf64_Phdr) * s->phdr_num;
 
 elf_header.e_shoff = cpu_to_dump64(s, shoff);
 elf_header.e_shentsize = cpu_to_dump16(s, sizeof(Elf64_Shdr));
@@ -155,6 +161,12 @@ static void write_elf64_header(DumpState *s, Error **errp)
 
 static void write_elf32_header(DumpState *s, Error **errp)
 {
+/*
+ * phnum in the elf header is 16 bit, if we have more segments we
+ * set phnum to PN_XNUM and write the real number of segments to a
+ * special section.
+ */
+uint16_t phnum = MIN(s->phdr_num, PN_XNUM);
 Elf32_Ehdr elf_header;
 int ret;
 
@@ -169,9 +181,9 @@ static void write_elf32_header(DumpState *s, Error **errp)
 elf_header.e_ehsize = cpu_to_dump16(s, sizeof(elf_header));
 elf_header.e_phoff = cpu_to_dump32(s, sizeof(Elf32_Ehdr));
 elf_header.e_phentsize = cpu_to_dump16(s, sizeof(Elf32_Phdr));
-elf_header.e_phnum = cpu_to_dump16(s, s->phdr_num);
+elf_header.e_phnum = cpu_to_dump16(s, phnum);
 if (s->have_section) {
-uint32_t shoff = sizeof(Elf32_Ehdr) + sizeof(Elf32_Phdr) * s->sh_info;
+uint32_t shoff = sizeof(Elf32_Ehdr) + sizeof(Elf32_Phdr) * s->phdr_num;
 
 elf_header.e_shoff = cpu_to_dump32(s, shoff);
 elf_header.e_shentsize = cpu_to_dump16(s, sizeof(Elf32_Shdr));
@@ -358,12 +370,12 @@ static void write_elf_section(DumpState *s, int type, 
Error **errp)
 if (type == 0) {
 shdr_size = sizeof(Elf32_Shdr);
 memset(&shdr32, 0, shdr_size);
-shdr32.sh_info = cpu_to_dump32(s, s->sh_info);
+shdr32.sh_info = cpu_to_dump32(s, s->phdr_num);
 shdr = &shdr32;
 } else {
 shdr_size = sizeof(Elf64_Shdr);
 memset(&shdr64, 0, shdr_size);
-shdr64.sh_info = cpu_to_dump32(s, s->sh_info);
+shdr64.sh_info = cpu_to_dump32(s, s->phdr_num);
 shdr = &shdr64;
 }
 
@@ -478,13 +490,6 @@ static void write_elf_loads(DumpState *s, Error **errp)
 hwaddr offset, filesz;
 MemoryMapping *memory_mapping;
 uint32_t phdr_index = 1;
-uint32_t max_index;
-
-if (s->have_section) {
-max_index = s->sh_info;
-} else {
-max_index = s->phdr_num;
-}
 
 QTAILQ_FOREACH(memory_mapping, &s->list.head, next) {
 get_offset_range(memory_mapping->phys_addr,
@@ -502,7 +507,7 @@ static void write_elf_loads(DumpState *s, Error **errp)
 return;
 }
 
-if (phdr_index >= max_index) {
+if (phdr_index >= s->phdr_num) {
 break;
 }
 }
@@ -1801,22 +1806,21 @@ static void dump_init(DumpState *s, int fd, bool 
has_format,
 s->phdr_num += s->list.num;
 s->have_section = false;
 } else {
+/* sh_info of section 0 holds the real number of phdrs */
 s->have_section = true;
-s->phdr_num = PN_XNUM;
-s->sh_info = 1; /* PT_NOTE */
 
 /* the type of shdr->sh_info is uint32_t, so we should avoid overflow 
*/
 if (s->list.num <= UINT32_MAX - 1) {
-s->sh_info += s->list.num;
+s->phdr_num += s->list.num;
 } else {
-s->sh_info = UINT32_MAX;
+s->phdr_num = UINT32_MAX;
 }
 }
 
 if (s->dump_info.d_class == ELFCLASS64) {
 if (s->have_section) {
 s->memory_offset = sizeof(Elf64_Ehdr) +
-  

[PATCH for-7.0] virtio-iommu: use-after-free fix

2022-04-07 Thread Michael S. Tsirkin
From: Wentao Liang 

A potential Use-after-free was reported in virtio_iommu_handle_command
when using virtio-iommu:

> I find a potential Use-after-free in QEMU 6.2.0, which is in
> virtio_iommu_handle_command() (./hw/virtio/virtio-iommu.c).
>
>
> Specifically, in the loop body, the variable 'buf' allocated at line 639 can 
> be
> freed by g_free() at line 659. However, if the execution path enters the loop
> body again and the if branch takes true at line 616, the control will directly
> jump to 'out' at line 651. At this time, 'buf' is a freed pointer, which is 
> not
> assigned with an allocated memory but used at line 653. As a result, a UAF bug
> is triggered.
>
>
>
> 599 for (;;) {
> ...
> 615 sz = iov_to_buf(iov, iov_cnt, 0, &head, sizeof(head));
> 616 if (unlikely(sz != sizeof(head))) {
> 617 tail.status = VIRTIO_IOMMU_S_DEVERR;
> 618 goto out;
> 619 }
> ...
> 639 buf = g_malloc0(output_size);
> ...
> 651 out:
> 652 sz = iov_from_buf(elem->in_sg, elem->in_num, 0,
> 653   buf ? buf : &tail, output_size);
> ...
> 659 g_free(buf);
>
> We can fix it by set ‘buf‘ to NULL after freeing it:
>
>
> 651 out:
> 652 sz = iov_from_buf(elem->in_sg, elem->in_num, 0,
> 653   buf ? buf : &tail, output_size);
> ...
> 659 g_free(buf);
> +++ buf = NULL;
> 660 }

Fix as suggested by the reporter.

Signed-off-by: Wentao Liang 
Signed-off-by: Michael S. Tsirkin 
Message-ID: <20220406040445-mutt-send-email-...@kernel.org>
---
 hw/virtio/virtio-iommu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/virtio/virtio-iommu.c b/hw/virtio/virtio-iommu.c
index 239fe97b12..2b1d21edd1 100644
--- a/hw/virtio/virtio-iommu.c
+++ b/hw/virtio/virtio-iommu.c
@@ -683,6 +683,7 @@ out:
 virtio_notify(vdev, vq);
 g_free(elem);
 g_free(buf);
+buf = NULL;
 }
 }
 
-- 
MST




Re: [PULL 0/3] virtio,pc: bugfixes

2022-04-07 Thread Michael S. Tsirkin
On Thu, Apr 07, 2022 at 10:18:24AM +0100, Peter Maydell wrote:
> On Wed, 6 Apr 2022 at 22:11, Michael S. Tsirkin  wrote:
> >
> > The following changes since commit 128e050d41794e61e5849c6c507160da5556ea61:
> >
> >   hw/acpi/microvm: turn on 8042 bit in FADT boot architecture flags if 
> > present (2022-03-07 17:43:14 -0500)
> >
> > are available in the Git repository at:
> >
> >   git://git.kernel.org/pub/scm/virt/kvm/mst/qemu.git tags/for_upstream
> >
> > for you to fetch changes up to f556b9a0cd20d41493afd403fb7f016c8fb01eb3:
> >
> >   virtio-iommu: use-after-free fix (2022-04-06 17:11:03 -0400)
> >
> > 
> > virtio,pc: bugfixes
> >
> > Several fixes all over the place
> >
> > Signed-off-by: Michael S. Tsirkin 
> 
> RC3 has gone. We're not taking any more changes unless they
> are absolutely release-critical, which means that you need
> to be clearly describing in the pull request cover letter
> what the changes are and why they are release critical.
> 
> thanks
> -- PMM

You know, after looking at it critically ;) I just sent the release critical
patch on list. Feel free to pick it up.

-- 
MST




Re: [PATCH for-7.0] virtio-iommu: use-after-free fix

2022-04-07 Thread Peter Maydell
On Thu, 7 Apr 2022 at 10:52, Michael S. Tsirkin  wrote:
>
> From: Wentao Liang 
>
> A potential Use-after-free was reported in virtio_iommu_handle_command
> when using virtio-iommu:
>
> > I find a potential Use-after-free in QEMU 6.2.0, which is in
> > virtio_iommu_handle_command() (./hw/virtio/virtio-iommu.c).

So, this isn't a regression. Do you think it's critically necessary
it goes in 7.0, or is it in the category "put it into 7.0 if we
need an rc4 for some other reason anyway" ?

(I have a feeling we'll need an rc4, but we'll see.)

thanks
-- PMM



Re: [PULL 09/12] virtiofsd: Create new file with security context

2022-04-07 Thread Peter Maydell
On Thu, 17 Feb 2022 at 17:40, Dr. David Alan Gilbert (git)
 wrote:
>
> From: Vivek Goyal 
>
> This patch adds support for creating new file with security context
> as sent by client. It basically takes three paths.
>
> - If no security context enabled, then it continues to create files without
>   security context.
>
> - If security context is enabled and but security.selinux has not been
>   remapped, then it uses /proc/thread-self/attr/fscreate knob to set
>   security context and then create the file. This will make sure that
>   newly created file gets the security context as set in "fscreate" and
>   this is atomic w.r.t file creation.
>
>   This is useful and host and guest SELinux policies don't conflict and
>   can work with each other. In that case, guest security.selinux xattr
>   is not remapped and it is passthrough as "security.selinux" xattr
>   on host.
>
> - If security context is enabled but security.selinux xattr has been
>   remapped to something else, then it first creates the file and then
>   uses setxattr() to set the remapped xattr with the security context.
>   This is a non-atomic operation w.r.t file creation.
>
>   This mode will be most versatile and allow host and guest to have their
>   own separate SELinux xattrs and have their own separate SELinux policies.
>
> Reviewed-by: Dr. David Alan Gilbert 
> Signed-off-by: Vivek Goyal 
> Message-Id: <20220208204813.682906-9-vgo...@redhat.com>
> Signed-off-by: Dr. David Alan Gilbert 

Hi; Coverity reports some issues (CID 1487142, 1487195), because
it is not a fan of the error-handling pattern used in this code:

> +static int do_mknod_symlink_secctx(fuse_req_t req, struct lo_inode *dir,
> +   const char *name, const char *secctx_name)
> +{
> +int path_fd, err;
> +char procname[64];
> +struct lo_data *lo = lo_data(req);
> +
> +if (!req->secctx.ctxlen) {
> +return 0;
> +}
> +
> +/* Open newly created element with O_PATH */
> +path_fd = openat(dir->fd, name, O_PATH | O_NOFOLLOW);
> +err = path_fd == -1 ? errno : 0;
> +if (err) {
> +return err;
> +}

We set err based on whether path_fd is -1 or not, but we decide
whether to early-return based on the value of err. Coverity
doesn't know that openat() will always set errno to something
non-zero if it returns -1, so it complains because it thinks
there's a code path where openat() returns -1, but errno is 0,
and so we don't take the early-return and instead continue
through all the code below to the "close(path_fd)", which
should not be being passed a negative value for the filedescriptor.

I could just mark these as false-positives, but it does seem a bit
odd that we are using two different conditions here. Perhaps it would
be better to rephrase? For instance, for the openat() we could write:

   path_fd = openat(dir->fd, name, O_PATH | O_NOFOLLOW);
   if (path_fd == -1) {
   return errno;
   }
and similarly for the openat() in open_set_proc_fscreate().

> +sprintf(procname, "%i", path_fd);
> +FCHDIR_NOFAIL(lo->proc_self_fd);
> +/* Set security context. This is not atomic w.r.t file creation */
> +err = setxattr(procname, secctx_name, req->secctx.ctx, 
> req->secctx.ctxlen,
> +   0);
> +if (err) {
> +err = errno;
> +}

> +FCHDIR_NOFAIL(lo->root.fd);
> +close(path_fd);
> +return err;
> +}

thanks
-- PMM



Re: [Qemu-devel] [PULL 28/30] introduce xlnx-dp

2022-04-07 Thread Peter Maydell
On Tue, 14 Jun 2016 at 15:40, Peter Maydell  wrote:
>
> From: KONRAD Frederic 
>
> This is the implementation of the DisplayPort.
> It has an aux-bus to access dpcd and edid.
>
> Graphic plane is connected to the channel 3.
> Video plane is connected to the channel 0.
> Audio stream are connected to the channels 4 and 5.

Very old patch, but Coverity has just pointed out an array
overrun in it (CID 1487260):

We define a set of offsets for V_BLEND registers, of which
the largest is this one:

> +#define V_BLEND_CHROMA_KEY_COMP3(0x01DC >> 2)

> +static void xlnx_dp_vblend_write(void *opaque, hwaddr offset,
> + uint64_t value, unsigned size)
> +{
> +XlnxDPState *s = XLNX_DP(opaque);
> +bool alpha_was_enabled;
> +
> +DPRINTF("vblend: write @0x%" HWADDR_PRIX " = 0x%" PRIX32 "\n", offset,
> +   
> (uint32_t)value);
> +offset = offset >> 2;
> +
> +switch (offset) {

> +case V_BLEND_CHROMA_KEY_COMP1:
> +case V_BLEND_CHROMA_KEY_COMP2:
> +case V_BLEND_CHROMA_KEY_COMP3:
> +s->vblend_registers[offset] = value & 0x0FFF0FFF;

We use V_BLEND_CHROMA_KEY_COMP3 as an index into the vblend_registers array...

> +break;
> +default:
> +s->vblend_registers[offset] = value;
> +break;
> +}
> +}

> +#define DP_CORE_REG_ARRAY_SIZE  (0x3AF >> 2)
> +#define DP_AVBUF_REG_ARRAY_SIZE (0x238 >> 2)
> +#define DP_VBLEND_REG_ARRAY_SIZE(0x1DF >> 2)
> +#define DP_AUDIO_REG_ARRAY_SIZE (0x50 >> 2)

> +uint32_t vblend_registers[DP_VBLEND_REG_ARRAY_SIZE];

...but we have defined DP_VBLEND_REG_ARRAY_SIZE to 0x1DF >> 2,
which is the same as 0x1DC >> 2, and so the array size is too small.

The size of the memory region is also suspicious:

+memory_region_init_io(&s->vblend_iomem, obj, &vblend_ops, s, TYPE_XLNX_DP
+  ".v_blend", 0x1DF);

This is a "32-bit accesses only" region, but we have defined it with a
size that is not a multiple of 4. That looks wrong... (It also means
that rather than having an array overrun I think the actual effect
is that the guest won't be able to access the last register, because
it's not entirely within the memoryregion.)

Coverity doesn't complain about it, but the DP_CORE_REG_ARRAY_SIZE
may also have a similar problem.

thanks
-- PMM



Re: [PATCH 1/2] block/throttle-groups: use QEMU_CLOCK_REALTIME for qtest too

2022-04-07 Thread Vladimir Sementsov-Ogievskiy

Thanks for explanation!

07.04.2022 09:42, Hanna Reitz wrote:

On 06.04.22 17:32, Vladimir Sementsov-Ogievskiy wrote:

Virtual clock just doesn't tick for iotests, and throttling just not
work. Let's use realtime clock.


It does tick when you make it take, specifically with the clock_step qtest 
command.  093 does this, and so with this patch, it fails, because it is no 
longer deterministic.

So far, if I needed realtime throttling, I simply switched the accelerator to 
tcg (e.g. in stream-error-on-reset).


Hm, I tried but it doesn't help (Add vm.add_args('-accel', 'tcg') before vm.launch() in 
the test), as " -accel qtest" is kept anyway, and therefore 
do_configure_accelerator is called for qtest and finally qtest_allowed is set to true.

But using QEMUMachine class instead of VM helps.



I’m not really opposed to this, but it does break 093, and without looking too 
closely into it, I would guess that it’d be difficult to rewrite 093 in a 
deterministic way without it relying on throttling using the virtual clock.  (A 
runtime option for the throttle-group object to choose the clock type might be 
an option.)


OK, I don't think we need these patches now.




Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  block/throttle-groups.c | 4 
  1 file changed, 4 deletions(-)

diff --git a/block/throttle-groups.c b/block/throttle-groups.c
index fb203c3ced..029158d797 100644
--- a/block/throttle-groups.c
+++ b/block/throttle-groups.c
@@ -753,10 +753,6 @@ static void throttle_group_obj_init(Object *obj)
  ThrottleGroup *tg = THROTTLE_GROUP(obj);
  tg->clock_type = QEMU_CLOCK_REALTIME;
-    if (qtest_enabled()) {
-    /* For testing block IO throttling only */
-    tg->clock_type = QEMU_CLOCK_VIRTUAL;
-    }
  tg->is_initialized = false;
  qemu_mutex_init(&tg->lock);
  throttle_init(&tg->ts);





--
Best regards,
Vladimir



Re: [PATCH v3 7/7] iotests: copy-before-write: add cases for cbw-timeout option

2022-04-07 Thread Vladimir Sementsov-Ogievskiy

07.04.2022 12:19, Hanna Reitz wrote:

On 06.04.22 20:08, Vladimir Sementsov-Ogievskiy wrote:

Add two simple test-cases: timeout failure with
break-snapshot-on-cbw-error behavior and similar with
break-guest-write-on-cbw-error behavior.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/tests/copy-before-write    | 78 +++
  .../qemu-iotests/tests/copy-before-write.out  |  4 +-
  2 files changed, 80 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/tests/copy-before-write 
b/tests/qemu-iotests/tests/copy-before-write
index a32608f597..5c90b8cd50 100755
--- a/tests/qemu-iotests/tests/copy-before-write
+++ b/tests/qemu-iotests/tests/copy-before-write
@@ -122,6 +122,84 @@ read 1048576/1048576 bytes at offset 0
  1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  """)
+    def do_cbw_timeout(self, on_cbw_error):
+    result = self.vm.qmp('object-add', {
+    'qom-type': 'throttle-group',
+    'id': 'group0',
+    'limits': {'bps-write': 300 * 1024}


Hm, yes, I can’t find a way to make this work without your other series.  For 
some reason, not even -accel tcg helps; and using qtest to advance the virtual 
clock doesn’t really help because the qemu-io commands block while the request 
is throttled.

One thing that should work would be to run everything in a qemu-storage-daemon 
instance, and then having qemu-io access an NBD export...



Simple switch to QEMUMachine helps. I'll resend soon.

--
Best regards,
Vladimir



Re: [PATCH 12/32] qga: replace deprecated g_get_current_time()

2022-04-07 Thread Marc-André Lureau
Hi

On Thu, Apr 7, 2022 at 9:54 AM Markus Armbruster  wrote:

> marcandre.lur...@redhat.com writes:
>
> > From: Marc-André Lureau 
> >
> > According to GLib API:
> > g_get_current_time has been deprecated since version 2.62 and should not
> > be used in newly-written code. GTimeVal is not year-2038-safe. Use
> > g_get_real_time() instead.
> >
> > Signed-off-by: Marc-André Lureau 
> > ---
> >  qga/main.c | 7 ---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > diff --git a/qga/main.c b/qga/main.c
> > index b9dd19918e47..1deb0ee2fbfe 100644
> > --- a/qga/main.c
> > +++ b/qga/main.c
> > @@ -314,7 +314,6 @@ static void ga_log(const gchar *domain,
> GLogLevelFlags level,
> > const gchar *msg, gpointer opaque)
> >  {
> >  GAState *s = opaque;
> > -GTimeVal time;
> >  const char *level_str = ga_log_level_str(level);
> >
> >  if (!ga_logging_enabled(s)) {
> > @@ -329,9 +328,11 @@ static void ga_log(const gchar *domain,
> GLogLevelFlags level,
> >  #else
> >  if (level & s->log_level) {
> >  #endif
> > -g_get_current_time(&time);
> > +gint64 t = g_get_real_time();
> >  fprintf(s->log_file,
> > -"%lu.%lu: %s: %s\n", time.tv_sec, time.tv_usec,
> level_str, msg);
>
> The old code is kind of wrong.  Say it's 1649309843.01 seconds past
> the epoch.  Prints "1649309843.1".  9us later, it prints
> "1649309843.10".  Should really use %06lu for the microseconds part.
>

good idea


> Whether you want to fix this in this patch, or just note it for later in
> the commit message, or ignore it alltogether is up to you.
>
> > +"%" G_GINT64_FORMAT ".%" G_GINT64_FORMAT
>
> This gives me flashbacks to the 90s.  Please use PRId64 like we do
> everywhere else.
>
> I'd ditch gint64_t for int64_t, too.
>

ack, ack


>
> > +": %s: %s\n", t / G_USEC_PER_SEC, t % G_USEC_PER_SEC,
> > +level_str, msg);
> >  fflush(s->log_file);
> >  }
> >  }
>
>
>

-- 
Marc-André Lureau


Re: [Qemu-devel] [PULL 28/30] introduce xlnx-dp

2022-04-07 Thread Frederic Konrad




Le 4/7/22 à 12:32, Peter Maydell a écrit :

On Tue, 14 Jun 2016 at 15:40, Peter Maydell  wrote:


From: KONRAD Frederic 

This is the implementation of the DisplayPort.
It has an aux-bus to access dpcd and edid.

Graphic plane is connected to the channel 3.
Video plane is connected to the channel 0.
Audio stream are connected to the channels 4 and 5.


Very old patch, but Coverity has just pointed out an array
overrun in it (CID 1487260):

We define a set of offsets for V_BLEND registers, of which
the largest is this one:


+#define V_BLEND_CHROMA_KEY_COMP3(0x01DC >> 2)



+static void xlnx_dp_vblend_write(void *opaque, hwaddr offset,
+ uint64_t value, unsigned size)
+{
+XlnxDPState *s = XLNX_DP(opaque);
+bool alpha_was_enabled;
+
+DPRINTF("vblend: write @0x%" HWADDR_PRIX " = 0x%" PRIX32 "\n", offset,
+   
(uint32_t)value);
+offset = offset >> 2;
+
+switch (offset) {



+case V_BLEND_CHROMA_KEY_COMP1:
+case V_BLEND_CHROMA_KEY_COMP2:
+case V_BLEND_CHROMA_KEY_COMP3:
+s->vblend_registers[offset] = value & 0x0FFF0FFF;


We use V_BLEND_CHROMA_KEY_COMP3 as an index into the vblend_registers array...


+break;
+default:
+s->vblend_registers[offset] = value;
+break;
+}
+}



+#define DP_CORE_REG_ARRAY_SIZE  (0x3AF >> 2)
+#define DP_AVBUF_REG_ARRAY_SIZE (0x238 >> 2)
+#define DP_VBLEND_REG_ARRAY_SIZE(0x1DF >> 2)
+#define DP_AUDIO_REG_ARRAY_SIZE (0x50 >> 2)



+uint32_t vblend_registers[DP_VBLEND_REG_ARRAY_SIZE];


..but we have defined DP_VBLEND_REG_ARRAY_SIZE to 0x1DF >> 2,
which is the same as 0x1DC >> 2, and so the array size is too small.

The size of the memory region is also suspicious:

+memory_region_init_io(&s->vblend_iomem, obj, &vblend_ops, s, TYPE_XLNX_DP
+  ".v_blend", 0x1DF);

This is a "32-bit accesses only" region, but we have defined it with a
size that is not a multiple of 4. That looks wrong... (It also means
that rather than having an array overrun I think the actual effect
is that the guest won't be able to access the last register, because
it's not entirely within the memoryregion.)


arg, sorry for that..

I share your point, it should not be possible to access it, but using
the monitor:

(qemu) info mtree
...
fd4aa000-fd4aa1de (prio 0, i/o): xlnx.v-dp.v_blend
...

I can actually read that register (at least it doesn't complain, on an
older qemu version though):
(qemu) xp /w 0xfd4aa1dc
fd4aa1dc: 0x

So I'm not totally sure.. do you need a patch for 7.0.0?



Coverity doesn't complain about it, but the DP_CORE_REG_ARRAY_SIZE
may also have a similar problem.


I think it doesn't complain because writing to the last register doesn't
actually write into the array but update the mask register instead:

case DP_INT_DS:
s->core_registers[DP_INT_MASK] |= ~value;
xlnx_dp_update_irq(s);
break;



thanks
-- PMM


Best Regards,
Fred



Re: [PATCH for-7.1 02/18] hw/intc/exynos4210_gic: Remove unused TYPE_EXYNOS4210_IRQ_GATE

2022-04-07 Thread Francisco Iglesias
On [2022 Apr 04] Mon 16:46:42, Peter Maydell wrote:
> Now we have removed the only use of TYPE_EXYNOS4210_IRQ_GATE we can
> delete the device entirely.
> 
> Signed-off-by: Peter Maydell 

Reviewed-by: Francisco Iglesias 

> ---
>  hw/intc/exynos4210_gic.c | 107 ---
>  1 file changed, 107 deletions(-)
> 
> diff --git a/hw/intc/exynos4210_gic.c b/hw/intc/exynos4210_gic.c
> index bc73d1f1152..794f6b5ac72 100644
> --- a/hw/intc/exynos4210_gic.c
> +++ b/hw/intc/exynos4210_gic.c
> @@ -373,110 +373,3 @@ static void exynos4210_gic_register_types(void)
>  }
>  
>  type_init(exynos4210_gic_register_types)
> -
> -/* IRQ OR Gate struct.
> - *
> - * This device models an OR gate. There are n_in input qdev gpio lines and 
> one
> - * output sysbus IRQ line. The output IRQ level is formed as OR between all
> - * gpio inputs.
> - */
> -
> -#define TYPE_EXYNOS4210_IRQ_GATE "exynos4210.irq_gate"
> -OBJECT_DECLARE_SIMPLE_TYPE(Exynos4210IRQGateState, EXYNOS4210_IRQ_GATE)
> -
> -struct Exynos4210IRQGateState {
> -SysBusDevice parent_obj;
> -
> -uint32_t n_in;  /* inputs amount */
> -uint32_t *level;/* input levels */
> -qemu_irq out;   /* output IRQ */
> -};
> -
> -static Property exynos4210_irq_gate_properties[] = {
> -DEFINE_PROP_UINT32("n_in", Exynos4210IRQGateState, n_in, 1),
> -DEFINE_PROP_END_OF_LIST(),
> -};
> -
> -static const VMStateDescription vmstate_exynos4210_irq_gate = {
> -.name = "exynos4210.irq_gate",
> -.version_id = 2,
> -.minimum_version_id = 2,
> -.fields = (VMStateField[]) {
> -VMSTATE_VBUFFER_UINT32(level, Exynos4210IRQGateState, 1, NULL, n_in),
> -VMSTATE_END_OF_LIST()
> -}
> -};
> -
> -/* Process a change in IRQ input. */
> -static void exynos4210_irq_gate_handler(void *opaque, int irq, int level)
> -{
> -Exynos4210IRQGateState *s = (Exynos4210IRQGateState *)opaque;
> -uint32_t i;
> -
> -assert(irq < s->n_in);
> -
> -s->level[irq] = level;
> -
> -for (i = 0; i < s->n_in; i++) {
> -if (s->level[i] >= 1) {
> -qemu_irq_raise(s->out);
> -return;
> -}
> -}
> -
> -qemu_irq_lower(s->out);
> -}
> -
> -static void exynos4210_irq_gate_reset(DeviceState *d)
> -{
> -Exynos4210IRQGateState *s = EXYNOS4210_IRQ_GATE(d);
> -
> -memset(s->level, 0, s->n_in * sizeof(*s->level));
> -}
> -
> -/*
> - * IRQ Gate initialization.
> - */
> -static void exynos4210_irq_gate_init(Object *obj)
> -{
> -Exynos4210IRQGateState *s = EXYNOS4210_IRQ_GATE(obj);
> -SysBusDevice *sbd = SYS_BUS_DEVICE(obj);
> -
> -sysbus_init_irq(sbd, &s->out);
> -}
> -
> -static void exynos4210_irq_gate_realize(DeviceState *dev, Error **errp)
> -{
> -Exynos4210IRQGateState *s = EXYNOS4210_IRQ_GATE(dev);
> -
> -/* Allocate general purpose input signals and connect a handler to each 
> of
> - * them */
> -qdev_init_gpio_in(dev, exynos4210_irq_gate_handler, s->n_in);
> -
> -s->level = g_malloc0(s->n_in * sizeof(*s->level));
> -}
> -
> -static void exynos4210_irq_gate_class_init(ObjectClass *klass, void *data)
> -{
> -DeviceClass *dc = DEVICE_CLASS(klass);
> -
> -dc->reset = exynos4210_irq_gate_reset;
> -dc->vmsd = &vmstate_exynos4210_irq_gate;
> -device_class_set_props(dc, exynos4210_irq_gate_properties);
> -dc->realize = exynos4210_irq_gate_realize;
> -}
> -
> -static const TypeInfo exynos4210_irq_gate_info = {
> -.name  = TYPE_EXYNOS4210_IRQ_GATE,
> -.parent= TYPE_SYS_BUS_DEVICE,
> -.instance_size = sizeof(Exynos4210IRQGateState),
> -.instance_init = exynos4210_irq_gate_init,
> -.class_init= exynos4210_irq_gate_class_init,
> -};
> -
> -static void exynos4210_irq_gate_register_types(void)
> -{
> -type_register_static(&exynos4210_irq_gate_info);
> -}
> -
> -type_init(exynos4210_irq_gate_register_types)
> -- 
> 2.25.1
> 
> 



Re: [Qemu-devel] [PULL 28/30] introduce xlnx-dp

2022-04-07 Thread Peter Maydell
On Thu, 7 Apr 2022 at 12:28, Frederic Konrad  wrote:
> So I'm not totally sure.. do you need a patch for 7.0.0?

It's not a regression, so we can fix this for 7.1.

thanks
-- PMM



Re: [PULL 09/12] virtiofsd: Create new file with security context

2022-04-07 Thread Dr. David Alan Gilbert
* Peter Maydell (peter.mayd...@linaro.org) wrote:
> On Thu, 17 Feb 2022 at 17:40, Dr. David Alan Gilbert (git)
>  wrote:
> >
> > From: Vivek Goyal 
> >
> > This patch adds support for creating new file with security context
> > as sent by client. It basically takes three paths.
> >
> > - If no security context enabled, then it continues to create files without
> >   security context.
> >
> > - If security context is enabled and but security.selinux has not been
> >   remapped, then it uses /proc/thread-self/attr/fscreate knob to set
> >   security context and then create the file. This will make sure that
> >   newly created file gets the security context as set in "fscreate" and
> >   this is atomic w.r.t file creation.
> >
> >   This is useful and host and guest SELinux policies don't conflict and
> >   can work with each other. In that case, guest security.selinux xattr
> >   is not remapped and it is passthrough as "security.selinux" xattr
> >   on host.
> >
> > - If security context is enabled but security.selinux xattr has been
> >   remapped to something else, then it first creates the file and then
> >   uses setxattr() to set the remapped xattr with the security context.
> >   This is a non-atomic operation w.r.t file creation.
> >
> >   This mode will be most versatile and allow host and guest to have their
> >   own separate SELinux xattrs and have their own separate SELinux policies.
> >
> > Reviewed-by: Dr. David Alan Gilbert 
> > Signed-off-by: Vivek Goyal 
> > Message-Id: <20220208204813.682906-9-vgo...@redhat.com>
> > Signed-off-by: Dr. David Alan Gilbert 
> 
> Hi; Coverity reports some issues (CID 1487142, 1487195), because
> it is not a fan of the error-handling pattern used in this code:
> 
> > +static int do_mknod_symlink_secctx(fuse_req_t req, struct lo_inode *dir,
> > +   const char *name, const char 
> > *secctx_name)
> > +{
> > +int path_fd, err;
> > +char procname[64];
> > +struct lo_data *lo = lo_data(req);
> > +
> > +if (!req->secctx.ctxlen) {
> > +return 0;
> > +}
> > +
> > +/* Open newly created element with O_PATH */
> > +path_fd = openat(dir->fd, name, O_PATH | O_NOFOLLOW);
> > +err = path_fd == -1 ? errno : 0;
> > +if (err) {
> > +return err;
> > +}
> 
> We set err based on whether path_fd is -1 or not, but we decide
> whether to early-return based on the value of err. Coverity
> doesn't know that openat() will always set errno to something
> non-zero if it returns -1, so it complains because it thinks
> there's a code path where openat() returns -1, but errno is 0,
> and so we don't take the early-return and instead continue
> through all the code below to the "close(path_fd)", which
> should not be being passed a negative value for the filedescriptor.
> 
> I could just mark these as false-positives, but it does seem a bit
> odd that we are using two different conditions here. Perhaps it would
> be better to rephrase? For instance, for the openat() we could write:
> 
>path_fd = openat(dir->fd, name, O_PATH | O_NOFOLLOW);
>if (path_fd == -1) {
>return errno;
>}

That looks OK to me; please send a patch.

Some of the cases look like they need to just be a little careful that
'err' always gets set to 0 if there are later cases that might set err.

Dave

> and similarly for the openat() in open_set_proc_fscreate().
> 
> > +sprintf(procname, "%i", path_fd);
> > +FCHDIR_NOFAIL(lo->proc_self_fd);
> > +/* Set security context. This is not atomic w.r.t file creation */
> > +err = setxattr(procname, secctx_name, req->secctx.ctx, 
> > req->secctx.ctxlen,
> > +   0);
> > +if (err) {
> > +err = errno;
> > +}
> 
> > +FCHDIR_NOFAIL(lo->root.fd);
> > +close(path_fd);
> > +return err;
> > +}
> 
> thanks
> -- PMM
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




Re: [PATCH v2 1/5] qdev: add user_creatable_requires_machine_allowance class flag

2022-04-07 Thread Edgar E. Iglesias
On Thu, Mar 31, 2022 at 01:53:08PM +0200, Damien Hedde wrote:
> This flag will be used in device_add to check if
> the device needs special allowance from the machine
> model.
> 
> It will replace the current check based only on the
> device being a TYPE_SYB_BUS_DEVICE.
> 

Looks good to me!
Reviewed-by: Edgar E. Iglesias 


> Signed-off-by: Damien Hedde 
> ---
> 
> v2:
>  + change the flag name and put it just below user_creatable
> ---
>  include/hw/qdev-core.h | 9 +
>  hw/core/qdev.c | 1 +
>  hw/core/sysbus.c   | 1 +
>  3 files changed, 11 insertions(+)
> 
> diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
> index 92c3d65208..6a040fcd3b 100644
> --- a/include/hw/qdev-core.h
> +++ b/include/hw/qdev-core.h
> @@ -122,6 +122,15 @@ struct DeviceClass {
>   * TODO remove once we're there
>   */
>  bool user_creatable;
> +/*
> + * Some devices can be user created under certain conditions (eg:
> + * specific machine support for sysbus devices), but it is
> + * preferable to prevent global allowance for the reasons
> + * described above.
> + * This flag is an additional constraint over user_creatable:
> + * user_creatable still needs to be set to true.
> + */
> +bool user_creatable_requires_machine_allowance;
>  bool hotpluggable;
>  
>  /* callbacks */
> diff --git a/hw/core/qdev.c b/hw/core/qdev.c
> index 84f3019440..0844c85a21 100644
> --- a/hw/core/qdev.c
> +++ b/hw/core/qdev.c
> @@ -833,6 +833,7 @@ static void device_class_init(ObjectClass *class, void 
> *data)
>   */
>  dc->hotpluggable = true;
>  dc->user_creatable = true;
> +dc->user_creatable_requires_machine_allowance = false;
>  vc->get_id = device_vmstate_if_get_id;
>  rc->get_state = device_get_reset_state;
>  rc->child_foreach = device_reset_child_foreach;
> diff --git a/hw/core/sysbus.c b/hw/core/sysbus.c
> index 05c1da3d31..5f771ed1e9 100644
> --- a/hw/core/sysbus.c
> +++ b/hw/core/sysbus.c
> @@ -325,6 +325,7 @@ static void sysbus_device_class_init(ObjectClass *klass, 
> void *data)
>   * subclass needs to override it and set user_creatable=true.
>   */
>  k->user_creatable = false;
> +k->user_creatable_requires_machine_allowance = true;
>  }
>  
>  static const TypeInfo sysbus_device_type_info = {
> -- 
> 2.35.1
> 
> 



Re: [PATCH v2 2/5] machine: update machine allowed list related functions/fields

2022-04-07 Thread Edgar E. Iglesias
On Thu, Mar 31, 2022 at 01:53:09PM +0200, Damien Hedde wrote:
> The list will now accept any device (not only sysbus devices) so
> we rename the related code and documentation.
> 
> Create some temporary inline functions with old names until
> we've udpated callsites as well.
> 
> Signed-off-by: Damien Hedde 
> Reviewed-by: Philippe Mathieu-Daudé 

Reviewed-by: Edgar E. Iglesias 


> ---
>  include/hw/boards.h | 50 +++--
>  hw/core/machine.c   | 10 -
>  2 files changed, 35 insertions(+), 25 deletions(-)
> 
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index c92ac8815c..1814793175 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -38,35 +38,45 @@ void machine_parse_smp_config(MachineState *ms,
>const SMPConfiguration *config, Error **errp);
>  
>  /**
> - * machine_class_allow_dynamic_sysbus_dev: Add type to list of valid devices
> + * machine_class_allow_dynamic_device: Add type to list of valid devices
>   * @mc: Machine class
> - * @type: type to allow (should be a subtype of TYPE_SYS_BUS_DEVICE)
> + * @type: type to allow (should be a subtype of TYPE_DEVICE having the
> + *uc_requires_machine_allowance flag)
>   *
>   * Add the QOM type @type to the list of devices of which are subtypes
> - * of TYPE_SYS_BUS_DEVICE but which are still permitted to be dynamically
> - * created (eg by the user on the command line with -device).
> - * By default if the user tries to create any devices on the command line
> - * that are subtypes of TYPE_SYS_BUS_DEVICE they will get an error message;
> - * for the special cases which are permitted for this machine model, the
> - * machine model class init code must call this function to add them
> - * to the list of specifically permitted devices.
> + * of TYPE_DEVICE but which are only permitted to be dynamically
> + * created (eg by the user on the command line with -device) if the
> + * machine allowed it.
> + *
> + * Otherwise if the user tries to create such a device on the command line,
> + * it will get an error message.
>   */
> -void machine_class_allow_dynamic_sysbus_dev(MachineClass *mc, const char 
> *type);
> +void machine_class_allow_dynamic_device(MachineClass *mc, const char *type);
> +static inline void machine_class_allow_dynamic_sysbus_dev(MachineClass *mc,
> +  const char *type)
> +{
> +machine_class_allow_dynamic_device(mc, type);
> +}
>  
>  /**
> - * device_type_is_dynamic_sysbus: Check if type is an allowed sysbus device
> + * device_type_is_dynamic_allowed: Check if type is an allowed device
>   * type for the machine class.
>   * @mc: Machine class
> - * @type: type to check (should be a subtype of TYPE_SYS_BUS_DEVICE)
> + * @type: type to check (should be a subtype of TYPE_DEVICE)
>   *
>   * Returns: true if @type is a type in the machine's list of
> - * dynamically pluggable sysbus devices; otherwise false.
> + * dynamically pluggable devices; otherwise false.
>   *
> - * Check if the QOM type @type is in the list of allowed sysbus device
> - * types (see machine_class_allowed_dynamic_sysbus_dev()).
> + * Check if the QOM type @type is in the list of allowed device
> + * types (see machine_class_allowed_dynamic_device()).
>   * Note that if @type has a parent type in the list, it is allowed too.
>   */
> -bool device_type_is_dynamic_sysbus(MachineClass *mc, const char *type);
> +bool device_type_is_dynamic_allowed(MachineClass *mc, const char *type);
> +static inline bool device_type_is_dynamic_sysbus(MachineClass *mc,
> + const char *type)
> +{
> +return device_type_is_dynamic_allowed(mc, type);
> +}
>  
>  /**
>   * device_is_dynamic_sysbus: test whether device is a dynamic sysbus device
> @@ -74,12 +84,12 @@ bool device_type_is_dynamic_sysbus(MachineClass *mc, 
> const char *type);
>   * @dev: device to check
>   *
>   * Returns: true if @dev is a sysbus device on the machine's list
> - * of dynamically pluggable sysbus devices; otherwise false.
> + * of dynamically pluggable devices; otherwise false.
>   *
>   * This function checks whether @dev is a valid dynamic sysbus device,
>   * by first confirming that it is a sysbus device and then checking it
> - * against the list of permitted dynamic sysbus devices which has been
> - * set up by the machine using machine_class_allow_dynamic_sysbus_dev().
> + * against the list of permitted dynamic devices which has been
> + * set up by the machine using machine_class_allow_dynamic_device().
>   *
>   * It is valid to call this with something that is not a subclass of
>   * TYPE_SYS_BUS_DEVICE; the function will return false in this case.
> @@ -263,7 +273,7 @@ struct MachineClass {
>  bool ignore_memory_transaction_failures;
>  int numa_mem_align_shift;
>  const char **valid_cpu_types;
> -strList *allowed_dynamic_sysbus_devices;
> +strList *allowed_dyna

Re: [PATCH v2 3/5] qdev-monitor: use the new user_creatable_requires_machine_allowance

2022-04-07 Thread Edgar E. Iglesias
On Thu, Mar 31, 2022 at 01:53:10PM +0200, Damien Hedde wrote:
> Instead of checking if the device is a sysbus device, just check
> the newly added flag in device class.
> 
> Signed-off-by: Damien Hedde 

Reviewed-by: Edgar E. Iglesias 


> ---
> 
> v2: update the flag name
> ---
>  softmmu/qdev-monitor.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c
> index 12fe60c467..77f468358d 100644
> --- a/softmmu/qdev-monitor.c
> +++ b/softmmu/qdev-monitor.c
> @@ -258,12 +258,12 @@ static DeviceClass *qdev_get_device_class(const char 
> **driver, Error **errp)
>  return NULL;
>  }
>  
> -if (object_class_dynamic_cast(oc, TYPE_SYS_BUS_DEVICE)) {
> -/* sysbus devices need to be allowed by the machine */
> +if (dc->user_creatable_requires_machine_allowance) {
> +/* some devices need to be allowed by the machine */
>  MachineClass *mc = 
> MACHINE_CLASS(object_get_class(qdev_get_machine()));
> -if (!device_type_is_dynamic_sysbus(mc, *driver)) {
> +if (!device_type_is_dynamic_allowed(mc, *driver)) {
>  error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "driver",
> -   "a dynamic sysbus device type for the machine");
> +   "the device type is not allowed for this machine");
>  return NULL;
>  }
>  }
> -- 
> 2.35.1
> 
> 



Re: [PATCH v2 4/5] rename machine_class_allow_dynamic_sysbus_dev

2022-04-07 Thread Edgar E. Iglesias
On Thu, Mar 31, 2022 at 01:53:11PM +0200, Damien Hedde wrote:
> All callsite are updated to the new function name
> "machine_class_allow_dynamic_device"
> 
> Signed-off-by: Damien Hedde 
> Reviewed-by: Philippe Mathieu-Daudé 

Reviewed-by: Edgar E. Iglesias 


> ---
>  hw/arm/virt.c   | 10 +-
>  hw/i386/microvm.c   |  2 +-
>  hw/i386/pc_piix.c   |  4 ++--
>  hw/i386/pc_q35.c|  8 
>  hw/ppc/e500plat.c   |  2 +-
>  hw/ppc/spapr.c  |  2 +-
>  hw/riscv/virt.c |  2 +-
>  hw/xen/xen-legacy-backend.c |  2 +-
>  8 files changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index d2e5ecd234..1442b8840b 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -2829,12 +2829,12 @@ static void virt_machine_class_init(ObjectClass *oc, 
> void *data)
>   * configuration of the particular instance.
>   */
>  mc->max_cpus = 512;
> -machine_class_allow_dynamic_sysbus_dev(mc, TYPE_VFIO_CALXEDA_XGMAC);
> -machine_class_allow_dynamic_sysbus_dev(mc, TYPE_VFIO_AMD_XGBE);
> -machine_class_allow_dynamic_sysbus_dev(mc, TYPE_RAMFB_DEVICE);
> -machine_class_allow_dynamic_sysbus_dev(mc, TYPE_VFIO_PLATFORM);
> +machine_class_allow_dynamic_device(mc, TYPE_VFIO_CALXEDA_XGMAC);
> +machine_class_allow_dynamic_device(mc, TYPE_VFIO_AMD_XGBE);
> +machine_class_allow_dynamic_device(mc, TYPE_RAMFB_DEVICE);
> +machine_class_allow_dynamic_device(mc, TYPE_VFIO_PLATFORM);
>  #ifdef CONFIG_TPM
> -machine_class_allow_dynamic_sysbus_dev(mc, TYPE_TPM_TIS_SYSBUS);
> +machine_class_allow_dynamic_device(mc, TYPE_TPM_TIS_SYSBUS);
>  #endif
>  mc->block_default_type = IF_VIRTIO;
>  mc->no_cdrom = 1;
> diff --git a/hw/i386/microvm.c b/hw/i386/microvm.c
> index 4b3b1dd262..4f8f423d31 100644
> --- a/hw/i386/microvm.c
> +++ b/hw/i386/microvm.c
> @@ -756,7 +756,7 @@ static void microvm_class_init(ObjectClass *oc, void 
> *data)
>  MICROVM_MACHINE_AUTO_KERNEL_CMDLINE,
>  "Set off to disable adding virtio-mmio devices to the kernel 
> cmdline");
>  
> -machine_class_allow_dynamic_sysbus_dev(mc, TYPE_RAMFB_DEVICE);
> +machine_class_allow_dynamic_device(mc, TYPE_RAMFB_DEVICE);
>  }
>  
>  static const TypeInfo microvm_machine_info = {
> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
> index b72c03d0a6..27373cb16a 100644
> --- a/hw/i386/pc_piix.c
> +++ b/hw/i386/pc_piix.c
> @@ -411,8 +411,8 @@ static void pc_i440fx_machine_options(MachineClass *m)
>  m->desc = "Standard PC (i440FX + PIIX, 1996)";
>  m->default_machine_opts = "firmware=bios-256k.bin";
>  m->default_display = "std";
> -machine_class_allow_dynamic_sysbus_dev(m, TYPE_RAMFB_DEVICE);
> -machine_class_allow_dynamic_sysbus_dev(m, TYPE_VMBUS_BRIDGE);
> +machine_class_allow_dynamic_device(m, TYPE_RAMFB_DEVICE);
> +machine_class_allow_dynamic_device(m, TYPE_VMBUS_BRIDGE);
>  }
>  
>  static void pc_i440fx_7_0_machine_options(MachineClass *m)
> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
> index 1780f79bc1..8221615fa4 100644
> --- a/hw/i386/pc_q35.c
> +++ b/hw/i386/pc_q35.c
> @@ -353,10 +353,10 @@ static void pc_q35_machine_options(MachineClass *m)
>  m->default_display = "std";
>  m->default_kernel_irqchip_split = false;
>  m->no_floppy = 1;
> -machine_class_allow_dynamic_sysbus_dev(m, TYPE_AMD_IOMMU_DEVICE);
> -machine_class_allow_dynamic_sysbus_dev(m, TYPE_INTEL_IOMMU_DEVICE);
> -machine_class_allow_dynamic_sysbus_dev(m, TYPE_RAMFB_DEVICE);
> -machine_class_allow_dynamic_sysbus_dev(m, TYPE_VMBUS_BRIDGE);
> +machine_class_allow_dynamic_device(m, TYPE_AMD_IOMMU_DEVICE);
> +machine_class_allow_dynamic_device(m, TYPE_INTEL_IOMMU_DEVICE);
> +machine_class_allow_dynamic_device(m, TYPE_RAMFB_DEVICE);
> +machine_class_allow_dynamic_device(m, TYPE_VMBUS_BRIDGE);
>  m->max_cpus = 288;
>  }
>  
> diff --git a/hw/ppc/e500plat.c b/hw/ppc/e500plat.c
> index fc911bbb7b..273cde9d06 100644
> --- a/hw/ppc/e500plat.c
> +++ b/hw/ppc/e500plat.c
> @@ -102,7 +102,7 @@ static void e500plat_machine_class_init(ObjectClass *oc, 
> void *data)
>  mc->max_cpus = 32;
>  mc->default_cpu_type = POWERPC_CPU_TYPE_NAME("e500v2_v30");
>  mc->default_ram_id = "mpc8544ds.ram";
> -machine_class_allow_dynamic_sysbus_dev(mc, TYPE_ETSEC_COMMON);
> +machine_class_allow_dynamic_device(mc, TYPE_ETSEC_COMMON);
>   }
>  
>  static const TypeInfo e500plat_info = {
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index a4372ba189..70e12d9037 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -4586,7 +4586,7 @@ static void spapr_machine_class_init(ObjectClass *oc, 
> void *data)
>  mc->default_ram_id = "ppc_spapr.ram";
>  mc->default_display = "std";
>  mc->kvm_type = spapr_kvm_type;
> -machine_class_allow_dynamic_sysbus_dev(mc, TYPE_SPAPR_PCI_HOST_BRIDGE);
> +machine_class_allow_dynamic_device(mc, TYPE_SPAPR_PCI_HOST_BRI

Re: [PATCH v2 5/5] machine: remove temporary inline functions

2022-04-07 Thread Edgar E. Iglesias
On Thu, Mar 31, 2022 at 01:53:12PM +0200, Damien Hedde wrote:
> Now we have renamed all calls to these old functions, we
> can delete the temporary inline we've defined.
> 
> Signed-off-by: Damien Hedde 
> Reviewed-by: Philippe Mathieu-Daudé 

Reviewed-by: Edgar E. Iglesias 


> ---
>  include/hw/boards.h | 10 --
>  1 file changed, 10 deletions(-)
> 
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index 1814793175..7efba048e9 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -52,11 +52,6 @@ void machine_parse_smp_config(MachineState *ms,
>   * it will get an error message.
>   */
>  void machine_class_allow_dynamic_device(MachineClass *mc, const char *type);
> -static inline void machine_class_allow_dynamic_sysbus_dev(MachineClass *mc,
> -  const char *type)
> -{
> -machine_class_allow_dynamic_device(mc, type);
> -}
>  
>  /**
>   * device_type_is_dynamic_allowed: Check if type is an allowed device
> @@ -72,11 +67,6 @@ static inline void 
> machine_class_allow_dynamic_sysbus_dev(MachineClass *mc,
>   * Note that if @type has a parent type in the list, it is allowed too.
>   */
>  bool device_type_is_dynamic_allowed(MachineClass *mc, const char *type);
> -static inline bool device_type_is_dynamic_sysbus(MachineClass *mc,
> - const char *type)
> -{
> -return device_type_is_dynamic_allowed(mc, type);
> -}
>  
>  /**
>   * device_is_dynamic_sysbus: test whether device is a dynamic sysbus device
> -- 
> 2.35.1
> 
> 



Re: [PULL 09/12] virtiofsd: Create new file with security context

2022-04-07 Thread Vivek Goyal
On Thu, Apr 07, 2022 at 01:44:35PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Maydell (peter.mayd...@linaro.org) wrote:
> > On Thu, 17 Feb 2022 at 17:40, Dr. David Alan Gilbert (git)
> >  wrote:
> > >
> > > From: Vivek Goyal 
> > >
> > > This patch adds support for creating new file with security context
> > > as sent by client. It basically takes three paths.
> > >
> > > - If no security context enabled, then it continues to create files 
> > > without
> > >   security context.
> > >
> > > - If security context is enabled and but security.selinux has not been
> > >   remapped, then it uses /proc/thread-self/attr/fscreate knob to set
> > >   security context and then create the file. This will make sure that
> > >   newly created file gets the security context as set in "fscreate" and
> > >   this is atomic w.r.t file creation.
> > >
> > >   This is useful and host and guest SELinux policies don't conflict and
> > >   can work with each other. In that case, guest security.selinux xattr
> > >   is not remapped and it is passthrough as "security.selinux" xattr
> > >   on host.
> > >
> > > - If security context is enabled but security.selinux xattr has been
> > >   remapped to something else, then it first creates the file and then
> > >   uses setxattr() to set the remapped xattr with the security context.
> > >   This is a non-atomic operation w.r.t file creation.
> > >
> > >   This mode will be most versatile and allow host and guest to have their
> > >   own separate SELinux xattrs and have their own separate SELinux 
> > > policies.
> > >
> > > Reviewed-by: Dr. David Alan Gilbert 
> > > Signed-off-by: Vivek Goyal 
> > > Message-Id: <20220208204813.682906-9-vgo...@redhat.com>
> > > Signed-off-by: Dr. David Alan Gilbert 
> > 
> > Hi; Coverity reports some issues (CID 1487142, 1487195), because
> > it is not a fan of the error-handling pattern used in this code:
> > 
> > > +static int do_mknod_symlink_secctx(fuse_req_t req, struct lo_inode *dir,
> > > +   const char *name, const char 
> > > *secctx_name)
> > > +{
> > > +int path_fd, err;
> > > +char procname[64];
> > > +struct lo_data *lo = lo_data(req);
> > > +
> > > +if (!req->secctx.ctxlen) {
> > > +return 0;
> > > +}
> > > +
> > > +/* Open newly created element with O_PATH */
> > > +path_fd = openat(dir->fd, name, O_PATH | O_NOFOLLOW);
> > > +err = path_fd == -1 ? errno : 0;
> > > +if (err) {
> > > +return err;
> > > +}
> > 
> > We set err based on whether path_fd is -1 or not, but we decide
> > whether to early-return based on the value of err. Coverity
> > doesn't know that openat() will always set errno to something
> > non-zero if it returns -1, so it complains because it thinks
> > there's a code path where openat() returns -1, but errno is 0,
> > and so we don't take the early-return and instead continue
> > through all the code below to the "close(path_fd)", which
> > should not be being passed a negative value for the filedescriptor.
> > 
> > I could just mark these as false-positives, but it does seem a bit
> > odd that we are using two different conditions here. Perhaps it would
> > be better to rephrase? For instance, for the openat() we could write:
> > 
> >path_fd = openat(dir->fd, name, O_PATH | O_NOFOLLOW);
> >if (path_fd == -1) {
> >return errno;
> >}
> 
> That looks OK to me; please send a patch.
> 
> Some of the cases look like they need to just be a little careful that
> 'err' always gets set to 0 if there are later cases that might set err.

I think use of "err" to save errno pattern is used because in some
cases we can't return immediately after error. Instead we have to
take some actions to restore some state and then return.

So for this specific case, it looks fine because we don't have to
restore any state before returning.

Vivek
> 
> Dave
> 
> > and similarly for the openat() in open_set_proc_fscreate().
> > 
> > > +sprintf(procname, "%i", path_fd);
> > > +FCHDIR_NOFAIL(lo->proc_self_fd);
> > > +/* Set security context. This is not atomic w.r.t file creation */
> > > +err = setxattr(procname, secctx_name, req->secctx.ctx, 
> > > req->secctx.ctxlen,
> > > +   0);
> > > +if (err) {
> > > +err = errno;
> > > +}
> > 
> > > +FCHDIR_NOFAIL(lo->root.fd);
> > > +close(path_fd);
> > > +return err;
> > > +}
> > 
> > thanks
> > -- PMM
> > 
> -- 
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> 




Re: [PATCH] x86: Implement Linear Address Masking support

2022-04-07 Thread Kirill A. Shutemov
On Wed, Apr 06, 2022 at 10:34:41PM -0500, Richard Henderson wrote:
> On 4/6/22 20:01, Kirill A. Shutemov wrote:
> > Linear Address Masking feature makes CPU ignore some bits of the virtual
> > address. These bits can be used to encode metadata.
> > 
> > The feature is enumerated with CPUID.(EAX=07H, ECX=01H):EAX.LAM[bit 26].
> > 
> > CR3.LAM_U57[bit 62] allows to encode 6 bits of metadata in bits 62:57 of
> > user pointers.
> > 
> > CR3.LAM_U48[bit 61] allows to encode 15 bits of metadata in bits 62:48
> > of user pointers.
> > 
> > CR4.LAM_SUP[bit 28] allows to encode metadata of supervisor pointers.
> > If 5-level paging is in use, 6 bits of metadata can be encoded in 62:57.
> > For 4-level paging, 15 bits of metadata can be encoded in bits 62:48.
> > 
> > QEMU strips address from the metadata bits and gets it to canonical
> > shape before handling memory access. It has to be done very early before
> > TLB lookup.
> 
> The new hook is incorrect, in that it doesn't apply to addresses along
> the tlb fast path.

I'm not sure what you mean by that. tlb_hit() mechanics works. We strip
the tag bits before tlb lookup.

Could you elaborate?

> But it isn't really needed.  You can do all of the work in the existing
> tlb_fill hook. AArch64 has a similar feature, and that works fine.

To be honest I don't fully understand how TBI emulation works.

Consider store_helper(). I failed to find where tag bits get stripped
before getting there for !CONFIG_USER_ONLY. clean_data_tbi() only covers
user-only case.

And if we get there with tags, I don't see how we will ever get to fast
path: tlb_hit() should never return true there if any bit in top byte is
set as cached tlb_addr has them stripped.

tlb_fill() will get it handled correctly, but it is wasteful to go through
pagewalk on every tagged pointer dereference.

Hm?

-- 
 Kirill A. Shutemov



[PATCH v4 5/7] block/block-copy: block_copy(): add timeout_ns parameter

2022-04-07 Thread Vladimir Sementsov-Ogievskiy
Add possibility to limit block_copy() call in time. To be used in the
next commit.

As timed-out block_copy() call will continue in background anyway (we
can't immediately cancel IO operation), it's important also give user a
possibility to pass a callback, to do some additional actions on
block-copy call finish.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Hanna Reitz 
---
 include/block/block-copy.h |  4 +++-
 block/block-copy.c | 33 ++---
 block/copy-before-write.c  |  2 +-
 3 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/include/block/block-copy.h b/include/block/block-copy.h
index 68bbd344b2..ba0b425d78 100644
--- a/include/block/block-copy.h
+++ b/include/block/block-copy.h
@@ -40,7 +40,9 @@ int64_t block_copy_reset_unallocated(BlockCopyState *s,
  int64_t offset, int64_t *count);
 
 int coroutine_fn block_copy(BlockCopyState *s, int64_t offset, int64_t bytes,
-bool ignore_ratelimit);
+bool ignore_ratelimit, uint64_t timeout_ns,
+BlockCopyAsyncCallbackFunc cb,
+void *cb_opaque);
 
 /*
  * Run block-copy in a coroutine, create corresponding BlockCopyCallState
diff --git a/block/block-copy.c b/block/block-copy.c
index ec46775ea5..bb947afdda 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -883,23 +883,42 @@ static int coroutine_fn 
block_copy_common(BlockCopyCallState *call_state)
 return ret;
 }
 
+static void coroutine_fn block_copy_async_co_entry(void *opaque)
+{
+block_copy_common(opaque);
+}
+
 int coroutine_fn block_copy(BlockCopyState *s, int64_t start, int64_t bytes,
-bool ignore_ratelimit)
+bool ignore_ratelimit, uint64_t timeout_ns,
+BlockCopyAsyncCallbackFunc cb,
+void *cb_opaque)
 {
-BlockCopyCallState call_state = {
+int ret;
+BlockCopyCallState *call_state = g_new(BlockCopyCallState, 1);
+
+*call_state = (BlockCopyCallState) {
 .s = s,
 .offset = start,
 .bytes = bytes,
 .ignore_ratelimit = ignore_ratelimit,
 .max_workers = BLOCK_COPY_MAX_WORKERS,
+.cb = cb,
+.cb_opaque = cb_opaque,
 };
 
-return block_copy_common(&call_state);
-}
+ret = qemu_co_timeout(block_copy_async_co_entry, call_state, timeout_ns,
+  g_free);
+if (ret < 0) {
+assert(ret == -ETIMEDOUT);
+block_copy_call_cancel(call_state);
+/* call_state will be freed by running coroutine. */
+return ret;
+}
 
-static void coroutine_fn block_copy_async_co_entry(void *opaque)
-{
-block_copy_common(opaque);
+ret = call_state->ret;
+g_free(call_state);
+
+return ret;
 }
 
 BlockCopyCallState *block_copy_async(BlockCopyState *s,
diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index c8a11a09d2..fc13c7cd44 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -111,7 +111,7 @@ static coroutine_fn int 
cbw_do_copy_before_write(BlockDriverState *bs,
 off = QEMU_ALIGN_DOWN(offset, cluster_size);
 end = QEMU_ALIGN_UP(offset + bytes, cluster_size);
 
-ret = block_copy(s->bcs, off, end - off, true);
+ret = block_copy(s->bcs, off, end - off, true, 0, NULL, NULL);
 if (ret < 0 && s->on_cbw_error == ON_CBW_ERROR_BREAK_GUEST_WRITE) {
 return ret;
 }
-- 
2.35.1




[PATCH v4 0/7] copy-before-write: on-cbw-error and cbw-timeout

2022-04-07 Thread Vladimir Sementsov-Ogievskiy
Hi all!

v4: Now based on master
01: add assertion and r-b
02: s/7.0/7.1/ and r-b
03: switch to QEMUMachine, touch-up pylintrc,  drop r-b
04,05,06: add r-b
07: switch to QEMUMachine


Here are two new options for copy-before-write filter:

on-cbw-error allows to alter the behavior on copy-before-write operation
failure: not break guest write but break the snapshot (and therefore
backup process)

cbw-timeout allows to limit cbw operation by some timeout.

So, for example, using cbw-timeout=60 and on-cbw-error=break-snapshot
you can be sure that guest write will not stuck for more than 60
seconds and will never fail due to backup problems.

Vladimir Sementsov-Ogievskiy (7):
  block/copy-before-write: refactor option parsing
  block/copy-before-write: add on-cbw-error open parameter
  iotests: add copy-before-write: on-cbw-error tests
  util: add qemu-co-timeout
  block/block-copy: block_copy(): add timeout_ns parameter
  block/copy-before-write: implement cbw-timeout option
  iotests: copy-before-write: add cases for cbw-timeout option

 qapi/block-core.json  |  31 ++-
 include/block/block-copy.h|   4 +-
 include/qemu/coroutine.h  |  13 ++
 block/block-copy.c|  33 ++-
 block/copy-before-write.c | 111 ++---
 util/qemu-co-timeout.c|  89 
 tests/qemu-iotests/pylintrc   |   5 +
 tests/qemu-iotests/tests/copy-before-write| 213 ++
 .../qemu-iotests/tests/copy-before-write.out  |   5 +
 util/meson.build  |   1 +
 10 files changed, 466 insertions(+), 39 deletions(-)
 create mode 100644 util/qemu-co-timeout.c
 create mode 100755 tests/qemu-iotests/tests/copy-before-write
 create mode 100644 tests/qemu-iotests/tests/copy-before-write.out

-- 
2.35.1




[PATCH v4 1/7] block/copy-before-write: refactor option parsing

2022-04-07 Thread Vladimir Sementsov-Ogievskiy
We are going to add one more option of enum type. Let's refactor option
parsing so that we can simply work with BlockdevOptionsCbw object.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Hanna Reitz 
---
 block/copy-before-write.c | 56 ---
 1 file changed, 29 insertions(+), 27 deletions(-)

diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index a8a06fdc09..e29c46cd7a 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -24,6 +24,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "qapi/qmp/qjson.h"
 
 #include "sysemu/block-backend.h"
 #include "qemu/cutils.h"
@@ -328,46 +329,34 @@ static void cbw_child_perm(BlockDriverState *bs, 
BdrvChild *c,
 }
 }
 
-static bool cbw_parse_bitmap_option(QDict *options, BdrvDirtyBitmap **bitmap,
-Error **errp)
+static BlockdevOptions *cbw_parse_options(QDict *options, Error **errp)
 {
-QDict *bitmap_qdict = NULL;
-BlockDirtyBitmap *bmp_param = NULL;
+BlockdevOptions *opts = NULL;
 Visitor *v = NULL;
-bool ret = false;
 
-*bitmap = NULL;
+qdict_put_str(options, "driver", "copy-before-write");
 
-qdict_extract_subqdict(options, &bitmap_qdict, "bitmap.");
-if (!qdict_size(bitmap_qdict)) {
-ret = true;
-goto out;
-}
-
-v = qobject_input_visitor_new_flat_confused(bitmap_qdict, errp);
+v = qobject_input_visitor_new_flat_confused(options, errp);
 if (!v) {
 goto out;
 }
 
-visit_type_BlockDirtyBitmap(v, NULL, &bmp_param, errp);
-if (!bmp_param) {
+visit_type_BlockdevOptions(v, NULL, &opts, errp);
+if (!opts) {
 goto out;
 }
 
-*bitmap = block_dirty_bitmap_lookup(bmp_param->node, bmp_param->name, NULL,
-errp);
-if (!*bitmap) {
-goto out;
-}
-
-ret = true;
+/*
+ * Delete options which we are going to parse through BlockdevOptions
+ * object for original options.
+ */
+qdict_extract_subqdict(options, NULL, "bitmap");
 
 out:
-qapi_free_BlockDirtyBitmap(bmp_param);
 visit_free(v);
-qobject_unref(bitmap_qdict);
+qdict_del(options, "driver");
 
-return ret;
+return opts;
 }
 
 static int cbw_open(BlockDriverState *bs, QDict *options, int flags,
@@ -376,6 +365,15 @@ static int cbw_open(BlockDriverState *bs, QDict *options, 
int flags,
 BDRVCopyBeforeWriteState *s = bs->opaque;
 BdrvDirtyBitmap *bitmap = NULL;
 int64_t cluster_size;
+g_autoptr(BlockdevOptions) full_opts = NULL;
+BlockdevOptionsCbw *opts;
+
+full_opts = cbw_parse_options(options, errp);
+if (!full_opts) {
+return -EINVAL;
+}
+assert(full_opts->driver == BLOCKDEV_DRIVER_COPY_BEFORE_WRITE);
+opts = &full_opts->u.copy_before_write;
 
 bs->file = bdrv_open_child(NULL, options, "file", bs, &child_of_bds,
BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
@@ -390,8 +388,12 @@ static int cbw_open(BlockDriverState *bs, QDict *options, 
int flags,
 return -EINVAL;
 }
 
-if (!cbw_parse_bitmap_option(options, &bitmap, errp)) {
-return -EINVAL;
+if (opts->has_bitmap) {
+bitmap = block_dirty_bitmap_lookup(opts->bitmap->node,
+   opts->bitmap->name, NULL, errp);
+if (!bitmap) {
+return -EINVAL;
+}
 }
 
 bs->total_sectors = bs->file->bs->total_sectors;
-- 
2.35.1




[PATCH v4 3/7] iotests: add copy-before-write: on-cbw-error tests

2022-04-07 Thread Vladimir Sementsov-Ogievskiy
Add tests for new option of copy-before-write filter: on-cbw-error.

Note that we use QEMUMachine instead of VM class, because in further
commit we'll want to use throttling which doesn't work with -accel
qtest used by VM.

We also touch pylintrc to not break iotest 297.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/pylintrc   |   5 +
 tests/qemu-iotests/tests/copy-before-write| 132 ++
 .../qemu-iotests/tests/copy-before-write.out  |   5 +
 3 files changed, 142 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/copy-before-write
 create mode 100644 tests/qemu-iotests/tests/copy-before-write.out

diff --git a/tests/qemu-iotests/pylintrc b/tests/qemu-iotests/pylintrc
index 32ab77b8bb..f4f823a991 100644
--- a/tests/qemu-iotests/pylintrc
+++ b/tests/qemu-iotests/pylintrc
@@ -51,3 +51,8 @@ notes=FIXME,
 
 # Maximum number of characters on a single line.
 max-line-length=79
+
+
+[SIMILARITIES]
+
+min-similarity-lines=6
diff --git a/tests/qemu-iotests/tests/copy-before-write 
b/tests/qemu-iotests/tests/copy-before-write
new file mode 100755
index 00..6c7638965e
--- /dev/null
+++ b/tests/qemu-iotests/tests/copy-before-write
@@ -0,0 +1,132 @@
+#!/usr/bin/env python3
+# group: auto backup
+#
+# Copyright (c) 2022 Virtuozzo International GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+import os
+import re
+
+from qemu.machine import QEMUMachine
+
+import iotests
+from iotests import qemu_img_create, qemu_io
+
+
+temp_img = os.path.join(iotests.test_dir, 'temp')
+source_img = os.path.join(iotests.test_dir, 'source')
+size = '1M'
+
+
+class TestCbwError(iotests.QMPTestCase):
+def tearDown(self):
+self.vm.shutdown()
+os.remove(temp_img)
+os.remove(source_img)
+
+def setUp(self):
+qemu_img_create('-f', iotests.imgfmt, source_img, size)
+qemu_img_create('-f', iotests.imgfmt, temp_img, size)
+qemu_io('-c', 'write 0 1M', source_img)
+
+self.vm = QEMUMachine(iotests.qemu_prog)
+self.vm.launch()
+
+def do_cbw_error(self, on_cbw_error):
+result = self.vm.qmp('blockdev-add', {
+'node-name': 'cbw',
+'driver': 'copy-before-write',
+'on-cbw-error': on_cbw_error,
+'file': {
+'driver': iotests.imgfmt,
+'file': {
+'driver': 'file',
+'filename': source_img,
+}
+},
+'target': {
+'driver': iotests.imgfmt,
+'file': {
+'driver': 'blkdebug',
+'image': {
+'driver': 'file',
+'filename': temp_img
+},
+'inject-error': [
+{
+'event': 'write_aio',
+'errno': 5,
+'immediately': False,
+'once': True
+}
+]
+}
+}
+})
+self.assert_qmp(result, 'return', {})
+
+result = self.vm.qmp('blockdev-add', {
+'node-name': 'access',
+'driver': 'snapshot-access',
+'file': 'cbw'
+})
+self.assert_qmp(result, 'return', {})
+
+result = self.vm.qmp('human-monitor-command',
+ command_line='qemu-io cbw "write 0 1M"')
+self.assert_qmp(result, 'return', '')
+
+result = self.vm.qmp('human-monitor-command',
+ command_line='qemu-io access "read 0 1M"')
+self.assert_qmp(result, 'return', '')
+
+self.vm.shutdown()
+log = self.vm.get_log()
+log = re.sub(r'^\[I \d+\.\d+\] OPENED\n', '', log)
+log = re.sub(r'\[I \+\d+\.\d+\] CLOSED\n?$', '', log)
+log = iotests.filter_qemu_io(log)
+return log
+
+def test_break_snapshot_on_cbw_error(self):
+"""break-snapshot behavior:
+Guest write succeed, but further snapshot-read fails, as snapshot is
+broken.
+"""
+log = self.do_cbw_error('break-snapshot')
+
+self.assertEqual(log, """\
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read failed: Permission denied

[PATCH v4 2/7] block/copy-before-write: add on-cbw-error open parameter

2022-04-07 Thread Vladimir Sementsov-Ogievskiy
Currently, behavior on copy-before-write operation failure is simple:
report error to the guest.

Let's implement alternative behavior: break the whole copy-before-write
process (and corresponding backup job or NBD client) but keep guest
working. It's needed if we consider guest stability as more important.

The realisation is simple: on copy-before-write failure we set
s->snapshot_ret and continue guest operations. s->snapshot_ret being
set will lead to all further snapshot API requests. Note that all
in-flight snapshot-API requests may still success: we do wait for them
on BREAK_SNAPSHOT-failure path in cbw_do_copy_before_write().

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Hanna Reitz 
---
 qapi/block-core.json  | 25 -
 block/copy-before-write.c | 32 ++--
 2 files changed, 54 insertions(+), 3 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index beeb91952a..6b870b2f37 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -4163,6 +4163,25 @@
   'base': 'BlockdevOptionsGenericFormat',
   'data': { '*bottom': 'str' } }
 
+##
+# @OnCbwError:
+#
+# An enumeration of possible behaviors for copy-before-write operation
+# failures.
+#
+# @break-guest-write: report the error to the guest. This way, the guest
+# will not be able to overwrite areas that cannot be
+# backed up, so the backup process remains valid.
+#
+# @break-snapshot: continue guest write. Doing so will make the provided
+#  snapshot state invalid and any backup or export
+#  process based on it will finally fail.
+#
+# Since: 7.1
+##
+{ 'enum': 'OnCbwError',
+  'data': [ 'break-guest-write', 'break-snapshot' ] }
+
 ##
 # @BlockdevOptionsCbw:
 #
@@ -4184,11 +4203,15 @@
 #  modifications (or removing) of specified bitmap doesn't
 #  influence the filter. (Since 7.0)
 #
+# @on-cbw-error: Behavior on failure of copy-before-write operation.
+#Default is @break-guest-write. (Since 7.1)
+#
 # Since: 6.2
 ##
 { 'struct': 'BlockdevOptionsCbw',
   'base': 'BlockdevOptionsGenericFormat',
-  'data': { 'target': 'BlockdevRef', '*bitmap': 'BlockDirtyBitmap' } }
+  'data': { 'target': 'BlockdevRef', '*bitmap': 'BlockDirtyBitmap',
+'*on-cbw-error': 'OnCbwError' } }
 
 ##
 # @BlockdevOptions:
diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index e29c46cd7a..c8a11a09d2 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -41,6 +41,7 @@
 typedef struct BDRVCopyBeforeWriteState {
 BlockCopyState *bcs;
 BdrvChild *target;
+OnCbwError on_cbw_error;
 
 /*
  * @lock: protects access to @access_bitmap, @done_bitmap and
@@ -65,6 +66,14 @@ typedef struct BDRVCopyBeforeWriteState {
  * node. These areas must not be rewritten by guest.
  */
 BlockReqList frozen_read_reqs;
+
+/*
+ * @snapshot_error is normally zero. But on first copy-before-write failure
+ * when @on_cbw_error == ON_CBW_ERROR_BREAK_SNAPSHOT, @snapshot_error takes
+ * value of this error (<0). After that all in-flight and further
+ * snapshot-API requests will fail with that error.
+ */
+int snapshot_error;
 } BDRVCopyBeforeWriteState;
 
 static coroutine_fn int cbw_co_preadv(
@@ -95,16 +104,27 @@ static coroutine_fn int 
cbw_do_copy_before_write(BlockDriverState *bs,
 return 0;
 }
 
+if (s->snapshot_error) {
+return 0;
+}
+
 off = QEMU_ALIGN_DOWN(offset, cluster_size);
 end = QEMU_ALIGN_UP(offset + bytes, cluster_size);
 
 ret = block_copy(s->bcs, off, end - off, true);
-if (ret < 0) {
+if (ret < 0 && s->on_cbw_error == ON_CBW_ERROR_BREAK_GUEST_WRITE) {
 return ret;
 }
 
 WITH_QEMU_LOCK_GUARD(&s->lock) {
-bdrv_set_dirty_bitmap(s->done_bitmap, off, end - off);
+if (ret < 0) {
+assert(s->on_cbw_error == ON_CBW_ERROR_BREAK_SNAPSHOT);
+if (!s->snapshot_error) {
+s->snapshot_error = ret;
+}
+} else {
+bdrv_set_dirty_bitmap(s->done_bitmap, off, end - off);
+}
 reqlist_wait_all(&s->frozen_read_reqs, off, end - off, &s->lock);
 }
 
@@ -176,6 +196,11 @@ static BlockReq *cbw_snapshot_read_lock(BlockDriverState 
*bs,
 
 QEMU_LOCK_GUARD(&s->lock);
 
+if (s->snapshot_error) {
+g_free(req);
+return NULL;
+}
+
 if (bdrv_dirty_bitmap_next_zero(s->access_bitmap, offset, bytes) != -1) {
 g_free(req);
 return NULL;
@@ -351,6 +376,7 @@ static BlockdevOptions *cbw_parse_options(QDict *options, 
Error **errp)
  * object for original options.
  */
 qdict_extract_subqdict(options, NULL, "bitmap");
+qdict_del(options, "on-cbw-error");
 
 out:
 visit_free(v);
@@ -395,6 +421,8 @@ static int cbw_open(BlockDriverState *bs, QDict *options, 
int flags,
 return -EINVAL;

[PATCH v4 6/7] block/copy-before-write: implement cbw-timeout option

2022-04-07 Thread Vladimir Sementsov-Ogievskiy
In some scenarios, when copy-before-write operations lasts too long
time, it's better to cancel it.

Most useful would be to use the new option together with
on-cbw-error=break-snapshot: this way if cbw operation takes too long
time we'll just cancel backup process but do not disturb the guest too
much.

Note the tricky point of realization: we keep additional point in
bs->in_flight during block_copy operation even if it's timed-out.
Background "cancelled" block_copy operations will finish at some point
and will want to access state. We should care to not free the state in
.bdrv_close() earlier.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Hanna Reitz 
---
 qapi/block-core.json  |  8 +++-
 block/copy-before-write.c | 23 ++-
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 6b870b2f37..682b599a4a 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -4206,12 +4206,18 @@
 # @on-cbw-error: Behavior on failure of copy-before-write operation.
 #Default is @break-guest-write. (Since 7.1)
 #
+# @cbw-timeout: Zero means no limit. Non-zero sets the timeout in seconds
+#   for copy-before-write operation. When a timeout occurs,
+#   the respective copy-before-write operation will fail, and
+#   the @on-cbw-error parameter will decide how this failure
+#   is handled. Default 0. (Since 7.1)
+#
 # Since: 6.2
 ##
 { 'struct': 'BlockdevOptionsCbw',
   'base': 'BlockdevOptionsGenericFormat',
   'data': { 'target': 'BlockdevRef', '*bitmap': 'BlockDirtyBitmap',
-'*on-cbw-error': 'OnCbwError' } }
+'*on-cbw-error': 'OnCbwError', '*cbw-timeout': 'uint32' } }
 
 ##
 # @BlockdevOptions:
diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index fc13c7cd44..1bc2e7f9ba 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -42,6 +42,7 @@ typedef struct BDRVCopyBeforeWriteState {
 BlockCopyState *bcs;
 BdrvChild *target;
 OnCbwError on_cbw_error;
+uint32_t cbw_timeout_ns;
 
 /*
  * @lock: protects access to @access_bitmap, @done_bitmap and
@@ -83,6 +84,14 @@ static coroutine_fn int cbw_co_preadv(
 return bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);
 }
 
+static void block_copy_cb(void *opaque)
+{
+BlockDriverState *bs = opaque;
+
+bs->in_flight--;
+aio_wait_kick();
+}
+
 /*
  * Do copy-before-write operation.
  *
@@ -111,7 +120,16 @@ static coroutine_fn int 
cbw_do_copy_before_write(BlockDriverState *bs,
 off = QEMU_ALIGN_DOWN(offset, cluster_size);
 end = QEMU_ALIGN_UP(offset + bytes, cluster_size);
 
-ret = block_copy(s->bcs, off, end - off, true, 0, NULL, NULL);
+/*
+ * Increase in_flight, so that in case of timed-out block-copy, the
+ * remaining background block_copy() request (which can't be immediately
+ * cancelled by timeout) is presented in bs->in_flight. This way we are
+ * sure that on bs close() we'll previously wait for all timed-out but yet
+ * running block_copy calls.
+ */
+bs->in_flight++;
+ret = block_copy(s->bcs, off, end - off, true, s->cbw_timeout_ns,
+ block_copy_cb, bs);
 if (ret < 0 && s->on_cbw_error == ON_CBW_ERROR_BREAK_GUEST_WRITE) {
 return ret;
 }
@@ -377,6 +395,7 @@ static BlockdevOptions *cbw_parse_options(QDict *options, 
Error **errp)
  */
 qdict_extract_subqdict(options, NULL, "bitmap");
 qdict_del(options, "on-cbw-error");
+qdict_del(options, "cbw-timeout");
 
 out:
 visit_free(v);
@@ -423,6 +442,8 @@ static int cbw_open(BlockDriverState *bs, QDict *options, 
int flags,
 }
 s->on_cbw_error = opts->has_on_cbw_error ? opts->on_cbw_error :
 ON_CBW_ERROR_BREAK_GUEST_WRITE;
+s->cbw_timeout_ns = opts->has_cbw_timeout ?
+opts->cbw_timeout * NANOSECONDS_PER_SECOND : 0;
 
 bs->total_sectors = bs->file->bs->total_sectors;
 bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
-- 
2.35.1




[PATCH v4 4/7] util: add qemu-co-timeout

2022-04-07 Thread Vladimir Sementsov-Ogievskiy
Add new API, to make a time limited call of the coroutine.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Hanna Reitz 
---
 include/qemu/coroutine.h | 13 ++
 util/qemu-co-timeout.c   | 89 
 util/meson.build |  1 +
 3 files changed, 103 insertions(+)
 create mode 100644 util/qemu-co-timeout.c

diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
index c828a95ee0..8704b05da8 100644
--- a/include/qemu/coroutine.h
+++ b/include/qemu/coroutine.h
@@ -316,6 +316,19 @@ static inline void coroutine_fn 
qemu_co_sleep_ns(QEMUClockType type, int64_t ns)
 qemu_co_sleep_ns_wakeable(&w, type, ns);
 }
 
+typedef void CleanupFunc(void *opaque);
+/**
+ * Run entry in a coroutine and start timer. Wait for entry to finish or for
+ * timer to elapse, what happen first. If entry finished, return 0, if timer
+ * elapsed earlier, return -ETIMEDOUT.
+ *
+ * Be careful, entry execution is not canceled, user should handle it somehow.
+ * If @clean is provided, it's called after coroutine finish if timeout
+ * happened.
+ */
+int coroutine_fn qemu_co_timeout(CoroutineEntry *entry, void *opaque,
+ uint64_t timeout_ns, CleanupFunc clean);
+
 /**
  * Wake a coroutine if it is sleeping in qemu_co_sleep_ns. The timer will be
  * deleted. @sleep_state must be the variable whose address was given to
diff --git a/util/qemu-co-timeout.c b/util/qemu-co-timeout.c
new file mode 100644
index 00..00cd335649
--- /dev/null
+++ b/util/qemu-co-timeout.c
@@ -0,0 +1,89 @@
+/*
+ * Helper functionality for distributing a fixed total amount of
+ * an abstract resource among multiple coroutines.
+ *
+ * Copyright (c) 2022 Virtuozzo International GmbH
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/coroutine.h"
+#include "block/aio.h"
+
+typedef struct QemuCoTimeoutState {
+CoroutineEntry *entry;
+void *opaque;
+QemuCoSleep sleep_state;
+bool marker;
+CleanupFunc *clean;
+} QemuCoTimeoutState;
+
+static void coroutine_fn qemu_co_timeout_entry(void *opaque)
+{
+QemuCoTimeoutState *s = opaque;
+
+s->entry(s->opaque);
+
+if (s->marker) {
+assert(!s->sleep_state.to_wake);
+/* .marker set by qemu_co_timeout, it have been failed */
+if (s->clean) {
+s->clean(s->opaque);
+}
+g_free(s);
+} else {
+s->marker = true;
+qemu_co_sleep_wake(&s->sleep_state);
+}
+}
+
+int coroutine_fn qemu_co_timeout(CoroutineEntry *entry, void *opaque,
+ uint64_t timeout_ns, CleanupFunc clean)
+{
+QemuCoTimeoutState *s;
+Coroutine *co;
+
+if (timeout_ns == 0) {
+entry(opaque);
+return 0;
+}
+
+s = g_new(QemuCoTimeoutState, 1);
+*s = (QemuCoTimeoutState) {
+.entry = entry,
+.opaque = opaque,
+.clean = clean
+};
+
+co = qemu_coroutine_create(qemu_co_timeout_entry, s);
+
+aio_co_enter(qemu_get_current_aio_context(), co);
+qemu_co_sleep_ns_wakeable(&s->sleep_state, QEMU_CLOCK_REALTIME, 
timeout_ns);
+
+if (s->marker) {
+/* .marker set by qemu_co_timeout_entry, success */
+g_free(s);
+return 0;
+}
+
+/* Don't free s, as we can't cancel qemu_co_timeout_entry execution */
+s->marker = true;
+return -ETIMEDOUT;
+}
diff --git a/util/meson.build b/util/meson.build
index f6ee74ad0c..249891db72 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -83,6 +83,7 @@ if have_block
   util_ss.add(files('block-helpers.c'))
   util_ss.add(files('qemu-coroutine-sleep.c'))
   util_ss.add(files('qemu-co-shared-resource.c'))
+  util_ss.add(files('qemu-co-timeout.c'))
   util_ss.add(files('thread-pool.c', 'qemu-timer.c'))
   util_ss.add(files('readline.c'))
   util_ss.add(files('throttle.c'))
-- 
2.35.1




[PATCH 2/3] libvhost-user: Fix extra vu_add/rem_mem_reg reply

2022-04-07 Thread Kevin Wolf
Outside of postcopy mode, neither VHOST_USER_ADD_MEM_REG nor
VHOST_USER_REM_MEM_REG are supposed to send a reply unless explicitly
requested with the need_reply flag. Their current implementation always
sends a reply, even if it isn't requested. This confuses the master
because it will interpret the reply as a reply for the next message for
which it actually expects a reply.

need_reply is already handled correctly by vu_dispatch(), so just don't
send a reply in the non-postcopy part of the message handler for these
two commands.

Signed-off-by: Kevin Wolf 
---
 subprojects/libvhost-user/libvhost-user.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 47d2efc60f..eccaff5168 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -800,8 +800,7 @@ vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 
 DPRINT("Successfully added new region\n");
 dev->nregions++;
-vmsg_set_reply_u64(vmsg, 0);
-return true;
+return false;
 }
 }
 
@@ -874,15 +873,13 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 }
 }
 
-if (found) {
-vmsg_set_reply_u64(vmsg, 0);
-} else {
+if (!found) {
 vu_panic(dev, "Specified region not found\n");
 }
 
 close(vmsg->fds[0]);
 
-return true;
+return false;
 }
 
 static bool
-- 
2.35.1




[PATCH v4 7/7] iotests: copy-before-write: add cases for cbw-timeout option

2022-04-07 Thread Vladimir Sementsov-Ogievskiy
Add two simple test-cases: timeout failure with
break-snapshot-on-cbw-error behavior and similar with
break-guest-write-on-cbw-error behavior.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/tests/copy-before-write| 81 +++
 .../qemu-iotests/tests/copy-before-write.out  |  4 +-
 2 files changed, 83 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/tests/copy-before-write 
b/tests/qemu-iotests/tests/copy-before-write
index 6c7638965e..f01f26f01c 100755
--- a/tests/qemu-iotests/tests/copy-before-write
+++ b/tests/qemu-iotests/tests/copy-before-write
@@ -126,6 +126,87 @@ read 1048576/1048576 bytes at offset 0
 1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 """)
 
+def do_cbw_timeout(self, on_cbw_error):
+result = self.vm.qmp('object-add', {
+'qom-type': 'throttle-group',
+'id': 'group0',
+'limits': {'bps-write': 300 * 1024}
+})
+self.assert_qmp(result, 'return', {})
+
+result = self.vm.qmp('blockdev-add', {
+'node-name': 'cbw',
+'driver': 'copy-before-write',
+'on-cbw-error': on_cbw_error,
+'cbw-timeout': 1,
+'file': {
+'driver': iotests.imgfmt,
+'file': {
+'driver': 'file',
+'filename': source_img,
+}
+},
+'target': {
+'driver': 'throttle',
+'throttle-group': 'group0',
+'file': {
+'driver': 'qcow2',
+'file': {
+'driver': 'file',
+'filename': temp_img
+}
+}
+}
+})
+self.assert_qmp(result, 'return', {})
+
+result = self.vm.qmp('blockdev-add', {
+'node-name': 'access',
+'driver': 'snapshot-access',
+'file': 'cbw'
+})
+self.assert_qmp(result, 'return', {})
+
+result = self.vm.qmp('human-monitor-command',
+ command_line='qemu-io cbw "write 0 512K"')
+self.assert_qmp(result, 'return', '')
+
+# We need second write to trigger throttling
+result = self.vm.qmp('human-monitor-command',
+ command_line='qemu-io cbw "write 512K 512K"')
+self.assert_qmp(result, 'return', '')
+
+result = self.vm.qmp('human-monitor-command',
+ command_line='qemu-io access "read 0 1M"')
+self.assert_qmp(result, 'return', '')
+
+self.vm.shutdown()
+log = self.vm.get_log()
+log = re.sub(r'^\[I \d+\.\d+\] OPENED\n', '', log)
+log = re.sub(r'\[I \+\d+\.\d+\] CLOSED\n?$', '', log)
+log = iotests.filter_qemu_io(log)
+return log
+
+def test_timeout_break_guest(self):
+log = self.do_cbw_timeout('break-guest-write')
+self.assertEqual(log, """\
+wrote 524288/524288 bytes at offset 0
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+write failed: Connection timed out
+read 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+""")
+
+def test_timeout_break_snapshot(self):
+log = self.do_cbw_timeout('break-snapshot')
+self.assertEqual(log, """\
+wrote 524288/524288 bytes at offset 0
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 524288/524288 bytes at offset 524288
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read failed: Permission denied
+""")
+
 
 if __name__ == '__main__':
 iotests.main(supported_fmts=['qcow2'],
diff --git a/tests/qemu-iotests/tests/copy-before-write.out 
b/tests/qemu-iotests/tests/copy-before-write.out
index fbc63e62f8..89968f35d7 100644
--- a/tests/qemu-iotests/tests/copy-before-write.out
+++ b/tests/qemu-iotests/tests/copy-before-write.out
@@ -1,5 +1,5 @@
-..
+
 --
-Ran 2 tests
+Ran 4 tests
 
 OK
-- 
2.35.1




[PATCH 0/3] vhost-user: Fixes for VHOST_USER_ADD/REM_MEM_REG

2022-04-07 Thread Kevin Wolf
While implementing a vhost-user-blk driver for libblkio, I found some
problems with VHOST_USER_ADD/REM_MEM_REG both in the spec and in the
implementations in QEMU and libvhost-user that this series addresses.

I also noticed that you can use REM_MEM_REG or SET_MEM_TABLE to unmap a
memory region that is still in use (e.g. a block I/O request using
addresses from the region has been started, but not completed yet),
which is not great. I'm not sure how to fix this best, though.

We would have to wait for these requests to complete (maybe introduce a
refcount and wait for it to drop to zero), but waiting seems impossible
in libvhost-user because it doesn't have any main loop integration. Just
failing the memory region removal would be safe, but potentially a
rather awkward interface because clients would have to implement some
retry logic.

Kevin Wolf (3):
  docs/vhost-user: Clarifications for VHOST_USER_ADD/REM_MEM_REG
  libvhost-user: Fix extra vu_add/rem_mem_reg reply
  vhost-user: Don't pass file descriptor for VHOST_USER_REM_MEM_REG

 docs/interop/vhost-user.rst   | 17 +
 hw/virtio/vhost-user.c|  2 +-
 subprojects/libvhost-user/libvhost-user.c | 17 +++--
 3 files changed, 25 insertions(+), 11 deletions(-)

-- 
2.35.1




[PATCH 3/3] vhost-user: Don't pass file descriptor for VHOST_USER_REM_MEM_REG

2022-04-07 Thread Kevin Wolf
The spec clarifies now that QEMU should not send a file descriptor in a
request to remove a memory region. Change it accordingly.

For libvhost-user, this is a bug fix that makes it compatible with
rust-vmm's implementation that doesn't send a file descriptor. Keep
accepting, but ignoring a file descriptor for compatibility with older
QEMU versions.

Signed-off-by: Kevin Wolf 
---
 hw/virtio/vhost-user.c| 2 +-
 subprojects/libvhost-user/libvhost-user.c | 8 
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 6abbc9da32..82caf607e5 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -751,7 +751,7 @@ static int send_remove_regions(struct vhost_dev *dev,
 vhost_user_fill_msg_region(®ion_buffer, shadow_reg, 0);
 msg->payload.mem_reg.region = region_buffer;
 
-ret = vhost_user_write(dev, msg, &fd, 1);
+ret = vhost_user_write(dev, msg, NULL, 0);
 if (ret < 0) {
 return ret;
 }
diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index eccaff5168..d0041c864b 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -822,15 +822,15 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 int i;
 bool found = false;
 
-if (vmsg->fd_num != 1) {
+if (vmsg->fd_num > 1) {
 vmsg_close_fds(vmsg);
-vu_panic(dev, "VHOST_USER_REM_MEM_REG received %d fds - only 1 fd "
+vu_panic(dev, "VHOST_USER_REM_MEM_REG received %d fds - at most 1 fd "
   "should be sent for this message type", vmsg->fd_num);
 return false;
 }
 
 if (vmsg->size < VHOST_USER_MEM_REG_SIZE) {
-close(vmsg->fds[0]);
+vmsg_close_fds(vmsg);
 vu_panic(dev, "VHOST_USER_REM_MEM_REG requires a message size of at "
   "least %d bytes and only %d bytes were received",
   VHOST_USER_MEM_REG_SIZE, vmsg->size);
@@ -877,7 +877,7 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 vu_panic(dev, "Specified region not found\n");
 }
 
-close(vmsg->fds[0]);
+vmsg_close_fds(vmsg);
 
 return false;
 }
-- 
2.35.1




Re: [PATCH v2 0/6] hw/riscv: Add TPM support to the virt board

2022-04-07 Thread Edgar E. Iglesias
On Thu, Apr 07, 2022 at 12:04:26PM +1000, Alistair Francis wrote:
> From: Alistair Francis 
> 
> This series adds support for connecting TPM devices to the RISC-V virt
> board. This is similar to how it works for the ARM virt board.
> 
> This was tested by first creating an emulated TPM device:
> 
> swtpm socket --tpm2 -t -d --tpmstate dir=/tmp/tpm \
> --ctrl type=unixio,path=swtpm-sock
> 
> Then launching QEMU with:
> 
> -chardev socket,id=chrtpm,path=swtpm-sock \
> -tpmdev emulator,id=tpm0,chardev=chrtpm \
> -device tpm-tis-device,tpmdev=tpm0
> 
> The TPM device can be seen in the memory tree and the generated device
> tree.

Hi Alistair!

You've got a typo in the subject of patch 4/6 "generating".

On the series:
Reviewed-by: Edgar E. Iglesias 

Cheers,
Edgar


> 
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/942
> 
> Alistair Francis (6):
>   hw/riscv: virt: Add a machine done notifier
>   hw/core: Move the ARM sysbus-fdt to core
>   hw/riscv: virt: Create a platform bus
>   hw/riscv: virt: Add support for generateing platform FDT entries
>   hw/riscv: virt: Add device plug support
>   hw/riscv: Enable TPM backends
> 
>  include/hw/{arm => core}/sysbus-fdt.h |   0
>  include/hw/riscv/virt.h   |   8 +-
>  hw/arm/virt.c |   2 +-
>  hw/arm/xlnx-versal-virt.c |   1 -
>  hw/{arm => core}/sysbus-fdt.c |   2 +-
>  hw/riscv/virt.c   | 312 +-
>  hw/arm/meson.build|   1 -
>  hw/core/meson.build   |   1 +
>  hw/riscv/Kconfig  |   2 +
>  9 files changed, 221 insertions(+), 108 deletions(-)
>  rename include/hw/{arm => core}/sysbus-fdt.h (100%)
>  rename hw/{arm => core}/sysbus-fdt.c (99%)
> 
> -- 
> 2.35.1
> 



[PATCH 1/3] docs/vhost-user: Clarifications for VHOST_USER_ADD/REM_MEM_REG

2022-04-07 Thread Kevin Wolf
The specification for VHOST_USER_ADD/REM_MEM_REG messages is unclear
in several points, which has led to clients having incompatible
implementations. This changes the specification to be more explicit
about them:

* VHOST_USER_ADD_MEM_REG is not specified as receiving a file
  descriptor, though it obviously does need to do so. All
  implementations agree on this one, fix the specification.

* VHOST_USER_REM_MEM_REG is not specified as receiving a file
  descriptor either, and it also has no reason to do so. rust-vmm does
  not send file descriptors for removing a memory region (in agreement
  with the specification), libvhost-user and QEMU do (which is a bug),
  though libvhost-user doesn't actually make any use of it.

  Change the specification so that for compatibility QEMU's behaviour
  becomes legal, even if discouraged, but rust-vmm's behaviour becomes
  the explicitly recommended mode of operation.

* VHOST_USER_ADD_MEM_REG doesn't have a documented return value, which
  is the desired behaviour in the non-postcopy case. It also implemented
  like this in QEMU and rust-vmm, though libvhost-user is buggy and
  sometimes sends an unexpected reply. This will be fixed in a separate
  patch.

  However, in postcopy mode it does reply like VHOST_USER_SET_MEM_TABLE.
  This behaviour is shared between libvhost-user and QEMU; rust-vmm
  doesn't implement postcopy mode yet. Mention it explicitly in the
  spec.

* The specification doesn't mention how VHOST_USER_REM_MEM_REG
  identifies the memory region to be removed. Change it to describe the
  existing behaviour of libvhost-user (guest address, user address and
  size must match).

Signed-off-by: Kevin Wolf 
---
 docs/interop/vhost-user.rst | 17 +
 1 file changed, 17 insertions(+)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index 4dbc84fd00..f9e721ba5f 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -308,6 +308,7 @@ replies. Here is a list of the ones that do:
 There are several messages that the master sends with file descriptors passed
 in the ancillary data:
 
+* ``VHOST_USER_ADD_MEM_REG``
 * ``VHOST_USER_SET_MEM_TABLE``
 * ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``)
 * ``VHOST_USER_SET_LOG_FD``
@@ -1334,6 +1335,14 @@ Master message types
   ``VHOST_USER_REM_MEM_REG`` message, this message is used to set and
   update the memory tables of the slave device.
 
+  Exactly one file descriptor from which the memory is mapped is
+  passed in the ancillary data.
+
+  In postcopy mode (see ``VHOST_USER_POSTCOPY_LISTEN``), the slave
+  replies with the bases of the memory mapped region to the master.
+  For further details on postcopy, see ``VHOST_USER_SET_MEM_TABLE``.
+  They apply to ``VHOST_USER_ADD_MEM_REG`` accordingly.
+
 ``VHOST_USER_REM_MEM_REG``
   :id: 38
   :equivalent ioctl: N/A
@@ -1349,6 +1358,14 @@ Master message types
   ``VHOST_USER_ADD_MEM_REG`` message, this message is used to set and
   update the memory tables of the slave device.
 
+  The memory region to be removed is identified by its guest address,
+  user address and size. The mmap offset is ignored.
+
+  No file descriptors SHOULD be passed in the ancillary data. For
+  compatibility with existing incorrect implementations, the slave MAY
+  accept messages with one file descriptor. If a file descriptor is
+  passed, the slave MUST close it without using it otherwise.
+
 ``VHOST_USER_SET_STATUS``
   :id: 39
   :equivalent ioctl: VHOST_VDPA_SET_STATUS
-- 
2.35.1




Re: [libvirt RFC] virFile: new VIR_FILE_WRAPPER_BIG_PIPE to improve performance

2022-04-07 Thread Dr. David Alan Gilbert
* Claudio Fontana (cfont...@suse.de) wrote:
> On 4/5/22 10:35 AM, Dr. David Alan Gilbert wrote:
> > * Claudio Fontana (cfont...@suse.de) wrote:
> >> On 3/28/22 10:31 AM, Daniel P. Berrangé wrote:
> >>> On Sat, Mar 26, 2022 at 04:49:46PM +0100, Claudio Fontana wrote:
>  On 3/25/22 12:29 PM, Daniel P. Berrangé wrote:
> > On Fri, Mar 18, 2022 at 02:34:29PM +0100, Claudio Fontana wrote:
> >> On 3/17/22 4:03 PM, Dr. David Alan Gilbert wrote:
> >>> * Claudio Fontana (cfont...@suse.de) wrote:
>  On 3/17/22 2:41 PM, Claudio Fontana wrote:
> > On 3/17/22 11:25 AM, Daniel P. Berrangé wrote:
> >> On Thu, Mar 17, 2022 at 11:12:11AM +0100, Claudio Fontana wrote:
> >>> On 3/16/22 1:17 PM, Claudio Fontana wrote:
>  On 3/14/22 6:48 PM, Daniel P. Berrangé wrote:
> > On Mon, Mar 14, 2022 at 06:38:31PM +0100, Claudio Fontana wrote:
> >> On 3/14/22 6:17 PM, Daniel P. Berrangé wrote:
> >>> On Sat, Mar 12, 2022 at 05:30:01PM +0100, Claudio Fontana 
> >>> wrote:
>  the first user is the qemu driver,
> 
>  virsh save/resume would slow to a crawl with a default pipe 
>  size (64k).
> 
>  This improves the situation by 400%.
> 
>  Going through io_helper still seems to incur in some penalty 
>  (~15%-ish)
>  compared with direct qemu migration to a nc socket to a file.
> 
>  Signed-off-by: Claudio Fontana 
>  ---
>   src/qemu/qemu_driver.c|  6 +++---
>   src/qemu/qemu_saveimage.c | 11 ++-
>   src/util/virfile.c| 12 
>   src/util/virfile.h|  1 +
>   4 files changed, 22 insertions(+), 8 deletions(-)
> 
>  Hello, I initially thought this to be a qemu performance 
>  issue,
>  so you can find the discussion about this in qemu-devel:
> 
>  "Re: bad virsh save /dev/null performance (600 MiB/s max)"
> 
>  https://lists.gnu.org/archive/html/qemu-devel/2022-03/msg03142.html
> >>
> >>
> >>> Current results show these experimental averages maximum 
> >>> throughput
> >>> migrating to /dev/null per each FdWrapper Pipe Size (as per QEMU 
> >>> QMP
> >>> "query-migrate", tests repeated 5 times for each).
> >>> VM Size is 60G, most of the memory effectively touched before 
> >>> migration,
> >>> through user application allocating and touching all memory with
> >>> pseudorandom data.
> >>>
> >>> 64K: 5200 Mbps (current situation)
> >>> 128K:5800 Mbps
> >>> 256K:   20900 Mbps
> >>> 512K:   21600 Mbps
> >>> 1M: 22800 Mbps
> >>> 2M: 22800 Mbps
> >>> 4M: 22400 Mbps
> >>> 8M: 22500 Mbps
> >>> 16M:22800 Mbps
> >>> 32M:22900 Mbps
> >>> 64M:22900 Mbps
> >>> 128M:   22800 Mbps
> >>>
> >>> This above is the throughput out of patched libvirt with multiple 
> >>> Pipe Sizes for the FDWrapper.
> >>
> >> Ok, its bouncing around with noise after 1 MB. So I'd suggest that
> >> libvirt attempt to raise the pipe limit to 1 MB by default, but
> >> not try to go higher.
> >>
> >>> As for the theoretical limit for the libvirt architecture,
> >>> I ran a qemu migration directly issuing the appropriate QMP
> >>> commands, setting the same migration parameters as per libvirt,
> >>> and then migrating to a socket netcatted to /dev/null via
> >>> {"execute": "migrate", "arguments": { "uri", 
> >>> "unix:///tmp/netcat.sock" } } :
> >>>
> >>> QMP:37000 Mbps
> >>
> >>> So although the Pipe size improves things (in particular the
> >>> large jump is for the 256K size, although 1M seems a very good 
> >>> value),
> >>> there is still a second bottleneck in there somewhere that
> >>> accounts for a loss of ~14200 Mbps in throughput.
> 
> 
>  Interesting addition: I tested quickly on a system with faster cpus 
>  and larger VM sizes, up to 200GB,
>  and the difference in throughput libvirt vs qemu is basically the 
>  same ~14500 Mbps.
> 
>  ~5 mbps qemu to netcat socket to /dev/null
>  ~35500 mbps virsh save to /dev/null
> 
>  Seems it is not proportional to cpu speed by the looks of it (not a 
>  totally fair comparison because the VM sizes are different).
> >>>
> >>> It might be closer to RAM or cache bandwidth limit

Re: [libvirt RFC] virFile: new VIR_FILE_WRAPPER_BIG_PIPE to improve performance

2022-04-07 Thread Claudio Fontana
On 4/7/22 3:53 PM, Dr. David Alan Gilbert wrote:
> * Claudio Fontana (cfont...@suse.de) wrote:
>> On 4/5/22 10:35 AM, Dr. David Alan Gilbert wrote:
>>> * Claudio Fontana (cfont...@suse.de) wrote:
 On 3/28/22 10:31 AM, Daniel P. Berrangé wrote:
> On Sat, Mar 26, 2022 at 04:49:46PM +0100, Claudio Fontana wrote:
>> On 3/25/22 12:29 PM, Daniel P. Berrangé wrote:
>>> On Fri, Mar 18, 2022 at 02:34:29PM +0100, Claudio Fontana wrote:
 On 3/17/22 4:03 PM, Dr. David Alan Gilbert wrote:
> * Claudio Fontana (cfont...@suse.de) wrote:
>> On 3/17/22 2:41 PM, Claudio Fontana wrote:
>>> On 3/17/22 11:25 AM, Daniel P. Berrangé wrote:
 On Thu, Mar 17, 2022 at 11:12:11AM +0100, Claudio Fontana wrote:
> On 3/16/22 1:17 PM, Claudio Fontana wrote:
>> On 3/14/22 6:48 PM, Daniel P. Berrangé wrote:
>>> On Mon, Mar 14, 2022 at 06:38:31PM +0100, Claudio Fontana wrote:
 On 3/14/22 6:17 PM, Daniel P. Berrangé wrote:
> On Sat, Mar 12, 2022 at 05:30:01PM +0100, Claudio Fontana 
> wrote:
>> the first user is the qemu driver,
>>
>> virsh save/resume would slow to a crawl with a default pipe 
>> size (64k).
>>
>> This improves the situation by 400%.
>>
>> Going through io_helper still seems to incur in some penalty 
>> (~15%-ish)
>> compared with direct qemu migration to a nc socket to a file.
>>
>> Signed-off-by: Claudio Fontana 
>> ---
>>  src/qemu/qemu_driver.c|  6 +++---
>>  src/qemu/qemu_saveimage.c | 11 ++-
>>  src/util/virfile.c| 12 
>>  src/util/virfile.h|  1 +
>>  4 files changed, 22 insertions(+), 8 deletions(-)
>>
>> Hello, I initially thought this to be a qemu performance 
>> issue,
>> so you can find the discussion about this in qemu-devel:
>>
>> "Re: bad virsh save /dev/null performance (600 MiB/s max)"
>>
>> https://lists.gnu.org/archive/html/qemu-devel/2022-03/msg03142.html


> Current results show these experimental averages maximum 
> throughput
> migrating to /dev/null per each FdWrapper Pipe Size (as per QEMU 
> QMP
> "query-migrate", tests repeated 5 times for each).
> VM Size is 60G, most of the memory effectively touched before 
> migration,
> through user application allocating and touching all memory with
> pseudorandom data.
>
> 64K: 5200 Mbps (current situation)
> 128K:5800 Mbps
> 256K:   20900 Mbps
> 512K:   21600 Mbps
> 1M: 22800 Mbps
> 2M: 22800 Mbps
> 4M: 22400 Mbps
> 8M: 22500 Mbps
> 16M:22800 Mbps
> 32M:22900 Mbps
> 64M:22900 Mbps
> 128M:   22800 Mbps
>
> This above is the throughput out of patched libvirt with multiple 
> Pipe Sizes for the FDWrapper.

 Ok, its bouncing around with noise after 1 MB. So I'd suggest that
 libvirt attempt to raise the pipe limit to 1 MB by default, but
 not try to go higher.

> As for the theoretical limit for the libvirt architecture,
> I ran a qemu migration directly issuing the appropriate QMP
> commands, setting the same migration parameters as per libvirt,
> and then migrating to a socket netcatted to /dev/null via
> {"execute": "migrate", "arguments": { "uri", 
> "unix:///tmp/netcat.sock" } } :
>
> QMP:37000 Mbps

> So although the Pipe size improves things (in particular the
> large jump is for the 256K size, although 1M seems a very good 
> value),
> there is still a second bottleneck in there somewhere that
> accounts for a loss of ~14200 Mbps in throughput.
>>
>>
>> Interesting addition: I tested quickly on a system with faster cpus 
>> and larger VM sizes, up to 200GB,
>> and the difference in throughput libvirt vs qemu is basically the 
>> same ~14500 Mbps.
>>
>> ~5 mbps qemu to netcat socket to /dev/null
>> ~35500 mbps virsh save to /dev/null
>>
>> Seems it is not proportional to cpu speed by the looks of it (not a 
>> totally fair comparison because the VM sizes are different).
>

Re: [PATCH 1/1] qemu-img: properly list formats which have consistency check implemented

2022-04-07 Thread Eric Blake
On Thu, Apr 07, 2022 at 11:39:32AM +0300, Denis V. Lunev wrote:
> Simple grep for the .bdrv_co_check callback presence gives the following
> list of block drivers
> * QED
> * VDI
> * VHDX
> * VMDK
> * Parallels
> which have this callback. The presense of the callback means that
> consistency check is supported.
> 
> The patch updates documentation accordingly.
> 
> Signed-off-by: Denis V. Lunev 
> CC: Kevin Wolf 
> CC: Hanna Reitz 
> ---
>  docs/tools/qemu-img.rst | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




RE: [PATCH v1 1/4] hw/arm: versal: Create an APU CPU Cluster

2022-04-07 Thread Frederic Konrad



-Original Message-
From: Edgar E. Iglesias  
Sent: 06 April 2022 18:43
To: qemu-devel@nongnu.org
Cc: qemu-...@nongnu.org; peter.mayd...@linaro.org; 
richard.hender...@linaro.org; alist...@alistair23.me; l...@lmichel.fr; 
f4...@amsat.org; frasse.igles...@gmail.com; Francisco Eduardo Iglesias 
; Sai Pavan Boddu ; Frederic Konrad 
; Edgar Iglesias ; edgar.igles...@amd.com
Subject: [PATCH v1 1/4] hw/arm: versal: Create an APU CPU Cluster

From: "Edgar E. Iglesias" 

Create an APU CPU Cluster. This is in preparation to add the RPU.

Signed-off-by: Edgar E. Iglesias 
---
 hw/arm/xlnx-versal.c | 9 -
 include/hw/arm/xlnx-versal.h | 2 ++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/hw/arm/xlnx-versal.c b/hw/arm/xlnx-versal.c index 
2551dfc22d..4415ee413f 100644
--- a/hw/arm/xlnx-versal.c
+++ b/hw/arm/xlnx-versal.c
@@ -34,10 +34,15 @@ static void versal_create_apu_cpus(Versal *s)  {
 int i;
 
+object_initialize_child(OBJECT(s), "apu-cluster", &s->fpd.apu.cluster,
+TYPE_CPU_CLUSTER);
+qdev_prop_set_uint32(DEVICE(&s->fpd.apu.cluster), "cluster-id", 0);
+
 for (i = 0; i < ARRAY_SIZE(s->fpd.apu.cpu); i++) {
 Object *obj;
 
-object_initialize_child(OBJECT(s), "apu-cpu[*]", &s->fpd.apu.cpu[i],
+object_initialize_child(OBJECT(&s->fpd.apu.cluster),
+"apu-cpu[*]", &s->fpd.apu.cpu[i],
 XLNX_VERSAL_ACPU_TYPE);
 obj = OBJECT(&s->fpd.apu.cpu[i]);
 if (i) {
@@ -52,6 +57,8 @@ static void versal_create_apu_cpus(Versal *s)
  &error_abort);
 qdev_realize(DEVICE(obj), NULL, &error_fatal);
 }
+
+qdev_realize(DEVICE(&s->fpd.apu.cluster), NULL, &error_fatal);
 }
 
 static void versal_create_apu_gic(Versal *s, qemu_irq *pic) diff --git 
a/include/hw/arm/xlnx-versal.h b/include/hw/arm/xlnx-versal.h index 
0728316ec7..d2d3028e18 100644
--- a/include/hw/arm/xlnx-versal.h
+++ b/include/hw/arm/xlnx-versal.h
@@ -14,6 +14,7 @@
 
 #include "hw/sysbus.h"
 #include "hw/arm/boot.h"
+#include "hw/cpu/cluster.h"
 #include "hw/or-irq.h"
 #include "hw/sd/sdhci.h"
 #include "hw/intc/arm_gicv3.h"
@@ -49,6 +50,7 @@ struct Versal {
 struct {
 struct {
 MemoryRegion mr;
+CPUClusterState cluster;
 ARMCPU cpu[XLNX_VERSAL_NR_ACPUS];
 GICv3State gic;
 } apu;
--
2.25.1

Reviewed-by: Frederic Konrad 



RE: [PATCH v1 2/4] hw/arm: versal: Add the Cortex-R5Fs

2022-04-07 Thread Frederic Konrad



-Original Message-
From: Edgar E. Iglesias  
Sent: 06 April 2022 18:43
To: qemu-devel@nongnu.org
Cc: qemu-...@nongnu.org; peter.mayd...@linaro.org; 
richard.hender...@linaro.org; alist...@alistair23.me; l...@lmichel.fr; 
f4...@amsat.org; frasse.igles...@gmail.com; Francisco Eduardo Iglesias 
; Sai Pavan Boddu ; Frederic Konrad 
; Edgar Iglesias ; edgar.igles...@amd.com
Subject: [PATCH v1 2/4] hw/arm: versal: Add the Cortex-R5Fs

From: "Edgar E. Iglesias" 

Add the Cortex-R5Fs of the Versal RPU (Real-time Processing Unit) subsystem.

Signed-off-by: Edgar E. Iglesias 
---
 hw/arm/xlnx-versal-virt.c|  6 +++---
 hw/arm/xlnx-versal.c | 36 
 include/hw/arm/xlnx-versal.h | 10 ++
 3 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/hw/arm/xlnx-versal-virt.c b/hw/arm/xlnx-versal-virt.c index 
7c7baff8b7..66a2de7e13 100644
--- a/hw/arm/xlnx-versal-virt.c
+++ b/hw/arm/xlnx-versal-virt.c
@@ -721,9 +721,9 @@ static void versal_virt_machine_class_init(ObjectClass *oc, 
void *data)
 
 mc->desc = "Xilinx Versal Virtual development board";
 mc->init = versal_virt_init;
-mc->min_cpus = XLNX_VERSAL_NR_ACPUS;
-mc->max_cpus = XLNX_VERSAL_NR_ACPUS;
-mc->default_cpus = XLNX_VERSAL_NR_ACPUS;
+mc->min_cpus = XLNX_VERSAL_NR_ACPUS + XLNX_VERSAL_NR_RCPUS;
+mc->max_cpus = XLNX_VERSAL_NR_ACPUS + XLNX_VERSAL_NR_RCPUS;
+mc->default_cpus = XLNX_VERSAL_NR_ACPUS + XLNX_VERSAL_NR_RCPUS;
 mc->no_cdrom = true;
 mc->default_ram_id = "ddr";
 }
diff --git a/hw/arm/xlnx-versal.c b/hw/arm/xlnx-versal.c index 
4415ee413f..ebad8dbb6d 100644
--- a/hw/arm/xlnx-versal.c
+++ b/hw/arm/xlnx-versal.c
@@ -25,6 +25,7 @@
 #include "hw/sysbus.h"
 
 #define XLNX_VERSAL_ACPU_TYPE ARM_CPU_TYPE_NAME("cortex-a72")
+#define XLNX_VERSAL_RCPU_TYPE ARM_CPU_TYPE_NAME("cortex-r5f")
 #define GEM_REVISION0x40070106
 
 #define VERSAL_NUM_PMC_APB_IRQS 3
@@ -130,6 +131,35 @@ static void versal_create_apu_gic(Versal *s, qemu_irq *pic)
 }
 }
 
+static void versal_create_rpu_cpus(Versal *s) {
+int i;
+
+object_initialize_child(OBJECT(s), "rpu-cluster", &s->lpd.rpu.cluster,
+TYPE_CPU_CLUSTER);
+qdev_prop_set_uint32(DEVICE(&s->lpd.rpu.cluster), "cluster-id", 1);
+
+for (i = 0; i < ARRAY_SIZE(s->lpd.rpu.cpu); i++) {
+Object *obj;
+
+object_initialize_child(OBJECT(&s->lpd.rpu.cluster),
+"rpu-cpu[*]", &s->lpd.rpu.cpu[i],
+XLNX_VERSAL_RCPU_TYPE);
+obj = OBJECT(&s->lpd.rpu.cpu[i]);
+object_property_set_bool(obj, "start-powered-off", true,
+ &error_abort);
+
+object_property_set_int(obj, "mp-affinity", 0x100 | i, &error_abort);
+object_property_set_int(obj, "core-count", ARRAY_SIZE(s->lpd.rpu.cpu),
+&error_abort);
+object_property_set_link(obj, "memory", OBJECT(&s->lpd.rpu.mr),
+ &error_abort);
+qdev_realize(DEVICE(obj), NULL, &error_fatal);
+}
+
+qdev_realize(DEVICE(&s->lpd.rpu.cluster), NULL, &error_fatal); }
+
 static void versal_create_uarts(Versal *s, qemu_irq *pic)  {
 int i;
@@ -638,6 +668,7 @@ static void versal_realize(DeviceState *dev, Error **errp)
 
 versal_create_apu_cpus(s);
 versal_create_apu_gic(s, pic);
+versal_create_rpu_cpus(s);
 versal_create_uarts(s, pic);
 versal_create_usbs(s, pic);
 versal_create_gems(s, pic);
@@ -659,6 +690,8 @@ static void versal_realize(DeviceState *dev, Error **errp)
 
 memory_region_add_subregion_overlap(&s->mr_ps, MM_OCM, &s->lpd.mr_ocm, 0);
 memory_region_add_subregion_overlap(&s->fpd.apu.mr, 0, &s->mr_ps, 0);
+memory_region_add_subregion_overlap(&s->lpd.rpu.mr, 0,
+&s->lpd.rpu.mr_ps_alias, 0);
 }
 
 static void versal_init(Object *obj)
@@ -666,7 +699,10 @@ static void versal_init(Object *obj)
 Versal *s = XLNX_VERSAL(obj);
 
 memory_region_init(&s->fpd.apu.mr, obj, "mr-apu", UINT64_MAX);
+memory_region_init(&s->lpd.rpu.mr, obj, "mr-rpu", UINT64_MAX);
 memory_region_init(&s->mr_ps, obj, "mr-ps-switch", UINT64_MAX);
+memory_region_init_alias(&s->lpd.rpu.mr_ps_alias, OBJECT(s),
+ "mr-rpu-ps-alias", &s->mr_ps, 0, 
+ UINT64_MAX);
 }
 
 static Property versal_properties[] = { diff --git 
a/include/hw/arm/xlnx-versal.h b/include/hw/arm/xlnx-versal.h index 
d2d3028e18..155e8c4b8c 100644
--- a/include/hw/arm/xlnx-versal.h
+++ b/include/hw/arm/xlnx-versal.h
@@ -35,6 +35,7 @@
 OBJECT_DECLARE_SIMPLE_TYPE(Versal, XLNX_VERSAL)
 
 #define XLNX_VERSAL_NR_ACPUS   2
+#define XLNX_VERSAL_NR_RCPUS   2
 #define XLNX_VERSAL_NR_UARTS   2
 #define XLNX_VERSAL_NR_GEMS2
 #define XLNX_VERSAL_NR_ADMAS   8
@@ -73,6 +74,15 @@ struct Versal {
 VersalUsb2 usb;
 } iou;
 
+/* Real-time Processing Unit.  */
+s

RE: [PATCH v1 3/4] hw/misc: Add a model of the Xilinx Versal CRL

2022-04-07 Thread Frederic Konrad



> -Original Message-
> From: Edgar E. Iglesias 
> Sent: 06 April 2022 18:43
> To: qemu-devel@nongnu.org
> Cc: qemu-...@nongnu.org; peter.mayd...@linaro.org;
> richard.hender...@linaro.org; alist...@alistair23.me; l...@lmichel.fr;
> f4...@amsat.org; frasse.igles...@gmail.com; Francisco Eduardo Iglesias
> ; Sai Pavan Boddu ; Frederic
> Konrad ; Edgar Iglesias ;
> edgar.igles...@amd.com
> Subject: [PATCH v1 3/4] hw/misc: Add a model of the Xilinx Versal CRL
> 
> From: "Edgar E. Iglesias" 
> 
> Add a model of the Xilinx Versal CRL.
> 
> Signed-off-by: Edgar E. Iglesias 
> ---
>  hw/misc/meson.build   |   1 +
>  hw/misc/xlnx-versal-crl.c | 421 ++
>  include/hw/misc/xlnx-versal-crl.h | 235 +
>  3 files changed, 657 insertions(+)
>  create mode 100644 hw/misc/xlnx-versal-crl.c
>  create mode 100644 include/hw/misc/xlnx-versal-crl.h
> 
> diff --git a/hw/misc/meson.build b/hw/misc/meson.build
> index 6fb69612e0..2ff05c7afa 100644
> --- a/hw/misc/meson.build
> +++ b/hw/misc/meson.build
> @@ -86,6 +86,7 @@ softmmu_ss.add(when: 'CONFIG_SLAVIO', if_true:
> files('slavio_misc.c'))
>  softmmu_ss.add(when: 'CONFIG_ZYNQ', if_true: files('zynq_slcr.c'))
>  specific_ss.add(when: 'CONFIG_XLNX_ZYNQMP_ARM', if_true: files('xlnx-
> zynqmp-crf.c'))
>  specific_ss.add(when: 'CONFIG_XLNX_ZYNQMP_ARM', if_true: files('xlnx-
> zynqmp-apu-ctrl.c'))
> +specific_ss.add(when: 'CONFIG_XLNX_VERSAL', if_true: files('xlnx-versal-
> crl.c'))
>  softmmu_ss.add(when: 'CONFIG_XLNX_VERSAL', if_true: files(
>'xlnx-versal-xramc.c',
>'xlnx-versal-pmc-iou-slcr.c',
> diff --git a/hw/misc/xlnx-versal-crl.c b/hw/misc/xlnx-versal-crl.c
> new file mode 100644
> index 00..767106b7a3
> --- /dev/null
> +++ b/hw/misc/xlnx-versal-crl.c
> @@ -0,0 +1,421 @@
> +/*
> + * QEMU model of the Clock-Reset-LPD (CRL).
> + *
> + * Copyright (c) 2022 Advanced Micro Devices, Inc.
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * Written by Edgar E. Iglesias 
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "qemu/log.h"
> +#include "qemu/bitops.h"
> +#include "migration/vmstate.h"
> +#include "hw/qdev-properties.h"
> +#include "hw/sysbus.h"
> +#include "hw/irq.h"
> +#include "hw/register.h"
> +#include "hw/resettable.h"
> +
> +#include "target/arm/arm-powerctl.h"
> +#include "hw/misc/xlnx-versal-crl.h"
> +
> +#ifndef XLNX_VERSAL_CRL_ERR_DEBUG
> +#define XLNX_VERSAL_CRL_ERR_DEBUG 0
> +#endif
> +
> +static void crl_update_irq(XlnxVersalCRL *s)
> +{
> +bool pending = s->regs[R_IR_STATUS] & ~s->regs[R_IR_MASK];
> +qemu_set_irq(s->irq, pending);
> +}
> +
> +static void crl_status_postw(RegisterInfo *reg, uint64_t val64)
> +{
> +XlnxVersalCRL *s = XLNX_VERSAL_CRL(reg->opaque);
> +crl_update_irq(s);
> +}
> +
> +static uint64_t crl_enable_prew(RegisterInfo *reg, uint64_t val64)
> +{
> +XlnxVersalCRL *s = XLNX_VERSAL_CRL(reg->opaque);
> +uint32_t val = val64;
> +
> +s->regs[R_IR_MASK] &= ~val;
> +crl_update_irq(s);
> +return 0;
> +}
> +
> +static uint64_t crl_disable_prew(RegisterInfo *reg, uint64_t val64)
> +{
> +XlnxVersalCRL *s = XLNX_VERSAL_CRL(reg->opaque);
> +uint32_t val = val64;
> +
> +s->regs[R_IR_MASK] |= val;
> +crl_update_irq(s);
> +return 0;
> +}
> +
> +static void crl_reset_dev(XlnxVersalCRL *s, DeviceState *dev,
> +  bool rst_old, bool rst_new)
> +{
> +device_cold_reset(dev);
> +}
> +
> +static void crl_reset_cpu(XlnxVersalCRL *s, ARMCPU *armcpu,
> +  bool rst_old, bool rst_new)
> +{
> +if (rst_new) {
> +arm_set_cpu_off(armcpu->mp_affinity);
> +} else {
> +arm_set_cpu_on_and_reset(armcpu->mp_affinity);
> +}
> +}
> +
> +#define REGFIELD_RESET(type, s, reg, f, new_val, dev) { \
> +bool old_f = ARRAY_FIELD_EX32((s)->regs, reg, f);   \
> +bool new_f = FIELD_EX32(new_val, reg, f);   \
> +\
> +/* Detect edges.  */\
> +if (dev && old_f != new_f) {\
> +crl_reset_ ## type(s, dev, old_f, new_f);   \
> +}   \
> +}
> +
> +static uint64_t crl_rst_r5_prew(RegisterInfo *reg, uint64_t val64)
> +{
> +XlnxVersalCRL *s = XLNX_VERSAL_CRL(reg->opaque);
> +
> +REGFIELD_RESET(cpu, s, RST_CPU_R5, RESET_CPU0, val64, s-
> >cfg.cpu_r5[0]);
> +REGFIELD_RESET(cpu, s, RST_CPU_R5, RESET_CPU1, val64, s-
> >cfg.cpu_r5[1]);
> +return val64;
> +}
> +
> +static uint64_t crl_rst_adma_prew(RegisterInfo *reg, uint64_t val64)
> +{
> +XlnxVersalCRL *s = XLNX_VERSAL_CRL(reg->opaque);
> +int i;
> +
> +/* A single register fans out to all ADMA reset inputs.  */
> +for (i = 0; i < ARRAY_SIZE(s->cfg.adma); i++) {
> +REGFIELD_RESET(dev, s, RST_ADMA, RESET, val64, s->cfg.adma[i]);
> +}

RE: [PATCH v1 4/4] hw/arm: versal: Connect the CRL

2022-04-07 Thread Frederic Konrad



> -Original Message-
> From: Edgar E. Iglesias 
> Sent: 06 April 2022 18:43
> To: qemu-devel@nongnu.org
> Cc: qemu-...@nongnu.org; peter.mayd...@linaro.org;
> richard.hender...@linaro.org; alist...@alistair23.me; l...@lmichel.fr;
> f4...@amsat.org; frasse.igles...@gmail.com; Francisco Eduardo Iglesias
> ; Sai Pavan Boddu ; Frederic
> Konrad ; Edgar Iglesias ;
> edgar.igles...@amd.com
> Subject: [PATCH v1 4/4] hw/arm: versal: Connect the CRL
> 
> From: "Edgar E. Iglesias" 
> 
> Connect the CRL (Clock Reset LPD) to the Versal SoC.
> 
> Signed-off-by: Edgar E. Iglesias 
> ---
>  hw/arm/xlnx-versal.c | 54 ++--
>  include/hw/arm/xlnx-versal.h |  4 +++
>  2 files changed, 56 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/arm/xlnx-versal.c b/hw/arm/xlnx-versal.c
> index ebad8dbb6d..57276e1506 100644
> --- a/hw/arm/xlnx-versal.c
> +++ b/hw/arm/xlnx-versal.c
> @@ -539,6 +539,57 @@ static void versal_create_ospi(Versal *s, qemu_irq
> *pic)
>  qdev_connect_gpio_out(orgate, 0, pic[VERSAL_OSPI_IRQ]);
>  }
> 
> +static void versal_create_crl(Versal *s, qemu_irq *pic)
> +{
> +SysBusDevice *sbd;
> +int i;
> +
> +object_initialize_child(OBJECT(s), "crl", &s->lpd.crl,
> +TYPE_XLNX_VERSAL_CRL);
> +sbd = SYS_BUS_DEVICE(&s->lpd.crl);
> +
> +for (i = 0; i < ARRAY_SIZE(s->lpd.rpu.cpu); i++) {
> +g_autofree gchar *name = g_strdup_printf("cpu_r5[%d]", i);
> +
> +object_property_set_link(OBJECT(&s->lpd.crl),
> + name, OBJECT(&s->lpd.rpu.cpu[i]),
> + &error_abort);
> +}
> +
> +for (i = 0; i < ARRAY_SIZE(s->lpd.iou.gem); i++) {
> +g_autofree gchar *name = g_strdup_printf("gem[%d]", i);
> +
> +object_property_set_link(OBJECT(&s->lpd.crl),
> + name, OBJECT(&s->lpd.iou.gem[i]),
> + &error_abort);
> +}
> +
> +for (i = 0; i < ARRAY_SIZE(s->lpd.iou.adma); i++) {
> +g_autofree gchar *name = g_strdup_printf("adma[%d]", i);
> +
> +object_property_set_link(OBJECT(&s->lpd.crl),
> + name, OBJECT(&s->lpd.iou.adma[i]),
> + &error_abort);
> +}
> +
> +for (i = 0; i < ARRAY_SIZE(s->lpd.iou.uart); i++) {
> +g_autofree gchar *name = g_strdup_printf("uart[%d]", i);
> +
> +object_property_set_link(OBJECT(&s->lpd.crl),
> + name, OBJECT(&s->lpd.iou.uart[i]),
> + &error_abort);
> +}
> +
> +object_property_set_link(OBJECT(&s->lpd.crl),
> + "usb", OBJECT(&s->lpd.iou.usb),
> + &error_abort);
> +
> +sysbus_realize(sbd, &error_fatal);
> +memory_region_add_subregion(&s->mr_ps, MM_CRL,
> +sysbus_mmio_get_region(sbd, 0));
> +sysbus_connect_irq(sbd, 0, pic[VERSAL_CRL_IRQ]);
> +}
> +
>  /* This takes the board allocated linear DDR memory and creates aliases
>   * for each split DDR range/aperture on the Versal address map.
>   */
> @@ -622,8 +673,6 @@ static void versal_unimp(Versal *s)
> 
>  versal_unimp_area(s, "psm", &s->mr_ps,
>  MM_PSM_START, MM_PSM_END - MM_PSM_START);
> -versal_unimp_area(s, "crl", &s->mr_ps,
> -MM_CRL, MM_CRL_SIZE);
>  versal_unimp_area(s, "crf", &s->mr_ps,
>  MM_FPD_CRF, MM_FPD_CRF_SIZE);
>  versal_unimp_area(s, "apu", &s->mr_ps,
> @@ -681,6 +730,7 @@ static void versal_realize(DeviceState *dev, Error
> **errp)
>  versal_create_efuse(s, pic);
>  versal_create_pmc_iou_slcr(s, pic);
>  versal_create_ospi(s, pic);
> +versal_create_crl(s, pic);
>  versal_map_ddr(s);
>  versal_unimp(s);
> 
> diff --git a/include/hw/arm/xlnx-versal.h b/include/hw/arm/xlnx-versal.h
> index 155e8c4b8c..cbe8a19c10 100644
> --- a/include/hw/arm/xlnx-versal.h
> +++ b/include/hw/arm/xlnx-versal.h
> @@ -29,6 +29,7 @@
>  #include "hw/nvram/xlnx-versal-efuse.h"
>  #include "hw/ssi/xlnx-versal-ospi.h"
>  #include "hw/dma/xlnx_csu_dma.h"
> +#include "hw/misc/xlnx-versal-crl.h"
>  #include "hw/misc/xlnx-versal-pmc-iou-slcr.h"
> 
>  #define TYPE_XLNX_VERSAL "xlnx-versal"
> @@ -87,6 +88,8 @@ struct Versal {
>  qemu_or_irq irq_orgate;
>  XlnxXramCtrl ctrl[XLNX_VERSAL_NR_XRAM];
>  } xram;
> +
> +XlnxVersalCRL crl;
>  } lpd;
> 
>  /* The Platform Management Controller subsystem.  */
> @@ -127,6 +130,7 @@ struct Versal {
>  #define VERSAL_TIMER_NS_EL1_IRQ 14
>  #define VERSAL_TIMER_NS_EL2_IRQ 10
> 
> +#define VERSAL_CRL_IRQ 10
>  #define VERSAL_UART0_IRQ_0 18
>  #define VERSAL_UART1_IRQ_0 19
>  #define VERSAL_USB0_IRQ_0  22
> --
> 2.25.1

Reviewed-by: Frederic Konrad 




Re: [PATCH] x86: Implement Linear Address Masking support

2022-04-07 Thread Richard Henderson

On 4/7/22 06:18, Kirill A. Shutemov wrote:

The new hook is incorrect, in that it doesn't apply to addresses along
the tlb fast path.


I'm not sure what you mean by that. tlb_hit() mechanics works. We strip
the tag bits before tlb lookup.

Could you elaborate?


The fast path does not clear the bits, so you enter the slow path before you get to 
clearing the bits.  You've lost most of the advantage of the tlb already.



To be honest I don't fully understand how TBI emulation works.


In get_phys_addr_lpae:

addrsize = 64 - 8 * param.tbi;
...
target_ulong top_bits = sextract64(address, inputsize,
   addrsize - inputsize);
if (-top_bits != param.select) {
/* The gap between the two regions is a Translation fault */
fault_type = ARMFault_Translation;
goto do_fault;
}

which does not include TBI bits in the validation of the sign-extended address.


Consider store_helper(). I failed to find where tag bits get stripped
before getting there for !CONFIG_USER_ONLY. clean_data_tbi() only covers
user-only case.

And if we get there with tags, I don't see how we will ever get to fast
path: tlb_hit() should never return true there if any bit in top byte is
set as cached tlb_addr has them stripped.

tlb_fill() will get it handled correctly, but it is wasteful to go through
pagewalk on every tagged pointer dereference.


We won't do a pagewalk for every tagged pointer dereference.  It'll be pointer 
dereferences with differing tags past the limit of the victim cache (CPU_VTLB_SIZE).  And 
one tag will get to use the fast path, e.g. on the store following a load.


I've just now had a browse through the Intel docs, and I see that you're not performing 
the required modified canonicality check.  While a proper tagged address will have the tag 
removed in CR2 during a page fault, an improper tagged address (with bit 63 != {47,56}) 
should have the original address reported to CR2.


I could imagine a hook that could aid the victim cache in ignoring the tag, so that we 
need go through tlb_fill fewer times.  But I wouldn't want to include that in the base 
version of this feature, and I'd want take more than a moment in the design so that it 
could be used by ARM and RISC-V as well.



r~



Re: [PATCH for-7.0] virtio-iommu: use-after-free fix

2022-04-07 Thread Michael S. Tsirkin
On Thu, Apr 07, 2022 at 11:03:16AM +0100, Peter Maydell wrote:
> On Thu, 7 Apr 2022 at 10:52, Michael S. Tsirkin  wrote:
> >
> > From: Wentao Liang 
> >
> > A potential Use-after-free was reported in virtio_iommu_handle_command
> > when using virtio-iommu:
> >
> > > I find a potential Use-after-free in QEMU 6.2.0, which is in
> > > virtio_iommu_handle_command() (./hw/virtio/virtio-iommu.c).
> 
> So, this isn't a regression. Do you think it's critically necessary
> it goes in 7.0, or is it in the category "put it into 7.0 if we
> need an rc4 for some other reason anyway" ?
> 
> (I have a feeling we'll need an rc4, but we'll see.)
> 
> thanks
> -- PMM

I am concerned it can be used to trigger a CVE but I could not
find a way. So I would say if there's an rc4 pls include it
but if not then we can pick it up in stable.

-- 
MST




[RFC PATCH] tests/qtest: pass stdout/stderr down to subtests

2022-04-07 Thread Alex Bennée
When trying to work out what the virtio-net-tests where doing it was
hard because the g_test_trap_subprocess redirects all output to
/dev/null. Lift this restriction by using the appropriate flags so you
can see something similar to what the vhost-user-blk tests show when
running.

While we are at it remove the g_test_verbose() check so we always show
how the QEMU is run.

Signed-off-by: Alex Bennée 
---
 tests/qtest/qos-test.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/tests/qtest/qos-test.c b/tests/qtest/qos-test.c
index f97d0a08fd..c6c196cc95 100644
--- a/tests/qtest/qos-test.c
+++ b/tests/qtest/qos-test.c
@@ -89,9 +89,7 @@ static void qos_set_machines_devices_available(void)
 
 static void restart_qemu_or_continue(char *path)
 {
-if (g_test_verbose()) {
-qos_printf("Run QEMU with: '%s'\n", path);
-}
+qos_printf("Run QEMU with: '%s'\n", path);
 /* compares the current command line with the
  * one previously executed: if they are the same,
  * don't restart QEMU, if they differ, stop previous
@@ -185,7 +183,8 @@ static void run_one_test(const void *arg)
 static void subprocess_run_one_test(const void *arg)
 {
 const gchar *path = arg;
-g_test_trap_subprocess(path, 0, 0);
+g_test_trap_subprocess(path, 0,
+   G_TEST_SUBPROCESS_INHERIT_STDOUT | 
G_TEST_SUBPROCESS_INHERIT_STDERR);
 g_test_trap_assert_passed();
 }
 
-- 
2.30.2




Re: [PATCH] x86: Implement Linear Address Masking support

2022-04-07 Thread Kirill A. Shutemov
On Thu, Apr 07, 2022 at 07:28:54AM -0700, Richard Henderson wrote:
> On 4/7/22 06:18, Kirill A. Shutemov wrote:
> > > The new hook is incorrect, in that it doesn't apply to addresses along
> > > the tlb fast path.
> > 
> > I'm not sure what you mean by that. tlb_hit() mechanics works. We strip
> > the tag bits before tlb lookup.
> > 
> > Could you elaborate?
> 
> The fast path does not clear the bits, so you enter the slow path before you
> get to clearing the bits.  You've lost most of the advantage of the tlb
> already.

Sorry for my ignorance, but what do you mean by fast path here?

My understanding is that it is the case when tlb_hit() is true and you
don't need to get into tlb_fill(). Are we talking about the same scheme?

For store_helper() I clear the bits before doing TLB look and fill. So TLB
will always deal with clean addresses.

Hm?

> > To be honest I don't fully understand how TBI emulation works.
> 
> In get_phys_addr_lpae:
> 
> addrsize = 64 - 8 * param.tbi;
> ...
> target_ulong top_bits = sextract64(address, inputsize,
>addrsize - inputsize);
> if (-top_bits != param.select) {
> /* The gap between the two regions is a Translation fault */
> fault_type = ARMFault_Translation;
> goto do_fault;
> }
> 
> which does not include TBI bits in the validation of the sign-extended 
> address.
> 
> > Consider store_helper(). I failed to find where tag bits get stripped
> > before getting there for !CONFIG_USER_ONLY. clean_data_tbi() only covers
> > user-only case.
> > 
> > And if we get there with tags, I don't see how we will ever get to fast
> > path: tlb_hit() should never return true there if any bit in top byte is
> > set as cached tlb_addr has them stripped.
> > 
> > tlb_fill() will get it handled correctly, but it is wasteful to go through
> > pagewalk on every tagged pointer dereference.
> 
> We won't do a pagewalk for every tagged pointer dereference.  It'll be
> pointer dereferences with differing tags past the limit of the victim cache
> (CPU_VTLB_SIZE).  And one tag will get to use the fast path, e.g. on the
> store following a load.
> 
> I've just now had a browse through the Intel docs, and I see that you're not
> performing the required modified canonicality check.

Modified is effectively done by clearing (and sign-extending) the address
before the check.

> While a proper tagged address will have the tag removed in CR2 during a
> page fault, an improper tagged address (with bit 63 != {47,56}) should
> have the original address reported to CR2.

Hm. I don't see it in spec. It rather points to other direction:

Page faults report the faulting linear address in CR2. Because LAM
masking (by sign-extension) applies before paging, the faulting
linear address recorded in CR2 does not contain the masked
metadata.

Yes, it talks about CR2 in case of page fault, not #GP due to canonicality
checking, but still.

> I could imagine a hook that could aid the victim cache in ignoring the tag,
> so that we need go through tlb_fill fewer times.  But I wouldn't want to
> include that in the base version of this feature, and I'd want take more
> than a moment in the design so that it could be used by ARM and RISC-V as
> well.

But what other options do you see. Clering the bits before TLB look up
matches the architectural spec and makes INVLPG match described behaviour
without special handling.

-- 
 Kirill A. Shutemov



Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK

2022-04-07 Thread Sean Christopherson
On Thu, Mar 10, 2022, Chao Peng wrote:
> Since page migration / swapping is not supported yet, MFD_INACCESSIBLE
> memory behave like longterm pinned pages and thus should be accounted to
> mm->pinned_vm and be restricted by RLIMIT_MEMLOCK.
> 
> Signed-off-by: Chao Peng 
> ---
>  mm/shmem.c | 25 -
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 7b43e274c9a2..ae46fb96494b 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, 
> pgoff_t start, pgoff_t end)
>  static void notify_invalidate_page(struct inode *inode, struct folio *folio,
>  pgoff_t start, pgoff_t end)
>  {
> -#ifdef CONFIG_MEMFILE_NOTIFIER
>   struct shmem_inode_info *info = SHMEM_I(inode);
>  
> +#ifdef CONFIG_MEMFILE_NOTIFIER
>   start = max(start, folio->index);
>   end = min(end, folio->index + folio_nr_pages(folio));
>  
>   memfile_notifier_invalidate(&info->memfile_notifiers, start, end);
>  #endif
> +
> + if (info->xflags & SHM_F_INACCESSIBLE)
> + atomic64_sub(end - start, ¤t->mm->pinned_vm);

As Vishal's to-be-posted selftest discovered, this is broken as current->mm may
be NULL.  Or it may be a completely different mm, e.g. AFAICT there's nothing 
that
prevents a different process from punching hole in the shmem backing.

I don't see a sane way of tracking this in the backing store unless the inode is
associated with a single mm when it's created, and that opens up a giant can of
worms, e.g. what happens with the accounting if the creating process goes away?

I think the correct approach is to not do the locking automatically for 
SHM_F_INACCESSIBLE,
and instead require userspace to do shmctl(.., SHM_LOCK, ...) if userspace 
knows the
consumers don't support migrate/swap.  That'd require wrapping migrate_page() 
and then
wiring up notifier hooks for migrate/swap, but IMO that's a good thing to get 
sorted
out sooner than later.  KVM isn't planning on support migrate/swap for TDX or 
SNP,
but supporting at least migrate for a software-only implementation a la pKVM 
should
be relatively straightforward.  On the notifiee side, KVM can terminate the VM 
if it
gets an unexpected migrate/swap, e.g. so that TDX/SEV VMs don't die later with
exceptions and/or data corruption (pre-SNP SEV guests) in the guest.

Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting the 
swap, so
maybe it's just the page migration path that needs to be updated?



Re: [PATCH] x86: Implement Linear Address Masking support

2022-04-07 Thread Paolo Bonzini

On 4/7/22 17:27, Kirill A. Shutemov wrote:

On Thu, Apr 07, 2022 at 07:28:54AM -0700, Richard Henderson wrote:

On 4/7/22 06:18, Kirill A. Shutemov wrote:

The new hook is incorrect, in that it doesn't apply to addresses along
the tlb fast path.


I'm not sure what you mean by that. tlb_hit() mechanics works. We strip
the tag bits before tlb lookup.

Could you elaborate?


The fast path does not clear the bits, so you enter the slow path before you
get to clearing the bits.  You've lost most of the advantage of the tlb
already.


Sorry for my ignorance, but what do you mean by fast path here?


The fast path is the TLB lookup code that is generated by the JIT 
compiler.  If the TLB hits, the memory access doesn't go through any C 
code.  I think tagged addresses always fail the fast path in your patch.



While a proper tagged address will have the tag removed in CR2 during a
page fault, an improper tagged address (with bit 63 != {47,56}) should
have the original address reported to CR2.


Hm. I don't see it in spec. It rather points to other direction:

Page faults report the faulting linear address in CR2. Because LAM
masking (by sign-extension) applies before paging, the faulting
linear address recorded in CR2 does not contain the masked
metadata.

Yes, it talks about CR2 in case of page fault, not #GP due to canonicality
checking, but still.


I could imagine a hook that could aid the victim cache in ignoring the tag,
so that we need go through tlb_fill fewer times.  But I wouldn't want to
include that in the base version of this feature, and I'd want take more
than a moment in the design so that it could be used by ARM and RISC-V as
well.


But what other options do you see. Clering the bits before TLB look up
matches the architectural spec and makes INVLPG match described behaviour
without special handling.


Ah, INVLPG handling is messy indeed.

Paolo




Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK

2022-04-07 Thread Andy Lutomirski



On Thu, Apr 7, 2022, at 9:05 AM, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
>> Since page migration / swapping is not supported yet, MFD_INACCESSIBLE
>> memory behave like longterm pinned pages and thus should be accounted to
>> mm->pinned_vm and be restricted by RLIMIT_MEMLOCK.
>> 
>> Signed-off-by: Chao Peng 
>> ---
>>  mm/shmem.c | 25 -
>>  1 file changed, 24 insertions(+), 1 deletion(-)
>> 
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 7b43e274c9a2..ae46fb96494b 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, 
>> pgoff_t start, pgoff_t end)
>>  static void notify_invalidate_page(struct inode *inode, struct folio *folio,
>> pgoff_t start, pgoff_t end)
>>  {
>> -#ifdef CONFIG_MEMFILE_NOTIFIER
>>  struct shmem_inode_info *info = SHMEM_I(inode);
>>  
>> +#ifdef CONFIG_MEMFILE_NOTIFIER
>>  start = max(start, folio->index);
>>  end = min(end, folio->index + folio_nr_pages(folio));
>>  
>>  memfile_notifier_invalidate(&info->memfile_notifiers, start, end);
>>  #endif
>> +
>> +if (info->xflags & SHM_F_INACCESSIBLE)
>> +atomic64_sub(end - start, ¤t->mm->pinned_vm);
>
> As Vishal's to-be-posted selftest discovered, this is broken as 
> current->mm may
> be NULL.  Or it may be a completely different mm, e.g. AFAICT there's 
> nothing that
> prevents a different process from punching hole in the shmem backing.
>

How about just not charging the mm in the first place?  There’s precedent: 
ramfs and hugetlbfs (at least sometimes — I’ve lost track of the current 
status).

In any case, for an administrator to try to assemble the various rlimits into a 
coherent policy is, and always has been, quite messy. ISTM cgroup limits, which 
can actually add across processes usefully, are much better.

So, aside from the fact that these fds aren’t in a filesystem and are thus 
available by default, I’m not convinced that this accounting is useful or 
necessary.

Maybe we could just have some switch require to enable creation of private 
memory in the first place, and anyone who flips that switch without configuring 
cgroups is subject to DoS.



Re: [PATCH v4 01/19] migration: Postpone releasing MigrationState.hostname

2022-04-07 Thread Dr. David Alan Gilbert
* Peter Xu (pet...@redhat.com) wrote:
> We used to release it right after migrate_fd_connect().  That's not good
> enough when there're more than one socket pair required, because it'll be
> needed to establish TLS connection for the rest channels.
> 
> One example is multifd, where we copied over the hostname for each channel
> but that's actually not needed.
> 
> Keeping the hostname until the cleanup phase of migration.
> 
> Cc: Daniel P. Berrange 
> Signed-off-by: Peter Xu 

Reviewed-by: Dr. David Alan Gilbert 

> ---
>  migration/channel.c   | 1 -
>  migration/migration.c | 5 +
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/channel.c b/migration/channel.c
> index c4fc000a1a..c6a8dcf1d7 100644
> --- a/migration/channel.c
> +++ b/migration/channel.c
> @@ -96,6 +96,5 @@ void migration_channel_connect(MigrationState *s,
>  }
>  }
>  migrate_fd_connect(s, error);
> -g_free(s->hostname);
>  error_free(error);
>  }
> diff --git a/migration/migration.c b/migration/migration.c
> index 695f0f2900..281d33326b 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1809,6 +1809,11 @@ static void migrate_fd_cleanup(MigrationState *s)
>  qemu_bh_delete(s->cleanup_bh);
>  s->cleanup_bh = NULL;
>  
> +if (s->hostname) {
> +g_free(s->hostname);
> +s->hostname = NULL;
> +}
> +
>  qemu_savevm_state_cleanup();
>  
>  if (s->to_dst_file) {
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




Re: [PATCH v4 02/19] migration: Drop multifd tls_hostname cache

2022-04-07 Thread Dr. David Alan Gilbert
* Peter Xu (pet...@redhat.com) wrote:
> The hostname is cached N times, N equals to the multifd channels.
> 
> Drop that cache because after previous patch we've got s->hostname
> being alive for the whole lifecycle of migration procedure.
> 
> Cc: Juan Quintela 
> Cc: Daniel P. Berrange 
> Signed-off-by: Peter Xu 

Reviewed-by: Dr. David Alan Gilbert 

> ---
>  migration/multifd.c | 10 +++---
>  migration/multifd.h |  2 --
>  2 files changed, 3 insertions(+), 9 deletions(-)
> 
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 76b57a7177..1be4ab5d17 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -542,8 +542,6 @@ void multifd_save_cleanup(void)
>  qemu_sem_destroy(&p->sem_sync);
>  g_free(p->name);
>  p->name = NULL;
> -g_free(p->tls_hostname);
> -p->tls_hostname = NULL;
>  multifd_pages_clear(p->pages);
>  p->pages = NULL;
>  p->packet_len = 0;
> @@ -763,7 +761,7 @@ static void multifd_tls_channel_connect(MultiFDSendParams 
> *p,
>  Error **errp)
>  {
>  MigrationState *s = migrate_get_current();
> -const char *hostname = p->tls_hostname;
> +const char *hostname = s->hostname;
>  QIOChannelTLS *tioc;
>  
>  tioc = migration_tls_client_create(s, ioc, hostname, errp);
> @@ -787,7 +785,8 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
>  MigrationState *s = migrate_get_current();
>  
>  trace_multifd_set_outgoing_channel(
> -ioc, object_get_typename(OBJECT(ioc)), p->tls_hostname, error);
> +ioc, object_get_typename(OBJECT(ioc)),
> +migrate_get_current()->hostname, error);
>  
>  if (!error) {
>  if (s->parameters.tls_creds &&
> @@ -874,7 +873,6 @@ int multifd_save_setup(Error **errp)
>  int thread_count;
>  uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
>  uint8_t i;
> -MigrationState *s;
>  
>  if (!migrate_use_multifd()) {
>  return 0;
> @@ -884,7 +882,6 @@ int multifd_save_setup(Error **errp)
>  return -1;
>  }
>  
> -s = migrate_get_current();
>  thread_count = migrate_multifd_channels();
>  multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
>  multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
> @@ -909,7 +906,6 @@ int multifd_save_setup(Error **errp)
>  p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
>  p->packet->version = cpu_to_be32(MULTIFD_VERSION);
>  p->name = g_strdup_printf("multifdsend_%d", i);
> -p->tls_hostname = g_strdup(s->hostname);
>  /* We need one extra place for the packet header */
>  p->iov = g_new0(struct iovec, page_count + 1);
>  p->normal = g_new0(ram_addr_t, page_count);
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 4dda900a0b..3d577b98b7 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -72,8 +72,6 @@ typedef struct {
>  uint8_t id;
>  /* channel thread name */
>  char *name;
> -/* tls hostname */
> -char *tls_hostname;
>  /* channel thread id */
>  QemuThread thread;
>  /* communication channel */
> -- 
> 2.32.0
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




Re: [PATCH] x86: Implement Linear Address Masking support

2022-04-07 Thread Kirill A. Shutemov
On Thu, Apr 07, 2022 at 06:38:40PM +0200, Paolo Bonzini wrote:
> On 4/7/22 17:27, Kirill A. Shutemov wrote:
> > On Thu, Apr 07, 2022 at 07:28:54AM -0700, Richard Henderson wrote:
> > > On 4/7/22 06:18, Kirill A. Shutemov wrote:
> > > > > The new hook is incorrect, in that it doesn't apply to addresses along
> > > > > the tlb fast path.
> > > > 
> > > > I'm not sure what you mean by that. tlb_hit() mechanics works. We strip
> > > > the tag bits before tlb lookup.
> > > > 
> > > > Could you elaborate?
> > > 
> > > The fast path does not clear the bits, so you enter the slow path before 
> > > you
> > > get to clearing the bits.  You've lost most of the advantage of the tlb
> > > already.
> > 
> > Sorry for my ignorance, but what do you mean by fast path here?
> 
> The fast path is the TLB lookup code that is generated by the JIT compiler.
> If the TLB hits, the memory access doesn't go through any C code.  I think
> tagged addresses always fail the fast path in your patch.

Ah. Got it.

Could you point me to the key code area relevant to the topic? I'm not
familiar with the JIT side of QEMU.

-- 
 Kirill A. Shutemov



Re: [PATCH v3] ui/cursor: fix integer overflow in cursor_alloc (CVE-2021-4206)

2022-04-07 Thread Peter Maydell
On Thu, 7 Apr 2022 at 10:21, Marc-André Lureau
 wrote:
>
>
>
> On Thu, Apr 7, 2022 at 12:23 PM Mauro Matteo Cascella  
> wrote:
>>
>> Prevent potential integer overflow by limiting 'width' and 'height' to
>> 512x512. Also change 'datasize' type to size_t. Refer to security
>> advisory https://starlabs.sg/advisories/22-4206/ for more information.
>>
>> Fixes: CVE-2021-4206
>
>
> (the Starlabs advisory has 2022, I guess it's wrong then)
>
>> Signed-off-by: Mauro Matteo Cascella 
>
>
> Reviewed-by: Marc-André Lureau 

Does this fix (or any of the other cursor-related stuff I've seen
floating past) need to go into 7.0 ? (ie is it release-critical?)

thanks
-- PMM



Re: [PATCH v4 2/2] Added parameter to take screenshot with screendump as PNG

2022-04-07 Thread Dr. David Alan Gilbert
* Markus Armbruster (arm...@redhat.com) wrote:
> Dave, please have a look at the HMP compatibility issue in
> hmp-command.hx below.
> 
> Kshitij Suri  writes:
> 
> > Currently screendump only supports PPM format, which is un-compressed and 
> > not
> > standard.
> 
> If "standard" means "have to pay a standards organization $$$ to access
> the spec", PPM is not standard.  If it means "widely supported", it
> certainly is.  I'd drop "and not standard".  Suggestion, not demand.
> 
> >   Added a "format" parameter to qemu monitor screendump capabilites
> > to support PNG image capture using libpng. The param was added in QAPI 
> > schema
> > of screendump present in ui.json along with png_save() function which 
> > converts
> > pixman_image to PNG. HMP command equivalent was also modified to support the
> > feature.
> 
> Suggest to use imperative mood to describe the commit, and omit details
> that aren't necessary here:
> 
> Add a "format" parameter to QMP and HMP screendump command
>   to support PNG image capture using libpng.
> 
> >
> > Example usage:
> > { "execute": "screendump", "arguments": { "filename": "/tmp/image",
> > "format":"png" } }
> 
> Providing an example in the commit message is always nice, thanks!
> 
> >
> > Resolves: https://gitlab.com/qemu-project/qemu/-/issues/718
> >
> > Signed-off-by: Kshitij Suri 
> >
> > Reviewed-by: Daniel P. Berrangé 
> > ---
> >  hmp-commands.hx|  11 ++---
> >  monitor/hmp-cmds.c |  12 +-
> >  qapi/ui.json   |  24 +--
> >  ui/console.c   | 101 +++--
> >  4 files changed, 136 insertions(+), 12 deletions(-)
> >
> > diff --git a/hmp-commands.hx b/hmp-commands.hx
> > index 8476277aa9..19b7cab595 100644
> > --- a/hmp-commands.hx
> > +++ b/hmp-commands.hx
> > @@ -244,11 +244,12 @@ ERST
> >  
> >  {
> >  .name   = "screendump",
> > -.args_type  = "filename:F,device:s?,head:i?",
> > -.params = "filename [device [head]]",
> > -.help   = "save screen from head 'head' of display device 
> > 'device' "
> > -  "into PPM image 'filename'",
> > -.cmd= hmp_screendump,
> > +.args_type  = "filename:F,format:s?,device:s?,head:i?",
> 
> Incompatible change: meaning of "screendump ONE TWO" changes from
> filename=ONE, device=TWO to filename=ONE, format=TWO.
> 
> As HMP is not a stable interface, incompatible change is permissible.
> But is this one wise?
> 
> Could we add the new argument at the end instead?
> 
> .args_type  = "filename:F,device:s?,head:i?,format:s?",
> 
> Could we do *without* an argument, and derive the format from the
> filename extension?  .png means format=png, anything else format=ppm.
> Would be a bad idea for QMP.  Okay for HMP?

Could we use the new optional flag with value that Stefan Reiter
added in 26fcd76 ? (and used in 675fd3c)
In which case I think we'd have:
  "filename:F,format:-fs,device:s?,head:i?"

That would seem cleanest;  Extracting from the filename would be OKish
if it errored if the format wasn't obvious.

Dave


> > +.params = "filename [format] [device [head]]",
> 
> This tells us that parameter format can be omitted like so
> 
> screendump foo.ppm device-id
> 
> which isn't true.  Better: "filename [format [device [head]]".
> 
> > +.help   = "save screen from head 'head' of display device 
> > 'device'"
> > +  "in specified format 'format' as image 'filename'."
> > +  "Currently only 'png' and 'ppm' formats are 
> > supported.",
> > + .cmd= hmp_screendump,
> >  .coroutine  = true,
> >  },
> >  
> > diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
> > index 634968498b..2442bfa989 100644
> > --- a/monitor/hmp-cmds.c
> > +++ b/monitor/hmp-cmds.c
> > @@ -1720,9 +1720,19 @@ hmp_screendump(Monitor *mon, const QDict *qdict)
> >  const char *filename = qdict_get_str(qdict, "filename");
> >  const char *id = qdict_get_try_str(qdict, "device");
> >  int64_t head = qdict_get_try_int(qdict, "head", 0);
> > +const char *input_format  = qdict_get_try_str(qdict, "format");
> >  Error *err = NULL;
> > +ImageFormat format;
> >  
> > -qmp_screendump(filename, id != NULL, id, id != NULL, head, &err);
> > +format = qapi_enum_parse(&ImageFormat_lookup, input_format,
> > +  IMAGE_FORMAT_PPM, &err);
> > +if (err) {
> > +goto end;
> > +}
> > +
> > +qmp_screendump(filename, id != NULL, id, id != NULL, head,
> > +   input_format != NULL, format, &err);
> > +end:
> >  hmp_handle_error(mon, err);
> >  }
> >  
> > diff --git a/qapi/ui.json b/qapi/ui.json
> > index 664da9e462..24371fce05 100644
> > --- a/qapi/ui.json
> > +++ b/qapi/ui.json
> > @@ -157,12 +157,27 @@
> >  ##
> >  { 'command': 'expire_password', 'boxed': true, 'data': 
> > 'ExpirePasswordOptions' }
> >  
> > +##
> > +# @I

Re: [PATCH qemu] ppc/spapr/ddw: Add 2M pagesize

2022-04-07 Thread Daniel Henrique Barboza




On 3/21/22 04:19, Alexey Kardashevskiy wrote:

Recently the LoPAPR spec got a new 2MB pagesize to support in Dynamic DMA
Windows API (DDW), this adds the new flag.

Linux supports it since
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=38727311871

Signed-off-by: Alexey Kardashevskiy 
---
PHYP added support for it in development builds as well.
---



Reviewed-by: Daniel Henrique Barboza 


  include/hw/ppc/spapr.h  | 1 +
  hw/ppc/spapr_rtas_ddw.c | 1 +
  2 files changed, 2 insertions(+)

diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index f5c33dcc8616..14b01c3f5963 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -745,6 +745,7 @@ void push_sregs_to_kvm_pr(SpaprMachineState *spapr);
  #define RTAS_DDW_PGSIZE_128M 0x20
  #define RTAS_DDW_PGSIZE_256M 0x40
  #define RTAS_DDW_PGSIZE_16G  0x80
+#define RTAS_DDW_PGSIZE_2M   0x100
  
  /* RTAS tokens */

  #define RTAS_TOKEN_BASE  0x2000
diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c
index 3e826e1308c4..13d339c807c1 100644
--- a/hw/ppc/spapr_rtas_ddw.c
+++ b/hw/ppc/spapr_rtas_ddw.c
@@ -72,6 +72,7 @@ static uint32_t spapr_page_mask_to_query_mask(uint64_t 
page_mask)
  const struct { int shift; uint32_t mask; } masks[] = {
  { 12, RTAS_DDW_PGSIZE_4K },
  { 16, RTAS_DDW_PGSIZE_64K },
+{ 21, RTAS_DDW_PGSIZE_2M },
  { 24, RTAS_DDW_PGSIZE_16M },
  { 25, RTAS_DDW_PGSIZE_32M },
  { 26, RTAS_DDW_PGSIZE_64M },




[PATCH] target/riscv/pmp: simplify NAPOT address range computation

2022-04-07 Thread Nicolas Pitre


No need for ctz64() nor special case for -1.

Signed-off-by: Nicolas Pitre 

diff --git a/target/riscv/pmp.c b/target/riscv/pmp.c
index 81b61bb65c..151da3fa08 100644
--- a/target/riscv/pmp.c
+++ b/target/riscv/pmp.c
@@ -141,17 +141,9 @@ static void pmp_decode_napot(target_ulong a, target_ulong 
*sa, target_ulong *ea)
0111...   2^(XLEN+2)-byte NAPOT range
...   Reserved
 */
-if (a == -1) {
-*sa = 0u;
-*ea = -1;
-return;
-} else {
-target_ulong t1 = ctz64(~a);
-target_ulong base = (a & ~(((target_ulong)1 << t1) - 1)) << 2;
-target_ulong range = ((target_ulong)1 << (t1 + 3)) - 1;
-*sa = base;
-*ea = base + range;
-}
+a = (a << 2) | 0x3;
+*sa = a & (a + 1);
+*ea = a | (a + 1);
 }
 
 void pmp_update_rule_addr(CPURISCVState *env, uint32_t pmp_index)



[PATCH for 7.1 1/1] block: add 'force' parameter to 'blockdev-change-medium' command

2022-04-07 Thread Denis V. Lunev
'blockdev-change-medium' is a convinient wrapper for the following
sequence of commands:
 * blockdev-open-tray
 * blockdev-remove-medium
 * blockdev-insert-medium
 * blockdev-close-tray
and should be used f.e. to change ISO image inside the CD-ROM tray.
Though the guest could lock the tray and some linux guests like
CentOS 8.5 actually does that. In this case the execution if this
command results in the error like the following:
  Device 'scsi0-0-1-0' is locked and force was not specified,
  wait for tray to open and try again.

This situation is could be resolved 'blockdev-open-tray' by passing
flag 'force' inside. Thus is seems reasonable to add the same
capability for 'blockdev-change-medium' too.

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Hanna Reitz 
CC: "Dr. David Alan Gilbert" 
CC: Eric Blake 
CC: Markus Armbruster 
---
 block/qapi-sysemu.c | 3 ++-
 monitor/hmp-cmds.c  | 4 +++-
 qapi/block.json | 6 ++
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/block/qapi-sysemu.c b/block/qapi-sysemu.c
index 8498402ad4..5b4fb75787 100644
--- a/block/qapi-sysemu.c
+++ b/block/qapi-sysemu.c
@@ -318,6 +318,7 @@ void qmp_blockdev_change_medium(bool has_device, const char 
*device,
 bool has_id, const char *id,
 const char *filename,
 bool has_format, const char *format,
+bool has_force, bool force,
 bool has_read_only,
 BlockdevChangeReadOnlyMode read_only,
 Error **errp)
@@ -380,7 +381,7 @@ void qmp_blockdev_change_medium(bool has_device, const char 
*device,
 
 rc = do_open_tray(has_device ? device : NULL,
   has_id ? id : NULL,
-  false, &err);
+  has_force ? force : false, &err);
 if (rc && rc != -ENOSYS) {
 error_propagate(errp, err);
 goto fail;
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 634968498b..d8b98bed6c 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -1472,6 +1472,7 @@ void hmp_change(Monitor *mon, const QDict *qdict)
 const char *target = qdict_get_str(qdict, "target");
 const char *arg = qdict_get_try_str(qdict, "arg");
 const char *read_only = qdict_get_try_str(qdict, "read-only-mode");
+bool force = qdict_get_try_bool(qdict, "force", false);
 BlockdevChangeReadOnlyMode read_only_mode = 0;
 Error *err = NULL;
 
@@ -1508,7 +1509,8 @@ void hmp_change(Monitor *mon, const QDict *qdict)
 }
 
 qmp_blockdev_change_medium(true, device, false, NULL, target,
-   !!arg, arg, !!read_only, read_only_mode,
+   !!arg, arg, true, force,
+   !!read_only, read_only_mode,
&err);
 }
 
diff --git a/qapi/block.json b/qapi/block.json
index 82fcf2c914..3f100d4887 100644
--- a/qapi/block.json
+++ b/qapi/block.json
@@ -326,6 +326,11 @@
 # @read-only-mode: change the read-only mode of the device; defaults
 #  to 'retain'
 #
+# @force: if false (the default), an eject request through blockdev-open-tray
+# will be sent to the guest if it has locked the tray (and the tray
+# will not be opened immediately); if true, the tray will be opened
+# regardless of whether it is locked. (since 7.1)
+#
 # Features:
 # @deprecated: Member @device is deprecated.  Use @id instead.
 #
@@ -367,6 +372,7 @@
 '*id': 'str',
 'filename': 'str',
 '*format': 'str',
+'*force': 'bool',
 '*read-only-mode': 'BlockdevChangeReadOnlyMode' } }
 
 
-- 
2.32.0




Re: [PATCH for 7.1 1/1] block: add 'force' parameter to 'blockdev-change-medium' command

2022-04-07 Thread Vladimir Sementsov-Ogievskiy

07.04.2022 23:48, Denis V. Lunev wrote:

'blockdev-change-medium' is a convinient wrapper for the following
sequence of commands:
  * blockdev-open-tray
  * blockdev-remove-medium
  * blockdev-insert-medium
  * blockdev-close-tray
and should be used f.e. to change ISO image inside the CD-ROM tray.
Though the guest could lock the tray and some linux guests like
CentOS 8.5 actually does that. In this case the execution if this
command results in the error like the following:
   Device 'scsi0-0-1-0' is locked and force was not specified,
   wait for tray to open and try again.

This situation is could be resolved 'blockdev-open-tray' by passing
flag 'force' inside. Thus is seems reasonable to add the same
capability for 'blockdev-change-medium' too.

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Hanna Reitz 
CC: "Dr. David Alan Gilbert" 
CC: Eric Blake 
CC: Markus Armbruster 
---
  block/qapi-sysemu.c | 3 ++-
  monitor/hmp-cmds.c  | 4 +++-
  qapi/block.json | 6 ++
  3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/block/qapi-sysemu.c b/block/qapi-sysemu.c
index 8498402ad4..5b4fb75787 100644
--- a/block/qapi-sysemu.c
+++ b/block/qapi-sysemu.c
@@ -318,6 +318,7 @@ void qmp_blockdev_change_medium(bool has_device, const char 
*device,
  bool has_id, const char *id,
  const char *filename,
  bool has_format, const char *format,
+bool has_force, bool force,
  bool has_read_only,
  BlockdevChangeReadOnlyMode read_only,
  Error **errp)
@@ -380,7 +381,7 @@ void qmp_blockdev_change_medium(bool has_device, const char 
*device,
  
  rc = do_open_tray(has_device ? device : NULL,

has_id ? id : NULL,
-  false, &err);
+  has_force ? force : false, &err);


It's guaranteed for force to be false when has_force is false (ans similarly 
for pointers), so that can be written as

  rc = do_open_tray(device, id, force, &err);


  if (rc && rc != -ENOSYS) {
  error_propagate(errp, err);
  goto fail;
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 634968498b..d8b98bed6c 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -1472,6 +1472,7 @@ void hmp_change(Monitor *mon, const QDict *qdict)
  const char *target = qdict_get_str(qdict, "target");
  const char *arg = qdict_get_try_str(qdict, "arg");
  const char *read_only = qdict_get_try_str(qdict, "read-only-mode");
+bool force = qdict_get_try_bool(qdict, "force", false);
  BlockdevChangeReadOnlyMode read_only_mode = 0;
  Error *err = NULL;
  
@@ -1508,7 +1509,8 @@ void hmp_change(Monitor *mon, const QDict *qdict)

  }
  
  qmp_blockdev_change_medium(true, device, false, NULL, target,

-   !!arg, arg, !!read_only, read_only_mode,
+   !!arg, arg, true, force,
+   !!read_only, read_only_mode,
 &err);
  }
  


Should we also update hmp-commands.hx ? Or you just can pass "false, false" if 
you don't really need hmp interface for new feature.

Also, I don't know what ui/cocoa.m is, but seems it has call to 
qmp_blockdev_change_medium(), which most probably should be updated too.


diff --git a/qapi/block.json b/qapi/block.json
index 82fcf2c914..3f100d4887 100644
--- a/qapi/block.json
+++ b/qapi/block.json
@@ -326,6 +326,11 @@
  # @read-only-mode: change the read-only mode of the device; defaults
  #  to 'retain'
  #
+# @force: if false (the default), an eject request through blockdev-open-tray
+# will be sent to the guest if it has locked the tray (and the tray
+# will not be opened immediately); if true, the tray will be opened
+# regardless of whether it is locked. (since 7.1)
+#
  # Features:
  # @deprecated: Member @device is deprecated.  Use @id instead.
  #
@@ -367,6 +372,7 @@
  '*id': 'str',
  'filename': 'str',
  '*format': 'str',
+'*force': 'bool',
  '*read-only-mode': 'BlockdevChangeReadOnlyMode' } }
  
  



--
Best regards,
Vladimir



Re: [PATCH for 7.1 1/1] block: add 'force' parameter to 'blockdev-change-medium' command

2022-04-07 Thread Denis V. Lunev

On 08.04.2022 00:51, 'Vladimir Sementsov-Ogievskiy' via den wrote:

07.04.2022 23:48, Denis V. Lunev wrote:

'blockdev-change-medium' is a convinient wrapper for the following
sequence of commands:
  * blockdev-open-tray
  * blockdev-remove-medium
  * blockdev-insert-medium
  * blockdev-close-tray
and should be used f.e. to change ISO image inside the CD-ROM tray.
Though the guest could lock the tray and some linux guests like
CentOS 8.5 actually does that. In this case the execution if this
command results in the error like the following:
   Device 'scsi0-0-1-0' is locked and force was not specified,
   wait for tray to open and try again.

This situation is could be resolved 'blockdev-open-tray' by passing
flag 'force' inside. Thus is seems reasonable to add the same
capability for 'blockdev-change-medium' too.

Signed-off-by: Denis V. Lunev 
CC: Kevin Wolf 
CC: Hanna Reitz 
CC: "Dr. David Alan Gilbert" 
CC: Eric Blake 
CC: Markus Armbruster 
---
  block/qapi-sysemu.c | 3 ++-
  monitor/hmp-cmds.c  | 4 +++-
  qapi/block.json | 6 ++
  3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/block/qapi-sysemu.c b/block/qapi-sysemu.c
index 8498402ad4..5b4fb75787 100644
--- a/block/qapi-sysemu.c
+++ b/block/qapi-sysemu.c
@@ -318,6 +318,7 @@ void qmp_blockdev_change_medium(bool has_device, 
const char *device,

  bool has_id, const char *id,
  const char *filename,
  bool has_format, const char *format,
+    bool has_force, bool force,
  bool has_read_only,
  BlockdevChangeReadOnlyMode read_only,
  Error **errp)
@@ -380,7 +381,7 @@ void qmp_blockdev_change_medium(bool has_device, 
const char *device,

    rc = do_open_tray(has_device ? device : NULL,
    has_id ? id : NULL,
-  false, &err);
+  has_force ? force : false, &err);


It's guaranteed for force to be false when has_force is false (ans 
similarly for pointers), so that can be written as


  rc = do_open_tray(device, id, force, &err);

you right :)




  if (rc && rc != -ENOSYS) {
  error_propagate(errp, err);
  goto fail;
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 634968498b..d8b98bed6c 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -1472,6 +1472,7 @@ void hmp_change(Monitor *mon, const QDict *qdict)
  const char *target = qdict_get_str(qdict, "target");
  const char *arg = qdict_get_try_str(qdict, "arg");
  const char *read_only = qdict_get_try_str(qdict, 
"read-only-mode");

+    bool force = qdict_get_try_bool(qdict, "force", false);
  BlockdevChangeReadOnlyMode read_only_mode = 0;
  Error *err = NULL;
  @@ -1508,7 +1509,8 @@ void hmp_change(Monitor *mon, const QDict 
*qdict)

  }
    qmp_blockdev_change_medium(true, device, false, NULL, 
target,
-   !!arg, arg, !!read_only, 
read_only_mode,

+   !!arg, arg, true, force,
+   !!read_only, read_only_mode,
 &err);
  }


Should we also update hmp-commands.hx ? Or you just can pass "false, 
false" if you don't really need hmp interface for new feature.


good point, I have added query for the dictionary thus it would be 
logical to make an update in .hx file too


Also, I don't know what ui/cocoa.m is, but seems it has call to 
qmp_blockdev_change_medium(), which most probably should be updated too.



Objective C file, looking like for MacOS UI


diff --git a/qapi/block.json b/qapi/block.json
index 82fcf2c914..3f100d4887 100644
--- a/qapi/block.json
+++ b/qapi/block.json
@@ -326,6 +326,11 @@
  # @read-only-mode: change the read-only mode of the device; defaults
  #  to 'retain'
  #
+# @force: if false (the default), an eject request through 
blockdev-open-tray
+# will be sent to the guest if it has locked the tray (and 
the tray
+# will not be opened immediately); if true, the tray will be 
opened

+# regardless of whether it is locked. (since 7.1)
+#
  # Features:
  # @deprecated: Member @device is deprecated.  Use @id instead.
  #
@@ -367,6 +372,7 @@
  '*id': 'str',
  'filename': 'str',
  '*format': 'str',
+    '*force': 'bool',
  '*read-only-mode': 'BlockdevChangeReadOnlyMode' } }








Re: [PATCH v9 33/45] cxl/cxl-host: Add memops for CFMWS region.

2022-04-07 Thread Tong Zhang

On 4/4/22 08:14, Jonathan Cameron wrote:
> From: Jonathan Cameron 
>
>
> +static MemTxResult cxl_read_cfmws(void *opaque, hwaddr addr, uint64_t *data,
> +  unsigned size, MemTxAttrs attrs)
> +{
> +CXLFixedWindow *fw = opaque;
> +PCIDevice *d;
> +
> +d = cxl_cfmws_find_device(fw, addr);
> +if (d == NULL) {
> +*data = 0;

I'm looking at this code and comparing it to CXL2.0 spec 8.2.5.12.2 CXL HDM

Decoder Global Control Register (Offset 04h) table. It seems that we should

check POSION_ON_ERR_EN bit, if this bit is set, we return poison, otherwise

should return all 1's data.

Also, from the spec, this bit is implementation specific and hard 
wired(RO) to either 1 or 0,

but for type3 device looks like we are currently allowing it to be 
overwritten in ct3d_reg_write()

function. We probably also need more sanitation in ct3d_reg_write. (Also 
for HDM

range/interleaving settings.)

> +/* Reads to invalid address return poison */
> +return MEMTX_ERROR;
> +}
> +
> +return cxl_type3_read(d, addr + fw->base, data, size, attrs);
> +}
> +

- Tong



Re: [PATCH v7 00/12] Improve PMU support

2022-04-07 Thread Atish Patra
On Wed, Mar 30, 2022 at 5:01 PM Atish Patra  wrote:
>
> The latest version of the SBI specification includes a Performance Monitoring
> Unit(PMU) extension[1] which allows the supervisor to start/stop/configure
> various PMU events. The Sscofpmf ('Ss' for Privileged arch and 
> Supervisor-level
> extensions, and 'cofpmf' for Count OverFlow and Privilege Mode Filtering)
> extension[2] allows the perf like tool to handle overflow interrupts and
> filtering support.
>
> This series implements full PMU infrastructure to support
> PMU in virt machine. This will allow us to add any PMU events in future.
>
> Currently, this series enables the following omu events.
> 1. cycle count
> 2. instruction count
> 3. DTLB load/store miss
> 4. ITLB prefetch miss
>
> The first two are computed using host ticks while last three are counted 
> during
> cpu_tlb_fill. We can do both sampling and count from guest userspace.
> This series has been tested on both RV64 and RV32. Both Linux[3] and 
> Opensbi[4]
> patches are required to get the perf working.
>
> Here is an output of perf stat/report while running hackbench with OpenSBI & 
> Linux
> kernel patches applied [3]. The kernel patches are available in upstream as 
> well.
>
> Perf stat:
> ==
> [root@fedora-riscv ~]# perf stat -e cycles -e instructions -e 
> dTLB-load-misses -e dTLB-store-misses -e iTLB-load-misses \
> > perf bench sched messaging -g 1 -l 10
> # Running 'sched/messaging' benchmark:
> # 20 sender and receiver processes per group
> # 1 groups == 40 processes run
>
>  Total time: 0.265 [sec]
>
>  Performance counter stats for 'perf bench sched messaging -g 1 -l 10':
>
>  4,167,825,362  cycles
>  4,166,609,256  instructions  #1.00  insn per cycle
>  3,092,026  dTLB-load-misses
>258,280  dTLB-store-misses
>  2,068,966  iTLB-load-misses
>
>0.585791767 seconds time elapsed
>
>0.373802000 seconds user
>1.042359000 seconds sys
>
> Perf record:
> 
> [root@fedora-riscv ~]# perf record -e cycles -e instructions \
> > -e dTLB-load-misses -e dTLB-store-misses -e iTLB-load-misses -c 1 \
> > perf bench sched messaging -g 1 -l 10
> # Running 'sched/messaging' benchmark:
> # 20 sender and receiver processes per group
> # 1 groups == 40 processes run
>
>  Total time: 1.397 [sec]
> [ perf record: Woken up 10 times to write data ]
> Check IO/CPU overload!
> [ perf record: Captured and wrote 8.211 MB perf.data (214486 samples) ]
>
> [root@fedora-riscv riscv]# perf report
> Available samples
> 107K cycles   
>  ◆
> 107K instructions 
>  ▒
> 250 dTLB-load-misses  
>  ▒
> 13 dTLB-store-misses  
>  ▒
> 172 iTLB-load-misses
> ..
>
> Changes from v6->v7:
> 1. Fixed all the compilation errors for the usermode.
>
> Changes from v5->v6:
> 1. Fixed compilation issue with PATCH 1.
> 2. Addressed other comments.
>
> Changes from v4->v5:
> 1. Rebased on top of the -next with following patches.
>- isa extension
>- priv 1.12 spec
> 2. Addressed all the comments on v4
> 3. Removed additional isa-ext DT node in favor of riscv,isa string update
>
> Changes from v3->v4:
> 1. Removed the dummy events from pmu DT node.
> 2. Fixed pmu_avail_counters mask generation.
> 3. Added a patch to simplify the predicate function for counters.
>
> Changes from v2->v3:
> 1. Addressed all the comments on PATCH1-4.
> 2. Split patch1 into two separate patches.
> 3. Added explicit comments to explain the event types in DT node.
> 4. Rebased on latest Qemu.
>
> Changes from v1->v2:
> 1. Dropped the ACks from v1 as signficant changes happened after v1.
> 2. sscofpmf support.
> 3. A generic counter management framework.
>
> [1] https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/riscv-sbi.adoc
> [2] https://drive.google.com/file/d/171j4jFjIkKdj5LWcExphq4xG_2sihbfd/edit
> [3] https://github.com/atishp04/linux/tree/riscv_pmu_v6
> [4] https://github.com/atishp04/qemu/tree/riscv_pmu_v7
>
> Atish Patra (12):
> target/riscv: Fix PMU CSR predicate function
> target/riscv: Implement PMU CSR predicate function for S-mode
> target/riscv: pmu: Rename the counters extension to pmu
> target/riscv: pmu: Make number of counters configurable
> target/riscv: Implement mcountinhibit CSR
> target/riscv: Add support for hpmcounters/hpmevents
> target/riscv: Support mcycle/minstret write operation
> target/riscv: Add sscofpmf extension support
> target/riscv: Simplify counter predicate function
> target/riscv: Add few cache related PMU events
> hw/riscv: virt: Add PMU DT node to the device tree
> target/riscv: Update the privilege field for sscofpmf CSRs
>
> hw/riscv/virt.c   |  28 ++
> target/riscv/cpu.c|  14 +-
> target/riscv/cpu.h|  49 ++-
>

Re: [PATCH 4/7] virtio: don't read pending event on host notifier if disabled

2022-04-07 Thread Si-Wei Liu




On 4/7/2022 12:05 AM, Jason Wang wrote:


在 2022/4/6 上午3:18, Si-Wei Liu 写道:



On 4/1/2022 7:00 PM, Jason Wang wrote:
On Sat, Apr 2, 2022 at 4:37 AM Si-Wei Liu  
wrote:



On 3/31/2022 1:36 AM, Jason Wang wrote:
On Thu, Mar 31, 2022 at 12:41 AM Si-Wei Liu 
 wrote:


On 3/30/2022 2:14 AM, Jason Wang wrote:
On Wed, Mar 30, 2022 at 2:33 PM Si-Wei Liu 
 wrote:

Previous commit prevents vhost-user and vhost-vdpa from using
userland vq handler via disable_ioeventfd_handler. The same
needs to be done for host notifier cleanup too, as the
virtio_queue_host_notifier_read handler still tends to read
pending event left behind on ioeventfd and attempts to handle
outstanding kicks from QEMU userland vq.

If vq handler is not disabled on cleanup, it may lead to sigsegv
with recursive virtio_net_set_status call on the control vq:

0  0x7f8ce3ff3387 in raise () at /lib64/libc.so.6
1  0x7f8ce3ff4a78 in abort () at /lib64/libc.so.6
2  0x7f8ce3fec1a6 in __assert_fail_base () at /lib64/libc.so.6
3  0x7f8ce3fec252 in  () at /lib64/libc.so.6
4  0x558f52d79421 in vhost_vdpa_get_vq_index 
(dev=, idx=) at 
../hw/virtio/vhost-vdpa.c:563
5  0x558f52d79421 in vhost_vdpa_get_vq_index 
(dev=, idx=) at 
../hw/virtio/vhost-vdpa.c:558
6  0x558f52d7329a in vhost_virtqueue_mask 
(hdev=0x558f55c01800, vdev=0x558f568f91f0, n=2, mask=out>) at ../hw/virtio/vhost.c:1557

I feel it's probably a bug elsewhere e.g when we fail to start
vhost-vDPA, it's the charge of the Qemu to poll host notifier 
and we

will fallback to the userspace vq handler.
Apologies, an incorrect stack trace was pasted which actually 
came from

patch #1. I will post a v2 with the corresponding one as below:

0  0x55f800df1780 in qdev_get_parent_bus (dev=0x0) at
../hw/core/qdev.c:376
1  0x55f800c68ad8 in virtio_bus_device_iommu_enabled
(vdev=vdev@entry=0x0) at ../hw/virtio/virtio-bus.c:331
2  0x55f800d70d7f in vhost_memory_unmap (dev=) at
../hw/virtio/vhost.c:318
3  0x55f800d70d7f in vhost_memory_unmap (dev=,
buffer=0x7fc19bec5240, len=2052, is_write=1, access_len=2052) at
../hw/virtio/vhost.c:336
4  0x55f800d71867 in vhost_virtqueue_stop
(dev=dev@entry=0x55f8037ccc30, vdev=vdev@entry=0x55f8044ec590,
vq=0x55f8037cceb0, idx=0) at ../hw/virtio/vhost.c:1241
5  0x55f800d7406c in vhost_dev_stop 
(hdev=hdev@entry=0x55f8037ccc30,

vdev=vdev@entry=0x55f8044ec590) at ../hw/virtio/vhost.c:1839
6  0x55f800bf00a7 in vhost_net_stop_one (net=0x55f8037ccc30,
dev=0x55f8044ec590) at ../hw/net/vhost_net.c:315
7  0x55f800bf0678 in vhost_net_stop 
(dev=dev@entry=0x55f8044ec590,

ncs=0x55f80452bae0, data_queue_pairs=data_queue_pairs@entry=7,
cvq=cvq@entry=1)
  at ../hw/net/vhost_net.c:423
8  0x55f800d4e628 in virtio_net_set_status (status=out>,

n=0x55f8044ec590) at ../hw/net/virtio-net.c:296
9  0x55f800d4e628 in virtio_net_set_status
(vdev=vdev@entry=0x55f8044ec590, status=15 '\017') at
../hw/net/virtio-net.c:370
I don't understand why virtio_net_handle_ctrl() call 
virtio_net_set_stauts()...

The pending request left over on the ctrl vq was a VIRTIO_NET_CTRL_MQ
command, i.e. in virtio_net_handle_mq():

Completely forget that the code was actually written by me :\


1413 n->curr_queue_pairs = queue_pairs;
1414 /* stop the backend before changing the number of queue_pairs
to avoid handling a
1415  * disabled queue */
1416 virtio_net_set_status(vdev, vdev->status);
1417 virtio_net_set_queue_pairs(n);

Noted before the vdpa multiqueue support, there was never a vhost_dev
for ctrl_vq exposed, i.e. there's no host notifier set up for the
ctrl_vq on vhost_kernel as it is emulated in QEMU software.


10 0x55f800d534d8 in virtio_net_handle_ctrl (iov_cnt=, iov=, cmd=0 '\000', n=0x55f8044ec590) at
../hw/net/virtio-net.c:1408
11 0x55f800d534d8 in virtio_net_handle_ctrl 
(vdev=0x55f8044ec590,

vq=0x7fc1a7e888d0) at ../hw/net/virtio-net.c:1452
12 0x55f800d69f37 in virtio_queue_host_notifier_read
(vq=0x7fc1a7e888d0) at ../hw/virtio/virtio.c:2331
13 0x55f800d69f37 in virtio_queue_host_notifier_read
(n=n@entry=0x7fc1a7e8894c) at ../hw/virtio/virtio.c:3575
14 0x55f800c688e6 in virtio_bus_cleanup_host_notifier
(bus=, n=n@entry=14) at ../hw/virtio/virtio-bus.c:312
15 0x55f800d73106 in vhost_dev_disable_notifiers
(hdev=hdev@entry=0x55f8035b51b0, vdev=vdev@entry=0x55f8044ec590)
  at ../../../include/hw/virtio/virtio-bus.h:35
16 0x55f800bf00b2 in vhost_net_stop_one (net=0x55f8035b51b0,
dev=0x55f8044ec590) at ../hw/net/vhost_net.c:316
17 0x55f800bf0678 in vhost_net_stop 
(dev=dev@entry=0x55f8044ec590,

ncs=0x55f80452bae0, data_queue_pairs=data_queue_pairs@entry=7,
cvq=cvq@entry=1)
  at ../hw/net/vhost_net.c:423
18 0x55f800d4e628 in virtio_net_set_status (status=out>,

n=0x55f8044ec590) at ../hw/net/virtio-net.c:296
19 0x55f800d4e628 in virtio_net_set_status (vdev=0x55f8044ec590,
status=15 '\017') at ../hw/net/virtio-net.c:370
20 0x55f800d6c4b2 in virtio_set_status (vdev=

Re: [PATCH v5 11/13] KVM: Zap existing KVM mappings when pages changed in the private fd

2022-04-07 Thread Sean Christopherson
On Tue, Apr 05, 2022, Michael Roth wrote:
> On Thu, Mar 10, 2022 at 10:09:09PM +0800, Chao Peng wrote:
> >  static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 67349421eae3..52319f49d58a 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -841,8 +841,43 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
> >  #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
> >  
> >  #ifdef CONFIG_MEMFILE_NOTIFIER
> > +static void kvm_memfile_notifier_handler(struct memfile_notifier *notifier,
> > +pgoff_t start, pgoff_t end)
> > +{
> > +   int idx;
> > +   struct kvm_memory_slot *slot = container_of(notifier,
> > +   struct kvm_memory_slot,
> > +   notifier);
> > +   struct kvm_gfn_range gfn_range = {
> > +   .slot   = slot,
> > +   .start  = start - (slot->private_offset >> PAGE_SHIFT),
> > +   .end= end - (slot->private_offset >> PAGE_SHIFT),
> > +   .may_block  = true,
> > +   };
> > +   struct kvm *kvm = slot->kvm;
> > +
> > +   gfn_range.start = max(gfn_range.start, slot->base_gfn);
> > +   gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages);
> > +
> > +   if (gfn_range.start >= gfn_range.end)
> > +   return;
> > +
> > +   idx = srcu_read_lock(&kvm->srcu);
> > +   KVM_MMU_LOCK(kvm);
> > +   kvm_unmap_gfn_range(kvm, &gfn_range);
> > +   kvm_flush_remote_tlbs(kvm);
> > +   KVM_MMU_UNLOCK(kvm);
> > +   srcu_read_unlock(&kvm->srcu, idx);
> 
> Should this also invalidate gfn_to_pfn_cache mappings? Otherwise it seems
> possible the kernel might end up inadvertantly writing to now-private guest
> memory via a now-stale gfn_to_pfn_cache entry.

Yes.  Ideally we'd get these flows to share common code and avoid these goofs.
I tried very briefly but they're just different enough to make it ugly.



Re: [PATCH 1/2] target/riscv: Use cpu_loop_exit_restore directly from mmu faults

2022-04-07 Thread Alistair Francis
On Fri, Apr 1, 2022 at 11:01 PM Richard Henderson
 wrote:
>
> The riscv_raise_exception function stores its argument into
> exception_index and then exits to the main loop.  When we
> have already set exception_index, we can just exit directly.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alistair Francis 

Alistair

> ---
>  target/riscv/cpu_helper.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/target/riscv/cpu_helper.c b/target/riscv/cpu_helper.c
> index 1c60fb2e80..126251d5da 100644
> --- a/target/riscv/cpu_helper.c
> +++ b/target/riscv/cpu_helper.c
> @@ -1150,7 +1150,7 @@ void riscv_cpu_do_transaction_failed(CPUState *cs, 
> hwaddr physaddr,
>  env->badaddr = addr;
>  env->two_stage_lookup = riscv_cpu_virt_enabled(env) ||
>  riscv_cpu_two_stage_lookup(mmu_idx);
> -riscv_raise_exception(&cpu->env, cs->exception_index, retaddr);
> +cpu_loop_exit_restore(cs, retaddr);
>  }
>
>  void riscv_cpu_do_unaligned_access(CPUState *cs, vaddr addr,
> @@ -1175,7 +1175,7 @@ void riscv_cpu_do_unaligned_access(CPUState *cs, vaddr 
> addr,
>  env->badaddr = addr;
>  env->two_stage_lookup = riscv_cpu_virt_enabled(env) ||
>  riscv_cpu_two_stage_lookup(mmu_idx);
> -riscv_raise_exception(env, cs->exception_index, retaddr);
> +cpu_loop_exit_restore(cs, retaddr);
>  }
>
>  bool riscv_cpu_tlb_fill(CPUState *cs, vaddr address, int size,
> @@ -1311,7 +1311,7 @@ bool riscv_cpu_tlb_fill(CPUState *cs, vaddr address, 
> int size,
>  first_stage_error,
>  riscv_cpu_virt_enabled(env) ||
>  riscv_cpu_two_stage_lookup(mmu_idx));
> -riscv_raise_exception(env, cs->exception_index, retaddr);
> +cpu_loop_exit_restore(cs, retaddr);
>  }
>
>  return true;
> --
> 2.25.1
>
>



[PULL 0/2] Fixes 20220408 patches

2022-04-07 Thread Gerd Hoffmann
The following changes since commit 95a3fcc7487e5bef262e1f937ed8636986764c4e:

  Update version for v7.0.0-rc3 release (2022-04-06 21:26:13 +0100)

are available in the Git repository at:

  git://git.kraxel.org/qemu tags/fixes-20220408-pull-request

for you to fetch changes up to fa892e9abb728e76afcf27323ab29c57fb0fe7aa:

  ui/cursor: fix integer overflow in cursor_alloc (CVE-2021-4206) (2022-04-07 
12:30:54 +0200)


two cursor/qxl related security fixes.



Mauro Matteo Cascella (2):
  display/qxl-render: fix race condition in qxl_cursor (CVE-2021-4207)
  ui/cursor: fix integer overflow in cursor_alloc (CVE-2021-4206)

 hw/display/qxl-render.c | 9 -
 hw/display/vmware_vga.c | 2 ++
 ui/cursor.c | 8 +++-
 3 files changed, 17 insertions(+), 2 deletions(-)

-- 
2.35.1





[PULL 1/2] display/qxl-render: fix race condition in qxl_cursor (CVE-2021-4207)

2022-04-07 Thread Gerd Hoffmann
From: Mauro Matteo Cascella 

Avoid fetching 'width' and 'height' a second time to prevent possible
race condition. Refer to security advisory
https://starlabs.sg/advisories/22-4207/ for more information.

Fixes: CVE-2021-4207
Signed-off-by: Mauro Matteo Cascella 
Reviewed-by: Marc-André Lureau 
Message-Id: <20220407081106.343235-1-mcasc...@redhat.com>
Signed-off-by: Gerd Hoffmann 
---
 hw/display/qxl-render.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/display/qxl-render.c b/hw/display/qxl-render.c
index d28849b12176..237ed293baae 100644
--- a/hw/display/qxl-render.c
+++ b/hw/display/qxl-render.c
@@ -266,7 +266,7 @@ static QEMUCursor *qxl_cursor(PCIQXLDevice *qxl, QXLCursor 
*cursor,
 }
 break;
 case SPICE_CURSOR_TYPE_ALPHA:
-size = sizeof(uint32_t) * cursor->header.width * cursor->header.height;
+size = sizeof(uint32_t) * c->width * c->height;
 qxl_unpack_chunks(c->data, size, qxl, &cursor->chunk, group_id);
 if (qxl->debug > 2) {
 cursor_print_ascii_art(c, "qxl/alpha");
-- 
2.35.1




[PULL 2/2] ui/cursor: fix integer overflow in cursor_alloc (CVE-2021-4206)

2022-04-07 Thread Gerd Hoffmann
From: Mauro Matteo Cascella 

Prevent potential integer overflow by limiting 'width' and 'height' to
512x512. Also change 'datasize' type to size_t. Refer to security
advisory https://starlabs.sg/advisories/22-4206/ for more information.

Fixes: CVE-2021-4206
Signed-off-by: Mauro Matteo Cascella 
Reviewed-by: Marc-André Lureau 
Message-Id: <20220407081712.345609-1-mcasc...@redhat.com>
Signed-off-by: Gerd Hoffmann 
---
 hw/display/qxl-render.c | 7 +++
 hw/display/vmware_vga.c | 2 ++
 ui/cursor.c | 8 +++-
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/hw/display/qxl-render.c b/hw/display/qxl-render.c
index 237ed293baae..ca217004bf72 100644
--- a/hw/display/qxl-render.c
+++ b/hw/display/qxl-render.c
@@ -247,6 +247,13 @@ static QEMUCursor *qxl_cursor(PCIQXLDevice *qxl, QXLCursor 
*cursor,
 size_t size;
 
 c = cursor_alloc(cursor->header.width, cursor->header.height);
+
+if (!c) {
+qxl_set_guest_bug(qxl, "%s: cursor %ux%u alloc error", __func__,
+cursor->header.width, cursor->header.height);
+goto fail;
+}
+
 c->hot_x = cursor->header.hot_spot_x;
 c->hot_y = cursor->header.hot_spot_y;
 switch (cursor->header.type) {
diff --git a/hw/display/vmware_vga.c b/hw/display/vmware_vga.c
index 98c83474adb5..45d06cbe2544 100644
--- a/hw/display/vmware_vga.c
+++ b/hw/display/vmware_vga.c
@@ -515,6 +515,8 @@ static inline void vmsvga_cursor_define(struct 
vmsvga_state_s *s,
 int i, pixels;
 
 qc = cursor_alloc(c->width, c->height);
+assert(qc != NULL);
+
 qc->hot_x = c->hot_x;
 qc->hot_y = c->hot_y;
 switch (c->bpp) {
diff --git a/ui/cursor.c b/ui/cursor.c
index 1d62ddd4d072..835f0802f951 100644
--- a/ui/cursor.c
+++ b/ui/cursor.c
@@ -46,6 +46,8 @@ static QEMUCursor *cursor_parse_xpm(const char *xpm[])
 
 /* parse pixel data */
 c = cursor_alloc(width, height);
+assert(c != NULL);
+
 for (pixel = 0, y = 0; y < height; y++, line++) {
 for (x = 0; x < height; x++, pixel++) {
 idx = xpm[line][x];
@@ -91,7 +93,11 @@ QEMUCursor *cursor_builtin_left_ptr(void)
 QEMUCursor *cursor_alloc(int width, int height)
 {
 QEMUCursor *c;
-int datasize = width * height * sizeof(uint32_t);
+size_t datasize = width * height * sizeof(uint32_t);
+
+if (width > 512 || height > 512) {
+return NULL;
+}
 
 c = g_malloc0(sizeof(QEMUCursor) + datasize);
 c->width  = width;
-- 
2.35.1




Re: [PATCH v3] ui/cursor: fix integer overflow in cursor_alloc (CVE-2021-4206)

2022-04-07 Thread Gerd Hoffmann
On Thu, Apr 07, 2022 at 06:46:00PM +0100, Peter Maydell wrote:
> On Thu, 7 Apr 2022 at 10:21, Marc-André Lureau
>  wrote:
> >
> >
> >
> > On Thu, Apr 7, 2022 at 12:23 PM Mauro Matteo Cascella  
> > wrote:
> >>
> >> Prevent potential integer overflow by limiting 'width' and 'height' to
> >> 512x512. Also change 'datasize' type to size_t. Refer to security
> >> advisory https://starlabs.sg/advisories/22-4206/ for more information.
> >>
> >> Fixes: CVE-2021-4206
> >
> >
> > (the Starlabs advisory has 2022, I guess it's wrong then)
> >
> >> Signed-off-by: Mauro Matteo Cascella 
> >
> >
> > Reviewed-by: Marc-André Lureau 
> 
> Does this fix (or any of the other cursor-related stuff I've seen
> floating past) need to go into 7.0 ? (ie is it release-critical?)

Yes.  The integer overflow can be triggered easily by guests.  Hitting
the double read race condition is harder but probably possible too.
Pull request sent minutes ago.

take care,
  Gerd




Re: [PATCH 1/2] gdbstub: Set current_cpu for memory read write

2022-04-07 Thread Bin Meng
On Sat, Apr 2, 2022 at 7:20 PM Bin Meng  wrote:
>
> On Tue, Mar 29, 2022 at 12:43 PM Bin Meng  wrote:
> >
> > On Mon, Mar 28, 2022 at 5:10 PM Peter Maydell  
> > wrote:
> > >
> > > On Mon, 28 Mar 2022 at 03:10, Bin Meng  wrote:
> > > > IMHO it's too bad to just ignore this bug forever.
> > > >
> > > > This is a valid use case. It's not about whether we intentionally want
> > > > to inspect the GIC register value from gdb. The case is that when
> > > > single stepping the source codes it triggers the core dump for no
> > > > reason if the instructions involved contain load/store to any of the
> > > > GIC registers.
> > >
> > > Huh? Single-stepping the instruction should execute it inside
> > > QEMU, which will do the load in the usual way. That should not
> > > be going via gdbstub reads and writes.
> >
> > Yes, single-stepping the instruction is executed in the vCPU context,
> > but a gdb client sends additional commands, more than just telling
> > QEMU to execute a single instruction.
> >
> > For example, the following is the sequence a gdb client sent when doing a 
> > "si":
> >
> > gdbstub_io_command Received: Z0,10,4
> > gdbstub_io_reply Sent: OK
> > gdbstub_io_got_ack Got ACK
> > gdbstub_io_command Received: m18c430,4
> > gdbstub_io_reply Sent: ff430091
> > gdbstub_io_got_ack Got ACK
> > gdbstub_io_command Received: vCont;s:p1.1;c:p1.-1
> > gdbstub_op_stepping Stepping CPU 0
> > gdbstub_op_continue_cpu Continuing CPU 1
> > gdbstub_op_continue_cpu Continuing CPU 2
> > gdbstub_op_continue_cpu Continuing CPU 3
> > gdbstub_hit_break RUN_STATE_DEBUG
> > gdbstub_io_reply Sent: T05thread:p01.01;
> > gdbstub_io_got_ack Got ACK
> > gdbstub_io_command Received: g
> > gdbstub_io_reply Sent:
> > 3848ed00f08fa6100300010001f930a5ec0034c41800c903
> > gdbstub_io_got_ack Got ACK
> > gdbstub_io_command Received: m18c434,4
> > gdbstub_io_reply Sent: 00e004d1
> > gdbstub_io_got_ack Got ACK
> > gdbstub_io_command Received: m18c430,4
> > gdbstub_io_reply Sent: ff430091
> > gdbstub_io_got_ack Got ACK
> > gdbstub_io_command Received: m18c434,4
> > gdbstub_io_reply Sent: 00e004d1
> > gdbstub_io_got_ack Got ACK
> > gdbstub_io_command Received: m18c400,40
> > gdbstub_io_reply Sent:
> > ff4300d1e00300f98037005840f900a0019140f900b0009140f900e004911e7800f9fe0340f91ef9ff43009100e004d174390094bb390094
> > gdbstub_io_got_ack Got ACK
> > gdbstub_io_command Received: mf901,4
> >
> > Here "mf901,4" triggers the bug where 0xf901 is the GIC register.
> >
> > This is not something QEMU can ignore or control. The logic is inside
> > the gdb client.
> >
>
> Ping for this series?
>

Ping?



Re: [PATCH 2/2] target/riscv: Mark amo insns during translation

2022-04-07 Thread Alistair Francis
On Fri, Apr 1, 2022 at 11:04 PM Richard Henderson
 wrote:
>
> Atomic memory operations perform both reads and writes as part
> of their implementation, but always raise write faults.
>
> Use TARGET_INSN_START_EXTRA_WORDS to mark amo insns in the
> opcode stream, and force the access type to write at the
> point of raising the exception.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alistair Francis 

Alistair

> ---
>  target/riscv/cpu.h  | 15 ++
>  target/riscv/cpu.c  |  3 ++
>  target/riscv/cpu_helper.c   | 62 +
>  target/riscv/translate.c|  9 
>  target/riscv/insn_trans/trans_rva.c.inc | 11 -
>  5 files changed, 79 insertions(+), 21 deletions(-)
>
> diff --git a/target/riscv/cpu.h b/target/riscv/cpu.h
> index c069fe85fa..3de4da3fa1 100644
> --- a/target/riscv/cpu.h
> +++ b/target/riscv/cpu.h
> @@ -290,6 +290,13 @@ struct CPUArchState {
>  /* True if in debugger mode.  */
>  bool debugger;
>
> +/*
> + * True if unwinding through an amo insn.  Used to transform a
> + * read fault into a store_amo fault; only valid immediately
> + * after cpu_restore_state().
> + */
> +bool unwind_amo;
> +
>  /*
>   * CSRs for PointerMasking extension
>   */
> @@ -517,6 +524,14 @@ FIELD(TB_FLAGS, XL, 20, 2)
>  FIELD(TB_FLAGS, PM_MASK_ENABLED, 22, 1)
>  FIELD(TB_FLAGS, PM_BASE_ENABLED, 23, 1)
>
> +#ifndef CONFIG_USER_ONLY
> +/*
> + * RISC-V-specific extra insn start words:
> + * 1: True if the instruction is AMO, false otherwise.
> + */
> +#define TARGET_INSN_START_EXTRA_WORDS 1
> +#endif
> +
>  #ifdef TARGET_RISCV32
>  #define riscv_cpu_mxl(env)  ((void)(env), MXL_RV32)
>  #else
> diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
> index ddda4906ff..3818d5ba80 100644
> --- a/target/riscv/cpu.c
> +++ b/target/riscv/cpu.c
> @@ -396,6 +396,9 @@ void restore_state_to_opc(CPURISCVState *env, 
> TranslationBlock *tb,
>  } else {
>  env->pc = data[0];
>  }
> +#ifndef CONFIG_USER_ONLY
> +env->unwind_amo = data[1];
> +#endif
>  }
>
>  static void riscv_cpu_reset(DeviceState *dev)
> diff --git a/target/riscv/cpu_helper.c b/target/riscv/cpu_helper.c
> index 126251d5da..b5bbe6fc39 100644
> --- a/target/riscv/cpu_helper.c
> +++ b/target/riscv/cpu_helper.c
> @@ -1139,26 +1139,11 @@ void riscv_cpu_do_transaction_failed(CPUState *cs, 
> hwaddr physaddr,
>  RISCVCPU *cpu = RISCV_CPU(cs);
>  CPURISCVState *env = &cpu->env;
>
> -if (access_type == MMU_DATA_STORE) {
> -cs->exception_index = RISCV_EXCP_STORE_AMO_ACCESS_FAULT;
> -} else if (access_type == MMU_DATA_LOAD) {
> -cs->exception_index = RISCV_EXCP_LOAD_ACCESS_FAULT;
> -} else {
> -cs->exception_index = RISCV_EXCP_INST_ACCESS_FAULT;
> +cpu_restore_state(cs, retaddr, true);
> +if (env->unwind_amo) {
> +access_type = MMU_DATA_STORE;
>  }
>
> -env->badaddr = addr;
> -env->two_stage_lookup = riscv_cpu_virt_enabled(env) ||
> -riscv_cpu_two_stage_lookup(mmu_idx);
> -cpu_loop_exit_restore(cs, retaddr);
> -}
> -
> -void riscv_cpu_do_unaligned_access(CPUState *cs, vaddr addr,
> -   MMUAccessType access_type, int mmu_idx,
> -   uintptr_t retaddr)
> -{
> -RISCVCPU *cpu = RISCV_CPU(cs);
> -CPURISCVState *env = &cpu->env;
>  switch (access_type) {
>  case MMU_INST_FETCH:
>  cs->exception_index = RISCV_EXCP_INST_ADDR_MIS;
> @@ -1172,10 +1157,43 @@ void riscv_cpu_do_unaligned_access(CPUState *cs, 
> vaddr addr,
>  default:
>  g_assert_not_reached();
>  }
> +
>  env->badaddr = addr;
>  env->two_stage_lookup = riscv_cpu_virt_enabled(env) ||
>  riscv_cpu_two_stage_lookup(mmu_idx);
> -cpu_loop_exit_restore(cs, retaddr);
> +cpu_loop_exit(cs);
> +}
> +
> +void riscv_cpu_do_unaligned_access(CPUState *cs, vaddr addr,
> +   MMUAccessType access_type, int mmu_idx,
> +   uintptr_t retaddr)
> +{
> +RISCVCPU *cpu = RISCV_CPU(cs);
> +CPURISCVState *env = &cpu->env;
> +
> +cpu_restore_state(cs, retaddr, true);
> +if (env->unwind_amo) {
> +access_type = MMU_DATA_STORE;
> +}
> +
> +switch (access_type) {
> +case MMU_INST_FETCH:
> +cs->exception_index = RISCV_EXCP_INST_ADDR_MIS;
> +break;
> +case MMU_DATA_LOAD:
> +cs->exception_index = RISCV_EXCP_LOAD_ADDR_MIS;
> +break;
> +case MMU_DATA_STORE:
> +cs->exception_index = RISCV_EXCP_STORE_AMO_ADDR_MIS;
> +break;
> +default:
> +g_assert_not_reached();
> +}
> +
> +env->badaddr = addr;
> +env->two_stage_lookup = riscv_cpu_virt_enabled(env) ||
> +riscv_cpu_two_stage_lookup(mmu_idx);
> +cpu_loop_exit(cs);
>  }
>
>  bool riscv_cpu_tlb_fill(CPUState