date:20220408

Re: [PATCH v7 11/12] hw/riscv: virt: Add PMU DT node to the device tree

2022-04-08 Thread Alistair Francis

On Thu, Mar 31, 2022 at 10:18 AM Atish Patra  wrote:
>
> Qemu virt machine can support few cache events and cycle/instret counters.
> It also supports counter overflow for these events.
>
> Add a DT node so that OpenSBI/Linux kernel is aware of the virt machine
> capabilities. There are some dummy nodes added for testing as well.
>
> Signed-off-by: Atish Patra 
> Signed-off-by: Atish Patra 

Acked-by: Alistair Francis 

Alistair

> ---
>  hw/riscv/virt.c| 28 +++
>  target/riscv/cpu.c |  1 +
>  target/riscv/pmu.c | 57 ++
>  target/riscv/pmu.h |  1 +
>  4 files changed, 87 insertions(+)
>
> diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
> index da50cbed43ec..13d61bf476ff 100644
> --- a/hw/riscv/virt.c
> +++ b/hw/riscv/virt.c
> @@ -28,6 +28,7 @@
>  #include "hw/qdev-properties.h"
>  #include "hw/char/serial.h"
>  #include "target/riscv/cpu.h"
> +#include "target/riscv/pmu.h"
>  #include "hw/riscv/riscv_hart.h"
>  #include "hw/riscv/virt.h"
>  #include "hw/riscv/boot.h"
> @@ -687,6 +688,32 @@ static void create_fdt_socket_aplic(RISCVVirtState *s,
>  aplic_phandles[socket] = aplic_s_phandle;
>  }
>
> +static void create_fdt_socket_pmu(RISCVVirtState *s,
> +  int socket, uint32_t *phandle,
> +  uint32_t *intc_phandles)
> +{
> +int cpu;
> +char *pmu_name;
> +uint32_t *pmu_cells;
> +MachineState *mc = MACHINE(s);
> +RISCVCPU hart = s->soc[socket].harts[0];
> +
> +pmu_cells = g_new0(uint32_t, s->soc[socket].num_harts * 2);
> +
> +for (cpu = 0; cpu < s->soc[socket].num_harts; cpu++) {
> +pmu_cells[cpu * 2 + 0] = cpu_to_be32(intc_phandles[cpu]);
> +pmu_cells[cpu * 2 + 1] = cpu_to_be32(IRQ_PMU_OVF);
> +}
> +
> +pmu_name = g_strdup_printf("/soc/pmu");
> +qemu_fdt_add_subnode(mc->fdt, pmu_name);
> +qemu_fdt_setprop_string(mc->fdt, pmu_name, "compatible", "riscv,pmu");
> +riscv_pmu_generate_fdt_node(mc->fdt, hart.cfg.pmu_num, pmu_name);
> +
> +g_free(pmu_name);
> +g_free(pmu_cells);
> +}
> +
>  static void create_fdt_sockets(RISCVVirtState *s, const MemMapEntry *memmap,
> bool is_32_bit, uint32_t *phandle,
> uint32_t *irq_mmio_phandle,
> @@ -732,6 +759,7 @@ static void create_fdt_sockets(RISCVVirtState *s, const 
> MemMapEntry *memmap,
>  &intc_phandles[phandle_pos]);
>  }
>  }
> +create_fdt_socket_pmu(s, socket, phandle, intc_phandles);
>  }
>
>  if (s->aia_type == VIRT_AIA_TYPE_APLIC_IMSIC) {
> diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
> index 9715eed2fc4e..d834e58a8bcd 100644
> --- a/target/riscv/cpu.c
> +++ b/target/riscv/cpu.c
> @@ -974,6 +974,7 @@ static void riscv_isa_string_ext(RISCVCPU *cpu, char 
> **isa_str, int max_str_len)
>  ISA_EDATA_ENTRY(zbs, ext_zbs),
>  ISA_EDATA_ENTRY(zve32f, ext_zve32f),
>  ISA_EDATA_ENTRY(zve64f, ext_zve64f),
> +ISA_EDATA_ENTRY(sscofpmf, ext_sscofpmf),
>  ISA_EDATA_ENTRY(svinval, ext_svinval),
>  ISA_EDATA_ENTRY(svnapot, ext_svnapot),
>  ISA_EDATA_ENTRY(svpbmt, ext_svpbmt),
> diff --git a/target/riscv/pmu.c b/target/riscv/pmu.c
> index 1c586770682b..f5e3e6d0281e 100644
> --- a/target/riscv/pmu.c
> +++ b/target/riscv/pmu.c
> @@ -20,11 +20,68 @@
>  #include "cpu.h"
>  #include "pmu.h"
>  #include "sysemu/cpu-timers.h"
> +#include "sysemu/device_tree.h"
>
>  #define RISCV_TIMEBASE_FREQ 10 /* 1Ghz */
>  #define MAKE_32BIT_MASK(shift, length) \
>  (((uint32_t)(~0UL) >> (32 - (length))) << (shift))
>
> +/**
> + * To keep it simple, any event can be mapped to any programmable counters in
> + * QEMU. The generic cycle & instruction count events can also be monitored
> + * using programmable counters. In that case, mcycle & minstret must continue
> + * to provide the correct value as well. Heterogeneous PMU per hart is not
> + * supported yet. Thus, number of counters are same across all harts.
> + */
> +void riscv_pmu_generate_fdt_node(void *fdt, int num_ctrs, char *pmu_name)
> +{
> +uint32_t fdt_event_ctr_map[20] = {};
> +uint32_t cmask;
> +
> +/* All the programmable counters can map to any event */
> +cmask = MAKE_32BIT_MASK(3, num_ctrs);
> +
> +   /**
> +* The event encoding is specified in the SBI specification
> +* Event idx is a 20bits wide number encoded as follows:
> +* event_idx[19:16] = type
> +* event_idx[15:0] = code
> +* The code field in cache events are encoded as follows:
> +* event_idx.code[15:3] = cache_id
> +* event_idx.code[2:1] = op_id
> +* event_idx.code[0:0] = result_id
> +*/
> +
> +   /* SBI_PMU_HW_CPU_CYCLES: 0x01 : type(0x00) */
> +   fdt_event_ctr_map[0] = cpu_to_be32(0x0001);
> +   fdt_event_ctr_map[1] = cpu_to_be32(0x0001);
> +   fdt_event_ctr_map[2] = cpu_to_be32(cmask | 1 << 0);
> +
> +   /* SBI_P

[PATCH v5 0/2] Option to take screenshot with screendump as PNG

2022-04-08 Thread Kshitij Suri

This patch series aims to add PNG support using libpng to screendump method.
Currently screendump only supports PPM format, which is uncompressed.

PATCH 1 phases out CONFIG_VNC_PNG parameter and replaces it with CONFIG_PNG
which detects libpng support.

PATCH 2 contains core logic for PNG creation from pixman using libpng. HMP
command equivalent is also implemented in this patch.

v4->v5
 - Modified format as a flag based optional parameter in HMP.

v3->v4
 - Added condition to check for libpng only in PNG option is allowed

v2->v3
 - HMP implementation fixes for png.
 - Used enum for image format.
 - Fixed description and updated QEMU support version.

v1->v2:
 - Removed repeated alpha conversion operation.
 - Modified logic to mirror png conversion in vnc-enc-tight.c file.
 - Added a new CONFIG_PNG parameter for libpng support.
 - Changed input format to enum instead of string.
 - Improved error handling.

Kshitij Suri (2):
  Replacing CONFIG_VNC_PNG with CONFIG_PNG
  Added parameter to take screenshot with screendump as PNG

 hmp-commands.hx|  11 ++---
 meson.build|  12 +++---
 meson_options.txt  |   4 +-
 monitor/hmp-cmds.c |  12 +-
 qapi/ui.json   |  24 +--
 ui/console.c   | 101 +++--
 ui/vnc-enc-tight.c |  18 
 ui/vnc.c   |   4 +-
 ui/vnc.h   |   2 +-
 9 files changed, 157 insertions(+), 31 deletions(-)

-- 
2.22.3

[PATCH v5 1/2] Replacing CONFIG_VNC_PNG with CONFIG_PNG

2022-04-08 Thread Kshitij Suri

Libpng is only detected if VNC is enabled currently. This patch adds a
generalised png option in the meson build which is aimed to replace use of
CONFIG_VNC_PNG with CONFIG_PNG.

Signed-off-by: Kshitij Suri 

Reviewed-by: Daniel P. Berrangé 
---
 meson.build| 12 +++-
 meson_options.txt  |  4 ++--
 ui/vnc-enc-tight.c | 18 +-
 ui/vnc.c   |  4 ++--
 ui/vnc.h   |  2 +-
 5 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/meson.build b/meson.build
index 282e7c4650..0790ccef99 100644
--- a/meson.build
+++ b/meson.build
@@ -1115,14 +1115,16 @@ if gtkx11.found()
   x11 = dependency('x11', method: 'pkg-config', required: gtkx11.found(),
kwargs: static_kwargs)
 endif
-vnc = not_found
 png = not_found
+if get_option('png').allowed() and have_system
+   png = dependency('libpng', required: get_option('png'),
+method: 'pkg-config', kwargs: static_kwargs)
+endif
+vnc = not_found
 jpeg = not_found
 sasl = not_found
 if get_option('vnc').allowed() and have_system
   vnc = declare_dependency() # dummy dependency
-  png = dependency('libpng', required: get_option('vnc_png'),
-   method: 'pkg-config', kwargs: static_kwargs)
   jpeg = dependency('libjpeg', required: get_option('vnc_jpeg'),
 method: 'pkg-config', kwargs: static_kwargs)
   sasl = cc.find_library('sasl2', has_headers: ['sasl/sasl.h'],
@@ -1554,9 +1556,9 @@ config_host_data.set('CONFIG_TPM', have_tpm)
 config_host_data.set('CONFIG_USB_LIBUSB', libusb.found())
 config_host_data.set('CONFIG_VDE', vde.found())
 config_host_data.set('CONFIG_VHOST_USER_BLK_SERVER', 
have_vhost_user_blk_server)
+config_host_data.set('CONFIG_PNG', png.found())
 config_host_data.set('CONFIG_VNC', vnc.found())
 config_host_data.set('CONFIG_VNC_JPEG', jpeg.found())
-config_host_data.set('CONFIG_VNC_PNG', png.found())
 config_host_data.set('CONFIG_VNC_SASL', sasl.found())
 config_host_data.set('CONFIG_VIRTFS', have_virtfs)
 config_host_data.set('CONFIG_VTE', vte.found())
@@ -3638,11 +3640,11 @@ summary_info += {'curses support':curses}
 summary_info += {'virgl support': virgl}
 summary_info += {'curl support':  curl}
 summary_info += {'Multipath support': mpathpersist}
+summary_info += {'PNG support':   png}
 summary_info += {'VNC support':   vnc}
 if vnc.found()
   summary_info += {'VNC SASL support':  sasl}
   summary_info += {'VNC JPEG support':  jpeg}
-  summary_info += {'VNC PNG support':   png}
 endif
 if targetos not in ['darwin', 'haiku', 'windows']
   summary_info += {'OSS support': oss}
diff --git a/meson_options.txt b/meson_options.txt
index 52b11cead4..d85734f8e6 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -177,12 +177,12 @@ option('vde', type : 'feature', value : 'auto',
description: 'vde network backend support')
 option('virglrenderer', type : 'feature', value : 'auto',
description: 'virgl rendering support')
+option('png', type : 'feature', value : 'auto',
+   description: 'PNG support with libpng')
 option('vnc', type : 'feature', value : 'auto',
description: 'VNC server')
 option('vnc_jpeg', type : 'feature', value : 'auto',
description: 'JPEG lossy compression for VNC server')
-option('vnc_png', type : 'feature', value : 'auto',
-   description: 'PNG compression for VNC server')
 option('vnc_sasl', type : 'feature', value : 'auto',
description: 'SASL authentication for VNC server')
 option('vte', type : 'feature', value : 'auto',
diff --git a/ui/vnc-enc-tight.c b/ui/vnc-enc-tight.c
index 7b86a4713d..e879cca7f5 100644
--- a/ui/vnc-enc-tight.c
+++ b/ui/vnc-enc-tight.c
@@ -32,7 +32,7 @@
INT32 definitions between jmorecfg.h (included by jpeglib.h) and
Win32 basetsd.h (included by windows.h). */
 
-#ifdef CONFIG_VNC_PNG
+#ifdef CONFIG_PNG
 /* The following define is needed by pngconf.h. Otherwise it won't compile,
because setjmp.h was already included by qemu-common.h. */
 #define PNG_SKIP_SETJMP_CHECK
@@ -95,7 +95,7 @@ static const struct {
 };
 #endif
 
-#ifdef CONFIG_VNC_PNG
+#ifdef CONFIG_PNG
 static const struct {
 int png_zlib_level, png_filters;
 } tight_png_conf[] = {
@@ -919,7 +919,7 @@ static int send_full_color_rect(VncState *vs, int x, int y, 
int w, int h)
 int stream = 0;
 ssize_t bytes;
 
-#ifdef CONFIG_VNC_PNG
+#ifdef CONFIG_PNG
 if (tight_can_send_png_rect(vs, w, h)) {
 return send_png_rect(vs, x, y, w, h, NULL);
 }
@@ -966,7 +966,7 @@ static int send_mono_rect(VncState *vs, int x, int y,
 int stream = 1;
 int level = tight_conf[vs->tight->compression].mono_zlib_level;
 
-#ifdef CONFIG_VNC_PNG
+#ifdef CONFIG_PNG
 if (tight_can_send_png_rect(vs, w, h)) {
 int ret;
 int bpp = vs->client_pf.bytes_per_pixel * 8;
@@ -1020,7 +1020,7 @@ static int send_mono_rect(VncState *vs, int x, int y,
 struct palette_cb_priv {
 VncState *vs;
 uint8_t *header;
-#ifdef CONFIG_VNC_PNG
+#i

[PATCH v5 2/2] Added parameter to take screenshot with screendump as PNG

2022-04-08 Thread Kshitij Suri

Currently screendump only supports PPM format, which is un-compressed. Added
a "format" parameter to QMP and HMP screendump command to support PNG image
capture using libpng.

QMP example usage:
{ "execute": "screendump", "arguments": { "filename": "/tmp/image",
"format":"png" } }

HMP example usage:
screendump /tmp/image -f png

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/718

Signed-off-by: Kshitij Suri 

Reviewed-by: Daniel P. Berrangé 
---
diff to v4:
  - Modified format to be an optional flag based parameter in HMP.

 hmp-commands.hx|  11 ++---
 monitor/hmp-cmds.c |  12 +-
 qapi/ui.json   |  24 +--
 ui/console.c   | 101 +++--
 4 files changed, 136 insertions(+), 12 deletions(-)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 8476277aa9..808020d005 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -244,11 +244,12 @@ ERST
 
 {
 .name   = "screendump",
-.args_type  = "filename:F,device:s?,head:i?",
-.params = "filename [device [head]]",
-.help   = "save screen from head 'head' of display device 'device' 
"
-  "into PPM image 'filename'",
-.cmd= hmp_screendump,
+.args_type  = "filename:F,format:-fs,device:s?,head:i?",
+.params = "filename [-f format] [device [head]]",
+.help   = "save screen from head 'head' of display device 'device'"
+  "in specified format 'format' as image 'filename'."
+  "Currently only 'png' and 'ppm' formats are supported.",
+ .cmd= hmp_screendump,
 .coroutine  = true,
 },
 
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 634968498b..2442bfa989 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -1720,9 +1720,19 @@ hmp_screendump(Monitor *mon, const QDict *qdict)
 const char *filename = qdict_get_str(qdict, "filename");
 const char *id = qdict_get_try_str(qdict, "device");
 int64_t head = qdict_get_try_int(qdict, "head", 0);
+const char *input_format  = qdict_get_try_str(qdict, "format");
 Error *err = NULL;
+ImageFormat format;
 
-qmp_screendump(filename, id != NULL, id, id != NULL, head, &err);
+format = qapi_enum_parse(&ImageFormat_lookup, input_format,
+  IMAGE_FORMAT_PPM, &err);
+if (err) {
+goto end;
+}
+
+qmp_screendump(filename, id != NULL, id, id != NULL, head,
+   input_format != NULL, format, &err);
+end:
 hmp_handle_error(mon, err);
 }
 
diff --git a/qapi/ui.json b/qapi/ui.json
index 664da9e462..98f0126999 100644
--- a/qapi/ui.json
+++ b/qapi/ui.json
@@ -157,12 +157,27 @@
 ##
 { 'command': 'expire_password', 'boxed': true, 'data': 'ExpirePasswordOptions' 
}
 
+##
+# @ImageFormat:
+#
+# Supported image format types.
+#
+# @png: PNG format
+#
+# @ppm: PPM format
+#
+# Since: 7.1
+#
+##
+{ 'enum': 'ImageFormat',
+  'data': ['ppm', 'png'] }
+
 ##
 # @screendump:
 #
-# Write a PPM of the VGA screen to a file.
+# Capture the contents of a screen and write it to a file.
 #
-# @filename: the path of a new PPM file to store the image
+# @filename: the path of a new file to store the image
 #
 # @device: ID of the display device that should be dumped. If this parameter
 #  is missing, the primary display will be used. (Since 2.12)
@@ -171,6 +186,8 @@
 #parameter is missing, head #0 will be used. Also note that the head
 #can only be specified in conjunction with the device ID. (Since 2.12)
 #
+# @format: image format for screendump. (default: ppm) (Since 7.1)
+#
 # Returns: Nothing on success
 #
 # Since: 0.14
@@ -183,7 +200,8 @@
 #
 ##
 { 'command': 'screendump',
-  'data': {'filename': 'str', '*device': 'str', '*head': 'int'},
+  'data': {'filename': 'str', '*device': 'str', '*head': 'int',
+   '*format': 'ImageFormat'},
   'coroutine': true }
 
 ##
diff --git a/ui/console.c b/ui/console.c
index da434ce1b2..f42f64d556 100644
--- a/ui/console.c
+++ b/ui/console.c
@@ -37,6 +37,9 @@
 #include "exec/memory.h"
 #include "io/channel-file.h"
 #include "qom/object.h"
+#ifdef CONFIG_PNG
+#include 
+#endif
 
 #define DEFAULT_BACKSCROLL 512
 #define CONSOLE_CURSOR_PERIOD 500
@@ -291,6 +294,89 @@ void graphic_hw_invalidate(QemuConsole *con)
 }
 }
 
+#ifdef CONFIG_PNG
+/**
+ * png_save: Take a screenshot as PNG
+ *
+ * Saves screendump as a PNG file
+ *
+ * Returns true for success or false for error.
+ *
+ * @fd: File descriptor for PNG file.
+ * @image: Image data in pixman format.
+ * @errp: Pointer to an error.
+ */
+static bool png_save(int fd, pixman_image_t *image, Error **errp)
+{
+int width = pixman_image_get_width(image);
+int height = pixman_image_get_height(image);
+g_autofree png_struct *png_ptr = NULL;
+g_autofree png_info *info_ptr = NULL;
+g_autoptr(pixman_image_t) linebuf =
+qemu_pixman_linebuf_create(PIXMAN_a

Re: [PATCH 1/3] vhost: Refactor vhost_reset_device() in VhostOps

2022-04-08 Thread Michael Qiu





在 2022/4/7 15:35, Jason Wang 写道:


在 2022/4/2 下午1:14, Michael Qiu 写道:



On 2022/4/2 10:38, Jason Wang wrote:


在 2022/4/1 下午7:06, Michael Qiu 写道:

Currently in vhost framwork, vhost_reset_device() is misnamed.
Actually, it should be vhost_reset_owner().

In vhost user, it make compatible with reset device ops, but
vhost kernel does not compatible with it, for vhost vdpa, it
only implement reset device action.

So we need seperate the function into vhost_reset_owner() and
vhost_reset_device(). So that different backend could use the
correct function.



I see no reason when RESET_OWNER needs to be done for kernel backend.



In kernel vhost, RESET_OWNER  indeed do vhost device level reset: 
vhost_net_reset_owner()


static long vhost_net_reset_owner(struct vhost_net *n)
{
[...]
    err = vhost_dev_check_owner(&n->dev);
    if (err)
    goto done;
    umem = vhost_dev_reset_owner_prepare();
    if (!umem) {
    err = -ENOMEM;
    goto done;
    }
    vhost_net_stop(n, &tx_sock, &rx_sock);
    vhost_net_flush(n);
    vhost_dev_stop(&n->dev);
    vhost_dev_reset_owner(&n->dev, umem);
    vhost_net_vq_reset(n);
[...]

}

In the history of QEMU, There is a commit:
commit d1f8b30ec8dde0318fd1b98d24a64926feae9625
Author: Yuanhan Liu 
Date:   Wed Sep 23 12:19:57 2015 +0800

    vhost: rename VHOST_RESET_OWNER to VHOST_RESET_DEVICE

    Quote from Michael:

    We really should rename VHOST_RESET_OWNER to VHOST_RESET_DEVICE.

but finally, it has been reverted by the author:
commit 60915dc4691768c4dc62458bb3e16c843fab091d
Author: Yuanhan Liu 
Date:   Wed Nov 11 21:24:37 2015 +0800

    vhost: rename RESET_DEVICE backto RESET_OWNER

    This patch basically reverts commit d1f8b30e.

    It turned out that it breaks stuff, so revert it:

http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg00949.html

Seems kernel take RESET_OWNER for reset,but QEMU never call to this 
function to do a reset.



The question is, we manage to survive by not using RESET_OWNER for past 
10 years. Any reason that we want to use that now?


Note that the RESET_OWNER is only useful the process want to drop the 
its mm refcnt from vhost, it doesn't reset the device (e.g it does not 
even call vhost_vq_reset()).


(Especially, it was deprecated in by the vhost-user protocol since its 
semantics is ambiguous)





So, you prefer to directly remove RESET_OWNER support now?



And if I understand the code correctly, vhost-user "abuse" 
RESET_OWNER for reset. So the current code looks fine?





Signde-off-by: Michael Qiu 
---
  hw/scsi/vhost-user-scsi.c |  6 +-
  hw/virtio/vhost-backend.c |  4 ++--
  hw/virtio/vhost-user.c    | 22 ++
  include/hw/virtio/vhost-backend.h |  2 ++
  4 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/hw/scsi/vhost-user-scsi.c b/hw/scsi/vhost-user-scsi.c
index 1b2f7ee..f179626 100644
--- a/hw/scsi/vhost-user-scsi.c
+++ b/hw/scsi/vhost-user-scsi.c
@@ -80,8 +80,12 @@ static void vhost_user_scsi_reset(VirtIODevice 
*vdev)

  return;
  }
-    if (dev->vhost_ops->vhost_reset_device) {
+    if (virtio_has_feature(dev->protocol_features,
+ VHOST_USER_PROTOCOL_F_RESET_DEVICE) &&
+ dev->vhost_ops->vhost_reset_device) {
  dev->vhost_ops->vhost_reset_device(dev);
+    } else if (dev->vhost_ops->vhost_reset_owner) {
+    dev->vhost_ops->vhost_reset_owner(dev);



Actually, I fail to understand why we need an indirection via 
vhost_ops. It's guaranteed to be vhost_user_ops.




  }
  }
diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
index e409a86..abbaa8b 100644
--- a/hw/virtio/vhost-backend.c
+++ b/hw/virtio/vhost-backend.c
@@ -191,7 +191,7 @@ static int vhost_kernel_set_owner(struct 
vhost_dev *dev)

  return vhost_kernel_call(dev, VHOST_SET_OWNER, NULL);
  }
-static int vhost_kernel_reset_device(struct vhost_dev *dev)
+static int vhost_kernel_reset_owner(struct vhost_dev *dev)
  {
  return vhost_kernel_call(dev, VHOST_RESET_OWNER, NULL);
  }
@@ -317,7 +317,7 @@ const VhostOps kernel_ops = {
  .vhost_get_features = vhost_kernel_get_features,
  .vhost_set_backend_cap = vhost_kernel_set_backend_cap,
  .vhost_set_owner = vhost_kernel_set_owner,
-    .vhost_reset_device = vhost_kernel_reset_device,
+    .vhost_reset_owner = vhost_kernel_reset_owner,



I think we can delete the current vhost_reset_device() since it not 
used in any code path.




I planned to use it for vDPA reset, 



For vhost-vDPA it can call vhost_vdpa_reset_device() directly.

As I mentioned before, the only user of vhost_reset_device config ops is 
vhost-user-scsi but it should directly call the vhost_user_reset_device().





Yes, but in the next patch I reuse it to reset backend device in vhost_net.



Thanks



and vhost-user-scsi also use device reset.

Thanks,
Michael


Thanks



  .vhost_get_vq_index = vhost_kern

Re: [PATCH 1/2] gdbstub: Set current_cpu for memory read write

2022-04-08 Thread Alex Bennée



Bin Meng  writes:

> On Sat, Apr 2, 2022 at 7:20 PM Bin Meng  wrote:
>>
>> On Tue, Mar 29, 2022 at 12:43 PM Bin Meng  wrote:
>> >
>> > On Mon, Mar 28, 2022 at 5:10 PM Peter Maydell  
>> > wrote:
>> > >
>> > > On Mon, 28 Mar 2022 at 03:10, Bin Meng  wrote:
>> > > > IMHO it's too bad to just ignore this bug forever.
>> > > >
>> > > > This is a valid use case. It's not about whether we intentionally want
>> > > > to inspect the GIC register value from gdb. The case is that when
>> > > > single stepping the source codes it triggers the core dump for no
>> > > > reason if the instructions involved contain load/store to any of the
>> > > > GIC registers.
>> > >
>> > > Huh? Single-stepping the instruction should execute it inside
>> > > QEMU, which will do the load in the usual way. That should not
>> > > be going via gdbstub reads and writes.
>> >
>> > Yes, single-stepping the instruction is executed in the vCPU context,
>> > but a gdb client sends additional commands, more than just telling
>> > QEMU to execute a single instruction.
>> >
>> > For example, the following is the sequence a gdb client sent when doing a 
>> > "si":
>> >
>> > gdbstub_io_command Received: Z0,10,4
>> > gdbstub_io_reply Sent: OK
>> > gdbstub_io_got_ack Got ACK
>> > gdbstub_io_command Received: m18c430,4
>> > gdbstub_io_reply Sent: ff430091
>> > gdbstub_io_got_ack Got ACK
>> > gdbstub_io_command Received: vCont;s:p1.1;c:p1.-1
>> > gdbstub_op_stepping Stepping CPU 0
>> > gdbstub_op_continue_cpu Continuing CPU 1
>> > gdbstub_op_continue_cpu Continuing CPU 2
>> > gdbstub_op_continue_cpu Continuing CPU 3
>> > gdbstub_hit_break RUN_STATE_DEBUG
>> > gdbstub_io_reply Sent: T05thread:p01.01;
>> > gdbstub_io_got_ack Got ACK
>> > gdbstub_io_command Received: g
>> > gdbstub_io_reply Sent:
>> > 3848ed00f08fa6100300010001f930a5ec0034c41800c903
>> > gdbstub_io_got_ack Got ACK
>> > gdbstub_io_command Received: m18c434,4
>> > gdbstub_io_reply Sent: 00e004d1
>> > gdbstub_io_got_ack Got ACK
>> > gdbstub_io_command Received: m18c430,4
>> > gdbstub_io_reply Sent: ff430091
>> > gdbstub_io_got_ack Got ACK
>> > gdbstub_io_command Received: m18c434,4
>> > gdbstub_io_reply Sent: 00e004d1
>> > gdbstub_io_got_ack Got ACK
>> > gdbstub_io_command Received: m18c400,40
>> > gdbstub_io_reply Sent:
>> > ff4300d1e00300f98037005840f900a0019140f900b0009140f900e004911e7800f9fe0340f91ef9ff43009100e004d174390094bb390094
>> > gdbstub_io_got_ack Got ACK
>> > gdbstub_io_command Received: mf901,4
>> >
>> > Here "mf901,4" triggers the bug where 0xf901 is the GIC register.
>> >
>> > This is not something QEMU can ignore or control. The logic is inside
>> > the gdb client.
>> >
>>
>> Ping for this series?
>>
>
> Ping?

Re-reading the thread we seem to have two problems:

  - gdbstub is not explicitly passing the explicit CPU for its access
  - some devices use current_cpu to work out what AS they should be
working in

But I've already said just fudging current_cpu isn't the correct
approach.

-- 
Alex Bennée

Re: Wiki: Update package name in build instructions

2022-04-08 Thread Thomas Huth


On 07/01/2022 18.10, Lucas Hecht wrote:

Hi there,

could someone please give me a wiki account or make this minor change themself:

In the wiki article "Host/Linux" under "Recommended additional packages" 
libvte-2.90-dev should be changed to libvte-2.91-dev since the former is not 
available anymore.


Thanks, finally updated!

 Thomas

[PATCH] Warn user if the vga flag is passed but no vga device is created

2022-04-08 Thread Gautam Agrawal

This patch is in regards to this 
issue:https://gitlab.com/qemu-project/qemu/-/issues/581#.
A global boolean variable "vga_interface_created"(declared in softmmu/globals.c)
has been used to track the creation of vga interface. If the vga flag is passed 
in the command
line "default_vga"(declared in softmmu/vl.c) variable is set to 0. To warn 
user, the condition
checks if vga_interface_created is false and default_vga is equal to 0.

The warning "No vga device is created" is logged if vga flag is passed
but no vga device is created. This patch has been tested for
x86_64, i386, sparc, sparc64 and arm boards.

Signed-off-by: Gautam Agrawal 
---
 hw/isa/isa-bus.c| 1 +
 hw/pci/pci.c| 1 +
 hw/sparc/sun4m.c| 2 ++
 hw/sparc64/sun4u.c  | 1 +
 include/sysemu/sysemu.h | 1 +
 softmmu/globals.c   | 1 +
 softmmu/vl.c| 3 +++
 7 files changed, 10 insertions(+)

diff --git a/hw/isa/isa-bus.c b/hw/isa/isa-bus.c
index 0ad1c5fd65..cd5ad3687d 100644
--- a/hw/isa/isa-bus.c
+++ b/hw/isa/isa-bus.c
@@ -166,6 +166,7 @@ bool isa_realize_and_unref(ISADevice *dev, ISABus *bus, 
Error **errp)
 
 ISADevice *isa_vga_init(ISABus *bus)
 {
+vga_interface_created = true;
 switch (vga_interface_type) {
 case VGA_CIRRUS:
 return isa_create_simple(bus, "isa-cirrus-vga");
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index dae9119bfe..fab9c80f8d 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2038,6 +2038,7 @@ PCIDevice *pci_nic_init_nofail(NICInfo *nd, PCIBus 
*rootbus,
 
 PCIDevice *pci_vga_init(PCIBus *bus)
 {
+vga_interface_created = true;
 switch (vga_interface_type) {
 case VGA_CIRRUS:
 return pci_create_simple(bus, -1, "cirrus-vga");
diff --git a/hw/sparc/sun4m.c b/hw/sparc/sun4m.c
index 7f3a7c0027..f45e29acc8 100644
--- a/hw/sparc/sun4m.c
+++ b/hw/sparc/sun4m.c
@@ -921,6 +921,7 @@ static void sun4m_hw_init(MachineState *machine)
 /* sbus irq 5 */
 cg3_init(hwdef->tcx_base, slavio_irq[11], 0x0010,
  graphic_width, graphic_height, graphic_depth);
+vga_interface_created = true;
 } else {
 /* If no display specified, default to TCX */
 if (graphic_depth != 8 && graphic_depth != 24) {
@@ -936,6 +937,7 @@ static void sun4m_hw_init(MachineState *machine)
 
 tcx_init(hwdef->tcx_base, slavio_irq[11], 0x0010,
  graphic_width, graphic_height, graphic_depth);
+vga_interface_created = true;
 }
 }
 
diff --git a/hw/sparc64/sun4u.c b/hw/sparc64/sun4u.c
index cda7df36e3..75334dba71 100644
--- a/hw/sparc64/sun4u.c
+++ b/hw/sparc64/sun4u.c
@@ -633,6 +633,7 @@ static void sun4uv_init(MemoryRegion *address_space_mem,
 switch (vga_interface_type) {
 case VGA_STD:
 pci_create_simple(pci_busA, PCI_DEVFN(2, 0), "VGA");
+vga_interface_created = true;
 break;
 case VGA_NONE:
 break;
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index b9421e03ff..a558b895e4 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -32,6 +32,7 @@ typedef enum {
 } VGAInterfaceType;
 
 extern int vga_interface_type;
+extern bool vga_interface_created;
 
 extern int graphic_width;
 extern int graphic_height;
diff --git a/softmmu/globals.c b/softmmu/globals.c
index 3ebd718e35..1a5f8d42ad 100644
--- a/softmmu/globals.c
+++ b/softmmu/globals.c
@@ -40,6 +40,7 @@ int nb_nics;
 NICInfo nd_table[MAX_NICS];
 int autostart = 1;
 int vga_interface_type = VGA_NONE;
+bool vga_interface_created = false;
 Chardev *parallel_hds[MAX_PARALLEL_PORTS];
 int win2k_install_hack;
 int singlestep;
diff --git a/softmmu/vl.c b/softmmu/vl.c
index 6f646531a0..cb79fa1f42 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -2734,6 +2734,9 @@ static void qemu_machine_creation_done(void)
 if (foreach_device_config(DEV_GDB, gdbserver_start) < 0) {
 exit(1);
 }
+if (!vga_interface_created && !default_vga) {
+warn_report("No vga device is created");
+}
 }
 
 void qmp_x_exit_preconfig(Error **errp)
-- 
2.34.1

Re: [PATCH v9 33/45] cxl/cxl-host: Add memops for CFMWS region.

2022-04-08 Thread Jonathan Cameron via

On Thu, 7 Apr 2022 21:07:06 +
Tong Zhang  wrote:

> On 4/4/22 08:14, Jonathan Cameron wrote:
> > From: Jonathan Cameron 
> >
> >
> > +static MemTxResult cxl_read_cfmws(void *opaque, hwaddr addr, uint64_t 
> > *data,
> > +  unsigned size, MemTxAttrs attrs)
> > +{
> > +CXLFixedWindow *fw = opaque;
> > +PCIDevice *d;
> > +
> > +d = cxl_cfmws_find_device(fw, addr);
> > +if (d == NULL) {
> > +*data = 0;  
> 
> I'm looking at this code and comparing it to CXL2.0 spec 8.2.5.12.2 CXL HDM
> 
> Decoder Global Control Register (Offset 04h) table. It seems that we should
> 
> check POSION_ON_ERR_EN bit, if this bit is set, we return poison, otherwise
> 
> should return all 1's data.

Good point.  Takes a bit of searching to find the statements on that, but
it should indeed by all 1s not all 0s. I'll fix that up.

> 
> Also, from the spec, this bit is implementation specific and hard 
> wired(RO) to either 1 or 0,

My temptation is to set that to 0 and not return poison, because the handling
of that in the host is horribly implementation specific.

> 
> but for type3 device looks like we are currently allowing it to be 
> overwritten in ct3d_reg_write()
> 
> function. We probably also need more sanitation in ct3d_reg_write. (Also 
> for HDM
> 
> range/interleaving settings.)

Absolutely agree. Generally my plan was to tighten up write restrictions
as a follow on series because it tends to require quite a lot of code and
makes it much harder to see the overall flow.

So far I've done most of the PCI config space santization (see the gitlab
tree) but not much yet on the memory mapped register space.

I'll add it to the todo list. If it turns out this particular case is
reasonably clean I might add it within this series.

Jonathan

> 
> > +/* Reads to invalid address return poison */
> > +return MEMTX_ERROR;
> > +}
> > +
> > +return cxl_type3_read(d, addr + fw->base, data, size, attrs);
> > +}
> > +  
> 
> - Tong
>

[PATCH 0/5] Vhost-user: add Virtio RSS support

2022-04-08 Thread Maxime Coquelin

The goal of this series is to add support for Virtio RSS
feature to the Vhost-user backend.

First patches are preliminary reworks to support variable
RSS key and indirection table length. eBPF change only adds
checks on whether the key length is 40B, it does not add
support for longer keys.

Vhost-user implementation supports up to 52B RSS key, in
order to match with the maximum supported by physical
NICs (Intel E810). Idea is that it could be used for
application like Virtio-forwarder, by programming the
Virtio device RSS key into the physical NIC and let the
physical NIC do the packets distribution.

DPDK Vhost-user backend PoC implementing the new requests
can be found here [0], it only implements the messages
handling, it does not perform any RSS for now.

[0]: https://gitlab.com/mcoquelin/dpdk-next-virtio/-/commits/vhost_user_rss_poc/

Maxime Coquelin (5):
  ebpf: pass and check RSS key length to the loader
  virtio-net: prepare for variable RSS key and indir table lengths
  virtio-net: add RSS support for Vhost backends
  docs: introduce RSS support in Vhost-user specification
  vhost-user: add RSS support

 docs/interop/vhost-user.rst   |  57 
 ebpf/ebpf_rss-stub.c  |   3 +-
 ebpf/ebpf_rss.c   |  17 ++--
 ebpf/ebpf_rss.h   |   3 +-
 hw/net/vhost_net-stub.c   |  10 ++
 hw/net/vhost_net.c|  22 +
 hw/net/virtio-net.c   |  87 +-
 hw/virtio/vhost-user.c| 146 +-
 include/hw/virtio/vhost-backend.h |   7 ++
 include/hw/virtio/virtio-net.h|  16 +++-
 include/migration/vmstate.h   |  10 ++
 include/net/vhost_net.h   |   4 +
 12 files changed, 344 insertions(+), 38 deletions(-)

-- 
2.35.1

[PATCH 4/5] docs: introduce RSS support in Vhost-user specification

2022-04-08 Thread Maxime Coquelin

This patch documents RSS feature in Vhost-user specification.
Two new requests are introduced backed by a dedicated
protocol feature.

First one is to query the Vhost-user slave RSS capabilities
such as supported hash types, maximum key length and
indirection table size.

The second one is to provide the slave with driver's RSS
configuration.

Signed-off-by: Maxime Coquelin 
---
 docs/interop/vhost-user.rst | 57 +
 1 file changed, 57 insertions(+)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index 4dbc84fd00..9de6297568 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -258,6 +258,42 @@ Inflight description
 
 :queue size: a 16-bit size of virtqueues
 
+RSS capabilities description
+
+
++--+-+---+
+| supported hash types | max key len | max indir len |
++--+-+---+
+
+:supported hash types: a 32-bit bitfield of supported hash types as defined
+   in the Virtio specification
+
+:max key len: a 8-bit maximum size of the RSS key
+
+:max indir len: a 16-bits maximum size of the RSS indirection table
+
+RSS data description
+
+
+++-+-+---+-+---+
+| hash types | key len | key | indir len | indir table | default queue |
+++-+-+---+-+---+
+
+:hash types: a 32-bit bitfield of supported hash types as defined in the
+ Virtio specification
+
+:key len: 8-bit size of the RSS key
+
+:key: a 8-bit array of 52 elements containing the RSS key
+
+:indir len: a 16-bit size of the RSS indirection table
+
+:indir table: a 16-bit array of 512 elements containing the hash indirection
+  table
+
+:default queue: the default queue index for flows not matching requested hash
+types
+
 C structure
 ---
 
@@ -858,6 +894,7 @@ Protocol features
   #define VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS 14
   #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS  15
   #define VHOST_USER_PROTOCOL_F_STATUS   16
+  #define VHOST_USER_PROTOCOL_F_NET_RSS  17
 
 Master message types
 
@@ -1371,6 +1408,26 @@ Master message types
   query the backend for its device status as defined in the Virtio
   specification.
 
+``VHOST_USER_NET_GET_RSS``
+  :id: 41
+  :equivalent ioctl: N/A
+  :slave payload: RSS capabilities description
+  :master payload: N/A
+
+  When the ``VHOST_USER_PROTOCOL_F_NET_RSS`` protocol has been successfully
+  negotiated, this message is submitted by the master to get the RSS
+  capabilities of the slave.
+
+``VHOST_USER_NET_SET_RSS``
+  :id: 42
+  :equivalent ioctl: N/A
+  :slave payload: N/A
+  :master payload: RSS data description
+
+  When the ``VHOST_USER_PROTOCOL_F_NET_RSS`` protocol has been successfully
+  negotiated, this message is submitted by the master to set the RSS
+  configuration defined by the Virtio driver.
+
 
 Slave message types
 ---
-- 
2.35.1

[PATCH 1/5] ebpf: pass and check RSS key length to the loader

2022-04-08 Thread Maxime Coquelin

This patch is preliminary rework to support RSS with
Vhost-user backends. The Vhost-user implementation will
allow RSS hash key of 40 bytes or more as allowed by the
Virtio specification, whereas the eBPF-based Vhost-kernel
solution only supports 40 bytes keys.

This patch adds the RSS key length to the loader, and
validate it is 40 bytes before copying it.

Signed-off-by: Maxime Coquelin 
---
 ebpf/ebpf_rss-stub.c |  3 ++-
 ebpf/ebpf_rss.c  | 11 +++
 ebpf/ebpf_rss.h  |  3 ++-
 hw/net/virtio-net.c  |  3 ++-
 4 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/ebpf/ebpf_rss-stub.c b/ebpf/ebpf_rss-stub.c
index e71e229190..ffc5c5574f 100644
--- a/ebpf/ebpf_rss-stub.c
+++ b/ebpf/ebpf_rss-stub.c
@@ -29,7 +29,8 @@ bool ebpf_rss_load(struct EBPFRSSContext *ctx)
 }
 
 bool ebpf_rss_set_all(struct EBPFRSSContext *ctx, struct EBPFRSSConfig *config,
-  uint16_t *indirections_table, uint8_t *toeplitz_key)
+  uint16_t *indirections_table, uint8_t *toeplitz_key,
+  uint8_t key_len)
 {
 return false;
 }
diff --git a/ebpf/ebpf_rss.c b/ebpf/ebpf_rss.c
index 118c68da83..4a63854175 100644
--- a/ebpf/ebpf_rss.c
+++ b/ebpf/ebpf_rss.c
@@ -110,14 +110,16 @@ static bool ebpf_rss_set_indirections_table(struct 
EBPFRSSContext *ctx,
 }
 
 static bool ebpf_rss_set_toepliz_key(struct EBPFRSSContext *ctx,
- uint8_t *toeplitz_key)
+ uint8_t *toeplitz_key,
+ size_t len)
 {
 uint32_t map_key = 0;
 
 /* prepare toeplitz key */
 uint8_t toe[VIRTIO_NET_RSS_MAX_KEY_SIZE] = {};
 
-if (!ebpf_rss_is_loaded(ctx) || toeplitz_key == NULL) {
+if (!ebpf_rss_is_loaded(ctx) || toeplitz_key == NULL ||
+len != VIRTIO_NET_RSS_MAX_KEY_SIZE) {
 return false;
 }
 memcpy(toe, toeplitz_key, VIRTIO_NET_RSS_MAX_KEY_SIZE);
@@ -131,7 +133,8 @@ static bool ebpf_rss_set_toepliz_key(struct EBPFRSSContext 
*ctx,
 }
 
 bool ebpf_rss_set_all(struct EBPFRSSContext *ctx, struct EBPFRSSConfig *config,
-  uint16_t *indirections_table, uint8_t *toeplitz_key)
+  uint16_t *indirections_table, uint8_t *toeplitz_key,
+  uint8_t key_len)
 {
 if (!ebpf_rss_is_loaded(ctx) || config == NULL ||
 indirections_table == NULL || toeplitz_key == NULL) {
@@ -147,7 +150,7 @@ bool ebpf_rss_set_all(struct EBPFRSSContext *ctx, struct 
EBPFRSSConfig *config,
 return false;
 }
 
-if (!ebpf_rss_set_toepliz_key(ctx, toeplitz_key)) {
+if (!ebpf_rss_set_toepliz_key(ctx, toeplitz_key, key_len)) {
 return false;
 }
 
diff --git a/ebpf/ebpf_rss.h b/ebpf/ebpf_rss.h
index bf3f2572c7..db23ccd25f 100644
--- a/ebpf/ebpf_rss.h
+++ b/ebpf/ebpf_rss.h
@@ -37,7 +37,8 @@ bool ebpf_rss_is_loaded(struct EBPFRSSContext *ctx);
 bool ebpf_rss_load(struct EBPFRSSContext *ctx);
 
 bool ebpf_rss_set_all(struct EBPFRSSContext *ctx, struct EBPFRSSConfig *config,
-  uint16_t *indirections_table, uint8_t *toeplitz_key);
+  uint16_t *indirections_table, uint8_t *toeplitz_key,
+  uint8_t key_len);
 
 void ebpf_rss_unload(struct EBPFRSSContext *ctx);
 
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 1067e72b39..73145d6390 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1201,7 +1201,8 @@ static bool virtio_net_attach_epbf_rss(VirtIONet *n)
 rss_data_to_rss_config(&n->rss_data, &config);
 
 if (!ebpf_rss_set_all(&n->ebpf_rss, &config,
-  n->rss_data.indirections_table, n->rss_data.key)) {
+  n->rss_data.indirections_table, n->rss_data.key,
+  VIRTIO_NET_RSS_MAX_KEY_SIZE)) {
 return false;
 }
 
-- 
2.35.1

[PATCH 5/5] vhost-user: add RSS support

2022-04-08 Thread Maxime Coquelin

This patch implements the RSS feature to the
Vhost-user backend.

The implementation supports up to 52 bytes RSS key length,
and 512 indirection table entries.

Signed-off-by: Maxime Coquelin 
---
 hw/virtio/vhost-user.c | 146 -
 1 file changed, 145 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 6abbc9da32..d047da81ba 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -81,6 +81,8 @@ enum VhostUserProtocolFeature {
 VHOST_USER_PROTOCOL_F_RESET_DEVICE = 13,
 /* Feature 14 reserved for VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS. */
 VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
+/* Feature 16 reserved for VHOST_USER_PROTOCOL_F_STATUS. */
+VHOST_USER_PROTOCOL_F_NET_RSS = 17,
 VHOST_USER_PROTOCOL_F_MAX
 };
 
@@ -126,6 +128,10 @@ typedef enum VhostUserRequest {
 VHOST_USER_GET_MAX_MEM_SLOTS = 36,
 VHOST_USER_ADD_MEM_REG = 37,
 VHOST_USER_REM_MEM_REG = 38,
+/* Message number 39 reserved for VHOST_USER_SET_STATUS. */
+/* Message number 40 reserved for VHOST_USER_GET_STATUS. */
+VHOST_USER_NET_GET_RSS = 41,
+VHOST_USER_NET_SET_RSS = 42,
 VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -196,6 +202,24 @@ typedef struct VhostUserInflight {
 uint16_t queue_size;
 } VhostUserInflight;
 
+typedef struct VhostUserRSSCapa {
+uint32_t supported_hash_types;
+uint8_t max_key_len;
+uint16_t max_indir_len;
+} VhostUserRSSCapa;
+
+#define VHOST_USER_RSS_MAX_KEY_LEN52
+#define VHOST_USER_RSS_MAX_INDIR_LEN  512
+
+typedef struct VhostUserRSSData {
+uint32_t hash_types;
+uint8_t key_len;
+uint8_t key[VHOST_USER_RSS_MAX_KEY_LEN];
+uint16_t indir_len;
+uint16_t indir_table[VHOST_USER_RSS_MAX_INDIR_LEN];
+uint16_t default_queue;
+} VhostUserRSSData;
+
 typedef struct {
 VhostUserRequest request;
 
@@ -220,6 +244,8 @@ typedef union {
 VhostUserCryptoSession session;
 VhostUserVringArea area;
 VhostUserInflight inflight;
+VhostUserRSSCapa rss_capa;
+VhostUserRSSData rss_data;
 } VhostUserPayload;
 
 typedef struct VhostUserMsg {
@@ -2178,7 +2204,123 @@ static int vhost_user_net_set_mtu(struct vhost_dev 
*dev, uint16_t mtu)
 return ret;
 }
 
-/* If reply_ack supported, slave has to ack specified MTU is valid */
+if (reply_supported) {
+return process_message_reply(dev, &msg);
+}
+
+return 0;
+}
+
+static int vhost_user_net_get_rss(struct vhost_dev *dev,
+  VirtioNetRssCapa *rss_capa)
+{
+int ret;
+VhostUserMsg msg = {
+.hdr.request = VHOST_USER_NET_GET_RSS,
+.hdr.flags = VHOST_USER_VERSION,
+};
+
+if (!(dev->protocol_features & (1ULL << VHOST_USER_PROTOCOL_F_NET_RSS))) {
+return -EPROTO;
+}
+
+ret = vhost_user_write(dev, &msg, NULL, 0);
+if (ret < 0) {
+return ret;
+}
+
+ret = vhost_user_read(dev, &msg);
+if (ret < 0) {
+return ret;
+}
+
+if (msg.hdr.request != VHOST_USER_NET_GET_RSS) {
+error_report("Received unexpected msg type. Expected %d received %d",
+ VHOST_USER_NET_GET_RSS, msg.hdr.request);
+return -EPROTO;
+}
+
+if (msg.hdr.size != sizeof(msg.payload.rss_capa)) {
+error_report("Received bad msg size.");
+return -EPROTO;
+}
+
+if (msg.payload.rss_capa.max_key_len < VIRTIO_NET_RSS_MIN_KEY_SIZE) {
+error_report("Invalid max RSS key len (%uB, minimum %uB).",
+ msg.payload.rss_capa.max_key_len,
+ VIRTIO_NET_RSS_MIN_KEY_SIZE);
+return -EINVAL;
+}
+
+if (msg.payload.rss_capa.max_indir_len < VIRTIO_NET_RSS_MIN_TABLE_LEN) {
+error_report("Invalid max RSS indir table entries (%u, minimum %u).",
+ msg.payload.rss_capa.max_indir_len,
+ VIRTIO_NET_RSS_MIN_TABLE_LEN);
+return -EINVAL;
+}
+
+rss_capa->supported_hashes = msg.payload.rss_capa.supported_hash_types;
+rss_capa->max_key_size = MIN(msg.payload.rss_capa.max_key_len,
+ VHOST_USER_RSS_MAX_KEY_LEN);
+rss_capa->max_indirection_len = MIN(msg.payload.rss_capa.max_indir_len,
+VHOST_USER_RSS_MAX_INDIR_LEN);
+
+return 0;
+}
+
+static int vhost_user_net_set_rss(struct vhost_dev *dev,
+  VirtioNetRssData *rss_data)
+{
+VhostUserMsg msg;
+bool reply_supported = virtio_has_feature(dev->protocol_features,
+  VHOST_USER_PROTOCOL_F_REPLY_ACK);
+int ret;
+
+if (!(dev->protocol_features & (1ULL << VHOST_USER_PROTOCOL_F_NET_RSS))) {
+return -EPROTO;
+}
+
+msg.hdr.request = VHOST_USER_NET_SET_RSS;
+msg.hdr.size = sizeof(msg.payload.rss_data);
+msg.hdr.flags = VHOST_USER_VERSION;
+if (reply_supported) {
+msg.h

[PATCH 3/5] virtio-net: add RSS support for Vhost backends

2022-04-08 Thread Maxime Coquelin

This patch introduces new Vhost backend callbacks to
support RSS, and makes them called in Virtio-net
device.

It will be used by Vhost-user backend implementation to
support RSS feature.

Signed-off-by: Maxime Coquelin 
---
 hw/net/vhost_net-stub.c   | 10 ++
 hw/net/vhost_net.c| 22 +
 hw/net/virtio-net.c   | 53 +--
 include/hw/virtio/vhost-backend.h |  7 
 include/net/vhost_net.h   |  4 +++
 5 files changed, 79 insertions(+), 17 deletions(-)

diff --git a/hw/net/vhost_net-stub.c b/hw/net/vhost_net-stub.c
index 89d71cfb8e..cc05e07c1f 100644
--- a/hw/net/vhost_net-stub.c
+++ b/hw/net/vhost_net-stub.c
@@ -101,3 +101,13 @@ int vhost_net_set_mtu(struct vhost_net *net, uint16_t mtu)
 {
 return 0;
 }
+
+int vhost_net_get_rss(struct vhost_net *net, VirtioNetRssCapa *rss_capa)
+{
+return 0;
+}
+
+int vhost_net_set_rss(struct vhost_net *net, VirtioNetRssData *rss_data)
+{
+return 0;
+}
diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index 30379d2ca4..aa2a1e8e5f 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -512,3 +512,25 @@ int vhost_net_set_mtu(struct vhost_net *net, uint16_t mtu)
 
 return vhost_ops->vhost_net_set_mtu(&net->dev, mtu);
 }
+
+int vhost_net_get_rss(struct vhost_net *net, VirtioNetRssCapa *rss_capa)
+{
+const VhostOps *vhost_ops = net->dev.vhost_ops;
+
+if (!vhost_ops->vhost_net_get_rss) {
+return 0;
+}
+
+return vhost_ops->vhost_net_get_rss(&net->dev, rss_capa);
+}
+
+int vhost_net_set_rss(struct vhost_net *net, VirtioNetRssData *rss_data)
+{
+const VhostOps *vhost_ops = net->dev.vhost_ops;
+
+if (!vhost_ops->vhost_net_set_rss) {
+return 0;
+}
+
+return vhost_ops->vhost_net_set_rss(&net->dev, rss_data);
+}
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 38436e472b..237bbdb1b3 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -741,8 +741,10 @@ static uint64_t virtio_net_get_features(VirtIODevice 
*vdev, uint64_t features,
 return features;
 }
 
-if (!ebpf_rss_is_loaded(&n->ebpf_rss)) {
-virtio_clear_feature(&features, VIRTIO_NET_F_RSS);
+if (nc->peer->info->type == NET_CLIENT_DRIVER_TAP) {
+if (!ebpf_rss_is_loaded(&n->ebpf_rss)) {
+virtio_clear_feature(&features, VIRTIO_NET_F_RSS);
+}
 }
 features = vhost_net_get_features(get_vhost_net(nc->peer), features);
 vdev->backend_features = features;
@@ -1161,11 +1163,17 @@ static void virtio_net_detach_epbf_rss(VirtIONet *n);
 
 static void virtio_net_disable_rss(VirtIONet *n)
 {
+NetClientState *nc = qemu_get_queue(n->nic);
+
 if (n->rss_data.enabled) {
 trace_virtio_net_rss_disable();
 }
 n->rss_data.enabled = false;
 
+if (nc->peer && nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_USER) {
+vhost_net_set_rss(get_vhost_net(nc->peer), &n->rss_data);
+}
+
 virtio_net_detach_epbf_rss(n);
 }
 
@@ -1239,6 +1247,7 @@ static uint16_t virtio_net_handle_rss(VirtIONet *n,
   bool do_rss)
 {
 VirtIODevice *vdev = VIRTIO_DEVICE(n);
+NetClientState *nc = qemu_get_queue(n->nic);
 struct virtio_net_rss_config cfg;
 size_t s, offset = 0, size_get;
 uint16_t queue_pairs, i;
@@ -1354,22 +1363,29 @@ static uint16_t virtio_net_handle_rss(VirtIONet *n,
 }
 n->rss_data.enabled = true;
 
-if (!n->rss_data.populate_hash) {
-if (!virtio_net_attach_epbf_rss(n)) {
-/* EBPF must be loaded for vhost */
-if (get_vhost_net(qemu_get_queue(n->nic)->peer)) {
-warn_report("Can't load eBPF RSS for vhost");
-goto error;
+if (nc->peer && nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_USER) {
+if (vhost_net_set_rss(get_vhost_net(nc->peer), &n->rss_data)) {
+warn_report("Failed to configure RSS for vhost-user");
+goto error;
+}
+} else {
+if (!n->rss_data.populate_hash) {
+if (!virtio_net_attach_epbf_rss(n)) {
+/* EBPF must be loaded for vhost */
+if (get_vhost_net(nc->peer)) {
+warn_report("Can't load eBPF RSS for vhost");
+goto error;
+}
+/* fallback to software RSS */
+warn_report("Can't load eBPF RSS - fallback to software RSS");
+n->rss_data.enabled_software_rss = true;
 }
-/* fallback to software RSS */
-warn_report("Can't load eBPF RSS - fallback to software RSS");
+} else {
+/* use software RSS for hash populating */
+/* and detach eBPF if was loaded before */
+virtio_net_detach_epbf_rss(n);
 n->rss_data.enabled_software_rss = true;
 }
-} else {
-/* use software RSS for hash populating */
-/* and detach eBPF if was loaded before */

[PATCH 2/5] virtio-net: prepare for variable RSS key and indir table lengths

2022-04-08 Thread Maxime Coquelin

This patch is a preliminary rework to support RSS with
Vhost-user backends. It enables supporting different types
of hashes, key lengths and indirection table lengths.

This patch does not introduces behavioral changes.

Signed-off-by: Maxime Coquelin 
---
 ebpf/ebpf_rss.c|  8 
 hw/net/virtio-net.c| 35 +-
 include/hw/virtio/virtio-net.h | 16 +---
 include/migration/vmstate.h| 10 ++
 4 files changed, 53 insertions(+), 16 deletions(-)

diff --git a/ebpf/ebpf_rss.c b/ebpf/ebpf_rss.c
index 4a63854175..f03be5f919 100644
--- a/ebpf/ebpf_rss.c
+++ b/ebpf/ebpf_rss.c
@@ -96,7 +96,7 @@ static bool ebpf_rss_set_indirections_table(struct 
EBPFRSSContext *ctx,
 uint32_t i = 0;
 
 if (!ebpf_rss_is_loaded(ctx) || indirections_table == NULL ||
-   len > VIRTIO_NET_RSS_MAX_TABLE_LEN) {
+   len > VIRTIO_NET_RSS_DEFAULT_TABLE_LEN) {
 return false;
 }
 
@@ -116,13 +116,13 @@ static bool ebpf_rss_set_toepliz_key(struct 
EBPFRSSContext *ctx,
 uint32_t map_key = 0;
 
 /* prepare toeplitz key */
-uint8_t toe[VIRTIO_NET_RSS_MAX_KEY_SIZE] = {};
+uint8_t toe[VIRTIO_NET_RSS_DEFAULT_KEY_SIZE] = {};
 
 if (!ebpf_rss_is_loaded(ctx) || toeplitz_key == NULL ||
-len != VIRTIO_NET_RSS_MAX_KEY_SIZE) {
+len != VIRTIO_NET_RSS_DEFAULT_KEY_SIZE) {
 return false;
 }
-memcpy(toe, toeplitz_key, VIRTIO_NET_RSS_MAX_KEY_SIZE);
+memcpy(toe, toeplitz_key, VIRTIO_NET_RSS_DEFAULT_KEY_SIZE);
 *(uint32_t *)toe = ntohl(*(uint32_t *)toe);
 
 if (bpf_map_update_elem(ctx->map_toeplitz_key, &map_key, toe,
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 73145d6390..38436e472b 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -137,12 +137,11 @@ static void virtio_net_get_config(VirtIODevice *vdev, 
uint8_t *config)
 memcpy(netcfg.mac, n->mac, ETH_ALEN);
 virtio_stl_p(vdev, &netcfg.speed, n->net_conf.speed);
 netcfg.duplex = n->net_conf.duplex;
-netcfg.rss_max_key_size = VIRTIO_NET_RSS_MAX_KEY_SIZE;
+netcfg.rss_max_key_size = n->rss_capa.max_key_size;
 virtio_stw_p(vdev, &netcfg.rss_max_indirection_table_length,
- virtio_host_has_feature(vdev, VIRTIO_NET_F_RSS) ?
- VIRTIO_NET_RSS_MAX_TABLE_LEN : 1);
+ n->rss_capa.max_indirection_len);
 virtio_stl_p(vdev, &netcfg.supported_hash_types,
- VIRTIO_NET_RSS_SUPPORTED_HASHES);
+ n->rss_capa.supported_hashes);
 memcpy(config, &netcfg, n->config_size);
 
 /*
@@ -1202,7 +1201,7 @@ static bool virtio_net_attach_epbf_rss(VirtIONet *n)
 
 if (!ebpf_rss_set_all(&n->ebpf_rss, &config,
   n->rss_data.indirections_table, n->rss_data.key,
-  VIRTIO_NET_RSS_MAX_KEY_SIZE)) {
+  n->rss_data.key_len)) {
 return false;
 }
 
@@ -1277,7 +1276,7 @@ static uint16_t virtio_net_handle_rss(VirtIONet *n,
 err_value = n->rss_data.indirections_len;
 goto error;
 }
-if (n->rss_data.indirections_len > VIRTIO_NET_RSS_MAX_TABLE_LEN) {
+if (n->rss_data.indirections_len > n->rss_capa.max_indirection_len) {
 err_msg = "Too large indirection table";
 err_value = n->rss_data.indirections_len;
 goto error;
@@ -1323,7 +1322,7 @@ static uint16_t virtio_net_handle_rss(VirtIONet *n,
 err_value = queue_pairs;
 goto error;
 }
-if (temp.b > VIRTIO_NET_RSS_MAX_KEY_SIZE) {
+if (temp.b > n->rss_capa.max_key_size) {
 err_msg = "Invalid key size";
 err_value = temp.b;
 goto error;
@@ -1339,6 +1338,14 @@ static uint16_t virtio_net_handle_rss(VirtIONet *n,
 }
 offset += size_get;
 size_get = temp.b;
+n->rss_data.key_len = temp.b;
+g_free(n->rss_data.key);
+n->rss_data.key = g_malloc(size_get);
+if (!n->rss_data.key) {
+err_msg = "Can't allocate key";
+err_value = n->rss_data.key_len;
+goto error;
+}
 s = iov_to_buf(iov, iov_cnt, offset, n->rss_data.key, size_get);
 if (s != size_get) {
 err_msg = "Can get key buffer";
@@ -3093,8 +3100,9 @@ static const VMStateDescription vmstate_virtio_net_rss = {
 VMSTATE_UINT32(rss_data.hash_types, VirtIONet),
 VMSTATE_UINT16(rss_data.indirections_len, VirtIONet),
 VMSTATE_UINT16(rss_data.default_queue, VirtIONet),
-VMSTATE_UINT8_ARRAY(rss_data.key, VirtIONet,
-VIRTIO_NET_RSS_MAX_KEY_SIZE),
+VMSTATE_VARRAY_UINT8_ALLOC(rss_data.key, VirtIONet,
+   rss_data.key_len, 0,
+   vmstate_info_uint8, uint8_t),
 VMSTATE_VARRAY_UINT16_ALLOC(rss_data.indirections_table, VirtIONet,
 rss_data.indirections_len, 0,
 vmstate_info_uint16, uint16_t),
@@ -3523,8 +

Re: [PATCH v5 02/13] mm: Introduce memfile_notifier

2022-04-08 Thread Chao Peng

On Tue, Mar 29, 2022 at 06:45:16PM +, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 70d4309c9ce3..f628256dce0d 100644
> > +void memfile_notifier_invalidate(struct memfile_notifier_list *list,
> > +pgoff_t start, pgoff_t end)
> > +{
> > +   struct memfile_notifier *notifier;
> > +   int id;
> > +
> > +   id = srcu_read_lock(&srcu);
> > +   list_for_each_entry_srcu(notifier, &list->head, list,
> > +srcu_read_lock_held(&srcu)) {
> > +   if (notifier->ops && notifier->ops->invalidate)
> 
> Any reason notifier->ops isn't mandatory?

Yes it's mandatory, will skip the check here.

> 
> > +   notifier->ops->invalidate(notifier, start, end);
> > +   }
> > +   srcu_read_unlock(&srcu, id);
> > +}
> > +
> > +void memfile_notifier_fallocate(struct memfile_notifier_list *list,
> > +   pgoff_t start, pgoff_t end)
> > +{
> > +   struct memfile_notifier *notifier;
> > +   int id;
> > +
> > +   id = srcu_read_lock(&srcu);
> > +   list_for_each_entry_srcu(notifier, &list->head, list,
> > +srcu_read_lock_held(&srcu)) {
> > +   if (notifier->ops && notifier->ops->fallocate)
> > +   notifier->ops->fallocate(notifier, start, end);
> > +   }
> > +   srcu_read_unlock(&srcu, id);
> > +}
> > +
> > +void memfile_register_backing_store(struct memfile_backing_store *bs)
> > +{
> > +   BUG_ON(!bs || !bs->get_notifier_list);
> > +
> > +   list_add_tail(&bs->list, &backing_store_list);
> > +}
> > +
> > +void memfile_unregister_backing_store(struct memfile_backing_store *bs)
> > +{
> > +   list_del(&bs->list);
> 
> Allowing unregistration of a backing store is broken.  Using the _safe() 
> variant
> is not sufficient to guard against concurrent modification.  I don't see any 
> reason
> to support this out of the gate, the only reason to support unregistering a 
> backing
> store is if the backing store is implemented as a module, and AFAIK none of 
> the
> backing stores we plan on supporting initially support being built as a 
> module.
> These aren't exported, so it's not like that's even possible.  Registration 
> would
> also be broken if modules are allowed, I'm pretty sure module init doesn't run
> under a global lock.
> 
> We can always add this complexity if it's needed in the future, but for now 
> the
> easiest thing would be to tag memfile_register_backing_store() with __init and
> make backing_store_list __ro_after_init.

The only currently supported backing store shmem does not need this so
can remove it for now.

> 
> > +}
> > +
> > +static int memfile_get_notifier_info(struct inode *inode,
> > +struct memfile_notifier_list **list,
> > +struct memfile_pfn_ops **ops)
> > +{
> > +   struct memfile_backing_store *bs, *iter;
> > +   struct memfile_notifier_list *tmp;
> > +
> > +   list_for_each_entry_safe(bs, iter, &backing_store_list, list) {
> > +   tmp = bs->get_notifier_list(inode);
> > +   if (tmp) {
> > +   *list = tmp;
> > +   if (ops)
> > +   *ops = &bs->pfn_ops;
> > +   return 0;
> > +   }
> > +   }
> > +   return -EOPNOTSUPP;
> > +}
> > +
> > +int memfile_register_notifier(struct inode *inode,
> 
> Taking an inode is a bit odd from a user perspective.  Any reason not to take 
> a
> "struct file *" and get the inode here?  That would give callers a hint that 
> they
> need to hold a reference to the file for the lifetime of the registration.

Yes, I can change.

> 
> > + struct memfile_notifier *notifier,
> > + struct memfile_pfn_ops **pfn_ops)
> > +{
> > +   struct memfile_notifier_list *list;
> > +   int ret;
> > +
> > +   if (!inode || !notifier | !pfn_ops)
> 
> Bitwise | instead of logical ||.  But IMO taking in a pfn_ops pointer is 
> silly.
> More below.
> 
> > +   return -EINVAL;
> > +
> > +   ret = memfile_get_notifier_info(inode, &list, pfn_ops);
> > +   if (ret)
> > +   return ret;
> > +
> > +   spin_lock(&list->lock);
> > +   list_add_rcu(¬ifier->list, &list->head);
> > +   spin_unlock(&list->lock);
> > +
> > +   return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(memfile_register_notifier);
> > +
> > +void memfile_unregister_notifier(struct inode *inode,
> > +struct memfile_notifier *notifier)
> > +{
> > +   struct memfile_notifier_list *list;
> > +
> > +   if (!inode || !notifier)
> > +   return;
> > +
> > +   BUG_ON(memfile_get_notifier_info(inode, &list, NULL));
> 
> Eww.  Rather than force the caller to provide the inode/file and the notifier,
> what about grabbing the backing store itself in the notifier?
> 
>   struct memfile_notifier {
>   struct list_head list;
>   struct memfile_notifier_ops *ops;

Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK

2022-04-08 Thread Chao Peng

On Thu, Apr 07, 2022 at 04:05:36PM +, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > Since page migration / swapping is not supported yet, MFD_INACCESSIBLE
> > memory behave like longterm pinned pages and thus should be accounted to
> > mm->pinned_vm and be restricted by RLIMIT_MEMLOCK.
> > 
> > Signed-off-by: Chao Peng 
> > ---
> >  mm/shmem.c | 25 -
> >  1 file changed, 24 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 7b43e274c9a2..ae46fb96494b 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, 
> > pgoff_t start, pgoff_t end)
> >  static void notify_invalidate_page(struct inode *inode, struct folio 
> > *folio,
> >pgoff_t start, pgoff_t end)
> >  {
> > -#ifdef CONFIG_MEMFILE_NOTIFIER
> > struct shmem_inode_info *info = SHMEM_I(inode);
> >  
> > +#ifdef CONFIG_MEMFILE_NOTIFIER
> > start = max(start, folio->index);
> > end = min(end, folio->index + folio_nr_pages(folio));
> >  
> > memfile_notifier_invalidate(&info->memfile_notifiers, start, end);
> >  #endif
> > +
> > +   if (info->xflags & SHM_F_INACCESSIBLE)
> > +   atomic64_sub(end - start, ¤t->mm->pinned_vm);
> 
> As Vishal's to-be-posted selftest discovered, this is broken as current->mm 
> may
> be NULL.  Or it may be a completely different mm, e.g. AFAICT there's nothing 
> that
> prevents a different process from punching hole in the shmem backing.
> 
> I don't see a sane way of tracking this in the backing store unless the inode 
> is
> associated with a single mm when it's created, and that opens up a giant can 
> of
> worms, e.g. what happens with the accounting if the creating process goes 
> away?

Yes, I realized this.

> 
> I think the correct approach is to not do the locking automatically for 
> SHM_F_INACCESSIBLE,
> and instead require userspace to do shmctl(.., SHM_LOCK, ...) if userspace 
> knows the
> consumers don't support migrate/swap.  That'd require wrapping migrate_page() 
> and then
> wiring up notifier hooks for migrate/swap, but IMO that's a good thing to get 
> sorted
> out sooner than later.  KVM isn't planning on support migrate/swap for TDX or 
> SNP,
> but supporting at least migrate for a software-only implementation a la pKVM 
> should
> be relatively straightforward.  On the notifiee side, KVM can terminate the 
> VM if it
> gets an unexpected migrate/swap, e.g. so that TDX/SEV VMs don't die later with
> exceptions and/or data corruption (pre-SNP SEV guests) in the guest.

SHM_LOCK sounds like a good match.

Thanks,
Chao
> 
> Hmm, shmem_writepage() already handles SHM_F_INACCESSIBLE by rejecting the 
> swap, so
> maybe it's just the page migration path that needs to be updated?

[PATCH v2 0/2] Remove PCIE root bridge LSI on powernv

2022-04-08 Thread Frederic Barrat

The powernv8/powernv9/powernv10 machines allocate a LSI for their root
port bridge, which is not the case on real hardware. The default root
port implementation in qemu requests a LSI. Since the powernv
implementation derives from it, that's where the LSI is coming
from. This series fixes it, so that the model matches the hardware.

However, the code in hw/pci to handle AER and hotplug events assume a
LSI is defined. It tends to assert/deassert a LSI if MSI or MSIX is
not enabled. Since we have hardware where that is not true, this patch
also fixes a few code paths to check if a LSI is configured before
trying to trigger it.


Changes from v1:
 - addressed comments from Daniel


Frederic Barrat (2):
  pcie: Don't try triggering a LSI when not defined
  ppc/pnv: Remove LSI on the PCIE host bridge

 hw/pci-host/pnv_phb3.c | 1 +
 hw/pci-host/pnv_phb4.c | 1 +
 hw/pci/pcie.c  | 5 +++--
 hw/pci/pcie_aer.c  | 2 +-
 4 files changed, 6 insertions(+), 3 deletions(-)

-- 
2.35.1

[PATCH v2 1/2] pcie: Don't try triggering a LSI when not defined

2022-04-08 Thread Frederic Barrat

This patch skips [de]asserting a LSI interrupt if the device doesn't
have any LSI defined. Doing so would trigger an assert in
pci_irq_handler().

The PCIE root port implementation in qemu requests a LSI (INTA), but a
subclass may want to change that behavior since it's a valid
configuration. For example on the POWER8/POWER9/POWER10 systems, the
root bridge doesn't request any LSI.

Signed-off-by: Frederic Barrat 
---
 hw/pci/pcie.c | 5 +++--
 hw/pci/pcie_aer.c | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
index 67a5d67372..68a62da0b5 100644
--- a/hw/pci/pcie.c
+++ b/hw/pci/pcie.c
@@ -353,7 +353,7 @@ static void hotplug_event_notify(PCIDevice *dev)
 msix_notify(dev, pcie_cap_flags_get_vector(dev));
 } else if (msi_enabled(dev)) {
 msi_notify(dev, pcie_cap_flags_get_vector(dev));
-} else {
+} else if (pci_intx(dev) != -1) {
 pci_set_irq(dev, dev->exp.hpev_notified);
 }
 }
@@ -361,7 +361,8 @@ static void hotplug_event_notify(PCIDevice *dev)
 static void hotplug_event_clear(PCIDevice *dev)
 {
 hotplug_event_update_event_status(dev);
-if (!msix_enabled(dev) && !msi_enabled(dev) && !dev->exp.hpev_notified) {
+if (!msix_enabled(dev) && !msi_enabled(dev) && pci_intx(dev) != -1 &&
+!dev->exp.hpev_notified) {
 pci_irq_deassert(dev);
 }
 }
diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c
index e1a8a88c8c..92bd0530dd 100644
--- a/hw/pci/pcie_aer.c
+++ b/hw/pci/pcie_aer.c
@@ -290,7 +290,7 @@ static void pcie_aer_root_notify(PCIDevice *dev)
 msix_notify(dev, pcie_aer_root_get_vector(dev));
 } else if (msi_enabled(dev)) {
 msi_notify(dev, pcie_aer_root_get_vector(dev));
-} else {
+} else if (pci_intx(dev) != -1) {
 pci_irq_assert(dev);
 }
 }
-- 
2.35.1

[PATCH v2 2/2] ppc/pnv: Remove LSI on the PCIE host bridge

2022-04-08 Thread Frederic Barrat

The phb3/phb4/phb5 root ports inherit from the default PCIE root port
implementation, which requests a LSI interrupt (#INTA). On real
hardware (POWER8/POWER9/POWER10), there is no such LSI. This patch
corrects it so that it matches the hardware.

As a consequence, the device tree previously generated was bogus, as
the root bridge LSI was not properly mapped. On some
implementation (powernv9), it was leading to inconsistent interrupt
controller (xive) data. With this patch, it is now clean.

Signed-off-by: Frederic Barrat 
---
 hw/pci-host/pnv_phb3.c | 1 +
 hw/pci-host/pnv_phb4.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/hw/pci-host/pnv_phb3.c b/hw/pci-host/pnv_phb3.c
index 6e9aa9d6ac..6a884833a8 100644
--- a/hw/pci-host/pnv_phb3.c
+++ b/hw/pci-host/pnv_phb3.c
@@ -1162,6 +1162,7 @@ static void pnv_phb3_root_port_realize(DeviceState *dev, 
Error **errp)
 error_propagate(errp, local_err);
 return;
 }
+pci_config_set_interrupt_pin(pci->config, 0);
 }
 
 static void pnv_phb3_root_port_class_init(ObjectClass *klass, void *data)
diff --git a/hw/pci-host/pnv_phb4.c b/hw/pci-host/pnv_phb4.c
index 11c97e27eb..dd81e940b7 100644
--- a/hw/pci-host/pnv_phb4.c
+++ b/hw/pci-host/pnv_phb4.c
@@ -1772,6 +1772,7 @@ static void pnv_phb4_root_port_reset(DeviceState *dev)
 pci_set_word(conf + PCI_PREF_MEMORY_LIMIT, 0xfff1);
 pci_set_long(conf + PCI_PREF_BASE_UPPER32, 0x1); /* Hack */
 pci_set_long(conf + PCI_PREF_LIMIT_UPPER32, 0x);
+pci_config_set_interrupt_pin(conf, 0);
 }
 
 static void pnv_phb4_root_port_realize(DeviceState *dev, Error **errp)
-- 
2.35.1

Re: [PATCH v5 05/13] KVM: Extend the memslot to support fd-based private memory

2022-04-08 Thread Chao Peng

On Mon, Mar 28, 2022 at 09:27:32PM +, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > Extend the memslot definition to provide fd-based private memory support
> > by adding two new fields (private_fd/private_offset). The memslot then
> > can maintain memory for both shared pages and private pages in a single
> > memslot. Shared pages are provided by existing userspace_addr(hva) field
> > and private pages are provided through the new private_fd/private_offset
> > fields.
> > 
> > Since there is no 'hva' concept anymore for private memory so we cannot
> > rely on get_user_pages() to get a pfn, instead we use the newly added
> > memfile_notifier to complete the same job.
> > 
> > This new extension is indicated by a new flag KVM_MEM_PRIVATE.
> > 
> > Signed-off-by: Yu Zhang 
> 
> Needs a Co-developed-by: for Yu, or a From: if Yu is the sole author.

Yes a Co-developed-by for Yu is needed, for all the patches throught the series.

> 
> > Signed-off-by: Chao Peng 
> > ---
> >  Documentation/virt/kvm/api.rst | 37 +++---
> >  include/linux/kvm_host.h   |  7 +++
> >  include/uapi/linux/kvm.h   |  8 
> >  3 files changed, 45 insertions(+), 7 deletions(-)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 3acbf4d263a5..f76ac598606c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1307,7 +1307,7 @@ yet and must be cleared on entry.
> >  :Capability: KVM_CAP_USER_MEMORY
> >  :Architectures: all
> >  :Type: vm ioctl
> > -:Parameters: struct kvm_userspace_memory_region (in)
> > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> >  :Returns: 0 on success, -1 on error
> >  
> >  ::
> > @@ -1320,9 +1320,17 @@ yet and must be cleared on entry.
> > __u64 userspace_addr; /* start of the userspace allocated memory */
> >};
> >  
> > +  struct kvm_userspace_memory_region_ext {
> > +   struct kvm_userspace_memory_region region;
> > +   __u64 private_offset;
> > +   __u32 private_fd;
> > +   __u32 padding[5];
> 
> Uber nit, I'd prefer we pad u32 for private_fd separate from padding the size 
> of
> the structure for future expansion.
> 
> Regarding future expansion, any reason not to go crazy and pad like 128+ 
> bytes?
> It'd be rather embarassing if the next memslot extension needs 3 u64s and we 
> end
> up with region_ext2 :-)

OK, so maybe:
__u64 private_offset;
__u32 private_fd;
__u32 pad1;
__u32 pad2[28];
> 
> > +};
> > +
> >/* for kvm_memory_region::flags */
> >#define KVM_MEM_LOG_DIRTY_PAGES  (1UL << 0)
> >#define KVM_MEM_READONLY (1UL << 1)
> > +  #define KVM_MEM_PRIVATE  (1UL << 2)
> >  
> >  This ioctl allows the user to create, modify or delete a guest physical
> >  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> 
> ...
> 
> > +static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot)
> 
> I 100% think we should usurp the name "private" for these memslots, but as 
> prep
> work this series should first rename KVM_PRIVATE_MEM_SLOTS to avoid confusion.
> Maybe KVM_INTERNAL_MEM_SLOTS?

Oh, I didn't realized 'PRIVATE' is already taken.  KVM_INTERNAL_MEM_SLOTS
sounds good.

Thanks,
Chao

[RFC PATCH v5 01/23] vdpa: Add missing tracing to batch mapping functions

2022-04-08 Thread Eugenio Pérez

These functions were not traced properly.

Signed-off-by: Eugenio Pérez 
---
 hw/virtio/vhost-vdpa.c | 2 ++
 hw/virtio/trace-events | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 8adf7c0b92..9e5fe15d03 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -129,6 +129,7 @@ static void vhost_vdpa_listener_begin_batch(struct 
vhost_vdpa *v)
 .iotlb.type = VHOST_IOTLB_BATCH_BEGIN,
 };
 
+trace_vhost_vdpa_listener_begin_batch(v, fd, msg.type, msg.iotlb.type);
 if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) {
 error_report("failed to write, fd=%d, errno=%d (%s)",
  fd, errno, strerror(errno));
@@ -163,6 +164,7 @@ static void vhost_vdpa_listener_commit(MemoryListener 
*listener)
 msg.type = v->msg_type;
 msg.iotlb.type = VHOST_IOTLB_BATCH_END;
 
+trace_vhost_vdpa_listener_commit(v, fd, msg.type, msg.iotlb.type);
 if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) {
 error_report("failed to write, fd=%d, errno=%d (%s)",
  fd, errno, strerror(errno));
diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index a5102eac9e..48d9d5 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -25,6 +25,8 @@ vhost_user_postcopy_waker_nomatch(const char *rb, uint64_t 
rb_offset) "%s + 0x%"
 # vhost-vdpa.c
 vhost_vdpa_dma_map(void *vdpa, int fd, uint32_t msg_type, uint64_t iova, 
uint64_t size, uint64_t uaddr, uint8_t perm, uint8_t type) "vdpa:%p fd: %d 
msg_type: %"PRIu32" iova: 0x%"PRIx64" size: 0x%"PRIx64" uaddr: 0x%"PRIx64" 
perm: 0x%"PRIx8" type: %"PRIu8
 vhost_vdpa_dma_unmap(void *vdpa, int fd, uint32_t msg_type, uint64_t iova, 
uint64_t size, uint8_t type) "vdpa:%p fd: %d msg_type: %"PRIu32" iova: 
0x%"PRIx64" size: 0x%"PRIx64" type: %"PRIu8
+vhost_vdpa_listener_begin_batch(void *v, int fd, uint32_t msg_type, uint8_t 
type)  "vdpa:%p fd: %d msg_type: %"PRIu32" type: %"PRIu8
+vhost_vdpa_listener_commit(void *v, int fd, uint32_t msg_type, uint8_t type)  
"vdpa:%p fd: %d msg_type: %"PRIu32" type: %"PRIu8
 vhost_vdpa_listener_region_add(void *vdpa, uint64_t iova, uint64_t llend, void 
*vaddr, bool readonly) "vdpa: %p iova 0x%"PRIx64" llend 0x%"PRIx64" vaddr: %p 
read-only: %d"
 vhost_vdpa_listener_region_del(void *vdpa, uint64_t iova, uint64_t llend) 
"vdpa: %p iova 0x%"PRIx64" llend 0x%"PRIx64
 vhost_vdpa_add_status(void *dev, uint8_t status) "dev: %p status: 0x%"PRIx8
-- 
2.27.0

[RFC PATCH v5 04/23] hw/virtio: Replace g_memdup() by g_memdup2()

2022-04-08 Thread Eugenio Pérez

From: Philippe Mathieu-Daudé 

Per 
https://discourse.gnome.org/t/port-your-module-from-g-memdup-to-g-memdup2-now/5538

  The old API took the size of the memory to duplicate as a guint,
  whereas most memory functions take memory sizes as a gsize. This
  made it easy to accidentally pass a gsize to g_memdup(). For large
  values, that would lead to a silent truncation of the size from 64
  to 32 bits, and result in a heap area being returned which is
  significantly smaller than what the caller expects. This can likely
  be exploited in various modules to cause a heap buffer overflow.

Replace g_memdup() by the safer g_memdup2() wrapper.

Signed-off-by: Philippe Mathieu-Daudé 
---
 hw/net/virtio-net.c   | 3 ++-
 hw/virtio/virtio-crypto.c | 6 +++---
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 1067e72b39..e4748a7e6c 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1443,7 +1443,8 @@ static void virtio_net_handle_ctrl(VirtIODevice *vdev, 
VirtQueue *vq)
 }
 
 iov_cnt = elem->out_num;
-iov2 = iov = g_memdup(elem->out_sg, sizeof(struct iovec) * 
elem->out_num);
+iov2 = iov = g_memdup2(elem->out_sg,
+   sizeof(struct iovec) * elem->out_num);
 s = iov_to_buf(iov, iov_cnt, 0, &ctrl, sizeof(ctrl));
 iov_discard_front(&iov, &iov_cnt, sizeof(ctrl));
 if (s != sizeof(ctrl)) {
diff --git a/hw/virtio/virtio-crypto.c b/hw/virtio/virtio-crypto.c
index dcd80b904d..0e31e3cc04 100644
--- a/hw/virtio/virtio-crypto.c
+++ b/hw/virtio/virtio-crypto.c
@@ -242,7 +242,7 @@ static void virtio_crypto_handle_ctrl(VirtIODevice *vdev, 
VirtQueue *vq)
 }
 
 out_num = elem->out_num;
-out_iov_copy = g_memdup(elem->out_sg, sizeof(out_iov[0]) * out_num);
+out_iov_copy = g_memdup2(elem->out_sg, sizeof(out_iov[0]) * out_num);
 out_iov = out_iov_copy;
 
 in_num = elem->in_num;
@@ -605,11 +605,11 @@ virtio_crypto_handle_request(VirtIOCryptoReq *request)
 }
 
 out_num = elem->out_num;
-out_iov_copy = g_memdup(elem->out_sg, sizeof(out_iov[0]) * out_num);
+out_iov_copy = g_memdup2(elem->out_sg, sizeof(out_iov[0]) * out_num);
 out_iov = out_iov_copy;
 
 in_num = elem->in_num;
-in_iov_copy = g_memdup(elem->in_sg, sizeof(in_iov[0]) * in_num);
+in_iov_copy = g_memdup2(elem->in_sg, sizeof(in_iov[0]) * in_num);
 in_iov = in_iov_copy;
 
 if (unlikely(iov_to_buf(out_iov, out_num, 0, &req, sizeof(req))
-- 
2.27.0

[RFC PATCH v5 06/23] vdpa: Add x-svq to NetdevVhostVDPAOptions

2022-04-08 Thread Eugenio Pérez

Finally offering the possibility to enable SVQ from the command line.

Signed-off-by: Eugenio Pérez 
---
 qapi/net.json|  9 -
 net/vhost-vdpa.c | 48 
 2 files changed, 48 insertions(+), 9 deletions(-)

diff --git a/qapi/net.json b/qapi/net.json
index b92f3f5fb4..92848e4362 100644
--- a/qapi/net.json
+++ b/qapi/net.json
@@ -445,12 +445,19 @@
 # @queues: number of queues to be created for multiqueue vhost-vdpa
 #  (default: 1)
 #
+# @x-svq: Start device with (experimental) shadow virtqueue. (Since 7.1)
+# (default: false)
+#
+# Features:
+# @unstable: Member @x-svq is experimental.
+#
 # Since: 5.1
 ##
 { 'struct': 'NetdevVhostVDPAOptions',
   'data': {
 '*vhostdev': 'str',
-'*queues':   'int' } }
+'*queues':   'int',
+'*x-svq':{'type': 'bool', 'features' : [ 'unstable'] } } }
 
 ##
 # @NetClientDriver:
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 1e9fe47c03..def738998b 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -127,7 +127,11 @@ err_init:
 static void vhost_vdpa_cleanup(NetClientState *nc)
 {
 VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+struct vhost_dev *dev = s->vhost_vdpa.dev;
 
+if (dev && dev->vq_index + dev->nvqs == dev->vq_index_end) {
+g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
+}
 if (s->vhost_net) {
 vhost_net_cleanup(s->vhost_net);
 g_free(s->vhost_net);
@@ -187,13 +191,23 @@ static NetClientInfo net_vhost_vdpa_info = {
 .check_peer_type = vhost_vdpa_check_peer_type,
 };
 
+static int vhost_vdpa_get_iova_range(int fd,
+ struct vhost_vdpa_iova_range *iova_range)
+{
+int ret = ioctl(fd, VHOST_VDPA_GET_IOVA_RANGE, iova_range);
+
+return ret < 0 ? -errno : 0;
+}
+
 static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
-   const char *device,
-   const char *name,
-   int vdpa_device_fd,
-   int queue_pair_index,
-   int nvqs,
-   bool is_datapath)
+   const char *device,
+   const char *name,
+   int vdpa_device_fd,
+   int queue_pair_index,
+   int nvqs,
+   bool is_datapath,
+   bool svq,
+   VhostIOVATree *iova_tree)
 {
 NetClientState *nc = NULL;
 VhostVDPAState *s;
@@ -211,6 +225,8 @@ static NetClientState *net_vhost_vdpa_init(NetClientState 
*peer,
 
 s->vhost_vdpa.device_fd = vdpa_device_fd;
 s->vhost_vdpa.index = queue_pair_index;
+s->vhost_vdpa.shadow_vqs_enabled = svq;
+s->vhost_vdpa.iova_tree = iova_tree;
 ret = vhost_vdpa_add(nc, (void *)&s->vhost_vdpa, queue_pair_index, nvqs);
 if (ret) {
 qemu_del_net_client(nc);
@@ -266,6 +282,7 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char 
*name,
 g_autofree NetClientState **ncs = NULL;
 NetClientState *nc;
 int queue_pairs, i, has_cvq = 0;
+g_autoptr(VhostIOVATree) iova_tree = NULL;
 
 assert(netdev->type == NET_CLIENT_DRIVER_VHOST_VDPA);
 opts = &netdev->u.vhost_vdpa;
@@ -285,29 +302,44 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char 
*name,
 qemu_close(vdpa_device_fd);
 return queue_pairs;
 }
+if (opts->x_svq) {
+struct vhost_vdpa_iova_range iova_range;
+
+if (has_cvq) {
+error_setg(errp, "vdpa svq does not work with cvq");
+goto err_svq;
+}
+vhost_vdpa_get_iova_range(vdpa_device_fd, &iova_range);
+iova_tree = vhost_iova_tree_new(iova_range.first, iova_range.last);
+}
 
 ncs = g_malloc0(sizeof(*ncs) * queue_pairs);
 
 for (i = 0; i < queue_pairs; i++) {
 ncs[i] = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name,
- vdpa_device_fd, i, 2, true);
+ vdpa_device_fd, i, 2, true, opts->x_svq,
+ iova_tree);
 if (!ncs[i])
 goto err;
 }
 
 if (has_cvq) {
 nc = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name,
- vdpa_device_fd, i, 1, false);
+ vdpa_device_fd, i, 1, false, opts->x_svq,
+ iova_tree);
 if (!nc)
 goto err;
 }
 
+iova_tree = NULL;
 return 0;
 
 err:
 if (i) {
 qemu_del_net_client(ncs[0]);
 }
+
+err_svq:
 qemu_close(vdpa_device_fd);
 
 return -1;
-- 
2.27.0

[RFC PATCH v5 00/23] Net Control VQ support with asid in vDPA SVQ

2022-04-08 Thread Eugenio Pérez

Control virtqueue is used by networking device for accepting various
commands from the driver. It's a must to support multiqueue and other
configurations.

Shadow VirtQueue (SVQ) already makes possible migration of virtqueue
states, effectively intercepting them so qemu can track what regions of memory
are dirty because device action and needs migration. However, this does not
solve networking device state seen by the driver because CVQ messages, like
changes on MAC addresses from the driver.

To solve that, this series uses SVQ infraestructure proposed at SVQ to
intercept networking control messages used by the device. This way, qemu is
able to update VirtIONet device model and to migrate it.

You can run qemu in two modes after applying this series: only intercepting
cvq with x-cvq-svq=on or intercept all the virtqueues adding cmdline x-svq=on:

-netdev 
type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,x-cvq-svq=on,x-svq=on

The most updated kernel part of ASID is proposed at [1].

Other modes without x-cvq-svq have been not tested with this series. Other vq
cmd commands than set mac are not tested. Some details like error control are
not 100% tested neither.

The firsts 5 patches will be or have already been proposed sepratedly. Patch 6
enables cmdline parameter to shadow all virtqueues. The rest of commits
introduce the actual functionality.

Comments are welcomed.

Changes from rfc v4:
* Add missing tracing
* Add multiqueue support
* Use already sent version for replacing g_memdup
* Care with memory management

Changes from rfc v3:
* Fix bad returning of descriptors to SVQ list.

Changes from rfc v2:
* Fix use-after-free.

Changes from rfc v1:
* Rebase to latest master.
* Configure ASID instead of assuming cvq asid != data vqs asid.
* Update device model so (MAC) state can be migrated too.

[1] https://lkml.kernel.org/kvm/20220224212314.1326-1-gda...@xilinx.com/

Eugenio Pérez (22):
  vdpa: Add missing tracing to batch mapping functions
  vdpa: Fix bad index calculus at vhost_vdpa_get_vring_base
  util: Return void on iova_tree_remove
  vhost: Fix bad return of descriptors to SVQ
  vdpa: Add x-svq to NetdevVhostVDPAOptions
  vhost: move descriptor translation to vhost_svq_vring_write_descs
  vdpa: Fix index calculus at vhost_vdpa_svqs_start
  virtio-net: Expose ctrl virtqueue logic
  vdpa: Extract get features part from vhost_vdpa_get_max_queue_pairs
  virtio: Make virtqueue_alloc_element non-static
  vhost: Add SVQElement
  vhost: Add custom used buffer callback
  vdpa: control virtqueue support on shadow virtqueue
  vhost: Add vhost_iova_tree_find
  vdpa: Add map/unmap operation callback to SVQ
  vhost: Add vhost_svq_inject
  vdpa: add NetClientState->start() callback
  vdpa: Add vhost_vdpa_start_control_svq
  vhost: Update kernel headers
  vhost: Make possible to check for device exclusive vq group
  vdpa: Add asid attribute to vdpa device
  vdpa: Add x-cvq-svq

Philippe Mathieu-Daudé (1):
  hw/virtio: Replace g_memdup() by g_memdup2()

 qapi/net.json|  13 +-
 hw/virtio/vhost-iova-tree.h  |   2 +
 hw/virtio/vhost-shadow-virtqueue.h   |  46 ++-
 include/hw/virtio/vhost-vdpa.h   |   4 +
 include/hw/virtio/vhost.h|   4 +
 include/hw/virtio/virtio-net.h   |   3 +
 include/hw/virtio/virtio.h   |   1 +
 include/net/net.h|   2 +
 include/qemu/iova-tree.h |   4 +-
 include/standard-headers/linux/vhost_types.h |  11 +-
 linux-headers/linux/vhost.h  |  25 +-
 hw/net/vhost_net.c   |   9 +-
 hw/net/virtio-net.c  |  82 +++--
 hw/virtio/vhost-iova-tree.c  |  14 +
 hw/virtio/vhost-shadow-virtqueue.c   | 255 ---
 hw/virtio/vhost-vdpa.c   | 201 ++--
 hw/virtio/virtio-crypto.c|   6 +-
 hw/virtio/virtio.c   |   2 +-
 net/vhost-vdpa.c | 308 +--
 util/iova-tree.c |   4 +-
 hw/virtio/trace-events   |   8 +-
 21 files changed, 864 insertions(+), 140 deletions(-)

-- 
2.27.0

[RFC PATCH v5 03/23] util: Return void on iova_tree_remove

2022-04-08 Thread Eugenio Pérez

It always returns IOVA_OK so nobody uses it.

Signed-off-by: Eugenio Pérez 
---
 include/qemu/iova-tree.h | 4 +---
 util/iova-tree.c | 4 +---
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index c938fb0793..16bbfdf5f8 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -72,10 +72,8 @@ int iova_tree_insert(IOVATree *tree, const DMAMap *map);
  * provided.  The range does not need to be exactly what has inserted,
  * all the mappings that are included in the provided range will be
  * removed from the tree.  Here map->translated_addr is meaningless.
- *
- * Return: 0 if succeeded, or <0 if error.
  */
-int iova_tree_remove(IOVATree *tree, const DMAMap *map);
+void iova_tree_remove(IOVATree *tree, const DMAMap *map);
 
 /**
  * iova_tree_find:
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 6dff29c1f6..fee530a579 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -164,15 +164,13 @@ void iova_tree_foreach(IOVATree *tree, iova_tree_iterator 
iterator)
 g_tree_foreach(tree->tree, iova_tree_traverse, iterator);
 }
 
-int iova_tree_remove(IOVATree *tree, const DMAMap *map)
+void iova_tree_remove(IOVATree *tree, const DMAMap *map)
 {
 const DMAMap *overlap;
 
 while ((overlap = iova_tree_find(tree, map))) {
 g_tree_remove(tree->tree, overlap);
 }
-
-return IOVA_OK;
 }
 
 /**
-- 
2.27.0

[RFC PATCH v5 07/23] vhost: move descriptor translation to vhost_svq_vring_write_descs

2022-04-08 Thread Eugenio Pérez

It's done for both in and out descriptors so it's better placed here.

Signed-off-by: Eugenio Pérez 
---
 hw/virtio/vhost-shadow-virtqueue.c | 26 +++---
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
b/hw/virtio/vhost-shadow-virtqueue.c
index c17506df20..bcb5f3aae9 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -122,17 +122,23 @@ static bool vhost_svq_translate_addr(const 
VhostShadowVirtqueue *svq,
 return true;
 }
 
-static void vhost_vring_write_descs(VhostShadowVirtqueue *svq, hwaddr *sg,
-const struct iovec *iovec, size_t num,
-bool more_descs, bool write)
+static bool vhost_svq_vring_write_descs(VhostShadowVirtqueue *svq, hwaddr *sg,
+const struct iovec *iovec, size_t num,
+bool more_descs, bool write)
 {
 uint16_t i = svq->free_head, last = svq->free_head;
 unsigned n;
 uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
 vring_desc_t *descs = svq->vring.desc;
+bool ok;
 
 if (num == 0) {
-return;
+return true;
+}
+
+ok = vhost_svq_translate_addr(svq, sg, iovec, num);
+if (unlikely(!ok)) {
+return false;
 }
 
 for (n = 0; n < num; n++) {
@@ -149,6 +155,7 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue 
*svq, hwaddr *sg,
 }
 
 svq->free_head = le16_to_cpu(descs[last].next);
+return true;
 }
 
 static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
@@ -168,21 +175,18 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
 return false;
 }
 
-ok = vhost_svq_translate_addr(svq, sgs, elem->out_sg, elem->out_num);
+ok = vhost_svq_vring_write_descs(svq, sgs, elem->out_sg, elem->out_num,
+ elem->in_num > 0, false);
 if (unlikely(!ok)) {
 return false;
 }
-vhost_vring_write_descs(svq, sgs, elem->out_sg, elem->out_num,
-elem->in_num > 0, false);
 
-
-ok = vhost_svq_translate_addr(svq, sgs, elem->in_sg, elem->in_num);
+ok = vhost_svq_vring_write_descs(svq, sgs, elem->in_sg, elem->in_num, 
false,
+ true);
 if (unlikely(!ok)) {
 return false;
 }
 
-vhost_vring_write_descs(svq, sgs, elem->in_sg, elem->in_num, false, true);
-
 /*
  * Put the entry in the available array (but don't update avail->idx until
  * they do sync).
-- 
2.27.0

[RFC PATCH v5 02/23] vdpa: Fix bad index calculus at vhost_vdpa_get_vring_base

2022-04-08 Thread Eugenio Pérez

Fixes: 6d0b222666 ("vdpa: Adapt vhost_vdpa_get_vring_base to SVQ")

Signed-off-by: Eugenio Pérez 
---
 hw/virtio/vhost-vdpa.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 9e5fe15d03..1f229ff4cb 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1172,11 +1172,11 @@ static int vhost_vdpa_get_vring_base(struct vhost_dev 
*dev,
struct vhost_vring_state *ring)
 {
 struct vhost_vdpa *v = dev->opaque;
+int vdpa_idx = ring->index - dev->vq_index;
 int ret;
 
 if (v->shadow_vqs_enabled) {
-VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs,
-  ring->index);
+VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, vdpa_idx);
 
 /*
  * Setting base as last used idx, so destination will see as available
-- 
2.27.0

[RFC PATCH v5 05/23] vhost: Fix bad return of descriptors to SVQ

2022-04-08 Thread Eugenio Pérez

Only the first one of them were properly enqueued back.

Fixes: 100890f7ca ("vhost: Shadow virtqueue buffers forwarding")
Signed-off-by: Eugenio Pérez 
---
 hw/virtio/vhost-shadow-virtqueue.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
b/hw/virtio/vhost-shadow-virtqueue.c
index b232803d1b..c17506df20 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -333,13 +333,25 @@ static void 
vhost_svq_disable_notification(VhostShadowVirtqueue *svq)
 svq->vring.avail->flags |= cpu_to_le16(VRING_AVAIL_F_NO_INTERRUPT);
 }
 
+static uint16_t vhost_svq_last_desc_of_chain(VhostShadowVirtqueue *svq,
+ uint16_t i)
+{
+vring_desc_t *descs = svq->vring.desc;
+
+while (le16_to_cpu(descs[i].flags) & VRING_DESC_F_NEXT) {
+i = le16_to_cpu(descs[i].next);
+}
+
+return i;
+}
+
 static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq,
uint32_t *len)
 {
 vring_desc_t *descs = svq->vring.desc;
 const vring_used_t *used = svq->vring.used;
 vring_used_elem_t used_elem;
-uint16_t last_used;
+uint16_t last_used, last_used_chain;
 
 if (!vhost_svq_more_used(svq)) {
 return NULL;
@@ -365,7 +377,8 @@ static VirtQueueElement 
*vhost_svq_get_buf(VhostShadowVirtqueue *svq,
 return NULL;
 }
 
-descs[used_elem.id].next = svq->free_head;
+last_used_chain = vhost_svq_last_desc_of_chain(svq, used_elem.id);
+descs[last_used_chain].next = svq->free_head;
 svq->free_head = used_elem.id;
 
 *len = used_elem.len;
-- 
2.27.0

[RFC PATCH v5 11/23] virtio: Make virtqueue_alloc_element non-static

2022-04-08 Thread Eugenio Pérez

So SVQ can allocate elements using it

Signed-off-by: Eugenio Pérez 
---
 include/hw/virtio/virtio.h | 1 +
 hw/virtio/virtio.c | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index b31c4507f5..1e85833897 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -195,6 +195,7 @@ void virtqueue_fill(VirtQueue *vq, const VirtQueueElement 
*elem,
 unsigned int len, unsigned int idx);
 
 void virtqueue_map(VirtIODevice *vdev, VirtQueueElement *elem);
+void *virtqueue_alloc_element(size_t sz, unsigned out_num, unsigned in_num);
 void *virtqueue_pop(VirtQueue *vq, size_t sz);
 unsigned int virtqueue_drop_all(VirtQueue *vq);
 void *qemu_get_virtqueue_element(VirtIODevice *vdev, QEMUFile *f, size_t sz);
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 9d637e043e..17cbbb5fca 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -1376,7 +1376,7 @@ void virtqueue_map(VirtIODevice *vdev, VirtQueueElement 
*elem)
 false);
 }
 
-static void *virtqueue_alloc_element(size_t sz, unsigned out_num, unsigned 
in_num)
+void *virtqueue_alloc_element(size_t sz, unsigned out_num, unsigned in_num)
 {
 VirtQueueElement *elem;
 size_t in_addr_ofs = QEMU_ALIGN_UP(sz, __alignof__(elem->in_addr[0]));
-- 
2.27.0

[RFC PATCH v5 12/23] vhost: Add SVQElement

2022-04-08 Thread Eugenio Pérez

This allows SVQ to add metadata to the different queue elements

Signed-off-by: Eugenio Pérez 
---
 hw/virtio/vhost-shadow-virtqueue.h |  8 --
 hw/virtio/vhost-shadow-virtqueue.c | 42 --
 2 files changed, 29 insertions(+), 21 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.h 
b/hw/virtio/vhost-shadow-virtqueue.h
index e5e24c536d..72aadb0aec 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -15,6 +15,10 @@
 #include "standard-headers/linux/vhost_types.h"
 #include "hw/virtio/vhost-iova-tree.h"
 
+typedef struct SVQElement {
+VirtQueueElement elem;
+} SVQElement;
+
 /* Shadow virtqueue to relay notifications */
 typedef struct VhostShadowVirtqueue {
 /* Shadow vring */
@@ -48,10 +52,10 @@ typedef struct VhostShadowVirtqueue {
 VhostIOVATree *iova_tree;
 
 /* Map for use the guest's descriptors */
-VirtQueueElement **ring_id_maps;
+SVQElement **ring_id_maps;
 
 /* Next VirtQueue element that guest made available */
-VirtQueueElement *next_guest_avail_elem;
+SVQElement *next_guest_avail_elem;
 
 /* Next head to expose to the device */
 uint16_t shadow_avail_idx;
diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
b/hw/virtio/vhost-shadow-virtqueue.c
index bcb5f3aae9..cf701576d1 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -158,9 +158,10 @@ static bool 
vhost_svq_vring_write_descs(VhostShadowVirtqueue *svq, hwaddr *sg,
 return true;
 }
 
-static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
-VirtQueueElement *elem, unsigned *head)
+static bool vhost_svq_add_split(VhostShadowVirtqueue *svq, SVQElement 
*svq_elem,
+unsigned *head)
 {
+const VirtQueueElement *elem = &svq_elem->elem;
 unsigned avail_idx;
 vring_avail_t *avail = svq->vring.avail;
 bool ok;
@@ -202,7 +203,7 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
 return true;
 }
 
-static bool vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
+static bool vhost_svq_add(VhostShadowVirtqueue *svq, SVQElement *elem)
 {
 unsigned qemu_head;
 bool ok = vhost_svq_add_split(svq, elem, &qemu_head);
@@ -251,19 +252,21 @@ static void vhost_handle_guest_kick(VhostShadowVirtqueue 
*svq)
 virtio_queue_set_notification(svq->vq, false);
 
 while (true) {
+SVQElement *svq_elem;
 VirtQueueElement *elem;
 bool ok;
 
 if (svq->next_guest_avail_elem) {
-elem = g_steal_pointer(&svq->next_guest_avail_elem);
+svq_elem = g_steal_pointer(&svq->next_guest_avail_elem);
 } else {
-elem = virtqueue_pop(svq->vq, sizeof(*elem));
+svq_elem = virtqueue_pop(svq->vq, sizeof(*svq_elem));
 }
 
-if (!elem) {
+if (!svq_elem) {
 break;
 }
 
+elem = &svq_elem->elem;
 if (elem->out_num + elem->in_num > vhost_svq_available_slots(svq)) 
{
 /*
  * This condition is possible since a contiguous buffer in GPA
@@ -276,11 +279,11 @@ static void vhost_handle_guest_kick(VhostShadowVirtqueue 
*svq)
  * queue the current guest descriptor and ignore further kicks
  * until some elements are used.
  */
-svq->next_guest_avail_elem = elem;
+svq->next_guest_avail_elem = svq_elem;
 return;
 }
 
-ok = vhost_svq_add(svq, elem);
+ok = vhost_svq_add(svq, svq_elem);
 if (unlikely(!ok)) {
 /* VQ is broken, just return and ignore any other kicks */
 return;
@@ -349,8 +352,7 @@ static uint16_t 
vhost_svq_last_desc_of_chain(VhostShadowVirtqueue *svq,
 return i;
 }
 
-static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq,
-   uint32_t *len)
+static SVQElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq, uint32_t *len)
 {
 vring_desc_t *descs = svq->vring.desc;
 const vring_used_t *used = svq->vring.used;
@@ -401,11 +403,13 @@ static void vhost_svq_flush(VhostShadowVirtqueue *svq,
 vhost_svq_disable_notification(svq);
 while (true) {
 uint32_t len;
-g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq, &len);
-if (!elem) {
+g_autofree SVQElement *svq_elem = vhost_svq_get_buf(svq, &len);
+VirtQueueElement *elem;
+if (!svq_elem) {
 break;
 }
 
+elem = &svq_elem->elem;
 if (unlikely(i >= svq->vring.num)) {
 qemu_log_mask(LOG_GUEST_ERROR,
  "More than %u used buffers obtained in a %u size SVQ",
@@ -556,7 +560,7 @@ void vhost_svq_start(VhostShadowVirtqueue *

[RFC PATCH v5 16/23] vdpa: Add map/unmap operation callback to SVQ

2022-04-08 Thread Eugenio Pérez

Signed-off-by: Eugenio Pérez 
---
 hw/virtio/vhost-shadow-virtqueue.h | 21 +++--
 hw/virtio/vhost-shadow-virtqueue.c |  8 +++-
 hw/virtio/vhost-vdpa.c | 20 +++-
 3 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.h 
b/hw/virtio/vhost-shadow-virtqueue.h
index 4ff6a0cda0..6e61d9bfef 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -26,6 +26,15 @@ typedef struct VhostShadowVirtqueueOps {
 VirtQueueElementCallback used_elem_handler;
 } VhostShadowVirtqueueOps;
 
+typedef int (*vhost_svq_map_op)(hwaddr iova, hwaddr size, void *vaddr,
+bool readonly, void *opaque);
+typedef int (*vhost_svq_unmap_op)(hwaddr iova, hwaddr size, void *opaque);
+
+typedef struct VhostShadowVirtqueueMapOps {
+vhost_svq_map_op map;
+vhost_svq_unmap_op unmap;
+} VhostShadowVirtqueueMapOps;
+
 /* Shadow virtqueue to relay notifications */
 typedef struct VhostShadowVirtqueue {
 /* Shadow vring */
@@ -67,6 +76,12 @@ typedef struct VhostShadowVirtqueue {
 /* Optional callbacks */
 const VhostShadowVirtqueueOps *ops;
 
+/* Device memory mapping callbacks */
+const VhostShadowVirtqueueMapOps *map_ops;
+
+/* Device memory mapping callbacks opaque */
+void *map_ops_opaque;
+
 /* Optional custom used virtqueue element handler */
 VirtQueueElementCallback used_elem_cb;
 
@@ -96,8 +111,10 @@ void vhost_svq_start(VhostShadowVirtqueue *svq, 
VirtIODevice *vdev,
  VirtQueue *vq);
 void vhost_svq_stop(VhostShadowVirtqueue *svq);
 
-VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree,
-const VhostShadowVirtqueueOps *ops);
+VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_map,
+const VhostShadowVirtqueueOps *ops,
+const VhostShadowVirtqueueMapOps *map_ops,
+void *map_ops_opaque);
 
 void vhost_svq_free(gpointer vq);
 G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostShadowVirtqueue, vhost_svq_free);
diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
b/hw/virtio/vhost-shadow-virtqueue.c
index 208832a698..15e6cbc5cb 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -610,13 +610,17 @@ void vhost_svq_stop(VhostShadowVirtqueue *svq)
  *
  * @iova_tree: Tree to perform descriptors translations
  * @ops: SVQ operations hooks
+ * @map_ops: SVQ mapping operation hooks
+ * @map_ops_opaque: Opaque data to pass to mapping operations
  *
  * Returns the new virtqueue or NULL.
  *
  * In case of error, reason is reported through error_report.
  */
 VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree,
-const VhostShadowVirtqueueOps *ops)
+const VhostShadowVirtqueueOps *ops,
+const VhostShadowVirtqueueMapOps *map_ops,
+void *map_ops_opaque)
 {
 g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
 int r;
@@ -639,6 +643,8 @@ VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree 
*iova_tree,
 event_notifier_set_handler(&svq->hdev_call, vhost_svq_handle_call);
 svq->iova_tree = iova_tree;
 svq->ops = ops;
+svq->map_ops = map_ops;
+svq->map_ops_opaque = map_ops_opaque;
 return g_steal_pointer(&svq);
 
 err_init_hdev_call:
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 421eddf8ca..d09e06d212 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -385,6 +385,22 @@ static int vhost_vdpa_get_dev_features(struct vhost_dev 
*dev,
 return ret;
 }
 
+static int vhost_vdpa_svq_map(hwaddr iova, hwaddr size, void *vaddr,
+  bool readonly, void *opaque)
+{
+return vhost_vdpa_dma_map(opaque, iova, size, vaddr, readonly);
+}
+
+static int vhost_vdpa_svq_unmap(hwaddr iova, hwaddr size, void *opaque)
+{
+return vhost_vdpa_dma_unmap(opaque, iova, size);
+}
+
+static const VhostShadowVirtqueueMapOps vhost_vdpa_svq_map_ops = {
+.map = vhost_vdpa_svq_map,
+.unmap = vhost_vdpa_svq_unmap,
+};
+
 static int vhost_vdpa_init_svq(struct vhost_dev *hdev, struct vhost_vdpa *v,
Error **errp)
 {
@@ -412,7 +428,9 @@ static int vhost_vdpa_init_svq(struct vhost_dev *hdev, 
struct vhost_vdpa *v,
 shadow_vqs = g_ptr_array_new_full(hdev->nvqs, vhost_svq_free);
 for (unsigned n = 0; n < hdev->nvqs; ++n) {
 g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new(v->iova_tree,
-v->shadow_vq_ops);
+   v->shadow_vq_ops,
+   &vhost_vdpa_svq_map_ops,
+   v);

[RFC PATCH v5 08/23] vdpa: Fix index calculus at vhost_vdpa_svqs_start

2022-04-08 Thread Eugenio Pérez

Signed-off-by: Eugenio Pérez 
---
 hw/virtio/vhost-vdpa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 1f229ff4cb..3f8fa66e8e 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1018,7 +1018,7 @@ static bool vhost_vdpa_svqs_start(struct vhost_dev *dev)
 VirtQueue *vq = virtio_get_queue(dev->vdev, dev->vq_index + i);
 VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, i);
 struct vhost_vring_addr addr = {
-.index = i,
+.index = dev->vq_index + i,
 };
 int r;
 bool ok = vhost_vdpa_svq_setup(dev, svq, i, &err);
-- 
2.27.0

[RFC PATCH v5 10/23] vdpa: Extract get features part from vhost_vdpa_get_max_queue_pairs

2022-04-08 Thread Eugenio Pérez

To know the device features is also needed for CVQ SVQ. Extract from
vhost_vdpa_get_max_queue_pairs so we can reuse it.

Report errno in case of failure getting them while we're at it.

Signed-off-by: Eugenio Pérez 
---
 net/vhost-vdpa.c | 30 --
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index def738998b..290aa01e13 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -235,20 +235,24 @@ static NetClientState *net_vhost_vdpa_init(NetClientState 
*peer,
 return nc;
 }
 
-static int vhost_vdpa_get_max_queue_pairs(int fd, int *has_cvq, Error **errp)
+static int vhost_vdpa_get_features(int fd, uint64_t *features, Error **errp)
+{
+int ret = ioctl(fd, VHOST_GET_FEATURES, features);
+if (ret) {
+error_setg_errno(errp, errno,
+ "Fail to query features from vhost-vDPA device");
+}
+return ret;
+}
+
+static int vhost_vdpa_get_max_queue_pairs(int fd, uint64_t features,
+  int *has_cvq, Error **errp)
 {
 unsigned long config_size = offsetof(struct vhost_vdpa_config, buf);
 g_autofree struct vhost_vdpa_config *config = NULL;
 __virtio16 *max_queue_pairs;
-uint64_t features;
 int ret;
 
-ret = ioctl(fd, VHOST_GET_FEATURES, &features);
-if (ret) {
-error_setg(errp, "Fail to query features from vhost-vDPA device");
-return ret;
-}
-
 if (features & (1 << VIRTIO_NET_F_CTRL_VQ)) {
 *has_cvq = 1;
 } else {
@@ -278,10 +282,11 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char 
*name,
 NetClientState *peer, Error **errp)
 {
 const NetdevVhostVDPAOptions *opts;
+uint64_t features;
 int vdpa_device_fd;
 g_autofree NetClientState **ncs = NULL;
 NetClientState *nc;
-int queue_pairs, i, has_cvq = 0;
+int queue_pairs, r, i, has_cvq = 0;
 g_autoptr(VhostIOVATree) iova_tree = NULL;
 
 assert(netdev->type == NET_CLIENT_DRIVER_VHOST_VDPA);
@@ -296,7 +301,12 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char 
*name,
 return -errno;
 }
 
-queue_pairs = vhost_vdpa_get_max_queue_pairs(vdpa_device_fd,
+r = vhost_vdpa_get_features(vdpa_device_fd, &features, errp);
+if (r) {
+return r;
+}
+
+queue_pairs = vhost_vdpa_get_max_queue_pairs(vdpa_device_fd, features,
  &has_cvq, errp);
 if (queue_pairs < 0) {
 qemu_close(vdpa_device_fd);
-- 
2.27.0

[RFC PATCH v5 18/23] vdpa: add NetClientState->start() callback

2022-04-08 Thread Eugenio Pérez

It allows to inject custom code on device success start, right before
release lock.

Signed-off-by: Eugenio Pérez 
---
 include/net/net.h  | 2 ++
 hw/net/vhost_net.c | 4 
 2 files changed, 6 insertions(+)

diff --git a/include/net/net.h b/include/net/net.h
index 523136c7ac..2fc3002ab4 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -44,6 +44,7 @@ typedef struct NICConf {
 
 typedef void (NetPoll)(NetClientState *, bool enable);
 typedef bool (NetCanReceive)(NetClientState *);
+typedef void (NetStart)(NetClientState *);
 typedef ssize_t (NetReceive)(NetClientState *, const uint8_t *, size_t);
 typedef ssize_t (NetReceiveIOV)(NetClientState *, const struct iovec *, int);
 typedef void (NetCleanup) (NetClientState *);
@@ -71,6 +72,7 @@ typedef struct NetClientInfo {
 NetReceive *receive_raw;
 NetReceiveIOV *receive_iov;
 NetCanReceive *can_receive;
+NetStart *start;
 NetCleanup *cleanup;
 LinkStatusChanged *link_status_changed;
 QueryRxFilter *query_rx_filter;
diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index 30379d2ca4..44a105ec29 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -274,6 +274,10 @@ static int vhost_net_start_one(struct vhost_net *net,
 }
 }
 }
+
+if (net->nc->info->start) {
+net->nc->info->start(net->nc);
+}
 return 0;
 fail:
 file.fd = -1;
-- 
2.27.0

[RFC PATCH v5 15/23] vhost: Add vhost_iova_tree_find

2022-04-08 Thread Eugenio Pérez

Just a simple wrapper so we can find DMAMap entries based on iova

Signed-off-by: Eugenio Pérez 
---
 hw/virtio/vhost-iova-tree.h |  2 ++
 hw/virtio/vhost-iova-tree.c | 14 ++
 2 files changed, 16 insertions(+)

diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
index 6a4f24e0f9..1ffcdc5b57 100644
--- a/hw/virtio/vhost-iova-tree.h
+++ b/hw/virtio/vhost-iova-tree.h
@@ -19,6 +19,8 @@ VhostIOVATree *vhost_iova_tree_new(uint64_t iova_first, 
uint64_t iova_last);
 void vhost_iova_tree_delete(VhostIOVATree *iova_tree);
 G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostIOVATree, vhost_iova_tree_delete);
 
+const DMAMap *vhost_iova_tree_find(const VhostIOVATree *iova_tree,
+   const DMAMap *map);
 const DMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_tree,
 const DMAMap *map);
 int vhost_iova_tree_map_alloc(VhostIOVATree *iova_tree, DMAMap *map);
diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
index 55fed1fefb..7d4e8ac499 100644
--- a/hw/virtio/vhost-iova-tree.c
+++ b/hw/virtio/vhost-iova-tree.c
@@ -56,6 +56,20 @@ void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
 g_free(iova_tree);
 }
 
+/**
+ * Find a mapping in the tree that matches map
+ *
+ * @iova_tree  The iova tree
+ * @mapThe map
+ *
+ * Return a matching map that contains argument map or NULL
+ */
+const DMAMap *vhost_iova_tree_find(const VhostIOVATree *iova_tree,
+   const DMAMap *map)
+{
+return iova_tree_find(iova_tree->iova_taddr_map, map);
+}
+
 /**
  * Find the IOVA address stored from a memory address
  *
-- 
2.27.0

[RFC PATCH v5 09/23] virtio-net: Expose ctrl virtqueue logic

2022-04-08 Thread Eugenio Pérez

This allows external vhost-net devices to modify the state of the
VirtIO device model once vhost-vdpa device has acknowledge the control
commands.

Signed-off-by: Eugenio Pérez 
---
 include/hw/virtio/virtio-net.h |  3 ++
 hw/net/virtio-net.c| 83 --
 2 files changed, 51 insertions(+), 35 deletions(-)

diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index eb87032627..e62f9e227f 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -218,6 +218,9 @@ struct VirtIONet {
 struct EBPFRSSContext ebpf_rss;
 };
 
+unsigned virtio_net_handle_ctrl_iov(VirtIODevice *vdev,
+const struct iovec *in_sg, size_t in_num,
+struct iovec *out_sg, unsigned out_num);
 void virtio_net_set_netclient_name(VirtIONet *n, const char *name,
const char *type);
 
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index e4748a7e6c..5905a9285c 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1419,57 +1419,70 @@ static int virtio_net_handle_mq(VirtIONet *n, uint8_t 
cmd,
 return VIRTIO_NET_OK;
 }
 
-static void virtio_net_handle_ctrl(VirtIODevice *vdev, VirtQueue *vq)
+unsigned virtio_net_handle_ctrl_iov(VirtIODevice *vdev,
+const struct iovec *in_sg, size_t in_num,
+struct iovec *out_sg, unsigned out_num)
 {
 VirtIONet *n = VIRTIO_NET(vdev);
 struct virtio_net_ctrl_hdr ctrl;
 virtio_net_ctrl_ack status = VIRTIO_NET_ERR;
-VirtQueueElement *elem;
 size_t s;
 struct iovec *iov, *iov2;
-unsigned int iov_cnt;
+
+if (iov_size(in_sg, in_num) < sizeof(status) ||
+iov_size(out_sg, out_num) < sizeof(ctrl)) {
+virtio_error(vdev, "virtio-net ctrl missing headers");
+return 0;
+}
+
+iov2 = iov = g_memdup2(out_sg, sizeof(struct iovec) * out_num);
+s = iov_to_buf(iov, out_num, 0, &ctrl, sizeof(ctrl));
+iov_discard_front(&iov, &out_num, sizeof(ctrl));
+if (s != sizeof(ctrl)) {
+status = VIRTIO_NET_ERR;
+} else if (ctrl.class == VIRTIO_NET_CTRL_RX) {
+status = virtio_net_handle_rx_mode(n, ctrl.cmd, iov, out_num);
+} else if (ctrl.class == VIRTIO_NET_CTRL_MAC) {
+status = virtio_net_handle_mac(n, ctrl.cmd, iov, out_num);
+} else if (ctrl.class == VIRTIO_NET_CTRL_VLAN) {
+status = virtio_net_handle_vlan_table(n, ctrl.cmd, iov, out_num);
+} else if (ctrl.class == VIRTIO_NET_CTRL_ANNOUNCE) {
+status = virtio_net_handle_announce(n, ctrl.cmd, iov, out_num);
+} else if (ctrl.class == VIRTIO_NET_CTRL_MQ) {
+status = virtio_net_handle_mq(n, ctrl.cmd, iov, out_num);
+} else if (ctrl.class == VIRTIO_NET_CTRL_GUEST_OFFLOADS) {
+status = virtio_net_handle_offloads(n, ctrl.cmd, iov, out_num);
+}
+
+s = iov_from_buf(in_sg, in_num, 0, &status, sizeof(status));
+assert(s == sizeof(status));
+
+g_free(iov2);
+return sizeof(status);
+}
+
+static void virtio_net_handle_ctrl(VirtIODevice *vdev, VirtQueue *vq)
+{
+VirtQueueElement *elem;
 
 for (;;) {
+unsigned written;
 elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
 if (!elem) {
 break;
 }
-if (iov_size(elem->in_sg, elem->in_num) < sizeof(status) ||
-iov_size(elem->out_sg, elem->out_num) < sizeof(ctrl)) {
-virtio_error(vdev, "virtio-net ctrl missing headers");
+
+written = virtio_net_handle_ctrl_iov(vdev, elem->in_sg, elem->in_num,
+ elem->out_sg, elem->out_num);
+if (written > 0) {
+virtqueue_push(vq, elem, written);
+virtio_notify(vdev, vq);
+g_free(elem);
+} else {
 virtqueue_detach_element(vq, elem, 0);
 g_free(elem);
 break;
 }
-
-iov_cnt = elem->out_num;
-iov2 = iov = g_memdup2(elem->out_sg,
-   sizeof(struct iovec) * elem->out_num);
-s = iov_to_buf(iov, iov_cnt, 0, &ctrl, sizeof(ctrl));
-iov_discard_front(&iov, &iov_cnt, sizeof(ctrl));
-if (s != sizeof(ctrl)) {
-status = VIRTIO_NET_ERR;
-} else if (ctrl.class == VIRTIO_NET_CTRL_RX) {
-status = virtio_net_handle_rx_mode(n, ctrl.cmd, iov, iov_cnt);
-} else if (ctrl.class == VIRTIO_NET_CTRL_MAC) {
-status = virtio_net_handle_mac(n, ctrl.cmd, iov, iov_cnt);
-} else if (ctrl.class == VIRTIO_NET_CTRL_VLAN) {
-status = virtio_net_handle_vlan_table(n, ctrl.cmd, iov, iov_cnt);
-} else if (ctrl.class == VIRTIO_NET_CTRL_ANNOUNCE) {
-status = virtio_net_handle_announce(n, ctrl.cmd, iov, iov_cnt);
-} else if (ctrl.class == VIRTIO_NET_CTRL_MQ) {
-status = virtio_net_handle_mq(n, ctrl.cmd, iov, iov_cn

[RFC PATCH v5 19/23] vdpa: Add vhost_vdpa_start_control_svq

2022-04-08 Thread Eugenio Pérez

This will send CVQ commands in the destination machine, seting up
everything o there is no guest-visible change.

Signed-off-by: Eugenio Pérez 
---
 net/vhost-vdpa.c | 63 
 1 file changed, 63 insertions(+)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index a83da4616c..09fcc4a88e 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -206,10 +206,73 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, 
const uint8_t *buf,
 return 0;
 }
 
+static bool vhost_vdpa_start_control_svq(VhostShadowVirtqueue *svq,
+ VirtIODevice *vdev)
+{
+VirtIONet *n = VIRTIO_NET(vdev);
+uint64_t features = vdev->host_features;
+
+if (features & BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR)) {
+const struct virtio_net_ctrl_hdr ctrl = {
+.class = VIRTIO_NET_CTRL_MAC,
+.cmd = VIRTIO_NET_CTRL_MAC_ADDR_SET,
+};
+uint8_t mac[6];
+const struct iovec data[] = {
+{
+.iov_base = (void *)&ctrl,
+.iov_len = sizeof(ctrl),
+},{
+.iov_base = mac,
+.iov_len = sizeof(mac),
+},{
+.iov_base = NULL,
+.iov_len = sizeof(virtio_net_ctrl_ack),
+}
+};
+bool ret;
+
+/* TODO: Only best effort? */
+memcpy(mac, n->mac, sizeof(mac));
+ret = vhost_svq_inject(svq, data, 2, 1);
+if (!ret) {
+return false;
+}
+}
+
+return true;
+}
+
+static void vhost_vdpa_start(NetClientState *nc)
+{
+assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
+VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+struct vhost_vdpa *v = &s->vhost_vdpa;
+struct vhost_dev *dev = &s->vhost_net->dev;
+VhostShadowVirtqueue *svq;
+
+if (nc->is_datapath) {
+/* This is not the cvq dev */
+return;
+}
+
+if (dev->vq_index + dev->nvqs != dev->vq_index_end) {
+return;
+}
+
+if (!v->shadow_vqs_enabled) {
+return;
+}
+
+svq = g_ptr_array_index(v->shadow_vqs, 0);
+vhost_vdpa_start_control_svq(svq, dev->vdev);
+}
+
 static NetClientInfo net_vhost_vdpa_info = {
 .type = NET_CLIENT_DRIVER_VHOST_VDPA,
 .size = sizeof(VhostVDPAState),
 .receive = vhost_vdpa_receive,
+.start = vhost_vdpa_start,
 .cleanup = vhost_vdpa_cleanup,
 .has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
 .has_ufo = vhost_vdpa_has_ufo,
-- 
2.27.0

[RFC PATCH v5 13/23] vhost: Add custom used buffer callback

2022-04-08 Thread Eugenio Pérez

The callback allows SVQ users to know the VirtQueue requests and
responses. QEMU can use this to synchronize virtio device model state,
allowing to migrate it with minimum changes to the migration code.

In the case of networking, this will be used to inspect control
virtqueue messages.

Signed-off-by: Eugenio Pérez 
---
 hw/virtio/vhost-shadow-virtqueue.h | 16 +++-
 include/hw/virtio/vhost-vdpa.h |  2 ++
 hw/virtio/vhost-shadow-virtqueue.c |  9 -
 hw/virtio/vhost-vdpa.c |  3 ++-
 4 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.h 
b/hw/virtio/vhost-shadow-virtqueue.h
index 72aadb0aec..4ff6a0cda0 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -19,6 +19,13 @@ typedef struct SVQElement {
 VirtQueueElement elem;
 } SVQElement;
 
+typedef void (*VirtQueueElementCallback)(VirtIODevice *vdev,
+ const VirtQueueElement *elem);
+
+typedef struct VhostShadowVirtqueueOps {
+VirtQueueElementCallback used_elem_handler;
+} VhostShadowVirtqueueOps;
+
 /* Shadow virtqueue to relay notifications */
 typedef struct VhostShadowVirtqueue {
 /* Shadow vring */
@@ -57,6 +64,12 @@ typedef struct VhostShadowVirtqueue {
 /* Next VirtQueue element that guest made available */
 SVQElement *next_guest_avail_elem;
 
+/* Optional callbacks */
+const VhostShadowVirtqueueOps *ops;
+
+/* Optional custom used virtqueue element handler */
+VirtQueueElementCallback used_elem_cb;
+
 /* Next head to expose to the device */
 uint16_t shadow_avail_idx;
 
@@ -83,7 +96,8 @@ void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice 
*vdev,
  VirtQueue *vq);
 void vhost_svq_stop(VhostShadowVirtqueue *svq);
 
-VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree);
+VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree,
+const VhostShadowVirtqueueOps *ops);
 
 void vhost_svq_free(gpointer vq);
 G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostShadowVirtqueue, vhost_svq_free);
diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index a29dbb3f53..f1ba46a860 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -17,6 +17,7 @@
 #include "hw/virtio/vhost-iova-tree.h"
 #include "hw/virtio/virtio.h"
 #include "standard-headers/linux/vhost_types.h"
+#include "hw/virtio/vhost-shadow-virtqueue.h"
 
 typedef struct VhostVDPAHostNotifier {
 MemoryRegion mr;
@@ -35,6 +36,7 @@ typedef struct vhost_vdpa {
 /* IOVA mapping used by the Shadow Virtqueue */
 VhostIOVATree *iova_tree;
 GPtrArray *shadow_vqs;
+const VhostShadowVirtqueueOps *shadow_vq_ops;
 struct vhost_dev *dev;
 VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
 } VhostVDPA;
diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
b/hw/virtio/vhost-shadow-virtqueue.c
index cf701576d1..208832a698 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -419,6 +419,10 @@ static void vhost_svq_flush(VhostShadowVirtqueue *svq,
 return;
 }
 virtqueue_fill(vq, elem, len, i++);
+
+if (svq->ops && svq->ops->used_elem_handler) {
+svq->ops->used_elem_handler(svq->vdev, elem);
+}
 }
 
 virtqueue_flush(vq, i);
@@ -605,12 +609,14 @@ void vhost_svq_stop(VhostShadowVirtqueue *svq)
  * shadow methods and file descriptors.
  *
  * @iova_tree: Tree to perform descriptors translations
+ * @ops: SVQ operations hooks
  *
  * Returns the new virtqueue or NULL.
  *
  * In case of error, reason is reported through error_report.
  */
-VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree)
+VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree *iova_tree,
+const VhostShadowVirtqueueOps *ops)
 {
 g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
 int r;
@@ -632,6 +638,7 @@ VhostShadowVirtqueue *vhost_svq_new(VhostIOVATree 
*iova_tree)
 event_notifier_init_fd(&svq->svq_kick, VHOST_FILE_UNBIND);
 event_notifier_set_handler(&svq->hdev_call, vhost_svq_handle_call);
 svq->iova_tree = iova_tree;
+svq->ops = ops;
 return g_steal_pointer(&svq);
 
 err_init_hdev_call:
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 3f8fa66e8e..421eddf8ca 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -411,7 +411,8 @@ static int vhost_vdpa_init_svq(struct vhost_dev *hdev, 
struct vhost_vdpa *v,
 
 shadow_vqs = g_ptr_array_new_full(hdev->nvqs, vhost_svq_free);
 for (unsigned n = 0; n < hdev->nvqs; ++n) {
-g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new(v->iova_tree);
+g_autoptr(VhostShadowVirtqueue) svq = vhost_svq_new(v->iova_tree,
+v->shadow_vq_ops);
 
 if (unlikely

[RFC PATCH v5 20/23] vhost: Update kernel headers

2022-04-08 Thread Eugenio Pérez

Signed-off-by: Eugenio Pérez 
---
 include/standard-headers/linux/vhost_types.h | 11 -
 linux-headers/linux/vhost.h  | 25 
 2 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/include/standard-headers/linux/vhost_types.h 
b/include/standard-headers/linux/vhost_types.h
index 0bd2684a2a..ce78551b0f 100644
--- a/include/standard-headers/linux/vhost_types.h
+++ b/include/standard-headers/linux/vhost_types.h
@@ -87,7 +87,7 @@ struct vhost_msg {
 
 struct vhost_msg_v2 {
uint32_t type;
-   uint32_t reserved;
+   uint32_t asid;
union {
struct vhost_iotlb_msg iotlb;
uint8_t padding[64];
@@ -153,4 +153,13 @@ struct vhost_vdpa_iova_range {
 /* vhost-net should add virtio_net_hdr for RX, and strip for TX packets. */
 #define VHOST_NET_F_VIRTIO_NET_HDR 27
 
+/* Use message type V2 */
+#define VHOST_BACKEND_F_IOTLB_MSG_V2 0x1
+/* IOTLB can accept batching hints */
+#define VHOST_BACKEND_F_IOTLB_BATCH  0x2
+/* IOTLB can accept address space identifier through V2 type of IOTLB
+ * message
+ */
+#define VHOST_BACKEND_F_IOTLB_ASID  0x3
+
 #endif
diff --git a/linux-headers/linux/vhost.h b/linux-headers/linux/vhost.h
index c998860d7b..5e083490f1 100644
--- a/linux-headers/linux/vhost.h
+++ b/linux-headers/linux/vhost.h
@@ -89,11 +89,6 @@
 
 /* Set or get vhost backend capability */
 
-/* Use message type V2 */
-#define VHOST_BACKEND_F_IOTLB_MSG_V2 0x1
-/* IOTLB can accept batching hints */
-#define VHOST_BACKEND_F_IOTLB_BATCH  0x2
-
 #define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64)
 #define VHOST_GET_BACKEND_FEATURES _IOR(VHOST_VIRTIO, 0x26, __u64)
 
@@ -150,4 +145,24 @@
 /* Get the valid iova range */
 #define VHOST_VDPA_GET_IOVA_RANGE  _IOR(VHOST_VIRTIO, 0x78, \
 struct vhost_vdpa_iova_range)
+/* Get the number of virtqueue groups. */
+#define VHOST_VDPA_GET_GROUP_NUM   _IOR(VHOST_VIRTIO, 0x79, unsigned int)
+
+/* Get the number of address spaces. */
+#define VHOST_VDPA_GET_AS_NUM  _IOR(VHOST_VIRTIO, 0x7A, unsigned int)
+
+/* Get the group for a virtqueue: read index, write group in num,
+ * The virtqueue index is stored in the index field of
+ * vhost_vring_state. The group for this specific virtqueue is
+ * returned via num field of vhost_vring_state.
+ */
+#define VHOST_VDPA_GET_VRING_GROUP _IOWR(VHOST_VIRTIO, 0x7B,   \
+ struct vhost_vring_state)
+/* Set the ASID for a virtqueue group. The group index is stored in
+ * the index field of vhost_vring_state, the ASID associated with this
+ * group is stored at num field of vhost_vring_state.
+ */
+#define VHOST_VDPA_SET_GROUP_ASID  _IOW(VHOST_VIRTIO, 0x7C, \
+struct vhost_vring_state)
+
 #endif
-- 
2.27.0

[RFC PATCH v5 21/23] vhost: Make possible to check for device exclusive vq group

2022-04-08 Thread Eugenio Pérez

CVQ needs to be in its own group, not shared with any data vq. Enable
the checking of it here, before introducing address space id concepts.

Signed-off-by: Eugenio Pérez 
---
 include/hw/virtio/vhost.h |  2 +
 hw/net/vhost_net.c|  4 +-
 hw/virtio/vhost-vdpa.c| 79 ++-
 hw/virtio/trace-events|  1 +
 4 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 58a73e7b7a..034868fa9e 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -78,6 +78,8 @@ struct vhost_dev {
 int vq_index_end;
 /* if non-zero, minimum required value for max_queues */
 int num_queues;
+/* Must be a vq group different than any other vhost dev */
+bool independent_vq_group;
 uint64_t features;
 uint64_t acked_features;
 uint64_t backend_features;
diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index 44a105ec29..10480e19e5 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -343,14 +343,16 @@ int vhost_net_start(VirtIODevice *dev, NetClientState 
*ncs,
 }
 
 for (i = 0; i < nvhosts; i++) {
+bool cvq_idx = i >= data_queue_pairs;
 
-if (i < data_queue_pairs) {
+if (!cvq_idx) {
 peer = qemu_get_peer(ncs, i);
 } else { /* Control Virtqueue */
 peer = qemu_get_peer(ncs, n->max_queue_pairs);
 }
 
 net = get_vhost_net(peer);
+net->dev.independent_vq_group = !!cvq_idx;
 vhost_net_set_vq_index(net, i * 2, index_end);
 
 /* Suppress the masking guest notifiers on vhost user
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index d09e06d212..dfff94d46f 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -677,7 +677,8 @@ static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
 {
 uint64_t features;
 uint64_t f = 0x1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2 |
-0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH;
+0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH |
+0x1ULL << VHOST_BACKEND_F_IOTLB_ASID;
 int r;
 
 if (vhost_vdpa_call(dev, VHOST_GET_BACKEND_FEATURES, &features)) {
@@ -1097,6 +1098,78 @@ static bool vhost_vdpa_svqs_stop(struct vhost_dev *dev)
 return true;
 }
 
+static int vhost_vdpa_get_vring_group(struct vhost_dev *dev,
+  struct vhost_vring_state *state)
+{
+int ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_VRING_GROUP, state);
+trace_vhost_vdpa_get_vring_group(dev, state->index, state->num);
+return ret;
+}
+
+static bool vhost_dev_is_independent_group(struct vhost_dev *dev)
+{
+struct vhost_vdpa *v = dev->opaque;
+struct vhost_vring_state this_vq_group = {
+.index = dev->vq_index,
+};
+int ret;
+
+if (!(dev->backend_cap & VHOST_BACKEND_F_IOTLB_ASID)) {
+return true;
+}
+
+if (!v->shadow_vqs_enabled) {
+return true;
+}
+
+ret = vhost_vdpa_get_vring_group(dev, &this_vq_group);
+if (unlikely(ret)) {
+goto call_err;
+}
+
+for (int i = 1; i < dev->nvqs; ++i) {
+struct vhost_vring_state vq_group = {
+.index = dev->vq_index + i,
+};
+
+ret = vhost_vdpa_get_vring_group(dev, &vq_group);
+if (unlikely(ret)) {
+goto call_err;
+}
+if (unlikely(vq_group.num != this_vq_group.num)) {
+error_report("VQ %d group is different than VQ %d one",
+ this_vq_group.index, vq_group.index);
+return false;
+}
+}
+
+for (int i = 0; i < dev->vq_index_end; ++i) {
+struct vhost_vring_state vq_group = {
+.index = i,
+};
+
+if (dev->vq_index <= i && i < dev->vq_index + dev->nvqs) {
+continue;
+}
+
+ret = vhost_vdpa_get_vring_group(dev, &vq_group);
+if (unlikely(ret)) {
+goto call_err;
+}
+if (unlikely(vq_group.num == this_vq_group.num)) {
+error_report("VQ %d group is the same as VQ %d one",
+ this_vq_group.index, vq_group.index);
+return false;
+}
+}
+
+return true;
+
+call_err:
+error_report("Can't read vq group, errno=%d (%s)", ret, g_strerror(-ret));
+return false;
+}
+
 static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
 {
 struct vhost_vdpa *v = dev->opaque;
@@ -1105,6 +1178,10 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, 
bool started)
 
 if (started) {
 vhost_vdpa_host_notifiers_init(dev);
+if (dev->independent_vq_group &&
+!vhost_dev_is_independent_group(dev)) {
+return -1;
+}
 ok = vhost_vdpa_svqs_start(dev);
 if (unlikely(!ok)) {
 return -1;
diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 48d9d5..e6fdc03514 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events

[RFC PATCH v5 22/23] vdpa: Add asid attribute to vdpa device

2022-04-08 Thread Eugenio Pérez

We can configure ASID per group, but we still use asid 0 for every vdpa
device. Multiple asid support for cvq will be introduced in next
patches

Signed-off-by: Eugenio Pérez 
---
 include/hw/virtio/vhost-vdpa.h |  2 +
 include/hw/virtio/vhost.h  |  2 +
 hw/net/vhost_net.c |  1 +
 hw/virtio/vhost-vdpa.c | 97 --
 net/vhost-vdpa.c   | 15 +++---
 hw/virtio/trace-events |  9 ++--
 6 files changed, 99 insertions(+), 27 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index f1ba46a860..aa572d1acc 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -32,6 +32,8 @@ typedef struct vhost_vdpa {
 MemoryListener listener;
 struct vhost_vdpa_iova_range iova_range;
 uint64_t acked_features;
+/* one past the last vq index of this virtqueue group */
+int vq_group_index_end;
 bool shadow_vqs_enabled;
 /* IOVA mapping used by the Shadow Virtqueue */
 VhostIOVATree *iova_tree;
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 034868fa9e..2a6819dc2e 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -78,6 +78,8 @@ struct vhost_dev {
 int vq_index_end;
 /* if non-zero, minimum required value for max_queues */
 int num_queues;
+/* address space id */
+uint32_t address_space_id;
 /* Must be a vq group different than any other vhost dev */
 bool independent_vq_group;
 uint64_t features;
diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index 10480e19e5..e8a99c8605 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -352,6 +352,7 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
 }
 
 net = get_vhost_net(peer);
+net->dev.address_space_id = !!cvq_idx;
 net->dev.independent_vq_group = !!cvq_idx;
 vhost_net_set_vq_index(net, i * 2, index_end);
 
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index dfff94d46f..1b4e03c658 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -79,14 +79,18 @@ static int vhost_vdpa_dma_map(struct vhost_vdpa *v, hwaddr 
iova, hwaddr size,
 int ret = 0;
 
 msg.type = v->msg_type;
+if (v->dev->backend_cap & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID)) {
+msg.asid = v->dev->address_space_id;
+}
 msg.iotlb.iova = iova;
 msg.iotlb.size = size;
 msg.iotlb.uaddr = (uint64_t)(uintptr_t)vaddr;
 msg.iotlb.perm = readonly ? VHOST_ACCESS_RO : VHOST_ACCESS_RW;
 msg.iotlb.type = VHOST_IOTLB_UPDATE;
 
-   trace_vhost_vdpa_dma_map(v, fd, msg.type, msg.iotlb.iova, msg.iotlb.size,
-msg.iotlb.uaddr, msg.iotlb.perm, msg.iotlb.type);
+trace_vhost_vdpa_dma_map(v, fd, msg.type, msg.asid, msg.iotlb.iova,
+ msg.iotlb.size, msg.iotlb.uaddr, msg.iotlb.perm,
+ msg.iotlb.type);
 
 if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) {
 error_report("failed to write, fd=%d, errno=%d (%s)",
@@ -104,12 +108,15 @@ static int vhost_vdpa_dma_unmap(struct vhost_vdpa *v, 
hwaddr iova,
 int fd = v->device_fd;
 int ret = 0;
 
+if (v->dev->backend_cap & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID)) {
+msg.asid = v->dev->address_space_id;
+}
 msg.type = v->msg_type;
 msg.iotlb.iova = iova;
 msg.iotlb.size = size;
 msg.iotlb.type = VHOST_IOTLB_INVALIDATE;
 
-trace_vhost_vdpa_dma_unmap(v, fd, msg.type, msg.iotlb.iova,
+trace_vhost_vdpa_dma_unmap(v, fd, msg.type, msg.asid, msg.iotlb.iova,
msg.iotlb.size, msg.iotlb.type);
 
 if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) {
@@ -129,7 +136,12 @@ static void vhost_vdpa_listener_begin_batch(struct 
vhost_vdpa *v)
 .iotlb.type = VHOST_IOTLB_BATCH_BEGIN,
 };
 
-trace_vhost_vdpa_listener_begin_batch(v, fd, msg.type, msg.iotlb.type);
+if (v->dev->backend_cap & BIT_ULL(VHOST_BACKEND_F_IOTLB_ASID)) {
+msg.asid = v->dev->address_space_id;
+}
+
+trace_vhost_vdpa_listener_begin_batch(v, fd, msg.type, msg.asid,
+  msg.iotlb.type);
 if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) {
 error_report("failed to write, fd=%d, errno=%d (%s)",
  fd, errno, strerror(errno));
@@ -162,9 +174,13 @@ static void vhost_vdpa_listener_commit(MemoryListener 
*listener)
 }
 
 msg.type = v->msg_type;
+if (dev->backend_cap & (0x1ULL << VHOST_BACKEND_F_IOTLB_ASID)) {
+msg.asid = v->dev->address_space_id;
+}
 msg.iotlb.type = VHOST_IOTLB_BATCH_END;
 
-trace_vhost_vdpa_listener_commit(v, fd, msg.type, msg.iotlb.type);
+trace_vhost_vdpa_listener_commit(v, fd, msg.type, msg.asid,
+ msg.iotlb.type);
 if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) {
 error_report("failed to write, fd=%d, errno=%d (%s)

Re: [PATCH v5 05/13] KVM: Extend the memslot to support fd-based private memory

2022-04-08 Thread Chao Peng

On Mon, Mar 28, 2022 at 09:56:33PM +, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > Extend the memslot definition to provide fd-based private memory support
> > by adding two new fields (private_fd/private_offset). The memslot then
> > can maintain memory for both shared pages and private pages in a single
> > memslot. Shared pages are provided by existing userspace_addr(hva) field
> > and private pages are provided through the new private_fd/private_offset
> > fields.
> > 
> > Since there is no 'hva' concept anymore for private memory so we cannot
> > rely on get_user_pages() to get a pfn, instead we use the newly added
> > memfile_notifier to complete the same job.
> > 
> > This new extension is indicated by a new flag KVM_MEM_PRIVATE.
> > 
> > Signed-off-by: Yu Zhang 
> > Signed-off-by: Chao Peng 
> > ---
> >  Documentation/virt/kvm/api.rst | 37 +++---
> >  include/linux/kvm_host.h   |  7 +++
> >  include/uapi/linux/kvm.h   |  8 
> >  3 files changed, 45 insertions(+), 7 deletions(-)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 3acbf4d263a5..f76ac598606c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1307,7 +1307,7 @@ yet and must be cleared on entry.
> >  :Capability: KVM_CAP_USER_MEMORY
> >  :Architectures: all
> >  :Type: vm ioctl
> > -:Parameters: struct kvm_userspace_memory_region (in)
> > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> >  :Returns: 0 on success, -1 on error
> >  
> >  ::
> > @@ -1320,9 +1320,17 @@ yet and must be cleared on entry.
> > __u64 userspace_addr; /* start of the userspace allocated memory */
> >};
> >  
> > +  struct kvm_userspace_memory_region_ext {
> > +   struct kvm_userspace_memory_region region;
> 
> Peeking ahead, the partial switch to the _ext variant is rather gross.  I 
> would
> prefer that KVM use an entirely different, but binary compatible, struct 
> internally.
> And once the kernel supports C11[*], I'm pretty sure we can make the "region" 
> in
> _ext an anonymous struct, and make KVM's internal struct a #define of _ext.  
> That
> should minimize the churn (no need to get the embedded "region" field), reduce
> line lengths, and avoid confusion due to some flows taking the _ext but others
> dealing with only the "base" struct.

Will try that.

> 
> Maybe kvm_user_memory_region or kvm_user_mem_region?  Though it's tempting to 
> be
> evil and usurp the old kvm_memory_region :-)
> 
> E.g. pre-C11 do
> 
> struct kvm_userspace_memory_region_ext {
>   struct kvm_userspace_memory_region region;
>   __u64 private_offset;
>   __u32 private_fd;
>   __u32 padding[5];
> };
> 
> #ifdef __KERNEL__
> struct kvm_user_mem_region {
>   __u32 slot;
>   __u32 flags;
>   __u64 guest_phys_addr;
>   __u64 memory_size; /* bytes */
>   __u64 userspace_addr; /* start of the userspace allocated memory */
>   __u64 private_offset;
>   __u32 private_fd;
>   __u32 padding[5];
> };
> #endif
> 
> and then post-C11 do
> 
> struct kvm_userspace_memory_region_ext {
> #ifdef __KERNEL__

Is this #ifndef? As I think anonymous struct is only for kernel?

Thanks,
Chao

>   struct kvm_userspace_memory_region region;
> #else
>   struct kvm_userspace_memory_region;
> #endif
>   __u64 private_offset;
>   __u32 private_fd;
>   __u32 padding[5];
> };
> 
> #ifdef __KERNEL__
> #define kvm_user_mem_region kvm_userspace_memory_region_ext
> #endif
> 
> [*] https://lore.kernel.org/all/20220301145233.3689119-1-a...@kernel.org
> 
> > +   __u64 private_offset;
> > +   __u32 private_fd;
> > +   __u32 padding[5];
> > +};

[RFC PATCH v5 14/23] vdpa: control virtqueue support on shadow virtqueue

2022-04-08 Thread Eugenio Pérez

Introduce the control virtqueue support for vDPA shadow virtqueue. This
is needed for advanced networking features like multiqueue.

To demonstrate command handling, VIRTIO_NET_F_CTRL_MACADDR and
VIRTIO_NET_CTRL_MQ are implemented. If vDPA device is started with SVQ
support and virtio-net driver changes MAC or the number of queues
virtio-net device model will be updated with the new one.

Others cvq commands could be added here straightforwardly but they have
been not tested.

Signed-off-by: Eugenio Pérez 
---
 net/vhost-vdpa.c | 80 ++--
 1 file changed, 77 insertions(+), 3 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 290aa01e13..a83da4616c 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -11,6 +11,7 @@
 
 #include "qemu/osdep.h"
 #include "clients.h"
+#include "hw/virtio/virtio-net.h"
 #include "net/vhost_net.h"
 #include "net/vhost-vdpa.h"
 #include "hw/virtio/vhost-vdpa.h"
@@ -69,6 +70,30 @@ const int vdpa_feature_bits[] = {
 VHOST_INVALID_FEATURE_BIT
 };
 
+/** Supported device specific feature bits with SVQ */
+static const uint64_t vdpa_svq_device_features =
+BIT_ULL(VIRTIO_NET_F_CSUM) |
+BIT_ULL(VIRTIO_NET_F_GUEST_CSUM) |
+BIT_ULL(VIRTIO_NET_F_CTRL_GUEST_OFFLOADS) |
+BIT_ULL(VIRTIO_NET_F_MTU) |
+BIT_ULL(VIRTIO_NET_F_MAC) |
+BIT_ULL(VIRTIO_NET_F_GUEST_TSO4) |
+BIT_ULL(VIRTIO_NET_F_GUEST_TSO6) |
+BIT_ULL(VIRTIO_NET_F_GUEST_ECN) |
+BIT_ULL(VIRTIO_NET_F_GUEST_UFO) |
+BIT_ULL(VIRTIO_NET_F_HOST_TSO4) |
+BIT_ULL(VIRTIO_NET_F_HOST_TSO6) |
+BIT_ULL(VIRTIO_NET_F_HOST_ECN) |
+BIT_ULL(VIRTIO_NET_F_HOST_UFO) |
+BIT_ULL(VIRTIO_NET_F_MRG_RXBUF) |
+BIT_ULL(VIRTIO_NET_F_STATUS) |
+BIT_ULL(VIRTIO_NET_F_CTRL_VQ) |
+BIT_ULL(VIRTIO_NET_F_MQ) |
+BIT_ULL(VIRTIO_F_ANY_LAYOUT) |
+BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR) |
+BIT_ULL(VIRTIO_NET_F_RSC_EXT) |
+BIT_ULL(VIRTIO_NET_F_STANDBY);
+
 VHostNetState *vhost_vdpa_get_vhost_net(NetClientState *nc)
 {
 VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
@@ -199,6 +224,46 @@ static int vhost_vdpa_get_iova_range(int fd,
 return ret < 0 ? -errno : 0;
 }
 
+static void vhost_vdpa_net_handle_ctrl(VirtIODevice *vdev,
+   const VirtQueueElement *elem)
+{
+struct virtio_net_ctrl_hdr ctrl;
+virtio_net_ctrl_ack status = VIRTIO_NET_ERR;
+size_t s;
+struct iovec in = {
+.iov_base = &status,
+.iov_len = sizeof(status),
+};
+
+s = iov_to_buf(elem->out_sg, elem->out_num, 0, &ctrl, sizeof(ctrl.class));
+if (s != sizeof(ctrl.class)) {
+return;
+}
+
+switch (ctrl.class) {
+case VIRTIO_NET_CTRL_MAC_ADDR_SET:
+case VIRTIO_NET_CTRL_MQ:
+break;
+default:
+return;
+};
+
+s = iov_to_buf(elem->in_sg, elem->in_num, 0, &status, sizeof(status));
+if (s != sizeof(status) || status != VIRTIO_NET_OK) {
+return;
+}
+
+status = VIRTIO_NET_ERR;
+virtio_net_handle_ctrl_iov(vdev, &in, 1, elem->out_sg, elem->out_num);
+if (status != VIRTIO_NET_OK) {
+error_report("Bad CVQ processing in model");
+}
+}
+
+static const VhostShadowVirtqueueOps vhost_vdpa_net_svq_ops = {
+.used_elem_handler = vhost_vdpa_net_handle_ctrl,
+};
+
 static NetClientState *net_vhost_vdpa_init(NetClientState *peer,
const char *device,
const char *name,
@@ -226,6 +291,9 @@ static NetClientState *net_vhost_vdpa_init(NetClientState 
*peer,
 s->vhost_vdpa.device_fd = vdpa_device_fd;
 s->vhost_vdpa.index = queue_pair_index;
 s->vhost_vdpa.shadow_vqs_enabled = svq;
+if (!is_datapath) {
+s->vhost_vdpa.shadow_vq_ops = &vhost_vdpa_net_svq_ops;
+}
 s->vhost_vdpa.iova_tree = iova_tree;
 ret = vhost_vdpa_add(nc, (void *)&s->vhost_vdpa, queue_pair_index, nvqs);
 if (ret) {
@@ -314,9 +382,15 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char 
*name,
 }
 if (opts->x_svq) {
 struct vhost_vdpa_iova_range iova_range;
-
-if (has_cvq) {
-error_setg(errp, "vdpa svq does not work with cvq");
+uint64_t invalid_dev_features =
+features & ~vdpa_svq_device_features &
+/* Transport are all accepted at this point */
+~MAKE_64BIT_MASK(VIRTIO_TRANSPORT_F_START,
+ VIRTIO_TRANSPORT_F_END - 
VIRTIO_TRANSPORT_F_START);
+
+if (invalid_dev_features) {
+error_setg(errp, "vdpa svq does not work with features 0x%" PRIx64,
+   invalid_dev_features);
 goto err_svq;
 }
 vhost_vdpa_get_iova_range(vdpa_device_fd, &iova_range);
-- 
2.27.0

Re: [PATCH v9 09/11] 9p: darwin: Implement compatibility for mknodat

2022-04-08 Thread Christian Schoenebeck

On Sonntag, 27. Februar 2022 23:35:20 CEST Will Cohen wrote:
> From: Keno Fischer 
> 
> Darwin does not support mknodat. However, to avoid race conditions
> with later setting the permissions, we must avoid using mknod on
> the full path instead. We could try to fchdir, but that would cause
> problems if multiple threads try to call mknodat at the same time.
> However, luckily there is a solution: Darwin includes a function
> that sets the cwd for the current thread only.
> This should suffice to use mknod safely.
[...]
> diff --git a/hw/9pfs/9p-util-darwin.c b/hw/9pfs/9p-util-darwin.c
> index cdb4c9e24c..bec0253474 100644
> --- a/hw/9pfs/9p-util-darwin.c
> +++ b/hw/9pfs/9p-util-darwin.c
> @@ -7,6 +7,8 @@
> 
>  #include "qemu/osdep.h"
>  #include "qemu/xattr.h"
> +#include "qapi/error.h"
> +#include "qemu/error-report.h"
>  #include "9p-util.h"
> 
>  ssize_t fgetxattrat_nofollow(int dirfd, const char *filename, const char
> *name, @@ -62,3 +64,34 @@ int fsetxattrat_nofollow(int dirfd, const char
> *filename, const char *name, close_preserve_errno(fd);
>  return ret;
>  }
> +
> +/*
> + * As long as mknodat is not available on macOS, this workaround
> + * using pthread_fchdir_np is needed.
> + *
> + * Radar filed with Apple for implementing mknodat:
> + * rdar://FB9862426 (https://openradar.appspot.com/FB9862426)
> + */
> +#if defined CONFIG_PTHREAD_FCHDIR_NP
> +
> +int qemu_mknodat(int dirfd, const char *filename, mode_t mode, dev_t dev)
> +{
> +int preserved_errno, err;
> +if (!pthread_fchdir_np) {
> +error_report_once("pthread_fchdir_np() not available on this
> version of macOS"); +return -ENOTSUP;
> +}
> +if (pthread_fchdir_np(dirfd) < 0) {
> +return -1;
> +}
> +err = mknod(filename, mode, dev);

I just tested this on macOS Monterey and realized mknod() seems to require 
admin privileges on macOS to work. So if you run QEMU as ordinary user on 
macOS then mknod() would fail with errno=1 (Operation not permitted).

That means a lot of stuff would simply not work on macOS, unless you really 
want to run QEMU with super user privileges, which does not sound appealing to 
me. :/

Should we introduce another fake behaviour here, i.e. remapping this on macOS 
hosts as regular file and make guest believe it would create a device, similar 
as we already do for mapped links?

> +preserved_errno = errno;
> +/* Stop using the thread-local cwd */
> +pthread_fchdir_np(-1);
> +if (err < 0) {
> +errno = preserved_errno;
> +}
> +return err;
> +}
> +
> +#endif

[RFC PATCH v5 17/23] vhost: Add vhost_svq_inject

2022-04-08 Thread Eugenio Pérez

This allows qemu to inject packets to the device without guest's notice.

This will be use to inject net CVQ messages to restore status in the destination

Signed-off-by: Eugenio Pérez 
---
 hw/virtio/vhost-shadow-virtqueue.h |   5 +
 hw/virtio/vhost-shadow-virtqueue.c | 179 +
 2 files changed, 160 insertions(+), 24 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.h 
b/hw/virtio/vhost-shadow-virtqueue.h
index 6e61d9bfef..d82a64d566 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -17,6 +17,9 @@
 
 typedef struct SVQElement {
 VirtQueueElement elem;
+hwaddr in_iova;
+hwaddr out_iova;
+bool not_from_guest;
 } SVQElement;
 
 typedef void (*VirtQueueElementCallback)(VirtIODevice *vdev,
@@ -100,6 +103,8 @@ typedef struct VhostShadowVirtqueue {
 
 bool vhost_svq_valid_features(uint64_t features, Error **errp);
 
+bool vhost_svq_inject(VhostShadowVirtqueue *svq, const struct iovec *iov,
+  size_t out_num, size_t in_num);
 void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
 void vhost_svq_set_svq_call_fd(VhostShadowVirtqueue *svq, int call_fd);
 void vhost_svq_get_vring_addr(const VhostShadowVirtqueue *svq,
diff --git a/hw/virtio/vhost-shadow-virtqueue.c 
b/hw/virtio/vhost-shadow-virtqueue.c
index 15e6cbc5cb..26f40dda9e 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -16,6 +16,7 @@
 #include "qemu/log.h"
 #include "qemu/memalign.h"
 #include "linux-headers/linux/vhost.h"
+#include "qemu/iov.h"
 
 /**
  * Validate the transport device features that both guests can use with the SVQ
@@ -122,7 +123,8 @@ static bool vhost_svq_translate_addr(const 
VhostShadowVirtqueue *svq,
 return true;
 }
 
-static bool vhost_svq_vring_write_descs(VhostShadowVirtqueue *svq, hwaddr *sg,
+static bool vhost_svq_vring_write_descs(VhostShadowVirtqueue *svq,
+SVQElement *svq_elem, hwaddr *sg,
 const struct iovec *iovec, size_t num,
 bool more_descs, bool write)
 {
@@ -130,15 +132,39 @@ static bool 
vhost_svq_vring_write_descs(VhostShadowVirtqueue *svq, hwaddr *sg,
 unsigned n;
 uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
 vring_desc_t *descs = svq->vring.desc;
-bool ok;
 
 if (num == 0) {
 return true;
 }
 
-ok = vhost_svq_translate_addr(svq, sg, iovec, num);
-if (unlikely(!ok)) {
-return false;
+if (svq_elem->not_from_guest) {
+DMAMap map = {
+.translated_addr = (hwaddr)iovec->iov_base,
+.size = ROUND_UP(iovec->iov_len, 4096) - 1,
+.perm = write ? IOMMU_RW : IOMMU_RO,
+};
+int r;
+
+if (unlikely(num != 1)) {
+error_report("Unexpected chain of element injected");
+return false;
+}
+r = vhost_iova_tree_map_alloc(svq->iova_tree, &map);
+if (unlikely(r != IOVA_OK)) {
+error_report("Cannot map injected element");
+return false;
+}
+
+r = svq->map_ops->map(map.iova, map.size + 1,
+  (void *)map.translated_addr, !write,
+  svq->map_ops_opaque);
+assert(r == 0);
+sg[0] = map.iova;
+} else {
+bool ok = vhost_svq_translate_addr(svq, sg, iovec, num);
+if (unlikely(!ok)) {
+return false;
+}
 }
 
 for (n = 0; n < num; n++) {
@@ -165,7 +191,8 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq, 
SVQElement *svq_elem,
 unsigned avail_idx;
 vring_avail_t *avail = svq->vring.avail;
 bool ok;
-g_autofree hwaddr *sgs = g_new(hwaddr, MAX(elem->out_num, elem->in_num));
+g_autofree hwaddr *sgs = NULL;
+hwaddr *in_sgs, *out_sgs;
 
 *head = svq->free_head;
 
@@ -176,15 +203,23 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue 
*svq, SVQElement *svq_elem,
 return false;
 }
 
-ok = vhost_svq_vring_write_descs(svq, sgs, elem->out_sg, elem->out_num,
- elem->in_num > 0, false);
+if (!svq_elem->not_from_guest) {
+sgs = g_new(hwaddr, MAX(elem->out_num, elem->in_num));
+in_sgs = out_sgs = sgs;
+} else {
+in_sgs = &svq_elem->in_iova;
+out_sgs = &svq_elem->out_iova;
+}
+ok = vhost_svq_vring_write_descs(svq, svq_elem, out_sgs, elem->out_sg,
+ elem->out_num, elem->in_num > 0, false);
 if (unlikely(!ok)) {
 return false;
 }
 
-ok = vhost_svq_vring_write_descs(svq, sgs, elem->in_sg, elem->in_num, 
false,
- true);
+ok = vhost_svq_vring_write_descs(svq, svq_elem, in_sgs, elem->in_sg,
+ elem->in_num, false, true);
 if (unlikely(!ok)) {
+/* TODO unwind o

[RFC PATCH v5 23/23] vdpa: Add x-cvq-svq

2022-04-08 Thread Eugenio Pérez

This isolates shadow cvq in its own group.

Signed-off-by: Eugenio Pérez 
---
 qapi/net.json|  8 +++-
 net/vhost-vdpa.c | 98 ++--
 2 files changed, 100 insertions(+), 6 deletions(-)

diff --git a/qapi/net.json b/qapi/net.json
index 92848e4362..39c245e6cd 100644
--- a/qapi/net.json
+++ b/qapi/net.json
@@ -447,9 +447,12 @@
 #
 # @x-svq: Start device with (experimental) shadow virtqueue. (Since 7.1)
 # (default: false)
+# @x-cvq-svq: Start device with (experimental) shadow virtqueue in its own
+# virtqueue group. (Since 7.1)
+# (default: false)
 #
 # Features:
-# @unstable: Member @x-svq is experimental.
+# @unstable: Members @x-svq and x-cvq-svq are experimental.
 #
 # Since: 5.1
 ##
@@ -457,7 +460,8 @@
   'data': {
 '*vhostdev': 'str',
 '*queues':   'int',
-'*x-svq':{'type': 'bool', 'features' : [ 'unstable'] } } }
+'*x-svq':{'type': 'bool', 'features' : [ 'unstable'] },
+'*x-cvq-svq':{'type': 'bool', 'features' : [ 'unstable'] } } }
 
 ##
 # @NetClientDriver:
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 6207ead884..e907ef1618 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -379,6 +379,17 @@ static int vhost_vdpa_get_features(int fd, uint64_t 
*features, Error **errp)
 return ret;
 }
 
+static int vhost_vdpa_get_backend_features(int fd, uint64_t *features,
+   Error **errp)
+{
+int ret = ioctl(fd, VHOST_GET_BACKEND_FEATURES, features);
+if (ret) {
+error_setg_errno(errp, errno,
+"Fail to query backend features from vhost-vDPA device");
+}
+return ret;
+}
+
 static int vhost_vdpa_get_max_queue_pairs(int fd, uint64_t features,
   int *has_cvq, Error **errp)
 {
@@ -412,16 +423,56 @@ static int vhost_vdpa_get_max_queue_pairs(int fd, 
uint64_t features,
 return 1;
 }
 
+/**
+ * Check vdpa device to support CVQ group asid 1
+ *
+ * @vdpa_device_fd: Vdpa device fd
+ * @queue_pairs: Queue pairs
+ * @errp: Error
+ */
+static int vhost_vdpa_check_cvq_svq(int vdpa_device_fd, int queue_pairs,
+Error **errp)
+{
+uint64_t backend_features;
+unsigned num_as;
+int r;
+
+r = vhost_vdpa_get_backend_features(vdpa_device_fd, &backend_features,
+errp);
+if (unlikely(r)) {
+return -1;
+}
+
+if (unlikely(!(backend_features & VHOST_BACKEND_F_IOTLB_ASID))) {
+error_setg(errp, "Device without IOTLB_ASID feature");
+return -1;
+}
+
+r = ioctl(vdpa_device_fd, VHOST_VDPA_GET_AS_NUM, &num_as);
+if (unlikely(r)) {
+error_setg_errno(errp, errno,
+ "Cannot retrieve number of supported ASs");
+return -1;
+}
+if (unlikely(num_as < 2)) {
+error_setg(errp, "Insufficient number of ASs (%u, min: 2)", num_as);
+}
+
+return 0;
+}
+
 int net_init_vhost_vdpa(const Netdev *netdev, const char *name,
 NetClientState *peer, Error **errp)
 {
 const NetdevVhostVDPAOptions *opts;
+struct vhost_vdpa_iova_range iova_range;
 uint64_t features;
 int vdpa_device_fd;
 g_autofree NetClientState **ncs = NULL;
 NetClientState *nc;
 int queue_pairs, r, i, has_cvq = 0;
 g_autoptr(VhostIOVATree) iova_tree = NULL;
+ERRP_GUARD();
 
 assert(netdev->type == NET_CLIENT_DRIVER_VHOST_VDPA);
 opts = &netdev->u.vhost_vdpa;
@@ -446,8 +497,9 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char 
*name,
 qemu_close(vdpa_device_fd);
 return queue_pairs;
 }
-if (opts->x_svq) {
-struct vhost_vdpa_iova_range iova_range;
+if (opts->x_cvq_svq || opts->x_svq) {
+vhost_vdpa_get_iova_range(vdpa_device_fd, &iova_range);
+
 uint64_t invalid_dev_features =
 features & ~vdpa_svq_device_features &
 /* Transport are all accepted at this point */
@@ -459,7 +511,21 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char 
*name,
invalid_dev_features);
 goto err_svq;
 }
-vhost_vdpa_get_iova_range(vdpa_device_fd, &iova_range);
+}
+
+if (opts->x_cvq_svq) {
+if (!has_cvq) {
+error_setg(errp, "Cannot use x-cvq-svq with a device without cvq");
+goto err_svq;
+}
+
+r = vhost_vdpa_check_cvq_svq(vdpa_device_fd, queue_pairs, errp);
+if (unlikely(r)) {
+error_prepend(errp, "Cannot configure CVQ SVQ: ");
+goto err_svq;
+}
+}
+if (opts->x_svq) {
 iova_tree = vhost_iova_tree_new(iova_range.first, iova_range.last);
 }
 
@@ -474,11 +540,35 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char 
*name,
 }
 
 if (has_cvq) {
+g_autoptr(VhostIOVATree) cvq_iova_tree = NULL;
+
+if (opts->x_cvq_svq) {
+

Re: [PULL 0/2] Fixes 20220408 patches

2022-04-08 Thread Peter Maydell

On Fri, 8 Apr 2022 at 05:55, Gerd Hoffmann  wrote:
>
> The following changes since commit 95a3fcc7487e5bef262e1f937ed8636986764c4e:
>
>   Update version for v7.0.0-rc3 release (2022-04-06 21:26:13 +0100)
>
> are available in the Git repository at:
>
>   git://git.kraxel.org/qemu tags/fixes-20220408-pull-request
>
> for you to fetch changes up to fa892e9abb728e76afcf27323ab29c57fb0fe7aa:
>
>   ui/cursor: fix integer overflow in cursor_alloc (CVE-2021-4206) (2022-04-07 
> 12:30:54 +0200)
>
> 
> two cursor/qxl related security fixes.
>

Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/7.0
for any user-visible changes.

-- PMM

Re: [PATCH v5 07/13] KVM: Add KVM_EXIT_MEMORY_ERROR exit

2022-04-08 Thread Chao Peng

On Mon, Mar 28, 2022 at 10:33:37PM +, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > This new KVM exit allows userspace to handle memory-related errors. It
> > indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> > The flags includes additional information for userspace to handle the
> > error. Currently bit 0 is defined as 'private memory' where '1'
> > indicates error happens due to private memory access and '0' indicates
> > error happens due to shared memory access.
> > 
> > After private memory is enabled, this new exit will be used for KVM to
> > exit to userspace for shared memory <-> private memory conversion in
> > memory encryption usage.
> > 
> > In such usage, typically there are two kind of memory conversions:
> >   - explicit conversion: happens when guest explicitly calls into KVM to
> > map a range (as private or shared), KVM then exits to userspace to
> > do the map/unmap operations.
> >   - implicit conversion: happens in KVM page fault handler.
> > * if the fault is due to a private memory access then causes a
> >   userspace exit for a shared->private conversion request when the
> >   page has not been allocated in the private memory backend.
> > * If the fault is due to a shared memory access then causes a
> >   userspace exit for a private->shared conversion request when the
> >   page has already been allocated in the private memory backend.
> > 
> > Signed-off-by: Yu Zhang 
> > Signed-off-by: Chao Peng 
> > ---
> >  Documentation/virt/kvm/api.rst | 22 ++
> >  include/uapi/linux/kvm.h   |  9 +
> >  2 files changed, 31 insertions(+)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index f76ac598606c..bad550c2212b 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6216,6 +6216,28 @@ array field represents return values. The userspace 
> > should update the return
> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >  
> > +::
> > +
> > +   /* KVM_EXIT_MEMORY_ERROR */
> > +   struct {
> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0)
> > +   __u32 flags;
> > +   __u32 padding;
> > +   __u64 gpa;
> > +   __u64 size;
> > +   } memory;
> > +If exit reason is KVM_EXIT_MEMORY_ERROR then it indicates that the VCPU has
> 
> Doh, I'm pretty sure I suggested KVM_EXIT_MEMORY_ERROR.  Any objection to 
> using
> KVM_EXIT_MEMORY_FAULT instead of KVM_EXIT_MEMORY_ERROR?  "ERROR" makes me 
> think
> of ECC errors, i.e. uncorrected #MC in x86 land, not more generic "faults".  
> That
> would align nicely with -EFAULT.

Sure.

> 
> > +encountered a memory error which is not handled by KVM kernel module and
> > +userspace may choose to handle it. The 'flags' field indicates the memory
> > +properties of the exit.

Re: [PATCH v5 08/13] KVM: Use memfile_pfn_ops to obtain pfn for private pages

2022-04-08 Thread Chao Peng

On Mon, Mar 28, 2022 at 11:56:06PM +, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > @@ -2217,4 +2220,34 @@ static inline void kvm_handle_signal_exit(struct 
> > kvm_vcpu *vcpu)
> >  /* Max number of entries allowed for each kvm dirty ring */
> >  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> >  
> > +#ifdef CONFIG_MEMFILE_NOTIFIER
> > +static inline long kvm_memfile_get_pfn(struct kvm_memory_slot *slot, gfn_t 
> > gfn,
> > +  int *order)
> > +{
> > +   pgoff_t index = gfn - slot->base_gfn +
> > +   (slot->private_offset >> PAGE_SHIFT);
> 
> This is broken for 32-bit kernels, where gfn_t is a 64-bit value but pgoff_t 
> is a
> 32-bit value.  There's no reason to support this for 32-bit kernels, so...
> 
> The easiest fix, and likely most maintainable for other code too, would be to
> add a dedicated CONFIG for private memory, and then have KVM check that for 
> all
> the memfile stuff.  x86 can then select it only for 64-bit kernels, and in 
> turn
> select MEMFILE_NOTIFIER iff private memory is supported.

Looks good.

> 
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index ca7b2a6a452a..ee9c8c155300 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -48,7 +48,9 @@ config KVM
> select SRCU
> select INTERVAL_TREE
> select HAVE_KVM_PM_NOTIFIER if PM
> -   select MEMFILE_NOTIFIER
> +   select HAVE_KVM_PRIVATE_MEM if X86_64
> +   select MEMFILE_NOTIFIER if HAVE_KVM_PRIVATE_MEM
> +
> help
>   Support hosting fully virtualized guest machines using hardware
>   virtualization extensions.  You will need a fairly recent
> 
> And in addition to replacing checks on CONFIG_MEMFILE_NOTIFIER, the probing of
> whether or not KVM_MEM_PRIVATE is allowed can be:
> 
> @@ -1499,23 +1499,19 @@ static void kvm_replace_memslot(struct kvm *kvm,
> }
>  }
> 
> -bool __weak kvm_arch_private_memory_supported(struct kvm *kvm)
> -{
> -   return false;
> -}
> -
>  static int check_memory_region_flags(struct kvm *kvm,
> const struct kvm_userspace_memory_region *mem)
>  {
> u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> 
> -   if (kvm_arch_private_memory_supported(kvm))
> -   valid_flags |= KVM_MEM_PRIVATE;
> -
>  #ifdef __KVM_HAVE_READONLY_MEM
> valid_flags |= KVM_MEM_READONLY;
>  #endif
> 
> +#ifdef CONFIG_KVM_HAVE_PRIVATE_MEM
> +   valid_flags |= KVM_MEM_PRIVATE;
> +#endif
> +
> if (mem->flags & ~valid_flags)
> return -EINVAL;
> 
> > +
> > +   return slot->pfn_ops->get_lock_pfn(file_inode(slot->private_file),
> > +  index, order);
> 
> In a similar vein, get_locK_pfn() shouldn't return a "long".  KVM likely 
> won't use
> these APIs on 32-bit kernels, but that may not hold true for other 
> subsystems, and
> this code is confusing and technically wrong.  The pfns for struct page 
> squeeze
> into an unsigned long because PAE support is capped at 64gb, but casting to a
> signed long could result in a pfn with bit 31 set being misinterpreted as an 
> error.
> 
> Even returning an "unsigned long" for the pfn is wrong.  It "works" for the 
> shmem
> code because shmem deals only with struct page, but it's technically wrong, 
> especially
> since one of the selling points of this approach is that it can work without 
> struct
> page.

Hmmm, that's correct.

> 
> OUT params suck, but I don't see a better option than having the return value 
> be
> 0/-errno, with "pfn_t *pfn" for the resolved pfn.
> 
> > +}
> > +
> > +static inline void kvm_memfile_put_pfn(struct kvm_memory_slot *slot,
> > +  kvm_pfn_t pfn)
> > +{
> > +   slot->pfn_ops->put_unlock_pfn(pfn);
> > +}
> > +
> > +#else
> > +static inline long kvm_memfile_get_pfn(struct kvm_memory_slot *slot, gfn_t 
> > gfn,
> > +  int *order)
> > +{
> 
> This should be a WARN_ON() as its usage should be guarded by a KVM_PRIVATE_MEM
> check, and private memslots should be disallowed in this case.
> 
> Alternatively, it might be a good idea to #ifdef these out entirely and not 
> provide
> stubs.  That'd likely require a stub or two in arch code, but overall it 
> might be
> less painful in the long run, e.g. would force us to more carefully consider 
> the
> touch points for private memory.  Definitely not a requirement, just an idea.

Make sense, let me try.

Thanks,
Chao

Re: [PATCH v5 06/13] KVM: Use kvm_userspace_memory_region_ext

2022-04-08 Thread Chao Peng

On Mon, Mar 28, 2022 at 10:26:55PM +, Sean Christopherson wrote:
> On Thu, Mar 10, 2022, Chao Peng wrote:
> > @@ -4476,14 +4477,23 @@ static long kvm_vm_ioctl(struct file *filp,
> > break;
> > }
> > case KVM_SET_USER_MEMORY_REGION: {
> > -   struct kvm_userspace_memory_region kvm_userspace_mem;
> > +   struct kvm_userspace_memory_region_ext region_ext;
> 
> It's probably a good idea to zero initialize the full region to avoid 
> consuming
> garbage stack data if there's a bug and an _ext field is accessed without 
> first
> checking KVM_MEM_PRIVATE.  I'm usually opposed to unnecessary initialization, 
> but
> this seems like something we could screw up quite easily.
> 
> > r = -EFAULT;
> > -   if (copy_from_user(&kvm_userspace_mem, argp,
> > -   sizeof(kvm_userspace_mem)))
> > +   if (copy_from_user(®ion_ext, argp,
> > +   sizeof(struct kvm_userspace_memory_region)))
> > goto out;
> > +   if (region_ext.region.flags & KVM_MEM_PRIVATE) {
> > +   int offset = offsetof(
> > +   struct kvm_userspace_memory_region_ext,
> > +   private_offset);
> > +   if (copy_from_user(®ion_ext.private_offset,
> > +  argp + offset,
> > +  sizeof(region_ext) - offset))
> 
> In this patch, KVM_MEM_PRIVATE should result in an -EINVAL as it's not yet
> supported.  Copying the _ext on KVM_MEM_PRIVATE belongs in the "Expose 
> KVM_MEM_PRIVATE"
> patch.

Agreed.

> 
> Mechnically, what about first reading flags via get_user(), and then doing a 
> single
> copy_from_user()?  It's technically more work in the common case, and 
> requires an
> extra check to guard against TOCTOU attacks, but this isn't a fast path by 
> any means
> and IMO the end result makes it easier to understand the relationship between
> KVM_MEM_PRIVATE and the two different structs.

Will use this code, thanks for typing.

Chao
> 
> E.g.
> 
>   case KVM_SET_USER_MEMORY_REGION: {
>   struct kvm_user_mem_region region;
>   unsigned long size;
>   u32 flags;
> 
>   memset(®ion, 0, sizeof(region));
> 
>   r = -EFAULT;
>   if (get_user(flags, (u32 __user *)(argp + 
> offsetof(typeof(region), flags
>   goto out;
> 
>   if (flags & KVM_MEM_PRIVATE)
>   size = sizeof(struct kvm_userspace_memory_region_ext);
>   else
>   size = sizeof(struct kvm_userspace_memory_region);
>   if (copy_from_user(®ion, argp, size))
>   goto out;
> 
>   r = -EINVAL;
>   if ((flags ^ region.flags) & KVM_MEM_PRIVATE)
>   goto out;
> 
>   r = kvm_vm_ioctl_set_memory_region(kvm, ®ion);
>   break;
>   }
> 
> > +   goto out;
> > +   }
> >  
> > -   r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem);
> > +   r = kvm_vm_ioctl_set_memory_region(kvm, ®ion_ext);
> > break;
> > }
> > case KVM_GET_DIRTY_LOG: {
> > -- 
> > 2.17.1
> >

[PATCH 00/41] arm: Implement GICv4

2022-04-08 Thread Peter Maydell

This patchset implements emulation of GICv4 in our TCG GIC and ITS
models, and makes the virt board use it where appropriate.

The GICv4 provides a single new feature: direct injection of virtual
interrupts from the ITS to a VM. In QEMU terms this means that if you
have an outer QEMU which is emulating a CPU with EL2, and the outer
guest passes through a PCI device (probably one emulated by the outer
QEMU) to an inner guest, interrupts from that device can go directly
to the inner guest, rather than having to go to the outer guest and
the outer guest then synthesizing virtual interrupts to the inner
guest. (If you aren't configuring the inner guest with a passthrough
PCI device then this new feature is of no interest.)

The basic structure of the patchset is as follows:

(1) There are a handful of preliminary patches fixing some minor
existing nits.

(2) The v4 ITS has some new in-guest-memory data structures and new
ITS commands that let the guest set them up. The next sequence of
patches implement all those commands. Where the command needs to
actually do something (eg "deliver a vLPI"), these patches call
functions in the redistributor which are left as unimplemented stubs
to be filled in in subsequent patches. This first chunk of patches
sticks to the data-structure handling and all the command argument
unpacking and error checking.

(3) The redistributor has a new redistributor frame (ie the amount of
guest memory used by redistributor registers is larger) with a two new
registers in it. We implement these initially as reads-as-written.

(4) The CPU interface needs relatively minor changes: as well as
looking at the list registers to determine the highest priority
pending virtual interrupt, we must also look at the highest priority
pending vLPI. We implement these changes, again leaving the interfaces
from this code into the redistributor as stubs for the moment.

(5) Now we can fill in all the stub code in the redistributor.  This
is almost all working with the pending and config tables for virtual
LPIs. (Side note: in real hardware some of this work is done by the
ITS rather than the redistributor, but in our implementation we split
between the two source files slightly differently. I've made the vLPI
handling follow the pattern of the existing LPI handling.)

(6) Finally, we can update the ID registers which tell the guest about
the presence of v4 features, allow the GIC device to accept 4 as a
value for its QOM revision property, and make the virt board set that
when appropriate.

General notes:

Since the only useful thing in GICv4 is direct virtual interrupt
injection, it isn't expected that you would have a system with a GICv4
and a CPU without EL2. So I've made this an error, and the virt board
will only use GICv4 if the user also enables emulation of
virtualization.

Because the redistributor frame is twice the size in GICv4, the
number of redistributors we can fit into a given area of memory
is reduced. This means that when using GICv4 the maximum number
of CPUs supported on the virt board drops from 512 to 317. (No,
I'm not sure why this is 317 and not 256 :-))

I have not particularly considered performance in this initial
implementation. In particular, we will do a complete re-scan of a
virtual LPI pending table every time the outer guest reschedules a
vCPU (and writes GICR_VPENDBASER). The spec provides scope for
optimisation here, by allowing part of the LPI table to have IMPDEF
contents, which we could in principle use to cache information like
the current highest priority pending vLPI. Given that emulating
nested guests with PCI passthrough is a fairly niche activity,
I propose that we not do this unless the three people doing that
complain about this all being too slow :-)

Tested with a Linux kernel passing through a virtio-blk device
to an inner Linux VM with KVM/QEMU. (NB that to get the outer
Linux kernel to actually use the new GICv4 functionality you
need to pass it "kvm-arm.vgic_v4_enable=1", as the kernel
will not use it by default.)

thanks
-- PMM

Peter Maydell (41):
  hw/intc/arm_gicv3_its: Add missing blank line
  hw/intc/arm_gicv3: Sanity-check num-cpu property
  hw/intc/arm_gicv3: Insist that redist region capacity matches CPU count
  hw/intc/arm_gicv3: Report correct PIDR0 values for ID registers
  target/arm/cpu.c: ignore VIRQ and VFIQ if no EL2
  hw/intc/arm_gicv3_its: Factor out "is intid a valid LPI ID?"
  hw/intc/arm_gicv3_its: Implement GITS_BASER2 for GICv4
  hw/intc/arm_gicv3_its: Implement VMAPI and VMAPTI
  hw/intc/arm_gicv3_its: Implement VMAPP
  hw/intc/arm_gicv3_its: Distinguish success and error cases of CMD_CONTINUE
  hw/intc/arm_gicv3_its: Factor out "find ITE given devid, eventid"
  hw/intc/arm_gicv3_its: Factor out CTE lookup sequence
  hw/intc/arm_gicv3_its: Split out process_its_cmd() physical interrupt code
  hw/intc/arm_gicv3_its: Handle virtual interrupts in process_its_cmd()
  hw/intc/arm_gicv3: Keep pointers to every connected ITS
  hw/intc/arm

[PATCH 05/41] target/arm/cpu.c: ignore VIRQ and VFIQ if no EL2

2022-04-08 Thread Peter Maydell

In a GICv3, it is impossible for the GIC to deliver a VIRQ or VFIQ to
the CPU unless the CPU has EL2, because VIRQ and VFIQ are only
configurable via EL2-only system registers.  Moreover, in our
implementation we were only calculating and updating the state of the
VIRQ and VFIQ lines in gicv3_cpuif_virt_irq_fiq_update() when those
EL2 system registers changed.  We were therefore able to assert in
arm_cpu_set_irq() that we didn't see a VIRQ or VFIQ line update if
EL2 wasn't present.

This assumption no longer holds with GICv4:
 * even if the CPU does not have EL2 the guest is able to cause the
   GIC to deliver a virtual LPI by programming the ITS (which is a
   silly thing for it to do, but possible)
 * because we now need to recalculate the state of the VIRQ and VFIQ
   lines in more cases than just "some EL2 GIC sysreg was written",
   we will see calls to arm_cpu_set_irq() for "VIRQ is 0, VFIQ is 0"
   even if the guest is not using the virtual LPI parts of the ITS

Remove the assertions, and instead simply ignore the state of the
VIRQ and VFIQ lines if the CPU does not have EL2.

Signed-off-by: Peter Maydell 
---
 target/arm/cpu.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index 5d4ca7a2270..1140ce5829e 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -694,6 +694,16 @@ static void arm_cpu_set_irq(void *opaque, int irq, int 
level)
 [ARM_CPU_VFIQ] = CPU_INTERRUPT_VFIQ
 };
 
+if (!arm_feature(env, ARM_FEATURE_EL2) &&
+(irq == ARM_CPU_VIRQ || irq == ARM_CPU_VFIQ)) {
+/*
+ * The GIC might tell us about VIRQ and VFIQ state, but if we don't
+ * have EL2 support we don't care. (Unless the guest is doing something
+ * silly this will only be calls saying "level is still 0".)
+ */
+return;
+}
+
 if (level) {
 env->irq_line_state |= mask[irq];
 } else {
@@ -702,11 +712,9 @@ static void arm_cpu_set_irq(void *opaque, int irq, int 
level)
 
 switch (irq) {
 case ARM_CPU_VIRQ:
-assert(arm_feature(env, ARM_FEATURE_EL2));
 arm_cpu_update_virq(cpu);
 break;
 case ARM_CPU_VFIQ:
-assert(arm_feature(env, ARM_FEATURE_EL2));
 arm_cpu_update_vfiq(cpu);
 break;
 case ARM_CPU_IRQ:
-- 
2.25.1

[PATCH 04/41] hw/intc/arm_gicv3: Report correct PIDR0 values for ID registers

2022-04-08 Thread Peter Maydell

We use the common function gicv3_idreg() to supply the CoreSight ID
register values for the GICv3 for the copies of these ID registers in
the distributor, redistributor and ITS register frames.  This isn't
quite correct, because while most of the register values are the
same, the PIDR0 value should vary to indicate which of these three
frames it is.  (You can see this and also the correct values of these
PIDR0 registers by looking at the GIC-600 or GIC-700 TRMs, for
example.)

Make gicv3_idreg() take an extra argument for the PIDR0 value.

Signed-off-by: Peter Maydell 
---
 hw/intc/gicv3_internal.h   | 15 +--
 hw/intc/arm_gicv3_dist.c   |  2 +-
 hw/intc/arm_gicv3_its.c|  2 +-
 hw/intc/arm_gicv3_redist.c |  2 +-
 4 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/hw/intc/gicv3_internal.h b/hw/intc/gicv3_internal.h
index 2bf1baef047..dec413f7cfa 100644
--- a/hw/intc/gicv3_internal.h
+++ b/hw/intc/gicv3_internal.h
@@ -555,7 +555,12 @@ static inline uint32_t gicv3_iidr(void)
 return 0x43b;
 }
 
-static inline uint32_t gicv3_idreg(int regoffset)
+/* CoreSight PIDR0 values for ARM GICv3 implementations */
+#define GICV3_PIDR0_DIST 0x92
+#define GICV3_PIDR0_REDIST 0x93
+#define GICV3_PIDR0_ITS 0x94
+
+static inline uint32_t gicv3_idreg(int regoffset, uint8_t pidr0)
 {
 /* Return the value of the CoreSight ID register at the specified
  * offset from the first ID register (as found in the distributor
@@ -565,7 +570,13 @@ static inline uint32_t gicv3_idreg(int regoffset)
 static const uint8_t gicd_ids[] = {
 0x44, 0x00, 0x00, 0x00, 0x92, 0xB4, 0x3B, 0x00, 0x0D, 0xF0, 0x05, 0xB1
 };
-return gicd_ids[regoffset / 4];
+
+regoffset /= 4;
+
+if (regoffset == 4) {
+return pidr0;
+}
+return gicd_ids[regoffset];
 }
 
 /**
diff --git a/hw/intc/arm_gicv3_dist.c b/hw/intc/arm_gicv3_dist.c
index 28d913b2114..7f6275363ea 100644
--- a/hw/intc/arm_gicv3_dist.c
+++ b/hw/intc/arm_gicv3_dist.c
@@ -557,7 +557,7 @@ static bool gicd_readl(GICv3State *s, hwaddr offset,
 }
 case GICD_IDREGS ... GICD_IDREGS + 0x2f:
 /* ID registers */
-*data = gicv3_idreg(offset - GICD_IDREGS);
+*data = gicv3_idreg(offset - GICD_IDREGS, GICV3_PIDR0_DIST);
 return true;
 case GICD_SGIR:
 /* WO registers, return unknown value */
diff --git a/hw/intc/arm_gicv3_its.c b/hw/intc/arm_gicv3_its.c
index 44914f25780..f8467b61ec5 100644
--- a/hw/intc/arm_gicv3_its.c
+++ b/hw/intc/arm_gicv3_its.c
@@ -1161,7 +1161,7 @@ static bool its_readl(GICv3ITSState *s, hwaddr offset,
 break;
 case GITS_IDREGS ... GITS_IDREGS + 0x2f:
 /* ID registers */
-*data = gicv3_idreg(offset - GITS_IDREGS);
+*data = gicv3_idreg(offset - GITS_IDREGS, GICV3_PIDR0_ITS);
 break;
 case GITS_TYPER:
 *data = extract64(s->typer, 0, 32);
diff --git a/hw/intc/arm_gicv3_redist.c b/hw/intc/arm_gicv3_redist.c
index 412a04f59cf..dc9729e8395 100644
--- a/hw/intc/arm_gicv3_redist.c
+++ b/hw/intc/arm_gicv3_redist.c
@@ -234,7 +234,7 @@ static MemTxResult gicr_readl(GICv3CPUState *cs, hwaddr 
offset,
 *data = cs->gicr_nsacr;
 return MEMTX_OK;
 case GICR_IDREGS ... GICR_IDREGS + 0x2f:
-*data = gicv3_idreg(offset - GICR_IDREGS);
+*data = gicv3_idreg(offset - GICR_IDREGS, GICV3_PIDR0_REDIST);
 return MEMTX_OK;
 default:
 return MEMTX_ERROR;
-- 
2.25.1

[PATCH 03/41] hw/intc/arm_gicv3: Insist that redist region capacity matches CPU count

2022-04-08 Thread Peter Maydell

Boards using the GICv3 need to configure it with both the total
number of CPUs and also the sizes of all the memory regions which
contain redistributors (one redistributor per CPU).  At the moment
the GICv3 checks that the number of CPUs specified is not too many to
fit in the defined redistributor regions, but in fact the code
assumes that the two match exactly.  For instance when we set the
GICR_TYPER.Last bit on the final redistributor in each region, we
assume that we don't need to consider the possibility of a region
being only half full of redistributors or even completely empty.  We
also assume in gicv3_redist_read() and gicv3_redist_write() that we
can calculate the CPU index from the offset within the MemoryRegion
and that this will always be in range.

Fortunately all the board code sets the redistributor region sizes to
exactly match the CPU count, so this isn't a visible bug.  We could
in theory make the GIC code handle non-full redistributor regions, or
have it automatically reduce the provided region sizes to match the
CPU count, but the simplest thing is just to strengthen the error
check and insist that the CPU count and redistributor region size
settings match exactly, since all the board code does that anyway.

Signed-off-by: Peter Maydell 
---
 hw/intc/arm_gicv3_common.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/intc/arm_gicv3_common.c b/hw/intc/arm_gicv3_common.c
index 90204be25b6..c797c82786b 100644
--- a/hw/intc/arm_gicv3_common.c
+++ b/hw/intc/arm_gicv3_common.c
@@ -354,9 +354,9 @@ static void arm_gicv3_common_realize(DeviceState *dev, 
Error **errp)
 for (i = 0; i < s->nb_redist_regions; i++) {
 rdist_capacity += s->redist_region_count[i];
 }
-if (rdist_capacity < s->num_cpu) {
+if (rdist_capacity != s->num_cpu) {
 error_setg(errp, "Capacity of the redist regions(%d) "
-   "is less than number of vcpus(%d)",
+   "does not match the number of vcpus(%d)",
rdist_capacity, s->num_cpu);
 return;
 }
-- 
2.25.1

[PATCH 06/41] hw/intc/arm_gicv3_its: Factor out "is intid a valid LPI ID?"

2022-04-08 Thread Peter Maydell

In process_mapti() we check interrupt IDs to see whether they are
in the valid LPI range. Factor this out into its own utility
function, as we're going to want it elsewhere too for GICv4.

Signed-off-by: Peter Maydell 
---
 hw/intc/arm_gicv3_its.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/hw/intc/arm_gicv3_its.c b/hw/intc/arm_gicv3_its.c
index f8467b61ec5..400cdf83794 100644
--- a/hw/intc/arm_gicv3_its.c
+++ b/hw/intc/arm_gicv3_its.c
@@ -79,6 +79,12 @@ typedef enum ItsCmdResult {
 CMD_CONTINUE = 1,
 } ItsCmdResult;
 
+static inline bool intid_in_lpi_range(uint32_t id)
+{
+return id >= GICV3_LPI_INTID_START &&
+id < (1ULL << (GICD_TYPER_IDBITS + 1));
+}
+
 static uint64_t baser_base_addr(uint64_t value, uint32_t page_sz)
 {
 uint64_t result = 0;
@@ -410,7 +416,6 @@ static ItsCmdResult process_mapti(GICv3ITSState *s, const 
uint64_t *cmdpkt,
 uint32_t devid, eventid;
 uint32_t pIntid = 0;
 uint64_t num_eventids;
-uint32_t num_intids;
 uint16_t icid = 0;
 DTEntry dte;
 ITEntry ite;
@@ -438,7 +443,6 @@ static ItsCmdResult process_mapti(GICv3ITSState *s, const 
uint64_t *cmdpkt,
 return CMD_STALL;
 }
 num_eventids = 1ULL << (dte.size + 1);
-num_intids = 1ULL << (GICD_TYPER_IDBITS + 1);
 
 if (icid >= s->ct.num_entries) {
 qemu_log_mask(LOG_GUEST_ERROR,
@@ -460,7 +464,7 @@ static ItsCmdResult process_mapti(GICv3ITSState *s, const 
uint64_t *cmdpkt,
 return CMD_CONTINUE;
 }
 
-if (pIntid < GICV3_LPI_INTID_START || pIntid >= num_intids) {
+if (!intid_in_lpi_range(pIntid)) {
 qemu_log_mask(LOG_GUEST_ERROR,
   "%s: invalid interrupt ID 0x%x\n", __func__, pIntid);
 return CMD_CONTINUE;
-- 
2.25.1

[PATCH 02/41] hw/intc/arm_gicv3: Sanity-check num-cpu property

2022-04-08 Thread Peter Maydell

In the GICv3 code we implicitly rely on there being at least one CPU
and thus at least one redistributor and CPU interface.  Sanity-check
that the property the board code sets is not zero.

Signed-off-by: Peter Maydell 
---
Doing this would be a board code error, but we might as well
get a clean diagnostic for it and not have to think about
num_cpu == 0 as a special case later.
---
 hw/intc/arm_gicv3_common.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/hw/intc/arm_gicv3_common.c b/hw/intc/arm_gicv3_common.c
index 4ca5ae9bc56..90204be25b6 100644
--- a/hw/intc/arm_gicv3_common.c
+++ b/hw/intc/arm_gicv3_common.c
@@ -328,6 +328,10 @@ static void arm_gicv3_common_realize(DeviceState *dev, 
Error **errp)
s->num_irq, GIC_INTERNAL);
 return;
 }
+if (s->num_cpu == 0) {
+error_setg(errp, "num-cpu must be at least 1");
+return;
+}
 
 /* ITLinesNumber is represented as (N / 32) - 1, so this is an
  * implementation imposed restriction, not an architectural one,
-- 
2.25.1

[PATCH 07/41] hw/intc/arm_gicv3_its: Implement GITS_BASER2 for GICv4

2022-04-08 Thread Peter Maydell

The GICv4 defines a new in-guest-memory table for the ITS: this is
the vPE table.  Implement the new GITS_BASER2 register which the
guest uses to tell the ITS where the vPE table is located, including
the decode of the register fields into the TableDesc structure which
we do for the GITS_BASER when the guest enables the ITS.

We guard provision of the new register with the its_feature_virtual()
function, which does a check of the GITS_TYPER.Virtual bit which
indicates presence of ITS support for virtual LPIs.  Since this bit
is currently always zero, GICv4-specific features will not be
accessible to the guest yet.

Signed-off-by: Peter Maydell 
---
 hw/intc/gicv3_internal.h   | 16 
 include/hw/intc/arm_gicv3_its_common.h |  1 +
 hw/intc/arm_gicv3_its.c| 25 +
 3 files changed, 42 insertions(+)

diff --git a/hw/intc/gicv3_internal.h b/hw/intc/gicv3_internal.h
index dec413f7cfa..4613b9e59ba 100644
--- a/hw/intc/gicv3_internal.h
+++ b/hw/intc/gicv3_internal.h
@@ -280,6 +280,7 @@ FIELD(GITS_CTLR, ENABLED, 0, 1)
 FIELD(GITS_CTLR, QUIESCENT, 31, 1)
 
 FIELD(GITS_TYPER, PHYSICAL, 0, 1)
+FIELD(GITS_TYPER, VIRTUAL, 1, 1)
 FIELD(GITS_TYPER, ITT_ENTRY_SIZE, 4, 4)
 FIELD(GITS_TYPER, IDBITS, 8, 5)
 FIELD(GITS_TYPER, DEVBITS, 13, 5)
@@ -298,6 +299,7 @@ FIELD(GITS_TYPER, CIL, 36, 1)
 #define GITS_BASER_PAGESIZE_64K   2
 
 #define GITS_BASER_TYPE_DEVICE   1ULL
+#define GITS_BASER_TYPE_VPE  2ULL
 #define GITS_BASER_TYPE_COLLECTION   4ULL
 
 #define GITS_PAGE_SIZE_4K   0x1000
@@ -419,6 +421,20 @@ FIELD(DTE, ITTADDR, 6, 44)
 FIELD(CTE, VALID, 0, 1)
 FIELD(CTE, RDBASE, 1, RDBASE_PROCNUM_LENGTH)
 
+/*
+ * 8 bytes VPE table entry size:
+ * Valid = 1 bit, VPTsize = 5 bits, VPTaddr = 36 bits, RDbase = 16 bits
+ *
+ * Field sizes for Valid and size are mandated; field sizes for RDbase
+ * and VPT_addr are IMPDEF.
+ */
+#define GITS_VPE_SIZE 0x8ULL
+
+FIELD(VTE, VALID, 0, 1)
+FIELD(VTE, VPTSIZE, 1, 5)
+FIELD(VTE, VPTADDR, 6, 36)
+FIELD(VTE, RDBASE, 42, RDBASE_PROCNUM_LENGTH)
+
 /* Special interrupt IDs */
 #define INTID_SECURE 1020
 #define INTID_NONSECURE 1021
diff --git a/include/hw/intc/arm_gicv3_its_common.h 
b/include/hw/intc/arm_gicv3_its_common.h
index 0f130494dd3..7d1cc0f7177 100644
--- a/include/hw/intc/arm_gicv3_its_common.h
+++ b/include/hw/intc/arm_gicv3_its_common.h
@@ -78,6 +78,7 @@ struct GICv3ITSState {
 
 TableDesc  dt;
 TableDesc  ct;
+TableDesc  vpet;
 CmdQDesc   cq;
 
 Error *migration_blocker;
diff --git a/hw/intc/arm_gicv3_its.c b/hw/intc/arm_gicv3_its.c
index 400cdf83794..609c29496a0 100644
--- a/hw/intc/arm_gicv3_its.c
+++ b/hw/intc/arm_gicv3_its.c
@@ -79,6 +79,12 @@ typedef enum ItsCmdResult {
 CMD_CONTINUE = 1,
 } ItsCmdResult;
 
+/* True if the ITS supports the GICv4 virtual LPI feature */
+static bool its_feature_virtual(GICv3ITSState *s)
+{
+return s->typer & R_GITS_TYPER_VIRTUAL_MASK;
+}
+
 static inline bool intid_in_lpi_range(uint32_t id)
 {
 return id >= GICV3_LPI_INTID_START &&
@@ -946,6 +952,15 @@ static void extract_table_params(GICv3ITSState *s)
 idbits = 16;
 }
 break;
+case GITS_BASER_TYPE_VPE:
+td = &s->vpet;
+/*
+ * For QEMU vPEIDs are always 16 bits. (GICv4.1 allows an
+ * implementation to implement fewer bits and report this
+ * via GICD_TYPER2.)
+ */
+idbits = 16;
+break;
 default:
 /*
  * GITS_BASER.TYPE is read-only, so GITS_BASER_RO_MASK
@@ -1425,6 +1440,7 @@ static void gicv3_its_reset(DeviceState *dev)
 /*
  * setting GITS_BASER0.Type = 0b001 (Device)
  * GITS_BASER1.Type = 0b100 (Collection Table)
+ * GITS_BASER2.Type = 0b010 (vPE) for GICv4 and later
  * GITS_BASER.Type,where n = 3 to 7 are 0b00 (Unimplemented)
  * GITS_BASER<0,1>.Page_Size = 64KB
  * and default translation table entry size to 16 bytes
@@ -1442,6 +1458,15 @@ static void gicv3_its_reset(DeviceState *dev)
  GITS_BASER_PAGESIZE_64K);
 s->baser[1] = FIELD_DP64(s->baser[1], GITS_BASER, ENTRYSIZE,
  GITS_CTE_SIZE - 1);
+
+if (its_feature_virtual(s)) {
+s->baser[2] = FIELD_DP64(s->baser[2], GITS_BASER, TYPE,
+ GITS_BASER_TYPE_VPE);
+s->baser[2] = FIELD_DP64(s->baser[2], GITS_BASER, PAGESIZE,
+ GITS_BASER_PAGESIZE_64K);
+s->baser[2] = FIELD_DP64(s->baser[2], GITS_BASER, ENTRYSIZE,
+ GITS_VPE_SIZE - 1);
+}
 }
 
 static void gicv3_its_post_load(GICv3ITSState *s)
-- 
2.25.1

[PATCH 08/41] hw/intc/arm_gicv3_its: Implement VMAPI and VMAPTI

2022-04-08 Thread Peter Maydell

Implement the GICv4 VMAPI and VMAPTI commands. These write
an interrupt translation table entry that maps (DeviceID,EventID)
to (vPEID,vINTID,doorbell). The only difference between VMAPI
and VMAPTI is that VMAPI assumes vINTID == EventID rather than
both being specified in the command packet.

(This code won't be reachable until we allow the GIC version to be
set to 4.  Support for reading this new virtual-interrupt DTE and
handling it correctly will be implemented in a later commit.)

Signed-off-by: Peter Maydell 
---
 hw/intc/gicv3_internal.h |  9 
 hw/intc/arm_gicv3_its.c  | 91 
 hw/intc/trace-events |  2 +
 3 files changed, 102 insertions(+)

diff --git a/hw/intc/gicv3_internal.h b/hw/intc/gicv3_internal.h
index 4613b9e59ba..d3670a8894e 100644
--- a/hw/intc/gicv3_internal.h
+++ b/hw/intc/gicv3_internal.h
@@ -329,6 +329,8 @@ FIELD(GITS_TYPER, CIL, 36, 1)
 #define GITS_CMD_INVALL   0x0D
 #define GITS_CMD_MOVALL   0x0E
 #define GITS_CMD_DISCARD  0x0F
+#define GITS_CMD_VMAPTI   0x2A
+#define GITS_CMD_VMAPI0x2B
 
 /* MAPC command fields */
 #define ICID_LENGTH  16
@@ -368,6 +370,13 @@ FIELD(MOVI_0, DEVICEID, 32, 32)
 FIELD(MOVI_1, EVENTID, 0, 32)
 FIELD(MOVI_2, ICID, 0, 16)
 
+/* VMAPI, VMAPTI command fields */
+FIELD(VMAPTI_0, DEVICEID, 32, 32)
+FIELD(VMAPTI_1, EVENTID, 0, 32)
+FIELD(VMAPTI_1, VPEID, 32, 16)
+FIELD(VMAPTI_2, VINTID, 0, 32) /* VMAPTI only */
+FIELD(VMAPTI_2, DOORBELL, 32, 32)
+
 /*
  * 12 bytes Interrupt translation Table Entry size
  * as per Table 5.3 in GICv3 spec
diff --git a/hw/intc/arm_gicv3_its.c b/hw/intc/arm_gicv3_its.c
index 609c29496a0..e1f26a205e4 100644
--- a/hw/intc/arm_gicv3_its.c
+++ b/hw/intc/arm_gicv3_its.c
@@ -91,6 +91,12 @@ static inline bool intid_in_lpi_range(uint32_t id)
 id < (1ULL << (GICD_TYPER_IDBITS + 1));
 }
 
+static inline bool valid_doorbell(uint32_t id)
+{
+/* Doorbell fields may be an LPI, or 1023 to mean "no doorbell" */
+return id == INTID_SPURIOUS || intid_in_lpi_range(id);
+}
+
 static uint64_t baser_base_addr(uint64_t value, uint32_t page_sz)
 {
 uint64_t result = 0;
@@ -486,6 +492,85 @@ static ItsCmdResult process_mapti(GICv3ITSState *s, const 
uint64_t *cmdpkt,
 return update_ite(s, eventid, &dte, &ite) ? CMD_CONTINUE : CMD_STALL;
 }
 
+static ItsCmdResult process_vmapti(GICv3ITSState *s, const uint64_t *cmdpkt,
+   bool ignore_vintid)
+{
+uint32_t devid, eventid, vintid, doorbell, vpeid;
+uint32_t num_eventids;
+DTEntry dte;
+ITEntry ite;
+
+if (!its_feature_virtual(s)) {
+return CMD_CONTINUE;
+}
+
+devid = FIELD_EX64(cmdpkt[0], VMAPTI_0, DEVICEID);
+eventid = FIELD_EX64(cmdpkt[1], VMAPTI_1, EVENTID);
+vpeid = FIELD_EX64(cmdpkt[1], VMAPTI_1, VPEID);
+doorbell = FIELD_EX64(cmdpkt[2], VMAPTI_2, DOORBELL);
+if (ignore_vintid) {
+vintid = eventid;
+trace_gicv3_its_cmd_vmapi(devid, eventid, vpeid, doorbell);
+} else {
+vintid = FIELD_EX64(cmdpkt[2], VMAPTI_2, VINTID);
+trace_gicv3_its_cmd_vmapti(devid, eventid, vpeid, vintid, doorbell);
+}
+
+if (devid >= s->dt.num_entries) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: invalid DeviceID 0x%x (must be less than 0x%x)\n",
+  __func__, devid, s->dt.num_entries);
+return CMD_CONTINUE;
+}
+
+if (get_dte(s, devid, &dte) != MEMTX_OK) {
+return CMD_STALL;
+}
+
+if (!dte.valid) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: no entry in device table for DeviceID 0x%x\n",
+  __func__, devid);
+return CMD_CONTINUE;
+}
+
+num_eventids = 1ULL << (dte.size + 1);
+
+if (eventid >= num_eventids) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: EventID 0x%x too large for DeviceID 0x%x "
+  "(must be less than 0x%x)\n",
+  __func__, eventid, devid, num_eventids);
+return CMD_CONTINUE;
+}
+if (!intid_in_lpi_range(vintid)) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: VIntID 0x%x not a valid LPI\n",
+  __func__, vintid);
+return CMD_CONTINUE;
+}
+if (!valid_doorbell(doorbell)) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: Doorbell 0x%x not 1023 and not a valid LPI\n",
+  __func__, doorbell);
+return CMD_CONTINUE;
+}
+if (vpeid >= s->vpet.num_entries) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: VPEID 0x%x out of range (must be less than 0x%x)\n",
+  __func__, vpeid, s->vpet.num_entries);
+return CMD_CONTINUE;
+}
+/* add ite entry to interrupt translation table */
+ite.valid = true;
+ite.inttype = ITE_INTTYPE_VIRTUAL;
+ite.intid = vintid;
+ite.icid = 0;
+ite.door

[PATCH 01/41] hw/intc/arm_gicv3_its: Add missing blank line

2022-04-08 Thread Peter Maydell

In commit b6f96009acc we split do_process_its_cmd() from
process_its_cmd(), but forgot the usual blank line between function
definitions.  Add it.

Signed-off-by: Peter Maydell 
---
 hw/intc/arm_gicv3_its.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/intc/arm_gicv3_its.c b/hw/intc/arm_gicv3_its.c
index 87466732139..44914f25780 100644
--- a/hw/intc/arm_gicv3_its.c
+++ b/hw/intc/arm_gicv3_its.c
@@ -380,6 +380,7 @@ static ItsCmdResult do_process_its_cmd(GICv3ITSState *s, 
uint32_t devid,
 }
 return CMD_CONTINUE;
 }
+
 static ItsCmdResult process_its_cmd(GICv3ITSState *s, const uint64_t *cmdpkt,
 ItsCmdType cmd)
 {
-- 
2.25.1

[PATCH 10/41] hw/intc/arm_gicv3_its: Distinguish success and error cases of CMD_CONTINUE

2022-04-08 Thread Peter Maydell

In the ItsCmdResult enum, we currently distinguish only CMD_STALL
(failure, stall processing of the command queue) and CMD_CONTINUE
(keep processing the queue), and we use the latter both for "there
was a parameter error, go on to the next command" and "the command
succeeded, go on to the next command".  Sometimes we would like to
distinguish those two cases, so add CMD_CONTINUE_OK to the enum to
represent the success situation, and use it in the relevant places.

Signed-off-by: Peter Maydell 
---
 hw/intc/arm_gicv3_its.c | 29 -
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/hw/intc/arm_gicv3_its.c b/hw/intc/arm_gicv3_its.c
index ea2b4b9e5a7..ba1893c072b 100644
--- a/hw/intc/arm_gicv3_its.c
+++ b/hw/intc/arm_gicv3_its.c
@@ -78,11 +78,13 @@ typedef struct VTEntry {
  * and continue processing.
  * The process_* functions which handle individual ITS commands all
  * return an ItsCmdResult which tells process_cmdq() whether it should
- * stall or keep going.
+ * stall, keep going because of an error, or keep going because the
+ * command was a success.
  */
 typedef enum ItsCmdResult {
 CMD_STALL = 0,
 CMD_CONTINUE = 1,
+CMD_CONTINUE_OK = 2,
 } ItsCmdResult;
 
 /* True if the ITS supports the GICv4 virtual LPI feature */
@@ -400,9 +402,9 @@ static ItsCmdResult do_process_its_cmd(GICv3ITSState *s, 
uint32_t devid,
 ITEntry ite = {};
 /* remove mapping from interrupt translation table */
 ite.valid = false;
-return update_ite(s, eventid, &dte, &ite) ? CMD_CONTINUE : CMD_STALL;
+return update_ite(s, eventid, &dte, &ite) ? CMD_CONTINUE_OK : 
CMD_STALL;
 }
-return CMD_CONTINUE;
+return CMD_CONTINUE_OK;
 }
 
 static ItsCmdResult process_its_cmd(GICv3ITSState *s, const uint64_t *cmdpkt,
@@ -495,7 +497,7 @@ static ItsCmdResult process_mapti(GICv3ITSState *s, const 
uint64_t *cmdpkt,
 ite.icid = icid;
 ite.doorbell = INTID_SPURIOUS;
 ite.vpeid = 0;
-return update_ite(s, eventid, &dte, &ite) ? CMD_CONTINUE : CMD_STALL;
+return update_ite(s, eventid, &dte, &ite) ? CMD_CONTINUE_OK : CMD_STALL;
 }
 
 static ItsCmdResult process_vmapti(GICv3ITSState *s, const uint64_t *cmdpkt,
@@ -574,7 +576,7 @@ static ItsCmdResult process_vmapti(GICv3ITSState *s, const 
uint64_t *cmdpkt,
 ite.icid = 0;
 ite.doorbell = doorbell;
 ite.vpeid = vpeid;
-return update_ite(s, eventid, &dte, &ite) ? CMD_CONTINUE : CMD_STALL;
+return update_ite(s, eventid, &dte, &ite) ? CMD_CONTINUE_OK : CMD_STALL;
 }
 
 /*
@@ -635,7 +637,7 @@ static ItsCmdResult process_mapc(GICv3ITSState *s, const 
uint64_t *cmdpkt)
 return CMD_CONTINUE;
 }
 
-return update_cte(s, icid, &cte) ? CMD_CONTINUE : CMD_STALL;
+return update_cte(s, icid, &cte) ? CMD_CONTINUE_OK : CMD_STALL;
 }
 
 /*
@@ -696,7 +698,7 @@ static ItsCmdResult process_mapd(GICv3ITSState *s, const 
uint64_t *cmdpkt)
 return CMD_CONTINUE;
 }
 
-return update_dte(s, devid, &dte) ? CMD_CONTINUE : CMD_STALL;
+return update_dte(s, devid, &dte) ? CMD_CONTINUE_OK : CMD_STALL;
 }
 
 static ItsCmdResult process_movall(GICv3ITSState *s, const uint64_t *cmdpkt)
@@ -725,13 +727,13 @@ static ItsCmdResult process_movall(GICv3ITSState *s, 
const uint64_t *cmdpkt)
 
 if (rd1 == rd2) {
 /* Move to same target must succeed as a no-op */
-return CMD_CONTINUE;
+return CMD_CONTINUE_OK;
 }
 
 /* Move all pending LPIs from redistributor 1 to redistributor 2 */
 gicv3_redist_movall_lpis(&s->gicv3->cpu[rd1], &s->gicv3->cpu[rd2]);
 
-return CMD_CONTINUE;
+return CMD_CONTINUE_OK;
 }
 
 static ItsCmdResult process_movi(GICv3ITSState *s, const uint64_t *cmdpkt)
@@ -845,7 +847,7 @@ static ItsCmdResult process_movi(GICv3ITSState *s, const 
uint64_t *cmdpkt)
 
 /* Update the ICID field in the interrupt translation table entry */
 old_ite.icid = new_icid;
-return update_ite(s, eventid, &dte, &old_ite) ? CMD_CONTINUE : CMD_STALL;
+return update_ite(s, eventid, &dte, &old_ite) ? CMD_CONTINUE_OK : 
CMD_STALL;
 }
 
 /*
@@ -924,7 +926,7 @@ static ItsCmdResult process_vmapp(GICv3ITSState *s, const 
uint64_t *cmdpkt)
 return CMD_CONTINUE;
 }
 
-return update_vte(s, vpeid, &vte) ? CMD_CONTINUE : CMD_STALL;
+return update_vte(s, vpeid, &vte) ? CMD_CONTINUE_OK : CMD_STALL;
 }
 
 /*
@@ -963,7 +965,7 @@ static void process_cmdq(GICv3ITSState *s)
 }
 
 while (wr_offset != rd_offset) {
-ItsCmdResult result = CMD_CONTINUE;
+ItsCmdResult result = CMD_CONTINUE_OK;
 void *hostmem;
 hwaddr buflen;
 uint64_t cmdpkt[GITS_CMDQ_ENTRY_WORDS];
@@ -1055,7 +1057,8 @@ static void process_cmdq(GICv3ITSState *s)
 trace_gicv3_its_cmd_unknown(cmd);
 break;
 }
-if (result == CMD_CONTINUE) {
+if (result != CMD_STALL) {
+/* CMD_CONTINUE or CMD_CONTINUE_OK */
 rd_offset++;
 rd

[PATCH 11/41] hw/intc/arm_gicv3_its: Factor out "find ITE given devid, eventid"

2022-04-08 Thread Peter Maydell

The operation of finding an interrupt table entry given a (DeviceID,
EventID) pair is necessary in multiple different ITS commands.  The
process requires first using the DeviceID as an index into the device
table to find the DTE, and then useng the EventID as an index into
the interrupt table specified by that DTE to find the ITE.  We also
need to handle all the possible error cases: indexes out of range,
table memory not readable, table entries not valid.

Factor this out into a separate lookup_ite() function which we
can then call from the places where we were previously open-coding
this sequence. We'll also need this for some of the new GICv4.0
commands.

Signed-off-by: Peter Maydell 
---
 hw/intc/arm_gicv3_its.c | 124 +---
 1 file changed, 64 insertions(+), 60 deletions(-)

diff --git a/hw/intc/arm_gicv3_its.c b/hw/intc/arm_gicv3_its.c
index ba1893c072b..fe1bea2dd81 100644
--- a/hw/intc/arm_gicv3_its.c
+++ b/hw/intc/arm_gicv3_its.c
@@ -314,6 +314,60 @@ out:
 return res;
 }
 
+/*
+ * Given a (DeviceID, EventID), look up the corresponding ITE, including
+ * checking for the various invalid-value cases. If we find a valid ITE,
+ * fill in @ite and @dte and return CMD_CONTINUE_OK. Otherwise return
+ * CMD_STALL or CMD_CONTINUE as appropriate (and the contents of @ite
+ * should not be relied on).
+ *
+ * The string @who is purely for the LOG_GUEST_ERROR messages,
+ * and should indicate the name of the calling function or similar.
+ */
+static ItsCmdResult lookup_ite(GICv3ITSState *s, const char *who,
+   uint32_t devid, uint32_t eventid, ITEntry *ite,
+   DTEntry *dte)
+{
+uint64_t num_eventids;
+
+if (devid >= s->dt.num_entries) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: invalid command attributes: devid %d>=%d",
+  who, devid, s->dt.num_entries);
+return CMD_CONTINUE;
+}
+
+if (get_dte(s, devid, dte) != MEMTX_OK) {
+return CMD_STALL;
+}
+if (!dte->valid) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: invalid command attributes: "
+  "invalid dte for %d\n", who, devid);
+return CMD_CONTINUE;
+}
+
+num_eventids = 1ULL << (dte->size + 1);
+if (eventid >= num_eventids) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: invalid command attributes: eventid %d >= %"
+  PRId64 "\n", who, eventid, num_eventids);
+return CMD_CONTINUE;
+}
+
+if (get_ite(s, eventid, dte, ite) != MEMTX_OK) {
+return CMD_STALL;
+}
+
+if (!ite->valid) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: invalid command attributes: invalid ITE\n", who);
+return CMD_CONTINUE;
+}
+
+return CMD_CONTINUE_OK;
+}
+
 /*
  * This function handles the processing of following commands based on
  * the ItsCmdType parameter passed:-
@@ -325,42 +379,17 @@ out:
 static ItsCmdResult do_process_its_cmd(GICv3ITSState *s, uint32_t devid,
uint32_t eventid, ItsCmdType cmd)
 {
-uint64_t num_eventids;
 DTEntry dte;
 CTEntry cte;
 ITEntry ite;
+ItsCmdResult cmdres;
 
-if (devid >= s->dt.num_entries) {
-qemu_log_mask(LOG_GUEST_ERROR,
-  "%s: invalid command attributes: devid %d>=%d",
-  __func__, devid, s->dt.num_entries);
-return CMD_CONTINUE;
+cmdres = lookup_ite(s, __func__, devid, eventid, &ite, &dte);
+if (cmdres != CMD_CONTINUE_OK) {
+return cmdres;
 }
 
-if (get_dte(s, devid, &dte) != MEMTX_OK) {
-return CMD_STALL;
-}
-if (!dte.valid) {
-qemu_log_mask(LOG_GUEST_ERROR,
-  "%s: invalid command attributes: "
-  "invalid dte for %d\n", __func__, devid);
-return CMD_CONTINUE;
-}
-
-num_eventids = 1ULL << (dte.size + 1);
-if (eventid >= num_eventids) {
-qemu_log_mask(LOG_GUEST_ERROR,
-  "%s: invalid command attributes: eventid %d >= %"
-  PRId64 "\n",
-  __func__, eventid, num_eventids);
-return CMD_CONTINUE;
-}
-
-if (get_ite(s, eventid, &dte, &ite) != MEMTX_OK) {
-return CMD_STALL;
-}
-
-if (!ite.valid || ite.inttype != ITE_INTTYPE_PHYSICAL) {
+if (ite.inttype != ITE_INTTYPE_PHYSICAL) {
 qemu_log_mask(LOG_GUEST_ERROR,
   "%s: invalid command attributes: invalid ITE\n",
   __func__);
@@ -740,10 +769,10 @@ static ItsCmdResult process_movi(GICv3ITSState *s, const 
uint64_t *cmdpkt)
 {
 uint32_t devid, eventid;
 uint16_t new_icid;
-uint64_t num_eventids;
 DTEntry dte;
 CTEntry old_cte, new_cte;
 ITEntry old_ite;
+ItsCmdResult cmdres;
 
 devid = FIELD_EX64(cmdpkt[0], MOVI_0, DEVICEID);
 eventid = FIELD_EX64(cm

[PATCH 20/41] hw/intc/arm_gicv3_its: Implement VMOVI

2022-04-08 Thread Peter Maydell

Implement the GICv4 VMOVI command, which moves the pending state
of a virtual interrupt from one redistributor to another. As with
MOVI, we handle the "parse and validate command arguments and
table lookups" part in the ITS source file, and pass the final
results to a function in the redistributor which will do the
actual operation. As with the "make a VLPI pending" change,
for the moment we leave that redistributor function as a stub,
to be implemented in a later commit.

Signed-off-by: Peter Maydell 
---
 hw/intc/gicv3_internal.h   | 23 +++
 hw/intc/arm_gicv3_its.c| 82 ++
 hw/intc/arm_gicv3_redist.c | 10 +
 hw/intc/trace-events   |  1 +
 4 files changed, 116 insertions(+)

diff --git a/hw/intc/gicv3_internal.h b/hw/intc/gicv3_internal.h
index 2f653a9b917..050e19d133b 100644
--- a/hw/intc/gicv3_internal.h
+++ b/hw/intc/gicv3_internal.h
@@ -329,6 +329,7 @@ FIELD(GITS_TYPER, CIL, 36, 1)
 #define GITS_CMD_INVALL   0x0D
 #define GITS_CMD_MOVALL   0x0E
 #define GITS_CMD_DISCARD  0x0F
+#define GITS_CMD_VMOVI0x21
 #define GITS_CMD_VMOVP0x22
 #define GITS_CMD_VSYNC0x25
 #define GITS_CMD_VMAPP0x29
@@ -403,6 +404,13 @@ FIELD(VMOVP_2, RDBASE, 16, 36)
 FIELD(VMOVP_2, DB, 63, 1) /* GICv4.1 only */
 FIELD(VMOVP_3, DEFAULT_DOORBELL, 0, 32) /* GICv4.1 only */
 
+/* VMOVI command fields */
+FIELD(VMOVI_0, DEVICEID, 32, 32)
+FIELD(VMOVI_1, EVENTID, 0, 32)
+FIELD(VMOVI_1, VPEID, 32, 16)
+FIELD(VMOVI_2, D, 0, 1)
+FIELD(VMOVI_2, DOORBELL, 32, 32)
+
 /*
  * 12 bytes Interrupt translation Table Entry size
  * as per Table 5.3 in GICv3 spec
@@ -614,6 +622,21 @@ void gicv3_redist_mov_lpi(GICv3CPUState *src, 
GICv3CPUState *dest, int irq);
  * by the ITS MOVALL command.
  */
 void gicv3_redist_movall_lpis(GICv3CPUState *src, GICv3CPUState *dest);
+/**
+ * gicv3_redist_mov_vlpi:
+ * @src: source redistributor
+ * @src_vptaddr: (guest) address of source VLPI table
+ * @dest: destination redistributor
+ * @dest_vptaddr: (guest) address of destination VLPI table
+ * @irq: VLPI to update
+ * @doorbell: doorbell for destination (1023 for "no doorbell")
+ *
+ * Move the pending state of the specified VLPI from @src to @dest,
+ * as required by the ITS VMOVI command.
+ */
+void gicv3_redist_mov_vlpi(GICv3CPUState *src, uint64_t src_vptaddr,
+   GICv3CPUState *dest, uint64_t dest_vptaddr,
+   int irq, int doorbell);
 
 void gicv3_redist_send_sgi(GICv3CPUState *cs, int grp, int irq, bool ns);
 void gicv3_init_cpuif(GICv3State *s);
diff --git a/hw/intc/arm_gicv3_its.c b/hw/intc/arm_gicv3_its.c
index c8b90e6b0d9..aef024009b2 100644
--- a/hw/intc/arm_gicv3_its.c
+++ b/hw/intc/arm_gicv3_its.c
@@ -1084,6 +1084,85 @@ static ItsCmdResult process_vmovp(GICv3ITSState *s, 
const uint64_t *cmdpkt)
 return cbdata.result;
 }
 
+static ItsCmdResult process_vmovi(GICv3ITSState *s, const uint64_t *cmdpkt)
+{
+uint32_t devid, eventid, vpeid, doorbell;
+bool doorbell_valid;
+DTEntry dte;
+ITEntry ite;
+VTEntry old_vte, new_vte;
+ItsCmdResult cmdres;
+
+if (!its_feature_virtual(s)) {
+return CMD_CONTINUE;
+}
+
+devid = FIELD_EX64(cmdpkt[0], VMOVI_0, DEVICEID);
+eventid = FIELD_EX64(cmdpkt[1], VMOVI_1, EVENTID);
+vpeid = FIELD_EX64(cmdpkt[1], VMOVI_1, VPEID);
+doorbell_valid = FIELD_EX64(cmdpkt[2], VMOVI_2, D);
+doorbell = FIELD_EX64(cmdpkt[2], VMOVI_2, DOORBELL);
+
+trace_gicv3_its_cmd_vmovi(devid, eventid, vpeid, doorbell_valid, doorbell);
+
+if (doorbell_valid && !valid_doorbell(doorbell)) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: invalid doorbell 0x%x\n", __func__, doorbell);
+return CMD_CONTINUE;
+}
+
+cmdres = lookup_ite(s, __func__, devid, eventid, &ite, &dte);
+if (cmdres != CMD_CONTINUE_OK) {
+return cmdres;
+}
+
+if (ite.inttype != ITE_INTTYPE_VIRTUAL) {
+qemu_log_mask(LOG_GUEST_ERROR, "%s: ITE is not for virtual 
interrupt\n",
+  __func__);
+return CMD_CONTINUE;
+}
+
+cmdres = lookup_vte(s, __func__, ite.vpeid, &old_vte);
+if (cmdres != CMD_CONTINUE_OK) {
+return cmdres;
+}
+cmdres = lookup_vte(s, __func__, vpeid, &new_vte);
+if (cmdres != CMD_CONTINUE_OK) {
+return cmdres;
+}
+
+if (!intid_in_lpi_range(ite.intid) ||
+ite.intid >= (1ULL << (old_vte.vptsize + 1)) ||
+ite.intid >= (1ULL << (new_vte.vptsize + 1))) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: ITE intid 0x%x out of range\n",
+  __func__, ite.intid);
+return CMD_CONTINUE;
+}
+
+ite.vpeid = vpeid;
+if (doorbell_valid) {
+ite.doorbell = doorbell;
+}
+
+/*
+ * Move the LPI from the old redistributor to the new one. We don't
+ * need to do anything if the guest somehow specified the
+ * same pendi

[PATCH 09/41] hw/intc/arm_gicv3_its: Implement VMAPP

2022-04-08 Thread Peter Maydell

Implement the GICv4 VMAPP command, which writes an entry to the vPE
table.

For GICv4.1 this command has extra fields in the command packet
and additional behaviour. We define the 4.1-only fields with the
FIELD macro, but only implement the GICv4.0 version of the command.

Signed-off-by: Peter Maydell 
---
GICv4.1 emulation is on my todo list, but I'm starting with
just the 4.0 parts first.
---
 hw/intc/gicv3_internal.h | 12 ++
 hw/intc/arm_gicv3_its.c  | 88 
 hw/intc/trace-events |  2 +
 3 files changed, 102 insertions(+)

diff --git a/hw/intc/gicv3_internal.h b/hw/intc/gicv3_internal.h
index d3670a8894e..bbb8a20ce61 100644
--- a/hw/intc/gicv3_internal.h
+++ b/hw/intc/gicv3_internal.h
@@ -329,6 +329,7 @@ FIELD(GITS_TYPER, CIL, 36, 1)
 #define GITS_CMD_INVALL   0x0D
 #define GITS_CMD_MOVALL   0x0E
 #define GITS_CMD_DISCARD  0x0F
+#define GITS_CMD_VMAPP0x29
 #define GITS_CMD_VMAPTI   0x2A
 #define GITS_CMD_VMAPI0x2B
 
@@ -377,6 +378,17 @@ FIELD(VMAPTI_1, VPEID, 32, 16)
 FIELD(VMAPTI_2, VINTID, 0, 32) /* VMAPTI only */
 FIELD(VMAPTI_2, DOORBELL, 32, 32)
 
+/* VMAPP command fields */
+FIELD(VMAPP_0, ALLOC, 8, 1) /* GICv4.1 only */
+FIELD(VMAPP_0, PTZ, 9, 1) /* GICv4.1 only */
+FIELD(VMAPP_0, VCONFADDR, 16, 36) /* GICv4.1 only */
+FIELD(VMAPP_1, DEFAULT_DOORBELL, 0, 32) /* GICv4.1 only */
+FIELD(VMAPP_1, VPEID, 32, 16)
+FIELD(VMAPP_2, RDBASE, 16, 36)
+FIELD(VMAPP_2, V, 63, 1)
+FIELD(VMAPP_3, VPTSIZE, 0, 8) /* For GICv4.0, bits [7:6] are RES0 */
+FIELD(VMAPP_3, VPTADDR, 16, 36)
+
 /*
  * 12 bytes Interrupt translation Table Entry size
  * as per Table 5.3 in GICv3 spec
diff --git a/hw/intc/arm_gicv3_its.c b/hw/intc/arm_gicv3_its.c
index e1f26a205e4..ea2b4b9e5a7 100644
--- a/hw/intc/arm_gicv3_its.c
+++ b/hw/intc/arm_gicv3_its.c
@@ -61,6 +61,12 @@ typedef struct ITEntry {
 uint32_t vpeid;
 } ITEntry;
 
+typedef struct VTEntry {
+bool valid;
+unsigned vptsize;
+uint32_t rdbase;
+uint64_t vptaddr;
+} VTEntry;
 
 /*
  * The ITS spec permits a range of CONSTRAINED UNPREDICTABLE options
@@ -842,6 +848,85 @@ static ItsCmdResult process_movi(GICv3ITSState *s, const 
uint64_t *cmdpkt)
 return update_ite(s, eventid, &dte, &old_ite) ? CMD_CONTINUE : CMD_STALL;
 }
 
+/*
+ * Update the vPE Table entry at index @vpeid with the entry @vte.
+ * Returns true on success, false if there was a memory access error.
+ */
+static bool update_vte(GICv3ITSState *s, uint32_t vpeid, const VTEntry *vte)
+{
+AddressSpace *as = &s->gicv3->dma_as;
+uint64_t entry_addr;
+uint64_t vteval = 0;
+MemTxResult res = MEMTX_OK;
+
+trace_gicv3_its_vte_write(vpeid, vte->valid, vte->vptsize, vte->vptaddr,
+  vte->rdbase);
+
+if (vte->valid) {
+vteval = FIELD_DP64(vteval, VTE, VALID, 1);
+vteval = FIELD_DP64(vteval, VTE, VPTSIZE, vte->vptsize);
+vteval = FIELD_DP64(vteval, VTE, VPTADDR, vte->vptaddr);
+vteval = FIELD_DP64(vteval, VTE, RDBASE, vte->rdbase);
+}
+
+entry_addr = table_entry_addr(s, &s->vpet, vpeid, &res);
+if (res != MEMTX_OK) {
+return false;
+}
+if (entry_addr == -1) {
+/* No L2 table for this index: discard write and continue */
+return true;
+}
+address_space_stq_le(as, entry_addr, vteval, MEMTXATTRS_UNSPECIFIED, &res);
+return res == MEMTX_OK;
+}
+
+static ItsCmdResult process_vmapp(GICv3ITSState *s, const uint64_t *cmdpkt)
+{
+VTEntry vte;
+uint32_t vpeid;
+
+if (!its_feature_virtual(s)) {
+return CMD_CONTINUE;
+}
+
+vpeid = FIELD_EX64(cmdpkt[1], VMAPP_1, VPEID);
+vte.rdbase = FIELD_EX64(cmdpkt[2], VMAPP_2, RDBASE);
+vte.valid = FIELD_EX64(cmdpkt[2], VMAPP_2, V);
+vte.vptsize = FIELD_EX64(cmdpkt[3], VMAPP_3, VPTSIZE);
+vte.vptaddr = FIELD_EX64(cmdpkt[3], VMAPP_3, VPTADDR);
+
+trace_gicv3_its_cmd_vmapp(vpeid, vte.rdbase, vte.valid,
+  vte.vptaddr, vte.vptsize);
+
+/*
+ * For GICv4.0 the VPT_size field is only 5 bits, whereas we
+ * define our field macros to include the full GICv4.1 8 bits.
+ * The range check on VPT_size will catch the cases where
+ * the guest set the RES0-in-GICv4.0 bits [7:6].
+ */
+if (vte.vptsize > FIELD_EX64(s->typer, GITS_TYPER, IDBITS)) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: invalid VPT_size 0x%x\n", __func__, vte.vptsize);
+return CMD_CONTINUE;
+}
+
+if (vte.valid && vte.rdbase >= s->gicv3->num_cpu) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: invalid rdbase 0x%x\n", __func__, vte.rdbase);
+return CMD_CONTINUE;
+}
+
+if (vpeid >= s->vpet.num_entries) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: VPEID 0x%x out of range (must be less than 0x%x)\n",
+  __func__, vpeid, s->vpet.num_entries);
+return CMD_CONTINUE

[PATCH 13/41] hw/intc/arm_gicv3_its: Split out process_its_cmd() physical interrupt code

2022-04-08 Thread Peter Maydell

Split the part of process_its_cmd() which is specific to physical
interrupts into its own function.  This is the part which starts by
taking the ICID and looking it up in the collection table.  The
handling of virtual interrupts is significantly different (involving
a lookup in the vPE table) so structuring the code with one
sub-function for the physical interrupt case and one for the virtual
interrupt case will be clearer than putting both cases in one large
function.

The code for handling the "remove mapping from ITE" for the DISCARD
command remains in process_its_cmd() because it is common to both
virtual and physical interrupts.

Signed-off-by: Peter Maydell 
---
 hw/intc/arm_gicv3_its.c | 51 ++---
 1 file changed, 33 insertions(+), 18 deletions(-)

diff --git a/hw/intc/arm_gicv3_its.c b/hw/intc/arm_gicv3_its.c
index 2cbac76256d..8ea1fc366d3 100644
--- a/hw/intc/arm_gicv3_its.c
+++ b/hw/intc/arm_gicv3_its.c
@@ -397,6 +397,19 @@ static ItsCmdResult lookup_cte(GICv3ITSState *s, const 
char *who,
 return CMD_CONTINUE_OK;
 }
 
+static ItsCmdResult process_its_cmd_phys(GICv3ITSState *s, const ITEntry *ite,
+ int irqlevel)
+{
+CTEntry cte;
+ItsCmdResult cmdres;
+
+cmdres = lookup_cte(s, __func__, ite->icid, &cte);
+if (cmdres != CMD_CONTINUE_OK) {
+return cmdres;
+}
+gicv3_redist_process_lpi(&s->gicv3->cpu[cte.rdbase], ite->intid, irqlevel);
+return CMD_CONTINUE_OK;
+}
 
 /*
  * This function handles the processing of following commands based on
@@ -410,34 +423,36 @@ static ItsCmdResult do_process_its_cmd(GICv3ITSState *s, 
uint32_t devid,
uint32_t eventid, ItsCmdType cmd)
 {
 DTEntry dte;
-CTEntry cte;
 ITEntry ite;
 ItsCmdResult cmdres;
+int irqlevel;
 
 cmdres = lookup_ite(s, __func__, devid, eventid, &ite, &dte);
 if (cmdres != CMD_CONTINUE_OK) {
 return cmdres;
 }
 
-if (ite.inttype != ITE_INTTYPE_PHYSICAL) {
-qemu_log_mask(LOG_GUEST_ERROR,
-  "%s: invalid command attributes: invalid ITE\n",
-  __func__);
-return CMD_CONTINUE;
+irqlevel = (cmd == CLEAR || cmd == DISCARD) ? 0 : 1;
+
+switch (ite.inttype) {
+case ITE_INTTYPE_PHYSICAL:
+cmdres = process_its_cmd_phys(s, &ite, irqlevel);
+break;
+case ITE_INTTYPE_VIRTUAL:
+if (!its_feature_virtual(s)) {
+/* Can't happen unless guest is illegally writing to table memory 
*/
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: invalid type %d in ITE (table corrupted?)\n",
+  __func__, ite.inttype);
+return CMD_CONTINUE;
+}
+/* The GICv4 virtual interrupt handling will go here */
+g_assert_not_reached();
+default:
+g_assert_not_reached();
 }
 
-cmdres = lookup_cte(s, __func__, ite.icid, &cte);
-if (cmdres != CMD_CONTINUE_OK) {
-return cmdres;
-}
-
-if ((cmd == CLEAR) || (cmd == DISCARD)) {
-gicv3_redist_process_lpi(&s->gicv3->cpu[cte.rdbase], ite.intid, 0);
-} else {
-gicv3_redist_process_lpi(&s->gicv3->cpu[cte.rdbase], ite.intid, 1);
-}
-
-if (cmd == DISCARD) {
+if (cmdres == CMD_CONTINUE_OK && cmd == DISCARD) {
 ITEntry ite = {};
 /* remove mapping from interrupt translation table */
 ite.valid = false;
-- 
2.25.1

[PATCH 18/41] hw/intc/arm_gicv3_its: Implement INV command properly

2022-04-08 Thread Peter Maydell

We were previously implementing INV (like INVALL) to just blow away
cached highest-priority-pending-LPI information on all connected
redistributors.  For GICv4.0, this isn't going to be sufficient,
because the LPI we are invalidating cached information for might be
either physical or virtual, and the required action is different for
those two cases.  So we need to do the full process of looking up the
ITE from the devid and eventid.  This also means we can do the error
checks that the spec lists for this command.

Split out INV handling into a process_inv() function like our other
command-processing functions.  For the moment, stick to handling only
physical LPIs; we will add the vLPI parts later.

Signed-off-by: Peter Maydell 
---
We could also improve INVALL to only prod the one redistributor
specified by the ICID in the command packet, but since INVALL
is only for physical LPIs I am leaving it as it is.
---
 hw/intc/gicv3_internal.h   | 12 +
 hw/intc/arm_gicv3_its.c| 50 +-
 hw/intc/arm_gicv3_redist.c | 11 +
 hw/intc/trace-events   |  3 ++-
 4 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/hw/intc/gicv3_internal.h b/hw/intc/gicv3_internal.h
index ef1d75b3cf4..25ea19de385 100644
--- a/hw/intc/gicv3_internal.h
+++ b/hw/intc/gicv3_internal.h
@@ -373,6 +373,10 @@ FIELD(MOVI_0, DEVICEID, 32, 32)
 FIELD(MOVI_1, EVENTID, 0, 32)
 FIELD(MOVI_2, ICID, 0, 16)
 
+/* INV command fields */
+FIELD(INV_0, DEVICEID, 32, 32)
+FIELD(INV_1, EVENTID, 0, 32)
+
 /* VMAPI, VMAPTI command fields */
 FIELD(VMAPTI_0, DEVICEID, 32, 32)
 FIELD(VMAPTI_1, EVENTID, 0, 32)
@@ -573,6 +577,14 @@ void gicv3_redist_update_lpi(GICv3CPUState *cs);
  * an incoming migration has loaded new state.
  */
 void gicv3_redist_update_lpi_only(GICv3CPUState *cs);
+/**
+ * gicv3_redist_inv_lpi:
+ * @cs: GICv3CPUState
+ * @irq: LPI to invalidate cached information for
+ *
+ * Forget or update any cached information associated with this LPI.
+ */
+void gicv3_redist_inv_lpi(GICv3CPUState *cs, int irq);
 /**
  * gicv3_redist_mov_lpi:
  * @src: source redistributor
diff --git a/hw/intc/arm_gicv3_its.c b/hw/intc/arm_gicv3_its.c
index 05d64630450..6ba554c16ea 100644
--- a/hw/intc/arm_gicv3_its.c
+++ b/hw/intc/arm_gicv3_its.c
@@ -1084,6 +1084,50 @@ static ItsCmdResult process_vmovp(GICv3ITSState *s, 
const uint64_t *cmdpkt)
 return cbdata.result;
 }
 
+static ItsCmdResult process_inv(GICv3ITSState *s, const uint64_t *cmdpkt)
+{
+uint32_t devid, eventid;
+ITEntry ite;
+DTEntry dte;
+CTEntry cte;
+ItsCmdResult cmdres;
+
+devid = FIELD_EX64(cmdpkt[0], INV_0, DEVICEID);
+eventid = FIELD_EX64(cmdpkt[1], INV_1, EVENTID);
+
+trace_gicv3_its_cmd_inv(devid, eventid);
+
+cmdres = lookup_ite(s, __func__, devid, eventid, &ite, &dte);
+if (cmdres != CMD_CONTINUE_OK) {
+return cmdres;
+}
+
+switch (ite.inttype) {
+case ITE_INTTYPE_PHYSICAL:
+cmdres = lookup_cte(s, __func__, ite.icid, &cte);
+if (cmdres != CMD_CONTINUE_OK) {
+return cmdres;
+}
+gicv3_redist_inv_lpi(&s->gicv3->cpu[cte.rdbase], ite.intid);
+break;
+case ITE_INTTYPE_VIRTUAL:
+if (!its_feature_virtual(s)) {
+/* Can't happen unless guest is illegally writing to table memory 
*/
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: invalid type %d in ITE (table corrupted?)\n",
+  __func__, ite.inttype);
+return CMD_CONTINUE;
+}
+/* We will implement the vLPI invalidation in a later commit */
+g_assert_not_reached();
+break;
+default:
+g_assert_not_reached();
+}
+
+return CMD_CONTINUE_OK;
+}
+
 /*
  * Current implementation blocks until all
  * commands are processed
@@ -1192,14 +1236,18 @@ static void process_cmdq(GICv3ITSState *s)
 result = process_its_cmd(s, cmdpkt, DISCARD);
 break;
 case GITS_CMD_INV:
+result = process_inv(s, cmdpkt);
+break;
 case GITS_CMD_INVALL:
 /*
  * Current implementation doesn't cache any ITS tables,
  * but the calculated lpi priority information. We only
  * need to trigger lpi priority re-calculation to be in
  * sync with LPI config table or pending table changes.
+ * INVALL operates on a collection specified by ICID so
+ * it only affects physical LPIs.
  */
-trace_gicv3_its_cmd_inv();
+trace_gicv3_its_cmd_invall();
 for (i = 0; i < s->gicv3->num_cpu; i++) {
 gicv3_redist_update_lpi(&s->gicv3->cpu[i]);
 }
diff --git a/hw/intc/arm_gicv3_redist.c b/hw/intc/arm_gicv3_redist.c
index b08b599c887..78650a3bb4c 100644
--- a/hw/intc/arm_gicv3_redist.c
+++ b/hw/intc/arm_gicv3_redist.c
@@ -681,6 +681,17 @@ void gicv3_redist_process_lpi(GICv3CPUState *cs,

[PATCH 12/41] hw/intc/arm_gicv3_its: Factor out CTE lookup sequence

2022-04-08 Thread Peter Maydell

Factor out the sequence of looking up a CTE from an ICID including
the validity and error checks.

Signed-off-by: Peter Maydell 
---
I think process_movi() in particular is now a lot cleaner
to read with all the error-checking factored out.
---
 hw/intc/arm_gicv3_its.c | 109 ++--
 1 file changed, 39 insertions(+), 70 deletions(-)

diff --git a/hw/intc/arm_gicv3_its.c b/hw/intc/arm_gicv3_its.c
index fe1bea2dd81..2cbac76256d 100644
--- a/hw/intc/arm_gicv3_its.c
+++ b/hw/intc/arm_gicv3_its.c
@@ -368,6 +368,36 @@ static ItsCmdResult lookup_ite(GICv3ITSState *s, const 
char *who,
 return CMD_CONTINUE_OK;
 }
 
+/*
+ * Given an ICID, look up the corresponding CTE, including checking for various
+ * invalid-value cases. If we find a valid CTE, fill in @cte and return
+ * CMD_CONTINUE_OK; otherwise return CMD_STALL or CMD_CONTINUE (and the
+ * contents of @cte should not be relied on).
+ *
+ * The string @who is purely for the LOG_GUEST_ERROR messages,
+ * and should indicate the name of the calling function or similar.
+ */
+static ItsCmdResult lookup_cte(GICv3ITSState *s, const char *who,
+   uint32_t icid, CTEntry *cte)
+{
+if (icid >= s->ct.num_entries) {
+qemu_log_mask(LOG_GUEST_ERROR, "%s: invalid ICID 0x%x\n", who, icid);
+return CMD_CONTINUE;
+}
+if (get_cte(s, icid, cte) != MEMTX_OK) {
+return CMD_STALL;
+}
+if (!cte->valid) {
+qemu_log_mask(LOG_GUEST_ERROR, "%s: invalid CTE\n", who);
+return CMD_CONTINUE;
+}
+if (cte->rdbase >= s->gicv3->num_cpu) {
+return CMD_CONTINUE;
+}
+return CMD_CONTINUE_OK;
+}
+
+
 /*
  * This function handles the processing of following commands based on
  * the ItsCmdType parameter passed:-
@@ -396,29 +426,9 @@ static ItsCmdResult do_process_its_cmd(GICv3ITSState *s, 
uint32_t devid,
 return CMD_CONTINUE;
 }
 
-if (ite.icid >= s->ct.num_entries) {
-qemu_log_mask(LOG_GUEST_ERROR,
-  "%s: invalid ICID 0x%x in ITE (table corrupted?)\n",
-  __func__, ite.icid);
-return CMD_CONTINUE;
-}
-
-if (get_cte(s, ite.icid, &cte) != MEMTX_OK) {
-return CMD_STALL;
-}
-if (!cte.valid) {
-qemu_log_mask(LOG_GUEST_ERROR,
-  "%s: invalid command attributes: invalid CTE\n",
-  __func__);
-return CMD_CONTINUE;
-}
-
-/*
- * Current implementation only supports rdbase == procnum
- * Hence rdbase physical address is ignored
- */
-if (cte.rdbase >= s->gicv3->num_cpu) {
-return CMD_CONTINUE;
+cmdres = lookup_cte(s, __func__, ite.icid, &cte);
+if (cmdres != CMD_CONTINUE_OK) {
+return cmdres;
 }
 
 if ((cmd == CLEAR) || (cmd == DISCARD)) {
@@ -792,54 +802,13 @@ static ItsCmdResult process_movi(GICv3ITSState *s, const 
uint64_t *cmdpkt)
 return CMD_CONTINUE;
 }
 
-if (old_ite.icid >= s->ct.num_entries) {
-qemu_log_mask(LOG_GUEST_ERROR,
-  "%s: invalid ICID 0x%x in ITE (table corrupted?)\n",
-  __func__, old_ite.icid);
-return CMD_CONTINUE;
+cmdres = lookup_cte(s, __func__, old_ite.icid, &old_cte);
+if (cmdres != CMD_CONTINUE_OK) {
+return cmdres;
 }
-
-if (new_icid >= s->ct.num_entries) {
-qemu_log_mask(LOG_GUEST_ERROR,
-  "%s: invalid command attributes: ICID 0x%x\n",
-  __func__, new_icid);
-return CMD_CONTINUE;
-}
-
-if (get_cte(s, old_ite.icid, &old_cte) != MEMTX_OK) {
-return CMD_STALL;
-}
-if (!old_cte.valid) {
-qemu_log_mask(LOG_GUEST_ERROR,
-  "%s: invalid command attributes: "
-  "invalid CTE for old ICID 0x%x\n",
-  __func__, old_ite.icid);
-return CMD_CONTINUE;
-}
-
-if (get_cte(s, new_icid, &new_cte) != MEMTX_OK) {
-return CMD_STALL;
-}
-if (!new_cte.valid) {
-qemu_log_mask(LOG_GUEST_ERROR,
-  "%s: invalid command attributes: "
-  "invalid CTE for new ICID 0x%x\n",
-  __func__, new_icid);
-return CMD_CONTINUE;
-}
-
-if (old_cte.rdbase >= s->gicv3->num_cpu) {
-qemu_log_mask(LOG_GUEST_ERROR,
-  "%s: CTE has invalid rdbase 0x%x\n",
-  __func__, old_cte.rdbase);
-return CMD_CONTINUE;
-}
-
-if (new_cte.rdbase >= s->gicv3->num_cpu) {
-qemu_log_mask(LOG_GUEST_ERROR,
-  "%s: CTE has invalid rdbase 0x%x\n",
-  __func__, new_cte.rdbase);
-return CMD_CONTINUE;
+cmdres = lookup_cte(s, __func__, new_icid, &new_cte);
+if (cmdres != CMD_CONTINUE_OK) {
+return cmdres;
 }
 
 if (old_cte.rdbase != new_cte.rdbase) {
-- 
2.25.1

[PATCH 23/41] hw/intc/arm_gicv3: Implement new GICv4 redistributor registers

2022-04-08 Thread Peter Maydell

Implement the new GICv4 redistributor registers: GICR_VPROPBASER
and GICR_VPENDBASER; for the moment we implement these as simple
reads-as-written stubs, together with the necessary migration
and reset handling.

We don't put ID-register checks on the handling of these registers,
because they are all in the only-in-v4 extra register frames, so
they're not accessible in a GICv3.

Signed-off-by: Peter Maydell 
---
GICv4.1 adds two further registers in the new VLPI frame,
as well as changing the layout of VPROPBASER and VPENDBASER,
but we aren't implementing v4.1 yet, just v4.
---
 hw/intc/gicv3_internal.h   | 21 +++
 include/hw/intc/arm_gicv3_common.h |  3 ++
 hw/intc/arm_gicv3_common.c | 22 
 hw/intc/arm_gicv3_redist.c | 56 ++
 4 files changed, 102 insertions(+)

diff --git a/hw/intc/gicv3_internal.h b/hw/intc/gicv3_internal.h
index 9720ccf7507..795bf57d2b3 100644
--- a/hw/intc/gicv3_internal.h
+++ b/hw/intc/gicv3_internal.h
@@ -77,6 +77,7 @@
  * Redistributor frame offsets from RD_base
  */
 #define GICR_SGI_OFFSET 0x1
+#define GICR_VLPI_OFFSET 0x2
 
 /*
  * Redistributor registers, offsets from RD_base
@@ -109,6 +110,10 @@
 #define GICR_IGRPMODR0(GICR_SGI_OFFSET + 0x0D00)
 #define GICR_NSACR(GICR_SGI_OFFSET + 0x0E00)
 
+/* VLPI redistributor registers, offsets from VLPI_base */
+#define GICR_VPROPBASER   (GICR_VLPI_OFFSET + 0x70)
+#define GICR_VPENDBASER   (GICR_VLPI_OFFSET + 0x78)
+
 #define GICR_CTLR_ENABLE_LPIS(1U << 0)
 #define GICR_CTLR_CES(1U << 1)
 #define GICR_CTLR_RWP(1U << 3)
@@ -143,6 +148,22 @@ FIELD(GICR_PENDBASER, PTZ, 62, 1)
 
 #define GICR_PROPBASER_IDBITS_THRESHOLD  0xd
 
+/* These are the GICv4 VPROPBASER and VPENDBASER layouts; v4.1 is different */
+FIELD(GICR_VPROPBASER, IDBITS, 0, 5)
+FIELD(GICR_VPROPBASER, INNERCACHE, 7, 3)
+FIELD(GICR_VPROPBASER, SHAREABILITY, 10, 2)
+FIELD(GICR_VPROPBASER, PHYADDR, 12, 40)
+FIELD(GICR_VPROPBASER, OUTERCACHE, 56, 3)
+
+FIELD(GICR_VPENDBASER, INNERCACHE, 7, 3)
+FIELD(GICR_VPENDBASER, SHAREABILITY, 10, 2)
+FIELD(GICR_VPENDBASER, PHYADDR, 16, 36)
+FIELD(GICR_VPENDBASER, OUTERCACHE, 56, 3)
+FIELD(GICR_VPENDBASER, DIRTY, 60, 1)
+FIELD(GICR_VPENDBASER, PENDINGLAST, 61, 1)
+FIELD(GICR_VPENDBASER, IDAI, 62, 1)
+FIELD(GICR_VPENDBASER, VALID, 63, 1)
+
 #define ICC_CTLR_EL1_CBPR   (1U << 0)
 #define ICC_CTLR_EL1_EOIMODE(1U << 1)
 #define ICC_CTLR_EL1_PMHE   (1U << 6)
diff --git a/include/hw/intc/arm_gicv3_common.h 
b/include/hw/intc/arm_gicv3_common.h
index 40bc404a652..7ff5a1aa5fc 100644
--- a/include/hw/intc/arm_gicv3_common.h
+++ b/include/hw/intc/arm_gicv3_common.h
@@ -179,6 +179,9 @@ struct GICv3CPUState {
 uint32_t gicr_igrpmodr0;
 uint32_t gicr_nsacr;
 uint8_t gicr_ipriorityr[GIC_INTERNAL];
+/* VLPI_base page registers */
+uint64_t gicr_vpropbaser;
+uint64_t gicr_vpendbaser;
 
 /* CPU interface */
 uint64_t icc_sre_el1;
diff --git a/hw/intc/arm_gicv3_common.c b/hw/intc/arm_gicv3_common.c
index 18999e3c8bb..14d76d74840 100644
--- a/hw/intc/arm_gicv3_common.c
+++ b/hw/intc/arm_gicv3_common.c
@@ -144,6 +144,25 @@ const VMStateDescription vmstate_gicv3_cpu_sre_el1 = {
 }
 };
 
+static bool gicv4_needed(void *opaque)
+{
+GICv3CPUState *cs = opaque;
+
+return cs->gic->revision > 3;
+}
+
+const VMStateDescription vmstate_gicv3_gicv4 = {
+.name = "arm_gicv3_cpu/gicv4",
+.version_id = 1,
+.minimum_version_id = 1,
+.needed = gicv4_needed,
+.fields = (VMStateField[]) {
+VMSTATE_UINT64(gicr_vpropbaser, GICv3CPUState),
+VMSTATE_UINT64(gicr_vpendbaser, GICv3CPUState),
+VMSTATE_END_OF_LIST()
+}
+};
+
 static const VMStateDescription vmstate_gicv3_cpu = {
 .name = "arm_gicv3_cpu",
 .version_id = 1,
@@ -175,6 +194,7 @@ static const VMStateDescription vmstate_gicv3_cpu = {
 .subsections = (const VMStateDescription * []) {
 &vmstate_gicv3_cpu_virt,
 &vmstate_gicv3_cpu_sre_el1,
+&vmstate_gicv3_gicv4,
 NULL
 }
 };
@@ -444,6 +464,8 @@ static void arm_gicv3_common_reset(DeviceState *dev)
 cs->gicr_waker = GICR_WAKER_ProcessorSleep | GICR_WAKER_ChildrenAsleep;
 cs->gicr_propbaser = 0;
 cs->gicr_pendbaser = 0;
+cs->gicr_vpropbaser = 0;
+cs->gicr_vpendbaser = 0;
 /* If we're resetting a TZ-aware GIC as if secure firmware
  * had set it up ready to start a kernel in non-secure, we
  * need to set interrupts to group 1 so the kernel can use them.
diff --git a/hw/intc/arm_gicv3_redist.c b/hw/intc/arm_gicv3_redist.c
index 9f1fe09a78e..c310d7f8ff2 100644
--- a/hw/intc/arm_gicv3_redist.c
+++ b/hw/intc/arm_gicv3_redist.c
@@ -236,6 +236,23 @@ static MemTxResult gicr_readl(GICv3CPUState *cs, hwaddr 
offset,
 case GICR_IDREGS ... GICR_IDREGS + 0x2f:
 *data = gicv3_idreg(offset - GICR_IDREGS, GICV3_PI

[PATCH 17/41] hw/intc/arm_gicv3_its: Implement VSYNC

2022-04-08 Thread Peter Maydell

The VSYNC command forces the ITS to synchronize all outstanding ITS
operations for the specified vPEID, so that subsequent writse to
GITS_TRANSLATER honour them.  The QEMU implementation is always in
sync, so for us this is a nop, like the existing SYNC command.

Signed-off-by: Peter Maydell 
---
 hw/intc/gicv3_internal.h |  1 +
 hw/intc/arm_gicv3_its.c  | 11 +++
 hw/intc/trace-events |  1 +
 3 files changed, 13 insertions(+)

diff --git a/hw/intc/gicv3_internal.h b/hw/intc/gicv3_internal.h
index c1467ce7263..ef1d75b3cf4 100644
--- a/hw/intc/gicv3_internal.h
+++ b/hw/intc/gicv3_internal.h
@@ -330,6 +330,7 @@ FIELD(GITS_TYPER, CIL, 36, 1)
 #define GITS_CMD_MOVALL   0x0E
 #define GITS_CMD_DISCARD  0x0F
 #define GITS_CMD_VMOVP0x22
+#define GITS_CMD_VSYNC0x25
 #define GITS_CMD_VMAPP0x29
 #define GITS_CMD_VMAPTI   0x2A
 #define GITS_CMD_VMAPI0x2B
diff --git a/hw/intc/arm_gicv3_its.c b/hw/intc/arm_gicv3_its.c
index bd82c84b46d..05d64630450 100644
--- a/hw/intc/arm_gicv3_its.c
+++ b/hw/intc/arm_gicv3_its.c
@@ -1165,6 +1165,17 @@ static void process_cmdq(GICv3ITSState *s)
  */
 trace_gicv3_its_cmd_sync();
 break;
+case GITS_CMD_VSYNC:
+/*
+ * VSYNC also is a nop, because our implementation is always
+ * in sync.
+ */
+if (!its_feature_virtual(s)) {
+result = CMD_CONTINUE;
+break;
+}
+trace_gicv3_its_cmd_vsync();
+break;
 case GITS_CMD_MAPD:
 result = process_mapd(s, cmdpkt);
 break;
diff --git a/hw/intc/trace-events b/hw/intc/trace-events
index a2dd1bdb6c3..b9efe14c690 100644
--- a/hw/intc/trace-events
+++ b/hw/intc/trace-events
@@ -191,6 +191,7 @@ gicv3_its_cmd_vmapi(uint32_t devid, uint32_t eventid, 
uint32_t vpeid, uint32_t d
 gicv3_its_cmd_vmapti(uint32_t devid, uint32_t eventid, uint32_t vpeid, 
uint32_t vintid, uint32_t doorbell) "GICv3 ITS: command VMAPI DeviceID 0x%x 
EventID 0x%x vPEID 0x%x vINTID 0x%x Dbell_pINTID 0x%x"
 gicv3_its_cmd_vmapp(uint32_t vpeid, uint64_t rdbase, int valid, uint64_t 
vptaddr, uint32_t vptsize) "GICv3 ITS: command VMAPP vPEID 0x%x RDbase 0x%" 
PRIx64 " V %d VPT_addr 0x%" PRIx64 " VPT_size 0x%x"
 gicv3_its_cmd_vmovp(uint32_t vpeid, uint64_t rdbase) "GICv3 ITS: command VMOVP 
vPEID 0x%x RDbase 0x%" PRIx64
+gicv3_its_cmd_vsync(void) "GICv3 ITS: command VSYNC"
 gicv3_its_cmd_unknown(unsigned cmd) "GICv3 ITS: unknown command 0x%x"
 gicv3_its_cte_read(uint32_t icid, int valid, uint32_t rdbase) "GICv3 ITS: 
Collection Table read for ICID 0x%x: valid %d RDBase 0x%x"
 gicv3_its_cte_write(uint32_t icid, int valid, uint32_t rdbase) "GICv3 ITS: 
Collection Table write for ICID 0x%x: valid %d RDBase 0x%x"
-- 
2.25.1

[PATCH 21/41] hw/intc/arm_gicv3_its: Implement VINVALL

2022-04-08 Thread Peter Maydell

The VINVALL command should cause any cached information in the
ITS or redistributor for the specified vCPU to be dropped or
otherwise made consistent with the in-memory LPI configuration
tables.

Here we implement the command and table parsing, leaving the
redistributor part as a stub for the moment, as usual.

Signed-off-by: Peter Maydell 
---
 hw/intc/gicv3_internal.h   | 13 +
 hw/intc/arm_gicv3_its.c| 26 ++
 hw/intc/arm_gicv3_redist.c |  5 +
 hw/intc/trace-events   |  1 +
 4 files changed, 45 insertions(+)

diff --git a/hw/intc/gicv3_internal.h b/hw/intc/gicv3_internal.h
index 050e19d133b..8d58d38836f 100644
--- a/hw/intc/gicv3_internal.h
+++ b/hw/intc/gicv3_internal.h
@@ -335,6 +335,7 @@ FIELD(GITS_TYPER, CIL, 36, 1)
 #define GITS_CMD_VMAPP0x29
 #define GITS_CMD_VMAPTI   0x2A
 #define GITS_CMD_VMAPI0x2B
+#define GITS_CMD_VINVALL  0x2D
 
 /* MAPC command fields */
 #define ICID_LENGTH  16
@@ -411,6 +412,9 @@ FIELD(VMOVI_1, VPEID, 32, 16)
 FIELD(VMOVI_2, D, 0, 1)
 FIELD(VMOVI_2, DOORBELL, 32, 32)
 
+/* VINVALL command fields */
+FIELD(VINVALL_1, VPEID, 32, 16)
+
 /*
  * 12 bytes Interrupt translation Table Entry size
  * as per Table 5.3 in GICv3 spec
@@ -637,6 +641,15 @@ void gicv3_redist_movall_lpis(GICv3CPUState *src, 
GICv3CPUState *dest);
 void gicv3_redist_mov_vlpi(GICv3CPUState *src, uint64_t src_vptaddr,
GICv3CPUState *dest, uint64_t dest_vptaddr,
int irq, int doorbell);
+/**
+ * gicv3_redist_vinvall:
+ * @cs: GICv3CPUState
+ * @vptaddr: address of VLPI pending table
+ *
+ * On redistributor @cs, invalidate all cached information associated
+ * with the vCPU defined by @vptaddr.
+ */
+void gicv3_redist_vinvall(GICv3CPUState *cs, uint64_t vptaddr);
 
 void gicv3_redist_send_sgi(GICv3CPUState *cs, int grp, int irq, bool ns);
 void gicv3_init_cpuif(GICv3State *s);
diff --git a/hw/intc/arm_gicv3_its.c b/hw/intc/arm_gicv3_its.c
index aef024009b2..6c44cccd369 100644
--- a/hw/intc/arm_gicv3_its.c
+++ b/hw/intc/arm_gicv3_its.c
@@ -1163,6 +1163,29 @@ static ItsCmdResult process_vmovi(GICv3ITSState *s, 
const uint64_t *cmdpkt)
 return update_ite(s, eventid, &dte, &ite) ? CMD_CONTINUE_OK : CMD_STALL;
 }
 
+static ItsCmdResult process_vinvall(GICv3ITSState *s, const uint64_t *cmdpkt)
+{
+VTEntry vte;
+uint32_t vpeid;
+ItsCmdResult cmdres;
+
+if (!its_feature_virtual(s)) {
+return CMD_CONTINUE;
+}
+
+vpeid = FIELD_EX64(cmdpkt[1], VINVALL_1, VPEID);
+
+trace_gicv3_its_cmd_vinvall(vpeid);
+
+cmdres = lookup_vte(s, __func__, vpeid, &vte);
+if (cmdres != CMD_CONTINUE_OK) {
+return cmdres;
+}
+
+gicv3_redist_vinvall(&s->gicv3->cpu[vte.rdbase], vte.vptaddr << 16);
+return CMD_CONTINUE_OK;
+}
+
 static ItsCmdResult process_inv(GICv3ITSState *s, const uint64_t *cmdpkt)
 {
 uint32_t devid, eventid;
@@ -1364,6 +1387,9 @@ static void process_cmdq(GICv3ITSState *s)
 case GITS_CMD_VMOVI:
 result = process_vmovi(s, cmdpkt);
 break;
+case GITS_CMD_VINVALL:
+result = process_vinvall(s, cmdpkt);
+break;
 default:
 trace_gicv3_its_cmd_unknown(cmd);
 break;
diff --git a/hw/intc/arm_gicv3_redist.c b/hw/intc/arm_gicv3_redist.c
index dc25997d1f9..7c75dd6f072 100644
--- a/hw/intc/arm_gicv3_redist.c
+++ b/hw/intc/arm_gicv3_redist.c
@@ -818,6 +818,11 @@ void gicv3_redist_mov_vlpi(GICv3CPUState *src, uint64_t 
src_vptaddr,
  */
 }
 
+void gicv3_redist_vinvall(GICv3CPUState *cs, uint64_t vptaddr)
+{
+/* The redistributor handling will be added in a subsequent commit */
+}
+
 void gicv3_redist_inv_vlpi(GICv3CPUState *cs, int irq, uint64_t vptaddr)
 {
 /*
diff --git a/hw/intc/trace-events b/hw/intc/trace-events
index 9894756e55a..004a1006fb8 100644
--- a/hw/intc/trace-events
+++ b/hw/intc/trace-events
@@ -194,6 +194,7 @@ gicv3_its_cmd_vmapp(uint32_t vpeid, uint64_t rdbase, int 
valid, uint64_t vptaddr
 gicv3_its_cmd_vmovp(uint32_t vpeid, uint64_t rdbase) "GICv3 ITS: command VMOVP 
vPEID 0x%x RDbase 0x%" PRIx64
 gicv3_its_cmd_vsync(void) "GICv3 ITS: command VSYNC"
 gicv3_its_cmd_vmovi(uint32_t devid,  uint32_t eventid, uint32_t vpeid, int 
dbvalid, uint32_t doorbell) "GICv3 ITS: command VMOVI DeviceID 0x%x EventID 
0x%x vPEID 0x%x D %d Dbell_pINTID 0x%x"
+gicv3_its_cmd_vinvall(uint32_t vpeid) "GICv3 ITS: command VINVALL vPEID 0x%x"
 gicv3_its_cmd_unknown(unsigned cmd) "GICv3 ITS: unknown command 0x%x"
 gicv3_its_cte_read(uint32_t icid, int valid, uint32_t rdbase) "GICv3 ITS: 
Collection Table read for ICID 0x%x: valid %d RDBase 0x%x"
 gicv3_its_cte_write(uint32_t icid, int valid, uint32_t rdbase) "GICv3 ITS: 
Collection Table write for ICID 0x%x: valid %d RDBase 0x%x"
-- 
2.25.1

[PATCH 15/41] hw/intc/arm_gicv3: Keep pointers to every connected ITS

2022-04-08 Thread Peter Maydell

The GICv4 ITS VMOVP command's semantics require it to perform the
operation on every ITS connected to the same GIC that the ITS that
received the command is attached to.  This means that the GIC object
needs to keep a pointer to every ITS that is connected to it
(previously it was sufficient for the ITS to have a pointer to its
GIC).

Add a glib ptrarray to the GICv3 object which holds pointers to every
connected ITS, and make the ITS add itself to the array for the GIC
it is connected to when it is realized.

Note that currently all QEMU machine types with an ITS have exactly
one ITS in the system, so typically the length of this ptrarray will
be 1.  Multiple ITSes are typically used to improve performance on
real hardware, so we wouldn't need to have more than one unless we
were modelling a real machine type that had multile ITSes.

Signed-off-by: Peter Maydell 
---
 hw/intc/gicv3_internal.h   | 9 +
 include/hw/intc/arm_gicv3_common.h | 2 ++
 hw/intc/arm_gicv3_common.c | 2 ++
 hw/intc/arm_gicv3_its.c| 2 ++
 hw/intc/arm_gicv3_its_kvm.c| 2 ++
 5 files changed, 17 insertions(+)

diff --git a/hw/intc/gicv3_internal.h b/hw/intc/gicv3_internal.h
index 6e22c8072e9..69a59daf867 100644
--- a/hw/intc/gicv3_internal.h
+++ b/hw/intc/gicv3_internal.h
@@ -709,4 +709,13 @@ static inline void 
gicv3_cache_all_target_cpustates(GICv3State *s)
 
 void gicv3_set_gicv3state(CPUState *cpu, GICv3CPUState *s);
 
+/*
+ * The ITS should call this when it is realized to add itself
+ * to its GIC's list of connected ITSes.
+ */
+static inline void gicv3_add_its(GICv3State *s, DeviceState *its)
+{
+g_ptr_array_add(s->itslist, its);
+}
+
 #endif /* QEMU_ARM_GICV3_INTERNAL_H */
diff --git a/include/hw/intc/arm_gicv3_common.h 
b/include/hw/intc/arm_gicv3_common.h
index fc38e4b7dca..08b27789385 100644
--- a/include/hw/intc/arm_gicv3_common.h
+++ b/include/hw/intc/arm_gicv3_common.h
@@ -272,6 +272,8 @@ struct GICv3State {
 uint32_t gicd_nsacr[DIV_ROUND_UP(GICV3_MAXIRQ, 16)];
 
 GICv3CPUState *cpu;
+/* List of all ITSes connected to this GIC */
+GPtrArray *itslist;
 };
 
 #define GICV3_BITMAP_ACCESSORS(BMP) \
diff --git a/hw/intc/arm_gicv3_common.c b/hw/intc/arm_gicv3_common.c
index c797c82786b..dcc5ce28c6a 100644
--- a/hw/intc/arm_gicv3_common.c
+++ b/hw/intc/arm_gicv3_common.c
@@ -414,6 +414,8 @@ static void arm_gicv3_common_realize(DeviceState *dev, 
Error **errp)
 cpuidx += s->redist_region_count[i];
 s->cpu[cpuidx - 1].gicr_typer |= GICR_TYPER_LAST;
 }
+
+s->itslist = g_ptr_array_new();
 }
 
 static void arm_gicv3_finalize(Object *obj)
diff --git a/hw/intc/arm_gicv3_its.c b/hw/intc/arm_gicv3_its.c
index 21bc1a6c18b..6ff3c3b0348 100644
--- a/hw/intc/arm_gicv3_its.c
+++ b/hw/intc/arm_gicv3_its.c
@@ -1680,6 +1680,8 @@ static void gicv3_arm_its_realize(DeviceState *dev, Error 
**errp)
 }
 }
 
+gicv3_add_its(s->gicv3, dev);
+
 gicv3_its_init_mmio(s, &gicv3_its_control_ops, &gicv3_its_translation_ops);
 
 /* set the ITS default features supported */
diff --git a/hw/intc/arm_gicv3_its_kvm.c b/hw/intc/arm_gicv3_its_kvm.c
index 0b4cbed28b3..529c7bd4946 100644
--- a/hw/intc/arm_gicv3_its_kvm.c
+++ b/hw/intc/arm_gicv3_its_kvm.c
@@ -106,6 +106,8 @@ static void kvm_arm_its_realize(DeviceState *dev, Error 
**errp)
 kvm_arm_register_device(&s->iomem_its_cntrl, -1, KVM_DEV_ARM_VGIC_GRP_ADDR,
 KVM_VGIC_ITS_ADDR_TYPE, s->dev_fd, 0);
 
+gicv3_add_its(s->gicv3, dev);
+
 gicv3_its_init_mmio(s, NULL, NULL);
 
 if (!kvm_device_check_attr(s->dev_fd, KVM_DEV_ARM_VGIC_GRP_ITS_REGS,
-- 
2.25.1

[PATCH 25/41] hw/intc/arm_gicv3_cpuif: Support vLPIs

2022-04-08 Thread Peter Maydell

The CPU interface changes to support vLPIs are fairly minor:
in the parts of the code that currently look at the list registers
to determine the highest priority pending virtual interrupt, we
must also look at the highest priority pending vLPI. To do this
we change hppvi_index() to check the vLPI and return a special-case
value if that is the right virtual interrupt to take. The callsites
(which handle HPPIR and IAR registers and the "raise vIRQ and vFIQ
lines" code) then have to handle this special-case value.

This commit includes two interfaces with the as-yet-unwritten
redistributor code:
 * the new GICv3CPUState::hppvlpi will be set by the redistributor
   (in the same way as the existing hpplpi does for physical LPIs)
 * when the CPU interface acknowledges a vLPI it needs to set it
   to non-pending; the new gicv3_redist_vlpi_pending() function
   (which matches the existing gicv3_redist_lpi_pending() used
   for physical LPIs) is a stub that will be filled in later

Signed-off-by: Peter Maydell 
---
 hw/intc/gicv3_internal.h   |  13 
 include/hw/intc/arm_gicv3_common.h |   3 +
 hw/intc/arm_gicv3_common.c |   1 +
 hw/intc/arm_gicv3_cpuif.c  | 119 +++--
 hw/intc/arm_gicv3_redist.c |   8 ++
 hw/intc/trace-events   |   2 +-
 6 files changed, 140 insertions(+), 6 deletions(-)

diff --git a/hw/intc/gicv3_internal.h b/hw/intc/gicv3_internal.h
index f25ddeca579..07644b2be6f 100644
--- a/hw/intc/gicv3_internal.h
+++ b/hw/intc/gicv3_internal.h
@@ -612,6 +612,19 @@ void gicv3_redist_process_lpi(GICv3CPUState *cs, int irq, 
int level);
  */
 void gicv3_redist_process_vlpi(GICv3CPUState *cs, int irq, uint64_t vptaddr,
int doorbell, int level);
+/**
+ * gicv3_redist_vlpi_pending:
+ * @cs: GICv3CPUState
+ * @irq: (virtual) interrupt number
+ * @level: level to set @irq to
+ *
+ * Set/clear the pending status of a virtual LPI in the vLPI table
+ * that this redistributor is currently using. (The difference between
+ * this and gicv3_redist_process_vlpi() is that this is called from
+ * the cpuif and does not need to do the not-running-on-this-vcpu checks.)
+ */
+void gicv3_redist_vlpi_pending(GICv3CPUState *cs, int irq, int level);
+
 void gicv3_redist_lpi_pending(GICv3CPUState *cs, int irq, int level);
 /**
  * gicv3_redist_update_lpi:
diff --git a/include/hw/intc/arm_gicv3_common.h 
b/include/hw/intc/arm_gicv3_common.h
index 7ff5a1aa5fc..4e416100559 100644
--- a/include/hw/intc/arm_gicv3_common.h
+++ b/include/hw/intc/arm_gicv3_common.h
@@ -219,6 +219,9 @@ struct GICv3CPUState {
  */
 PendingIrq hpplpi;
 
+/* Cached information recalculated from vLPI tables in guest memory */
+PendingIrq hppvlpi;
+
 /* This is temporary working state, to avoid a malloc in gicv3_update() */
 bool seenbetter;
 };
diff --git a/hw/intc/arm_gicv3_common.c b/hw/intc/arm_gicv3_common.c
index 14d76d74840..3f47b3501fe 100644
--- a/hw/intc/arm_gicv3_common.c
+++ b/hw/intc/arm_gicv3_common.c
@@ -487,6 +487,7 @@ static void arm_gicv3_common_reset(DeviceState *dev)
 
 cs->hppi.prio = 0xff;
 cs->hpplpi.prio = 0xff;
+cs->hppvlpi.prio = 0xff;
 
 /* State in the CPU interface must *not* be reset here, because it
  * is part of the CPU's reset domain, not the GIC device's.
diff --git a/hw/intc/arm_gicv3_cpuif.c b/hw/intc/arm_gicv3_cpuif.c
index 5fb64d4663c..f11863ff613 100644
--- a/hw/intc/arm_gicv3_cpuif.c
+++ b/hw/intc/arm_gicv3_cpuif.c
@@ -21,6 +21,12 @@
 #include "hw/irq.h"
 #include "cpu.h"
 
+/*
+ * Special case return value from hppvi_index(); must be larger than
+ * the architecturally maximum possible list register index (which is 15)
+ */
+#define HPPVI_INDEX_VLPI 16
+
 static GICv3CPUState *icc_cs_from_env(CPUARMState *env)
 {
 return env->gicv3state;
@@ -157,10 +163,18 @@ static int ich_highest_active_virt_prio(GICv3CPUState *cs)
 
 static int hppvi_index(GICv3CPUState *cs)
 {
-/* Return the list register index of the highest priority pending
+/*
+ * Return the list register index of the highest priority pending
  * virtual interrupt, as per the HighestPriorityVirtualInterrupt
  * pseudocode. If no pending virtual interrupts, return -1.
+ * If the highest priority pending virtual interrupt is a vLPI,
+ * return HPPVI_INDEX_VLPI.
+ * (The pseudocode handles checking whether the vLPI is higher
+ * priority than the highest priority list register at every
+ * callsite of HighestPriorityVirtualInterrupt; we check it here.)
  */
+ARMCPU *cpu = ARM_CPU(cs->cpu);
+CPUARMState *env = &cpu->env;
 int idx = -1;
 int i;
 /* Note that a list register entry with a priority of 0xff will
@@ -202,6 +216,23 @@ static int hppvi_index(GICv3CPUState *cs)
 }
 }
 
+/*
+ * "no pending vLPI" is indicated with prio = 0xff, which always
+ * fails the priority check here. vLPIs are only considered
+ * w

[PATCH 16/41] hw/intc/arm_gicv3_its: Implement VMOVP

2022-04-08 Thread Peter Maydell

Implement the GICv4 VMOVP command, which updates an entry in the vPE
table to change its rdbase field. This command is unique in the ITS
command set because its effects must be propagated to all the other
ITSes connected to the same GIC as the ITS which executes the VMOVP
command.

The GICv4 spec allows two implementation choices for handling the
propagation to other ITSes:
 * If GITS_TYPER.VMOVP is 1, the guest only needs to issue the command
   on one ITS, and the implementation handles the propagation to
   all ITSes
 * If GITS_TYPER.VMOVP is 0, the guest must issue the command on
   every ITS, and arrange for the ITSes to synchronize the updates
   with each other by setting ITSList and Sequence Number fields
   in the command packets

We choose the GITS_TYPER.VMOVP = 1 approach, and synchronously
execute the update on every ITS.

For GICv4.1 this command has extra fields in the command packet and
additional behaviour.  We define the 4.1-only fields with the FIELD
macro, but only implement the GICv4.0 version of the command.

Note that we don't update the reported GITS_TYPER value here;
we'll do that later in a commit which updates all the reported
feature bit and ID register values for GICv4.

Signed-off-by: Peter Maydell 
---
 hw/intc/gicv3_internal.h | 18 ++
 hw/intc/arm_gicv3_its.c  | 75 
 hw/intc/trace-events |  1 +
 3 files changed, 94 insertions(+)

diff --git a/hw/intc/gicv3_internal.h b/hw/intc/gicv3_internal.h
index 69a59daf867..c1467ce7263 100644
--- a/hw/intc/gicv3_internal.h
+++ b/hw/intc/gicv3_internal.h
@@ -329,6 +329,7 @@ FIELD(GITS_TYPER, CIL, 36, 1)
 #define GITS_CMD_INVALL   0x0D
 #define GITS_CMD_MOVALL   0x0E
 #define GITS_CMD_DISCARD  0x0F
+#define GITS_CMD_VMOVP0x22
 #define GITS_CMD_VMAPP0x29
 #define GITS_CMD_VMAPTI   0x2A
 #define GITS_CMD_VMAPI0x2B
@@ -389,6 +390,14 @@ FIELD(VMAPP_2, V, 63, 1)
 FIELD(VMAPP_3, VPTSIZE, 0, 8) /* For GICv4.0, bits [7:6] are RES0 */
 FIELD(VMAPP_3, VPTADDR, 16, 36)
 
+/* VMOVP command fields */
+FIELD(VMOVP_0, SEQNUM, 32, 16) /* not used for GITS_TYPER.VMOVP == 1 */
+FIELD(VMOVP_1, ITSLIST, 0, 16) /* not used for GITS_TYPER.VMOVP == 1 */
+FIELD(VMOVP_1, VPEID, 32, 16)
+FIELD(VMOVP_2, RDBASE, 16, 36)
+FIELD(VMOVP_2, DB, 63, 1) /* GICv4.1 only */
+FIELD(VMOVP_3, DEFAULT_DOORBELL, 0, 32) /* GICv4.1 only */
+
 /*
  * 12 bytes Interrupt translation Table Entry size
  * as per Table 5.3 in GICv3 spec
@@ -718,4 +727,13 @@ static inline void gicv3_add_its(GICv3State *s, 
DeviceState *its)
 g_ptr_array_add(s->itslist, its);
 }
 
+/*
+ * The ITS can use this for operations that must be performed on
+ * every ITS connected to the same GIC that it is
+ */
+static inline void gicv3_foreach_its(GICv3State *s, GFunc func, void *opaque)
+{
+g_ptr_array_foreach(s->itslist, func, opaque);
+}
+
 #endif /* QEMU_ARM_GICV3_INTERNAL_H */
diff --git a/hw/intc/arm_gicv3_its.c b/hw/intc/arm_gicv3_its.c
index 6ff3c3b0348..bd82c84b46d 100644
--- a/hw/intc/arm_gicv3_its.c
+++ b/hw/intc/arm_gicv3_its.c
@@ -1012,6 +1012,78 @@ static ItsCmdResult process_vmapp(GICv3ITSState *s, 
const uint64_t *cmdpkt)
 return update_vte(s, vpeid, &vte) ? CMD_CONTINUE_OK : CMD_STALL;
 }
 
+typedef struct VmovpCallbackData {
+uint64_t rdbase;
+uint32_t vpeid;
+/*
+ * Overall command result. If more than one callback finds an
+ * error, STALL beats CONTINUE.
+ */
+ItsCmdResult result;
+} VmovpCallbackData;
+
+static void vmovp_callback(gpointer data, gpointer opaque)
+{
+/*
+ * This function is called to update the VPEID field in a VPE
+ * table entry for this ITS. This might be because of a VMOVP
+ * command executed on any ITS that is connected to the same GIC
+ * as this ITS.  We need to read the VPE table entry for the VPEID
+ * and update its RDBASE field.
+ */
+GICv3ITSState *s = data;
+VmovpCallbackData *cbdata = opaque;
+VTEntry vte;
+ItsCmdResult cmdres;
+
+cmdres = lookup_vte(s, __func__, cbdata->vpeid, &vte);
+switch (cmdres) {
+case CMD_STALL:
+cbdata->result = CMD_STALL;
+return;
+case CMD_CONTINUE:
+if (cbdata->result != CMD_STALL) {
+cbdata->result = CMD_CONTINUE;
+}
+return;
+case CMD_CONTINUE_OK:
+break;
+}
+
+vte.rdbase = cbdata->rdbase;
+if (!update_vte(s, cbdata->vpeid, &vte)) {
+cbdata->result = CMD_STALL;
+}
+}
+
+static ItsCmdResult process_vmovp(GICv3ITSState *s, const uint64_t *cmdpkt)
+{
+VmovpCallbackData cbdata;
+
+if (!its_feature_virtual(s)) {
+return CMD_CONTINUE;
+}
+
+cbdata.vpeid = FIELD_EX64(cmdpkt[1], VMOVP_1, VPEID);
+cbdata.rdbase = FIELD_EX64(cmdpkt[2], VMOVP_2, RDBASE);
+
+trace_gicv3_its_cmd_vmovp(cbdata.vpeid, cbdata.rdbase);
+
+if (cbdata.rdbase >= s->gicv3->num_cpu) {
+return CMD_CONTINUE;
+}
+
+/*
+

[PATCH 26/41] hw/intc/arm_gicv3_cpuif: Don't recalculate maintenance irq unnecessarily

2022-04-08 Thread Peter Maydell

The maintenance interrupt state depends only on:
 * ICH_HCR_EL2
 * ICH_LR_EL2
 * ICH_VMCR_EL2 fields VENG0 and VENG1

Now we have a separate function that updates only the vIRQ and vFIQ
lines, use that in places that only change state that affects vIRQ
and vFIQ but not the maintenance interrupt.

Signed-off-by: Peter Maydell 
---
 hw/intc/arm_gicv3_cpuif.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/hw/intc/arm_gicv3_cpuif.c b/hw/intc/arm_gicv3_cpuif.c
index f11863ff613..d627ddac90f 100644
--- a/hw/intc/arm_gicv3_cpuif.c
+++ b/hw/intc/arm_gicv3_cpuif.c
@@ -543,7 +543,7 @@ static void icv_ap_write(CPUARMState *env, const 
ARMCPRegInfo *ri,
 
 cs->ich_apr[grp][regno] = value & 0xU;
 
-gicv3_cpuif_virt_update(cs);
+gicv3_cpuif_virt_irq_fiq_update(cs);
 return;
 }
 
@@ -588,7 +588,7 @@ static void icv_bpr_write(CPUARMState *env, const 
ARMCPRegInfo *ri,
 
 write_vbpr(cs, grp, value);
 
-gicv3_cpuif_virt_update(cs);
+gicv3_cpuif_virt_irq_fiq_update(cs);
 }
 
 static uint64_t icv_pmr_read(CPUARMState *env, const ARMCPRegInfo *ri)
@@ -615,7 +615,7 @@ static void icv_pmr_write(CPUARMState *env, const 
ARMCPRegInfo *ri,
 cs->ich_vmcr_el2 = deposit64(cs->ich_vmcr_el2, ICH_VMCR_EL2_VPMR_SHIFT,
  ICH_VMCR_EL2_VPMR_LENGTH, value);
 
-gicv3_cpuif_virt_update(cs);
+gicv3_cpuif_virt_irq_fiq_update(cs);
 }
 
 static uint64_t icv_igrpen_read(CPUARMState *env, const ARMCPRegInfo *ri)
@@ -682,7 +682,7 @@ static void icv_ctlr_write(CPUARMState *env, const 
ARMCPRegInfo *ri,
 cs->ich_vmcr_el2 = deposit64(cs->ich_vmcr_el2, ICH_VMCR_EL2_VEOIM_SHIFT,
  1, value & ICC_CTLR_EL1_EOIMODE ? 1 : 0);
 
-gicv3_cpuif_virt_update(cs);
+gicv3_cpuif_virt_irq_fiq_update(cs);
 }
 
 static uint64_t icv_rpr_read(CPUARMState *env, const ARMCPRegInfo *ri)
@@ -2452,7 +2452,7 @@ static void ich_ap_write(CPUARMState *env, const 
ARMCPRegInfo *ri,
 trace_gicv3_ich_ap_write(ri->crm & 1, regno, gicv3_redist_affid(cs), 
value);
 
 cs->ich_apr[grp][regno] = value & 0xU;
-gicv3_cpuif_virt_update(cs);
+gicv3_cpuif_virt_irq_fiq_update(cs);
 }
 
 static uint64_t ich_hcr_read(CPUARMState *env, const ARMCPRegInfo *ri)
-- 
2.25.1

[PATCH 14/41] hw/intc/arm_gicv3_its: Handle virtual interrupts in process_its_cmd()

2022-04-08 Thread Peter Maydell

For GICv4, interrupt table entries read by process_its_cmd() may
indicate virtual LPIs which are to be directly injected into a VM.
Implement the ITS side of the code for handling this.  This is
similar to the existing handling of physical LPIs, but instead of
looking up a collection ID in a collection table, we look up a vPEID
in a vPE table.  As with the physical LPIs, we leave the rest of the
work to code in the redistributor device.

The redistributor half will be implemented in a later commit;
for now we just provide a stub function which does nothing.

Signed-off-by: Peter Maydell 
---
 hw/intc/gicv3_internal.h   | 17 +++
 hw/intc/arm_gicv3_its.c| 99 +-
 hw/intc/arm_gicv3_redist.c |  9 
 hw/intc/trace-events   |  2 +
 4 files changed, 125 insertions(+), 2 deletions(-)

diff --git a/hw/intc/gicv3_internal.h b/hw/intc/gicv3_internal.h
index bbb8a20ce61..6e22c8072e9 100644
--- a/hw/intc/gicv3_internal.h
+++ b/hw/intc/gicv3_internal.h
@@ -527,6 +527,23 @@ MemTxResult gicv3_redist_write(void *opaque, hwaddr 
offset, uint64_t data,
 void gicv3_dist_set_irq(GICv3State *s, int irq, int level);
 void gicv3_redist_set_irq(GICv3CPUState *cs, int irq, int level);
 void gicv3_redist_process_lpi(GICv3CPUState *cs, int irq, int level);
+/**
+ * gicv3_redist_process_vlpi:
+ * @cs: GICv3CPUState
+ * @irq: (virtual) interrupt number
+ * @vptaddr: (guest) address of VLPI table
+ * @doorbell: doorbell (physical) interrupt number (1023 for "no doorbell")
+ * @level: level to set @irq to
+ *
+ * Process a virtual LPI being directly injected by the ITS. This function
+ * will update the VLPI table specified by @vptaddr and @vptsize. If the
+ * vCPU corresponding to that VLPI table is currently running on
+ * the CPU associated with this redistributor, directly inject the VLPI
+ * @irq. If the vCPU is not running on this CPU, raise the doorbell
+ * interrupt instead.
+ */
+void gicv3_redist_process_vlpi(GICv3CPUState *cs, int irq, uint64_t vptaddr,
+   int doorbell, int level);
 void gicv3_redist_lpi_pending(GICv3CPUState *cs, int irq, int level);
 /**
  * gicv3_redist_update_lpi:
diff --git a/hw/intc/arm_gicv3_its.c b/hw/intc/arm_gicv3_its.c
index 8ea1fc366d3..21bc1a6c18b 100644
--- a/hw/intc/arm_gicv3_its.c
+++ b/hw/intc/arm_gicv3_its.c
@@ -314,6 +314,42 @@ out:
 return res;
 }
 
+/*
+ * Read the vPE Table entry at index @vpeid. On success (including
+ * successfully determining that there is no valid entry for this index),
+ * we return MEMTX_OK and populate the VTEntry struct accordingly.
+ * If there is an error reading memory then we return the error code.
+ */
+static MemTxResult get_vte(GICv3ITSState *s, uint32_t vpeid, VTEntry *vte)
+{
+MemTxResult res = MEMTX_OK;
+AddressSpace *as = &s->gicv3->dma_as;
+uint64_t entry_addr = table_entry_addr(s, &s->vpet, vpeid, &res);
+uint64_t vteval;
+
+if (entry_addr == -1) {
+/* No L2 table entry, i.e. no valid VTE, or a memory error */
+vte->valid = false;
+goto out;
+}
+vteval = address_space_ldq_le(as, entry_addr, MEMTXATTRS_UNSPECIFIED, 
&res);
+if (res != MEMTX_OK) {
+goto out;
+}
+vte->valid = FIELD_EX64(vteval, VTE, VALID);
+vte->vptsize = FIELD_EX64(vteval, VTE, VPTSIZE);
+vte->vptaddr = FIELD_EX64(vteval, VTE, VPTADDR);
+vte->rdbase = FIELD_EX64(vteval, VTE, RDBASE);
+out:
+if (res != MEMTX_OK) {
+trace_gicv3_its_vte_read_fault(vpeid);
+} else {
+trace_gicv3_its_vte_read(vpeid, vte->valid, vte->vptsize,
+ vte->vptaddr, vte->rdbase);
+}
+return res;
+}
+
 /*
  * Given a (DeviceID, EventID), look up the corresponding ITE, including
  * checking for the various invalid-value cases. If we find a valid ITE,
@@ -397,6 +433,38 @@ static ItsCmdResult lookup_cte(GICv3ITSState *s, const 
char *who,
 return CMD_CONTINUE_OK;
 }
 
+/*
+ * Given a VPEID, look up the corresponding VTE, including checking
+ * for various invalid-value cases. if we find a valid VTE, fill in @vte
+ * and return CMD_CONTINUE_OK; otherwise return CMD_STALL or CMD_CONTINUE
+ * (and the contents of @vte should not be relied on).
+ *
+ * The string @who is purely for the LOG_GUEST_ERROR messages,
+ * and should indicate the name of the calling function or similar.
+ */
+static ItsCmdResult lookup_vte(GICv3ITSState *s, const char *who,
+   uint32_t vpeid, VTEntry *vte)
+{
+if (vpeid >= s->vpet.num_entries) {
+qemu_log_mask(LOG_GUEST_ERROR, "%s: invalid VPEID 0x%x\n", who, vpeid);
+return CMD_CONTINUE;
+}
+
+if (get_vte(s, vpeid, vte) != MEMTX_OK) {
+return CMD_STALL;
+}
+if (!vte->valid) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: invalid VTE for VPEID 0x%x\n", who, vpeid);
+return CMD_CONTINUE;
+}
+
+if (vte->rdbase >= s->gicv3->num_cpu) {
+return CMD_CONTI

[PATCH 19/41] hw/intc/arm_gicv3_its: Implement INV for virtual interrupts

2022-04-08 Thread Peter Maydell

Implement the ITS side of the handling of the INV command for
virtual interrupts; as usual this calls into a redistributor
function which we leave as a stub to fill in later.

Signed-off-by: Peter Maydell 
---
 hw/intc/gicv3_internal.h   |  9 +
 hw/intc/arm_gicv3_its.c| 16 ++--
 hw/intc/arm_gicv3_redist.c |  8 
 3 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/hw/intc/gicv3_internal.h b/hw/intc/gicv3_internal.h
index 25ea19de385..2f653a9b917 100644
--- a/hw/intc/gicv3_internal.h
+++ b/hw/intc/gicv3_internal.h
@@ -585,6 +585,15 @@ void gicv3_redist_update_lpi_only(GICv3CPUState *cs);
  * Forget or update any cached information associated with this LPI.
  */
 void gicv3_redist_inv_lpi(GICv3CPUState *cs, int irq);
+/**
+ * gicv3_redist_inv_vlpi:
+ * @cs: GICv3CPUState
+ * @irq: vLPI to invalidate cached information for
+ * @vptaddr: (guest) address of vLPI table
+ *
+ * Forget or update any cached information associated with this vLPI.
+ */
+void gicv3_redist_inv_vlpi(GICv3CPUState *cs, int irq, uint64_t vptaddr);
 /**
  * gicv3_redist_mov_lpi:
  * @src: source redistributor
diff --git a/hw/intc/arm_gicv3_its.c b/hw/intc/arm_gicv3_its.c
index 6ba554c16ea..c8b90e6b0d9 100644
--- a/hw/intc/arm_gicv3_its.c
+++ b/hw/intc/arm_gicv3_its.c
@@ -1090,6 +1090,7 @@ static ItsCmdResult process_inv(GICv3ITSState *s, const 
uint64_t *cmdpkt)
 ITEntry ite;
 DTEntry dte;
 CTEntry cte;
+VTEntry vte;
 ItsCmdResult cmdres;
 
 devid = FIELD_EX64(cmdpkt[0], INV_0, DEVICEID);
@@ -1118,8 +1119,19 @@ static ItsCmdResult process_inv(GICv3ITSState *s, const 
uint64_t *cmdpkt)
   __func__, ite.inttype);
 return CMD_CONTINUE;
 }
-/* We will implement the vLPI invalidation in a later commit */
-g_assert_not_reached();
+
+cmdres = lookup_vte(s, __func__, ite.vpeid, &vte);
+if (cmdres != CMD_CONTINUE_OK) {
+return cmdres;
+}
+if (!intid_in_lpi_range(ite.intid) ||
+ite.intid >= (1ULL << (vte.vptsize + 1))) {
+qemu_log_mask(LOG_GUEST_ERROR, "%s: intid 0x%x out of range\n",
+  __func__, ite.intid);
+return CMD_CONTINUE;
+}
+gicv3_redist_inv_vlpi(&s->gicv3->cpu[vte.rdbase], ite.intid,
+  vte.vptaddr << 16);
 break;
 default:
 g_assert_not_reached();
diff --git a/hw/intc/arm_gicv3_redist.c b/hw/intc/arm_gicv3_redist.c
index 78650a3bb4c..856494b4e8f 100644
--- a/hw/intc/arm_gicv3_redist.c
+++ b/hw/intc/arm_gicv3_redist.c
@@ -808,6 +808,14 @@ void gicv3_redist_process_vlpi(GICv3CPUState *cs, int irq, 
uint64_t vptaddr,
  */
 }
 
+void gicv3_redist_inv_vlpi(GICv3CPUState *cs, int irq, uint64_t vptaddr)
+{
+/*
+ * The redistributor handling for invalidating cached information
+ * about a VLPI will be added in a subsequent commit.
+ */
+}
+
 void gicv3_redist_set_irq(GICv3CPUState *cs, int irq, int level)
 {
 /* Update redistributor state for a change in an external PPI input line */
-- 
2.25.1

[PATCH 27/41] hw/intc/arm_gicv3_redist: Factor out "update hpplpi for one LPI" logic

2022-04-08 Thread Peter Maydell

Currently the functions which update the highest priority pending LPI
information by looking at the LPI Pending and Configuration tables
are hard-coded to use the physical LPI tables addressed by
GICR_PENDBASER and GICR_PROPBASER.  To support virtual LPIs we will
need to do essentially the same job, but looking at the current
virtual LPI Pending and Configuration tables and updating cs->hppvlpi
instead of cs->hpplpi.

Factor out the common part of the gicv3_redist_check_lpi_priority()
function into a new update_for_one_lpi() function, which updates
a PendingIrq struct if the specified LPI is higher priority than
what is currently recorded there.

Signed-off-by: Peter Maydell 
---
 hw/intc/arm_gicv3_redist.c | 74 --
 1 file changed, 47 insertions(+), 27 deletions(-)

diff --git a/hw/intc/arm_gicv3_redist.c b/hw/intc/arm_gicv3_redist.c
index 3464972c139..571e0fa8309 100644
--- a/hw/intc/arm_gicv3_redist.c
+++ b/hw/intc/arm_gicv3_redist.c
@@ -60,6 +60,49 @@ static uint32_t gicr_read_bitmap_reg(GICv3CPUState *cs, 
MemTxAttrs attrs,
 return reg;
 }
 
+/**
+ * update_for_one_lpi: Update pending information if this LPI is better
+ *
+ * @cs: GICv3CPUState
+ * @irq: interrupt to look up in the LPI Configuration table
+ * @ctbase: physical address of the LPI Configuration table to use
+ * @ds: true if priority value should not be shifted
+ * @hpp: points to pending information to update
+ *
+ * Look up @irq in the Configuration table specified by @ctbase
+ * to see if it is enabled and what its priority is. If it is an
+ * enabled interrupt with a higher priority than that currently
+ * recorded in @hpp, update @hpp.
+ */
+static void update_for_one_lpi(GICv3CPUState *cs, int irq,
+   uint64_t ctbase, bool ds, PendingIrq *hpp)
+{
+uint8_t lpite;
+uint8_t prio;
+
+address_space_read(&cs->gic->dma_as,
+   ctbase + ((irq - GICV3_LPI_INTID_START) * 
sizeof(lpite)),
+   MEMTXATTRS_UNSPECIFIED, &lpite, sizeof(lpite));
+
+if (!(lpite & LPI_CTE_ENABLED)) {
+return;
+}
+
+if (ds) {
+prio = lpite & LPI_PRIORITY_MASK;
+} else {
+prio = ((lpite & LPI_PRIORITY_MASK) >> 1) | 0x80;
+}
+
+if ((prio < hpp->prio) ||
+((prio == hpp->prio) && (irq <= hpp->irq))) {
+hpp->irq = irq;
+hpp->prio = prio;
+/* LPIs and vLPIs are always non-secure Grp1 interrupts */
+hpp->grp = GICV3_G1NS;
+}
+}
+
 static uint8_t gicr_read_ipriorityr(GICv3CPUState *cs, MemTxAttrs attrs,
 int irq)
 {
@@ -598,34 +641,11 @@ MemTxResult gicv3_redist_write(void *opaque, hwaddr 
offset, uint64_t data,
 
 static void gicv3_redist_check_lpi_priority(GICv3CPUState *cs, int irq)
 {
-AddressSpace *as = &cs->gic->dma_as;
-uint64_t lpict_baddr;
-uint8_t lpite;
-uint8_t prio;
+uint64_t lpict_baddr = cs->gicr_propbaser & R_GICR_PROPBASER_PHYADDR_MASK;
 
-lpict_baddr = cs->gicr_propbaser & R_GICR_PROPBASER_PHYADDR_MASK;
-
-address_space_read(as, lpict_baddr + ((irq - GICV3_LPI_INTID_START) *
-   sizeof(lpite)), MEMTXATTRS_UNSPECIFIED, &lpite,
-   sizeof(lpite));
-
-if (!(lpite & LPI_CTE_ENABLED)) {
-return;
-}
-
-if (cs->gic->gicd_ctlr & GICD_CTLR_DS) {
-prio = lpite & LPI_PRIORITY_MASK;
-} else {
-prio = ((lpite & LPI_PRIORITY_MASK) >> 1) | 0x80;
-}
-
-if ((prio < cs->hpplpi.prio) ||
-((prio == cs->hpplpi.prio) && (irq <= cs->hpplpi.irq))) {
-cs->hpplpi.irq = irq;
-cs->hpplpi.prio = prio;
-/* LPIs are always non-secure Grp1 interrupts */
-cs->hpplpi.grp = GICV3_G1NS;
-}
+update_for_one_lpi(cs, irq, lpict_baddr,
+   cs->gic->gicd_ctlr & GICD_CTLR_DS,
+   &cs->hpplpi);
 }
 
 void gicv3_redist_update_lpi_only(GICv3CPUState *cs)
-- 
2.25.1

[PATCH 22/41] hw/intc/arm_gicv3: Implement GICv4's new redistributor frame

2022-04-08 Thread Peter Maydell

The GICv4 extends the redistributor register map -- where GICv3
had two 64KB frames per CPU, GICv4 has four frames. Add support
for the extra frame by using a new gicv3_redist_size() function
in the places in the GIC implementation which currently use
a fixed constant size for the redistributor register block.
(Until we implement the extra registers they will RAZ/WI.)

Any board that wants to use a GICv4 will need to also adjust
to handle the different sized redistributor register block;
that will be done separately.

Signed-off-by: Peter Maydell 
---
 hw/intc/gicv3_internal.h   | 21 +
 include/hw/intc/arm_gicv3_common.h |  5 +
 hw/intc/arm_gicv3_common.c |  2 +-
 hw/intc/arm_gicv3_redist.c |  8 
 4 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/hw/intc/gicv3_internal.h b/hw/intc/gicv3_internal.h
index 8d58d38836f..9720ccf7507 100644
--- a/hw/intc/gicv3_internal.h
+++ b/hw/intc/gicv3_internal.h
@@ -489,6 +489,27 @@ FIELD(VTE, RDBASE, 42, RDBASE_PROCNUM_LENGTH)
 
 /* Functions internal to the emulated GICv3 */
 
+/**
+ * gicv3_redist_size:
+ * @s: GICv3State
+ *
+ * Return the size of the redistributor register frame in bytes
+ * (which depends on what GIC version this is)
+ */
+static inline int gicv3_redist_size(GICv3State *s)
+{
+/*
+ * Redistributor size is controlled by the redistributor GICR_TYPER.VLPIS.
+ * It's the same for every redistributor in the GIC, so arbitrarily
+ * use the register field in the first one.
+ */
+if (s->cpu[0].gicr_typer & GICR_TYPER_VLPIS) {
+return GICV4_REDIST_SIZE;
+} else {
+return GICV3_REDIST_SIZE;
+}
+}
+
 /**
  * gicv3_intid_is_special:
  * @intid: interrupt ID
diff --git a/include/hw/intc/arm_gicv3_common.h 
b/include/hw/intc/arm_gicv3_common.h
index 08b27789385..40bc404a652 100644
--- a/include/hw/intc/arm_gicv3_common.h
+++ b/include/hw/intc/arm_gicv3_common.h
@@ -38,7 +38,12 @@
 
 #define GICV3_LPI_INTID_START 8192
 
+/*
+ * The redistributor in GICv3 has two 64KB frames per CPU; in
+ * GICv4 it has four 64KB frames per CPU.
+ */
 #define GICV3_REDIST_SIZE 0x2
+#define GICV4_REDIST_SIZE 0x4
 
 /* Number of SGI target-list bits */
 #define GICV3_TARGETLIST_BITS 16
diff --git a/hw/intc/arm_gicv3_common.c b/hw/intc/arm_gicv3_common.c
index dcc5ce28c6a..18999e3c8bb 100644
--- a/hw/intc/arm_gicv3_common.c
+++ b/hw/intc/arm_gicv3_common.c
@@ -295,7 +295,7 @@ void gicv3_init_irqs_and_mmio(GICv3State *s, 
qemu_irq_handler handler,
 
 memory_region_init_io(®ion->iomem, OBJECT(s),
   ops ? &ops[1] : NULL, region, name,
-  s->redist_region_count[i] * GICV3_REDIST_SIZE);
+  s->redist_region_count[i] * 
gicv3_redist_size(s));
 sysbus_init_mmio(sbd, ®ion->iomem);
 g_free(name);
 }
diff --git a/hw/intc/arm_gicv3_redist.c b/hw/intc/arm_gicv3_redist.c
index 7c75dd6f072..9f1fe09a78e 100644
--- a/hw/intc/arm_gicv3_redist.c
+++ b/hw/intc/arm_gicv3_redist.c
@@ -442,8 +442,8 @@ MemTxResult gicv3_redist_read(void *opaque, hwaddr offset, 
uint64_t *data,
  * in the memory map); if so then the GIC has multiple MemoryRegions
  * for the redistributors.
  */
-cpuidx = region->cpuidx + offset / GICV3_REDIST_SIZE;
-offset %= GICV3_REDIST_SIZE;
+cpuidx = region->cpuidx + offset / gicv3_redist_size(s);
+offset %= gicv3_redist_size(s);
 
 cs = &s->cpu[cpuidx];
 
@@ -501,8 +501,8 @@ MemTxResult gicv3_redist_write(void *opaque, hwaddr offset, 
uint64_t data,
  * in the memory map); if so then the GIC has multiple MemoryRegions
  * for the redistributors.
  */
-cpuidx = region->cpuidx + offset / GICV3_REDIST_SIZE;
-offset %= GICV3_REDIST_SIZE;
+cpuidx = region->cpuidx + offset / gicv3_redist_size(s);
+offset %= gicv3_redist_size(s);
 
 cs = &s->cpu[cpuidx];
 
-- 
2.25.1

[PATCH 33/41] hw/intc/arm_gicv3_redist: Use set_pending_table_bit() in mov handling

2022-04-08 Thread Peter Maydell

We can use our new set_pending_table_bit() utility function
in gicv3_redist_mov_lpi() to clear the bit in the source
pending table, rather than doing the "load, clear bit, store"
ourselves.

Signed-off-by: Peter Maydell 
---
 hw/intc/arm_gicv3_redist.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/hw/intc/arm_gicv3_redist.c b/hw/intc/arm_gicv3_redist.c
index eadf5e8265e..3127af3e2ca 100644
--- a/hw/intc/arm_gicv3_redist.c
+++ b/hw/intc/arm_gicv3_redist.c
@@ -909,11 +909,9 @@ void gicv3_redist_mov_lpi(GICv3CPUState *src, 
GICv3CPUState *dest, int irq)
  * we choose to NOP. If LPIs are disabled on source there's nothing
  * to be transferred anyway.
  */
-AddressSpace *as = &src->gic->dma_as;
 uint64_t idbits;
 uint32_t pendt_size;
 uint64_t src_baddr;
-uint8_t src_pend;
 
 if (!(src->gicr_ctlr & GICR_CTLR_ENABLE_LPIS) ||
 !(dest->gicr_ctlr & GICR_CTLR_ENABLE_LPIS)) {
@@ -932,15 +930,10 @@ void gicv3_redist_mov_lpi(GICv3CPUState *src, 
GICv3CPUState *dest, int irq)
 
 src_baddr = src->gicr_pendbaser & R_GICR_PENDBASER_PHYADDR_MASK;
 
-address_space_read(as, src_baddr + (irq / 8),
-   MEMTXATTRS_UNSPECIFIED, &src_pend, sizeof(src_pend));
-if (!extract32(src_pend, irq % 8, 1)) {
+if (!set_pending_table_bit(src, src_baddr, irq, 0)) {
 /* Not pending on source, nothing to do */
 return;
 }
-src_pend &= ~(1 << (irq % 8));
-address_space_write(as, src_baddr + (irq / 8),
-MEMTXATTRS_UNSPECIFIED, &src_pend, sizeof(src_pend));
 if (irq == src->hpplpi.irq) {
 /*
  * We just made this LPI not-pending so only need to update
-- 
2.25.1

[PATCH 30/41] hw/intc/arm_gicv3_redist: Factor out "update bit in pending table" code

2022-04-08 Thread Peter Maydell

Factor out the code which sets a single bit in an LPI pending table.
We're going to need this for handling vLPI tables, not just the
physical LPI table.

Signed-off-by: Peter Maydell 
---
 hw/intc/arm_gicv3_redist.c | 49 +++---
 1 file changed, 30 insertions(+), 19 deletions(-)

diff --git a/hw/intc/arm_gicv3_redist.c b/hw/intc/arm_gicv3_redist.c
index bfdde36a206..64e5d96ac36 100644
--- a/hw/intc/arm_gicv3_redist.c
+++ b/hw/intc/arm_gicv3_redist.c
@@ -145,6 +145,34 @@ static void update_for_all_lpis(GICv3CPUState *cs, 
uint64_t ptbase,
 }
 }
 
+/**
+ * set_lpi_pending_bit: Set or clear pending bit for an LPI
+ *
+ * @cs: GICv3CPUState
+ * @ptbase: physical address of LPI Pending table
+ * @irq: LPI to change pending state for
+ * @level: 0 to clear pending state, 1 to set
+ *
+ * Returns true if we needed to do something, false if the pending bit
+ * was already at @level.
+ */
+static bool set_pending_table_bit(GICv3CPUState *cs, uint64_t ptbase,
+  int irq, int level)
+{
+AddressSpace *as = &cs->gic->dma_as;
+uint64_t addr = ptbase + irq / 8;
+uint8_t pend;
+
+address_space_read(as, addr, MEMTXATTRS_UNSPECIFIED, &pend, 1);
+if (extract32(pend, irq % 8, 1) == level) {
+/* Bit already at requested state, no action required */
+return false;
+}
+pend = deposit32(pend, irq % 8, 1, level ? 1 : 0);
+address_space_write(as, addr, MEMTXATTRS_UNSPECIFIED, &pend, 1);
+return true;
+}
+
 static uint8_t gicr_read_ipriorityr(GICv3CPUState *cs, MemTxAttrs attrs,
 int irq)
 {
@@ -809,30 +837,13 @@ void gicv3_redist_lpi_pending(GICv3CPUState *cs, int irq, 
int level)
  * This function updates the pending bit in lpi pending table for
  * the irq being activated or deactivated.
  */
-AddressSpace *as = &cs->gic->dma_as;
 uint64_t lpipt_baddr;
-bool ispend = false;
-uint8_t pend;
 
-/*
- * get the bit value corresponding to this irq in the
- * lpi pending table
- */
 lpipt_baddr = cs->gicr_pendbaser & R_GICR_PENDBASER_PHYADDR_MASK;
-
-address_space_read(as, lpipt_baddr + ((irq / 8) * sizeof(pend)),
-   MEMTXATTRS_UNSPECIFIED, &pend, sizeof(pend));
-
-ispend = extract32(pend, irq % 8, 1);
-
-/* no change in the value of pending bit, return */
-if (ispend == level) {
+if (!set_pending_table_bit(cs, lpipt_baddr, irq, level)) {
+/* no change in the value of pending bit, return */
 return;
 }
-pend = deposit32(pend, irq % 8, 1, level ? 1 : 0);
-
-address_space_write(as, lpipt_baddr + ((irq / 8) * sizeof(pend)),
-MEMTXATTRS_UNSPECIFIED, &pend, sizeof(pend));
 
 /*
  * check if this LPI is better than the current hpplpi, if yes
-- 
2.25.1

[PATCH 24/41] hw/intc/arm_gicv3_cpuif: Split "update vIRQ/vFIQ" from gicv3_cpuif_virt_update()

2022-04-08 Thread Peter Maydell

The function gicv3_cpuif_virt_update() currently sets all of vIRQ,
vFIQ and the maintenance interrupt.  This implies that it has to be
used quite carefully -- as the comment notes, setting the maintenance
interrupt will typically cause the GIC code to be re-entered
recursively.  For handling vLPIs, we need the redistributor to be
able to tell the cpuif to update the vIRQ and vFIQ lines when the
highest priority pending vLPI changes.  Since that change can't cause
the maintenance interrupt state to change, we can pull the "update
vIRQ/vFIQ" parts of gicv3_cpuif_virt_update() out into a separate
function, which the redistributor can then call without having to
worry about the reentrancy issue.

Signed-off-by: Peter Maydell 
---
 hw/intc/gicv3_internal.h  | 11 +++
 hw/intc/arm_gicv3_cpuif.c | 64 ---
 hw/intc/trace-events  |  3 +-
 3 files changed, 53 insertions(+), 25 deletions(-)

diff --git a/hw/intc/gicv3_internal.h b/hw/intc/gicv3_internal.h
index 795bf57d2b3..f25ddeca579 100644
--- a/hw/intc/gicv3_internal.h
+++ b/hw/intc/gicv3_internal.h
@@ -707,6 +707,17 @@ void gicv3_init_cpuif(GICv3State *s);
  */
 void gicv3_cpuif_update(GICv3CPUState *cs);
 
+/*
+ * gicv3_cpuif_virt_irq_fiq_update:
+ * @cs: GICv3CPUState for the CPU to update
+ *
+ * Recalculate whether to assert the virtual IRQ or FIQ lines after
+ * a change to the current highest priority pending virtual interrupt.
+ * Note that this does not recalculate and change the maintenance
+ * interrupt status (for that, see gicv3_cpuif_virt_update()).
+ */
+void gicv3_cpuif_virt_irq_fiq_update(GICv3CPUState *cs);
+
 static inline uint32_t gicv3_iidr(void)
 {
 /* Return the Implementer Identification Register value
diff --git a/hw/intc/arm_gicv3_cpuif.c b/hw/intc/arm_gicv3_cpuif.c
index 1a3d440a54b..5fb64d4663c 100644
--- a/hw/intc/arm_gicv3_cpuif.c
+++ b/hw/intc/arm_gicv3_cpuif.c
@@ -370,30 +370,20 @@ static uint32_t maintenance_interrupt_state(GICv3CPUState 
*cs)
 return value;
 }
 
-static void gicv3_cpuif_virt_update(GICv3CPUState *cs)
+void gicv3_cpuif_virt_irq_fiq_update(GICv3CPUState *cs)
 {
-/* Tell the CPU about any pending virtual interrupts or
- * maintenance interrupts, following a change to the state
- * of the CPU interface relevant to virtual interrupts.
- *
- * CAUTION: this function will call qemu_set_irq() on the
- * CPU maintenance IRQ line, which is typically wired up
- * to the GIC as a per-CPU interrupt. This means that it
- * will recursively call back into the GIC code via
- * gicv3_redist_set_irq() and thus into the CPU interface code's
- * gicv3_cpuif_update(). It is therefore important that this
- * function is only called as the final action of a CPU interface
- * register write implementation, after all the GIC state
- * fields have been updated. gicv3_cpuif_update() also must
- * not cause this function to be called, but that happens
- * naturally as a result of there being no architectural
- * linkage between the physical and virtual GIC logic.
+/*
+ * Tell the CPU about any pending virtual interrupts.
+ * This should only be called for changes that affect the
+ * vIRQ and vFIQ status and do not change the maintenance
+ * interrupt status. This means that unlike gicv3_cpuif_virt_update()
+ * this function won't recursively call back into the GIC code.
+ * The main use of this is when the redistributor has changed the
+ * highest priority pending virtual LPI.
  */
 int idx;
 int irqlevel = 0;
 int fiqlevel = 0;
-int maintlevel = 0;
-ARMCPU *cpu = ARM_CPU(cs->cpu);
 
 idx = hppvi_index(cs);
 trace_gicv3_cpuif_virt_update(gicv3_redist_affid(cs), idx);
@@ -410,16 +400,42 @@ static void gicv3_cpuif_virt_update(GICv3CPUState *cs)
 }
 }
 
+trace_gicv3_cpuif_virt_set_irqs(gicv3_redist_affid(cs), fiqlevel, 
irqlevel);
+qemu_set_irq(cs->parent_vfiq, fiqlevel);
+qemu_set_irq(cs->parent_virq, irqlevel);
+}
+
+static void gicv3_cpuif_virt_update(GICv3CPUState *cs)
+{
+/*
+ * Tell the CPU about any pending virtual interrupts or
+ * maintenance interrupts, following a change to the state
+ * of the CPU interface relevant to virtual interrupts.
+ *
+ * CAUTION: this function will call qemu_set_irq() on the
+ * CPU maintenance IRQ line, which is typically wired up
+ * to the GIC as a per-CPU interrupt. This means that it
+ * will recursively call back into the GIC code via
+ * gicv3_redist_set_irq() and thus into the CPU interface code's
+ * gicv3_cpuif_update(). It is therefore important that this
+ * function is only called as the final action of a CPU interface
+ * register write implementation, after all the GIC state
+ * fields have been updated. gicv3_cpuif_update() also must
+ * not cause this function to be called, but that happens
+ * naturally as a result of there bei

[PATCH 32/41] hw/intc/arm_gicv3_redist: Implement gicv3_redist_vlpi_pending()

2022-04-08 Thread Peter Maydell

Implement the function gicv3_redist_vlpi_pending(), which was
previously left as a stub.  This is the function that is called by
the CPU interface when it changes the state of a vLPI.  It's similar
to gicv3_redist_process_vlpi(), but we know that the vCPU is
definitely resident on the redistributor and the irq is in range, so
it is a bit simpler.

Signed-off-by: Peter Maydell 
---
 hw/intc/arm_gicv3_redist.c | 23 +--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/hw/intc/arm_gicv3_redist.c b/hw/intc/arm_gicv3_redist.c
index be36978b45c..eadf5e8265e 100644
--- a/hw/intc/arm_gicv3_redist.c
+++ b/hw/intc/arm_gicv3_redist.c
@@ -1009,9 +1009,28 @@ void gicv3_redist_movall_lpis(GICv3CPUState *src, 
GICv3CPUState *dest)
 void gicv3_redist_vlpi_pending(GICv3CPUState *cs, int irq, int level)
 {
 /*
- * The redistributor handling for changing the pending state
- * of a vLPI will be added in a subsequent commit.
+ * Change the pending state of the specified vLPI.
+ * Unlike gicv3_redist_process_vlpi(), we know here that the
+ * vCPU is definitely resident on this redistributor, and that
+ * the irq is in range.
  */
+uint64_t vptbase, ctbase;
+
+vptbase = FIELD_EX64(cs->gicr_vpendbaser, GICR_VPENDBASER, PHYADDR) << 16;
+
+if (set_pending_table_bit(cs, vptbase, irq, level)) {
+if (level) {
+/* Check whether this vLPI is now the best */
+ctbase = cs->gicr_vpropbaser & R_GICR_VPROPBASER_PHYADDR_MASK;
+update_for_one_lpi(cs, irq, ctbase, true, &cs->hppvlpi);
+gicv3_cpuif_virt_irq_fiq_update(cs);
+} else {
+/* Only need to recalculate if this was previously the best vLPI */
+if (irq == cs->hppvlpi.irq) {
+gicv3_redist_update_vlpi(cs);
+}
+}
+}
 }
 
 void gicv3_redist_process_vlpi(GICv3CPUState *cs, int irq, uint64_t vptaddr,
-- 
2.25.1

[PATCH 37/41] hw/intc/arm_gicv3: Update ID and feature registers for GICv4

2022-04-08 Thread Peter Maydell

Update the various GIC ID and feature registers for GICv4:
 * PIDR2 [7:4] is the GIC architecture revision
 * GICD_TYPER.DVIS is 1 to indicate direct vLPI injection support
 * GICR_TYPER.VLPIS is 1 to indicate redistributor support for vLPIs
 * GITS_TYPER.VIRTUAL is 1 to indicate vLPI support
 * GITS_TYPER.VMOVP is 1 to indicate that our VMOVP implementation
   handles cross-ITS synchronization for the guest
 * ICH_VTR_EL2.nV4 is 0 to indicate direct vLPI injection support

Signed-off-by: Peter Maydell 
---
 hw/intc/gicv3_internal.h   | 15 +++
 hw/intc/arm_gicv3_common.c |  7 +--
 hw/intc/arm_gicv3_cpuif.c  |  6 +-
 hw/intc/arm_gicv3_dist.c   |  7 ---
 hw/intc/arm_gicv3_its.c|  7 ++-
 hw/intc/arm_gicv3_redist.c |  2 +-
 6 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/hw/intc/gicv3_internal.h b/hw/intc/gicv3_internal.h
index 07644b2be6f..0bf68452395 100644
--- a/hw/intc/gicv3_internal.h
+++ b/hw/intc/gicv3_internal.h
@@ -309,6 +309,7 @@ FIELD(GITS_TYPER, SEIS, 18, 1)
 FIELD(GITS_TYPER, PTA, 19, 1)
 FIELD(GITS_TYPER, CIDBITS, 32, 4)
 FIELD(GITS_TYPER, CIL, 36, 1)
+FIELD(GITS_TYPER, VMOVP, 37, 1)
 
 #define GITS_IDREGS   0xFFD0
 
@@ -747,23 +748,29 @@ static inline uint32_t gicv3_iidr(void)
 #define GICV3_PIDR0_REDIST 0x93
 #define GICV3_PIDR0_ITS 0x94
 
-static inline uint32_t gicv3_idreg(int regoffset, uint8_t pidr0)
+static inline uint32_t gicv3_idreg(GICv3State *s, int regoffset, uint8_t pidr0)
 {
 /* Return the value of the CoreSight ID register at the specified
  * offset from the first ID register (as found in the distributor
  * and redistributor register banks).
- * These values indicate an ARM implementation of a GICv3.
+ * These values indicate an ARM implementation of a GICv3 or v4.
  */
 static const uint8_t gicd_ids[] = {
-0x44, 0x00, 0x00, 0x00, 0x92, 0xB4, 0x3B, 0x00, 0x0D, 0xF0, 0x05, 0xB1
+0x44, 0x00, 0x00, 0x00, 0x92, 0xB4, 0x0B, 0x00, 0x0D, 0xF0, 0x05, 0xB1
 };
+uint32_t id;
 
 regoffset /= 4;
 
 if (regoffset == 4) {
 return pidr0;
 }
-return gicd_ids[regoffset];
+id = gicd_ids[regoffset];
+if (regoffset == 6) {
+/* PIDR2 bits [7:4] are the GIC architecture revision */
+id |= s->revision << 4;
+}
+return id;
 }
 
 /**
diff --git a/hw/intc/arm_gicv3_common.c b/hw/intc/arm_gicv3_common.c
index 3f47b3501fe..181f342f32c 100644
--- a/hw/intc/arm_gicv3_common.c
+++ b/hw/intc/arm_gicv3_common.c
@@ -406,8 +406,8 @@ static void arm_gicv3_common_realize(DeviceState *dev, 
Error **errp)
  *  Last == 1 if this is the last redistributor in a series of
  *contiguous redistributor pages
  *  DirectLPI == 0 (direct injection of LPIs not supported)
- *  VLPIS == 0 (virtual LPIs not supported)
- *  PLPIS == 0 (physical LPIs not supported)
+ *  VLPIS == 1 if vLPIs supported (GICv4 and up)
+ *  PLPIS == 1 if LPIs supported
  */
 cpu_affid = object_property_get_uint(OBJECT(cpu), "mp-affinity", NULL);
 
@@ -422,6 +422,9 @@ static void arm_gicv3_common_realize(DeviceState *dev, 
Error **errp)
 
 if (s->lpi_enable) {
 s->cpu[i].gicr_typer |= GICR_TYPER_PLPIS;
+if (s->revision > 3) {
+s->cpu[i].gicr_typer |= GICR_TYPER_VLPIS;
+}
 }
 }
 
diff --git a/hw/intc/arm_gicv3_cpuif.c b/hw/intc/arm_gicv3_cpuif.c
index d627ddac90f..8404f46ee0b 100644
--- a/hw/intc/arm_gicv3_cpuif.c
+++ b/hw/intc/arm_gicv3_cpuif.c
@@ -2578,11 +2578,15 @@ static uint64_t ich_vtr_read(CPUARMState *env, const 
ARMCPRegInfo *ri)
 uint64_t value;
 
 value = ((cs->num_list_regs - 1) << ICH_VTR_EL2_LISTREGS_SHIFT)
-| ICH_VTR_EL2_TDS | ICH_VTR_EL2_NV4 | ICH_VTR_EL2_A3V
+| ICH_VTR_EL2_TDS | ICH_VTR_EL2_A3V
 | (1 << ICH_VTR_EL2_IDBITS_SHIFT)
 | ((cs->vprebits - 1) << ICH_VTR_EL2_PREBITS_SHIFT)
 | ((cs->vpribits - 1) << ICH_VTR_EL2_PRIBITS_SHIFT);
 
+if (cs->gic->revision < 4) {
+value |= ICH_VTR_EL2_NV4;
+}
+
 trace_gicv3_ich_vtr_read(gicv3_redist_affid(cs), value);
 return value;
 }
diff --git a/hw/intc/arm_gicv3_dist.c b/hw/intc/arm_gicv3_dist.c
index 7f6275363ea..b9ed955e36b 100644
--- a/hw/intc/arm_gicv3_dist.c
+++ b/hw/intc/arm_gicv3_dist.c
@@ -383,7 +383,7 @@ static bool gicd_readl(GICv3State *s, hwaddr offset,
  * No1N == 1 (1-of-N SPI interrupts not supported)
  * A3V == 1 (non-zero values of Affinity level 3 supported)
  * IDbits == 0xf (we support 16-bit interrupt identifiers)
- * DVIS == 0 (Direct virtual LPI injection not supported)
+ * DVIS == 1 (Direct virtual LPI injection supported) if GICv4
  * LPIS == 1 (LPIs are supported if affinity routing is enabled)
  * num_LPIs == 0b0 (bits [15:11],Number of LPIs as indicated
  *  by GICD_TYPER.IDbits)
@@ -399,8 +399,9 @@

[PATCH 28/41] hw/intc/arm_gicv3_redist: Factor out "update hpplpi for all LPIs" logic

2022-04-08 Thread Peter Maydell

Factor out the common part of gicv3_redist_update_lpi_only() into
a new function update_for_all_lpis(), which does a full rescan
of an LPI Pending table and sets the specified PendingIrq struct
with the highest priority pending enabled LPI it finds.

Signed-off-by: Peter Maydell 
---
 hw/intc/arm_gicv3_redist.c | 66 ++
 1 file changed, 46 insertions(+), 20 deletions(-)

diff --git a/hw/intc/arm_gicv3_redist.c b/hw/intc/arm_gicv3_redist.c
index 571e0fa8309..2379389d14e 100644
--- a/hw/intc/arm_gicv3_redist.c
+++ b/hw/intc/arm_gicv3_redist.c
@@ -103,6 +103,48 @@ static void update_for_one_lpi(GICv3CPUState *cs, int irq,
 }
 }
 
+/**
+ * update_for_all_lpis: Fully scan LPI tables and find best pending LPI
+ *
+ * @cs: GICv3CPUState
+ * @ptbase: physical address of LPI Pending table
+ * @ctbase: physical address of LPI Configuration table
+ * @ptsizebits: size of tables, specified as number of interrupt ID bits minus 
1
+ * @ds: true if priority value should not be shifted
+ * @hpp: points to pending information to set
+ *
+ * Recalculate the highest priority pending enabled LPI from scratch,
+ * and set @hpp accordingly.
+ *
+ * We scan the LPI pending table @ptbase; for each pending LPI, we read the
+ * corresponding entry in the LPI configuration table @ctbase to extract
+ * the priority and enabled information.
+ *
+ * We take @ptsizebits in the form idbits-1 because this is the way that
+ * LPI table sizes are architecturally specified in GICR_PROPBASER.IDBits
+ * and in the VMAPP command's VPT_size field.
+ */
+static void update_for_all_lpis(GICv3CPUState *cs, uint64_t ptbase,
+uint64_t ctbase, unsigned ptsizebits,
+bool ds, PendingIrq *hpp)
+{
+AddressSpace *as = &cs->gic->dma_as;
+uint8_t pend;
+uint32_t pendt_size = (1ULL << (ptsizebits + 1));
+int i, bit;
+
+hpp->prio = 0xff;
+
+for (i = GICV3_LPI_INTID_START / 8; i < pendt_size / 8; i++) {
+address_space_read(as, ptbase + i, MEMTXATTRS_UNSPECIFIED, &pend, 1);
+while (pend) {
+bit = ctz32(pend);
+update_for_one_lpi(cs, i * 8 + bit, ctbase, ds, hpp);
+pend &= ~(1 << bit);
+}
+}
+}
+
 static uint8_t gicr_read_ipriorityr(GICv3CPUState *cs, MemTxAttrs attrs,
 int irq)
 {
@@ -657,11 +699,7 @@ void gicv3_redist_update_lpi_only(GICv3CPUState *cs)
  * priority is lower than the last computed high priority lpi interrupt.
  * If yes, replace current LPI as the new high priority lpi interrupt.
  */
-AddressSpace *as = &cs->gic->dma_as;
-uint64_t lpipt_baddr;
-uint32_t pendt_size = 0;
-uint8_t pend;
-int i, bit;
+uint64_t lpipt_baddr, lpict_baddr;
 uint64_t idbits;
 
 idbits = MIN(FIELD_EX64(cs->gicr_propbaser, GICR_PROPBASER, IDBITS),
@@ -671,23 +709,11 @@ void gicv3_redist_update_lpi_only(GICv3CPUState *cs)
 return;
 }
 
-cs->hpplpi.prio = 0xff;
-
 lpipt_baddr = cs->gicr_pendbaser & R_GICR_PENDBASER_PHYADDR_MASK;
+lpict_baddr = cs->gicr_propbaser & R_GICR_PROPBASER_PHYADDR_MASK;
 
-/* Determine the highest priority pending interrupt among LPIs */
-pendt_size = (1ULL << (idbits + 1));
-
-for (i = GICV3_LPI_INTID_START / 8; i < pendt_size / 8; i++) {
-address_space_read(as, lpipt_baddr + i, MEMTXATTRS_UNSPECIFIED, &pend,
-   sizeof(pend));
-
-while (pend) {
-bit = ctz32(pend);
-gicv3_redist_check_lpi_priority(cs, i * 8 + bit);
-pend &= ~(1 << bit);
-}
-}
+update_for_all_lpis(cs, lpipt_baddr, lpict_baddr, idbits,
+cs->gic->gicd_ctlr & GICD_CTLR_DS, &cs->hpplpi);
 }
 
 void gicv3_redist_update_lpi(GICv3CPUState *cs)
-- 
2.25.1

[PATCH 31/41] hw/intc/arm_gicv3_redist: Implement gicv3_redist_process_vlpi()

2022-04-08 Thread Peter Maydell

Implement the function gicv3_redist_process_vlpi(), which was left as
just a stub earlier.  This function deals with being handed a VLPI by
the ITS.  It must set the bit in the pending table.  If the vCPU is
currently resident we must recalculate the highest priority pending
vLPI; otherwise we may need to ring a "doorbell" interrupt to let the
hypervisor know it might want to reschedule the vCPU.

Signed-off-by: Peter Maydell 
---
 hw/intc/arm_gicv3_redist.c | 48 ++
 1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/hw/intc/arm_gicv3_redist.c b/hw/intc/arm_gicv3_redist.c
index 64e5d96ac36..be36978b45c 100644
--- a/hw/intc/arm_gicv3_redist.c
+++ b/hw/intc/arm_gicv3_redist.c
@@ -60,6 +60,19 @@ static uint32_t gicr_read_bitmap_reg(GICv3CPUState *cs, 
MemTxAttrs attrs,
 return reg;
 }
 
+static bool vcpu_resident(GICv3CPUState *cs, uint64_t vptaddr)
+{
+/*
+ * Return true if a vCPU is resident, which is defined by
+ * whether the GICR_VPENDBASER register is marked VALID and
+ * has the right virtual pending table address.
+ */
+if (!FIELD_EX64(cs->gicr_vpendbaser, GICR_VPENDBASER, VALID)) {
+return false;
+}
+return vptaddr == (cs->gicr_vpendbaser & R_GICR_VPENDBASER_PHYADDR_MASK);
+}
+
 /**
  * update_for_one_lpi: Update pending information if this LPI is better
  *
@@ -1004,10 +1017,37 @@ void gicv3_redist_vlpi_pending(GICv3CPUState *cs, int 
irq, int level)
 void gicv3_redist_process_vlpi(GICv3CPUState *cs, int irq, uint64_t vptaddr,
int doorbell, int level)
 {
-/*
- * The redistributor handling for being handed a VLPI by the ITS
- * will be added in a subsequent commit.
- */
+bool bit_changed;
+bool resident = vcpu_resident(cs, vptaddr);
+uint64_t ctbase;
+
+if (resident) {
+uint32_t idbits = FIELD_EX64(cs->gicr_vpropbaser, GICR_VPROPBASER, 
IDBITS);
+if (irq >= (1ULL << (idbits + 1))) {
+return;
+}
+}
+
+bit_changed = set_pending_table_bit(cs, vptaddr, irq, level);
+if (resident && bit_changed) {
+if (level) {
+/* Check whether this vLPI is now the best */
+ctbase = cs->gicr_vpropbaser & R_GICR_VPROPBASER_PHYADDR_MASK;
+update_for_one_lpi(cs, irq, ctbase, true, &cs->hppvlpi);
+gicv3_cpuif_virt_irq_fiq_update(cs);
+} else {
+/* Only need to recalculate if this was previously the best vLPI */
+if (irq == cs->hppvlpi.irq) {
+gicv3_redist_update_vlpi(cs);
+}
+}
+}
+
+if (!resident && level && doorbell != INTID_SPURIOUS &&
+(cs->gicr_ctlr & GICR_CTLR_ENABLE_LPIS)) {
+/* vCPU is not currently resident: ring the doorbell */
+gicv3_redist_process_lpi(cs, doorbell, 1);
+}
 }
 
 void gicv3_redist_mov_vlpi(GICv3CPUState *src, uint64_t src_vptaddr,
-- 
2.25.1

[PATCH 38/41] hw/intc/arm_gicv3: Allow 'revision' property to be set to 4

2022-04-08 Thread Peter Maydell

Now that we have implemented all the GICv4 requirements, relax the
error-checking on the GIC object's 'revision' property to allow a TCG
GIC to be a GICv4, whilst still constraining the KVM GIC to GICv3.

Our 'revision' property doesn't consider the possibility of wanting
to specify the minor version of the GIC -- for instance there is a
GICv3.1 which adds support for extended SPI and PPI ranges, among
other things, and also GICv4.1.  But since the QOM property is
internal to QEMU, not user-facing, we can cross that bridge when we
come to it. Within the GIC implementation itself code generally
checks against the appropriate ID register feature bits, and the
only use of s->revision is for setting those ID register bits.

Signed-off-by: Peter Maydell 
---
 hw/intc/arm_gicv3_common.c | 12 +++-
 hw/intc/arm_gicv3_kvm.c|  5 +
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/hw/intc/arm_gicv3_common.c b/hw/intc/arm_gicv3_common.c
index 181f342f32c..5634c6fc788 100644
--- a/hw/intc/arm_gicv3_common.c
+++ b/hw/intc/arm_gicv3_common.c
@@ -326,12 +326,14 @@ static void arm_gicv3_common_realize(DeviceState *dev, 
Error **errp)
 GICv3State *s = ARM_GICV3_COMMON(dev);
 int i, rdist_capacity, cpuidx;
 
-/* revision property is actually reserved and currently used only in order
- * to keep the interface compatible with GICv2 code, avoiding extra
- * conditions. However, in future it could be used, for example, if we
- * implement GICv4.
+/*
+ * This GIC device supports only revisions 3 and 4. The GICv1/v2
+ * is a separate device.
+ * Note that subclasses of this device may impose further restrictions
+ * on the GIC revision: notably, the in-kernel KVM GIC doesn't
+ * support GICv4.
  */
-if (s->revision != 3) {
+if (s->revision != 3 && s->revision != 4) {
 error_setg(errp, "unsupported GIC revision %d", s->revision);
 return;
 }
diff --git a/hw/intc/arm_gicv3_kvm.c b/hw/intc/arm_gicv3_kvm.c
index 5ec5ff9ef6e..06f5aceee52 100644
--- a/hw/intc/arm_gicv3_kvm.c
+++ b/hw/intc/arm_gicv3_kvm.c
@@ -781,6 +781,11 @@ static void kvm_arm_gicv3_realize(DeviceState *dev, Error 
**errp)
 return;
 }
 
+if (s->revision != 3) {
+error_setg(errp, "unsupported GIC revision %d for in-kernel GIC",
+   s->revision);
+}
+
 if (s->security_extn) {
 error_setg(errp, "the in-kernel VGICv3 does not implement the "
"security extensions");
-- 
2.25.1

[PATCH 39/41] hw/arm/virt: Use VIRT_GIC_VERSION_* enum values in create_gic()

2022-04-08 Thread Peter Maydell

Everywhere we need to check which GIC version we're using, we look at
vms->gic_version and use the VIRT_GIC_VERSION_* enum values, except
in create_gic(), which copies vms->gic_version into a local 'int'
variable and makes direct comparisons against values 2 and 3.

For consistency, change this function to check the GIC version
the same way we do elsewhere. This includes not implicitly relying
on the enumeration type values happening to match the integer
'revision' values the GIC device object wants.

Signed-off-by: Peter Maydell 
---
 hw/arm/virt.c | 31 +++
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index d2e5ecd234a..594a3d0660a 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -690,14 +690,29 @@ static void create_gic(VirtMachineState *vms, 
MemoryRegion *mem)
 /* We create a standalone GIC */
 SysBusDevice *gicbusdev;
 const char *gictype;
-int type = vms->gic_version, i;
+int i;
 unsigned int smp_cpus = ms->smp.cpus;
 uint32_t nb_redist_regions = 0;
+int revision;
 
-gictype = (type == 3) ? gicv3_class_name() : gic_class_name();
+if (vms->gic_version == VIRT_GIC_VERSION_2) {
+gictype = gic_class_name();
+} else {
+gictype = gicv3_class_name();
+}
 
+switch (vms->gic_version) {
+case VIRT_GIC_VERSION_2:
+revision = 2;
+break;
+case VIRT_GIC_VERSION_3:
+revision = 3;
+break;
+default:
+g_assert_not_reached();
+}
 vms->gic = qdev_new(gictype);
-qdev_prop_set_uint32(vms->gic, "revision", type);
+qdev_prop_set_uint32(vms->gic, "revision", revision);
 qdev_prop_set_uint32(vms->gic, "num-cpu", smp_cpus);
 /* Note that the num-irq property counts both internal and external
  * interrupts; there are always 32 of the former (mandated by GIC spec).
@@ -707,7 +722,7 @@ static void create_gic(VirtMachineState *vms, MemoryRegion 
*mem)
 qdev_prop_set_bit(vms->gic, "has-security-extensions", vms->secure);
 }
 
-if (type == 3) {
+if (vms->gic_version == VIRT_GIC_VERSION_3) {
 uint32_t redist0_capacity =
 vms->memmap[VIRT_GIC_REDIST].size / GICV3_REDIST_SIZE;
 uint32_t redist0_count = MIN(smp_cpus, redist0_capacity);
@@ -742,7 +757,7 @@ static void create_gic(VirtMachineState *vms, MemoryRegion 
*mem)
 gicbusdev = SYS_BUS_DEVICE(vms->gic);
 sysbus_realize_and_unref(gicbusdev, &error_fatal);
 sysbus_mmio_map(gicbusdev, 0, vms->memmap[VIRT_GIC_DIST].base);
-if (type == 3) {
+if (vms->gic_version == VIRT_GIC_VERSION_3) {
 sysbus_mmio_map(gicbusdev, 1, vms->memmap[VIRT_GIC_REDIST].base);
 if (nb_redist_regions == 2) {
 sysbus_mmio_map(gicbusdev, 2,
@@ -780,7 +795,7 @@ static void create_gic(VirtMachineState *vms, MemoryRegion 
*mem)
ppibase + timer_irq[irq]));
 }
 
-if (type == 3) {
+if (vms->gic_version == VIRT_GIC_VERSION_3) {
 qemu_irq irq = qdev_get_gpio_in(vms->gic,
 ppibase + ARCH_GIC_MAINT_IRQ);
 qdev_connect_gpio_out_named(cpudev, "gicv3-maintenance-interrupt",
@@ -806,9 +821,9 @@ static void create_gic(VirtMachineState *vms, MemoryRegion 
*mem)
 
 fdt_add_gic_node(vms);
 
-if (type == 3 && vms->its) {
+if (vms->gic_version == VIRT_GIC_VERSION_3 && vms->its) {
 create_its(vms);
-} else if (type == 2) {
+} else if (vms->gic_version == VIRT_GIC_VERSION_2) {
 create_v2m(vms);
 }
 }
-- 
2.25.1

[PATCH 34/41] hw/intc/arm_gicv3_redist: Implement gicv3_redist_mov_vlpi()

2022-04-08 Thread Peter Maydell

Implement the gicv3_redist_mov_vlpi() function (previously left as a
stub).  This function handles the work of a VMOVI command: it marks
the vLPI not-pending on the source and pending on the destination.

Signed-off-by: Peter Maydell 
---
 hw/intc/arm_gicv3_redist.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/hw/intc/arm_gicv3_redist.c b/hw/intc/arm_gicv3_redist.c
index 3127af3e2ca..9866dd94c60 100644
--- a/hw/intc/arm_gicv3_redist.c
+++ b/hw/intc/arm_gicv3_redist.c
@@ -1067,9 +1067,25 @@ void gicv3_redist_mov_vlpi(GICv3CPUState *src, uint64_t 
src_vptaddr,
int irq, int doorbell)
 {
 /*
- * The redistributor handling for moving a VLPI will be added
- * in a subsequent commit.
+ * Move the specified vLPI's pending state from the source redistributor
+ * to the destination.
  */
+if (!set_pending_table_bit(src, src_vptaddr, irq, 0)) {
+/* Not pending on source, nothing to do */
+return;
+}
+if (vcpu_resident(src, src_vptaddr) && irq == src->hppvlpi.irq) {
+/*
+ * Update src's cached highest-priority pending vLPI if we just made
+ * it not-pending
+ */
+gicv3_redist_update_vlpi(src);
+}
+/*
+ * Mark the vLPI pending on the destination (ringing the doorbell
+ * if the vCPU isn't resident)
+ */
+gicv3_redist_process_vlpi(dest, irq, dest_vptaddr, doorbell, irq);
 }
 
 void gicv3_redist_vinvall(GICv3CPUState *cs, uint64_t vptaddr)
-- 
2.25.1

[PATCH 29/41] hw/intc/arm_gicv3_redist: Recalculate hppvlpi on VPENDBASER writes

2022-04-08 Thread Peter Maydell

The guest uses GICR_VPENDBASER to tell the redistributor when it is
scheduling or descheduling a vCPU.  When it writes and changes the
VALID bit from 0 to 1, it is scheduling a vCPU, and we must update
our view of the current highest priority pending vLPI from the new
Pending and Configuration tables.  When it writes and changes the
VALID bit from 1 to 0, it is descheduling, which means that there is
no longer a highest priority pending vLPI.

The specification allows the implementation to use part of the vLPI
Pending table as an IMPDEF area where it can cache information when a
vCPU is descheduled, so that it can avoid having to do a full rescan
of the tables when the vCPU is scheduled again.  For now, we don't
take advantage of this, and simply do a complete rescan.

Signed-off-by: Peter Maydell 
---
 hw/intc/arm_gicv3_redist.c | 87 --
 1 file changed, 84 insertions(+), 3 deletions(-)

diff --git a/hw/intc/arm_gicv3_redist.c b/hw/intc/arm_gicv3_redist.c
index 2379389d14e..bfdde36a206 100644
--- a/hw/intc/arm_gicv3_redist.c
+++ b/hw/intc/arm_gicv3_redist.c
@@ -185,6 +185,87 @@ static void gicr_write_ipriorityr(GICv3CPUState *cs, 
MemTxAttrs attrs, int irq,
 cs->gicr_ipriorityr[irq] = value;
 }
 
+static void gicv3_redist_update_vlpi_only(GICv3CPUState *cs)
+{
+uint64_t ptbase, ctbase, idbits;
+
+if (!FIELD_EX64(cs->gicr_vpendbaser, GICR_VPENDBASER, VALID)) {
+cs->hppvlpi.prio = 0xff;
+return;
+}
+
+ptbase = cs->gicr_vpendbaser & R_GICR_VPENDBASER_PHYADDR_MASK;
+ctbase = cs->gicr_vpropbaser & R_GICR_VPROPBASER_PHYADDR_MASK;
+idbits = FIELD_EX64(cs->gicr_vpropbaser, GICR_VPROPBASER, IDBITS);
+
+update_for_all_lpis(cs, ptbase, ctbase, idbits, true, &cs->hppvlpi);
+}
+
+static void gicv3_redist_update_vlpi(GICv3CPUState *cs)
+{
+gicv3_redist_update_vlpi_only(cs);
+gicv3_cpuif_virt_irq_fiq_update(cs);
+}
+
+static void gicr_write_vpendbaser(GICv3CPUState *cs, uint64_t newval)
+{
+/* Write @newval to GICR_VPENDBASER, handling its effects */
+bool oldvalid = FIELD_EX64(cs->gicr_vpendbaser, GICR_VPENDBASER, VALID);
+bool newvalid = FIELD_EX64(newval, GICR_VPENDBASER, VALID);
+bool pendinglast;
+
+/*
+ * The DIRTY bit is read-only and for us is always zero;
+ * other fields are writeable.
+ */
+newval &= R_GICR_VPENDBASER_INNERCACHE_MASK |
+R_GICR_VPENDBASER_SHAREABILITY_MASK |
+R_GICR_VPENDBASER_PHYADDR_MASK |
+R_GICR_VPENDBASER_OUTERCACHE_MASK |
+R_GICR_VPENDBASER_PENDINGLAST_MASK |
+R_GICR_VPENDBASER_IDAI_MASK |
+R_GICR_VPENDBASER_VALID_MASK;
+
+if (oldvalid && newvalid) {
+/*
+ * Changing other fields while VALID is 1 is UNPREDICTABLE;
+ * we choose to log and ignore the write.
+ */
+if (cs->gicr_vpendbaser ^ newval) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: Changing GICR_VPENDBASER when VALID=1 "
+  "is UNPREDICTABLE\n", __func__);
+}
+return;
+}
+if (!oldvalid && !newvalid) {
+cs->gicr_vpendbaser = newval;
+return;
+}
+
+if (newvalid) {
+/*
+ * Valid going from 0 to 1: update hppvlpi from tables.
+ * If IDAI is 0 we are allowed to use the info we cached in
+ * the IMPDEF area of the table.
+ * PendingLast is RES1 when we make this transition.
+ */
+pendinglast = true;
+} else {
+/*
+ * Valid going from 1 to 0:
+ * Set PendingLast if there was a pending enabled interrupt
+ * for the vPE that was just descheduled.
+ * If we cache info in the IMPDEF area, write it out here.
+ */
+pendinglast = cs->hppvlpi.prio != 0xff;
+}
+
+newval = FIELD_DP64(newval, GICR_VPENDBASER, PENDINGLAST, pendinglast);
+cs->gicr_vpendbaser = newval;
+gicv3_redist_update_vlpi(cs);
+}
+
 static MemTxResult gicr_readb(GICv3CPUState *cs, hwaddr offset,
   uint64_t *data, MemTxAttrs attrs)
 {
@@ -493,10 +574,10 @@ static MemTxResult gicr_writel(GICv3CPUState *cs, hwaddr 
offset,
 cs->gicr_vpropbaser = deposit64(cs->gicr_vpropbaser, 32, 32, value);
 return MEMTX_OK;
 case GICR_VPENDBASER:
-cs->gicr_vpendbaser = deposit64(cs->gicr_vpendbaser, 0, 32, value);
+gicr_write_vpendbaser(cs, deposit64(cs->gicr_vpendbaser, 0, 32, 
value));
 return MEMTX_OK;
 case GICR_VPENDBASER + 4:
-cs->gicr_vpendbaser = deposit64(cs->gicr_vpendbaser, 32, 32, value);
+gicr_write_vpendbaser(cs, deposit64(cs->gicr_vpendbaser, 32, 32, 
value));
 return MEMTX_OK;
 default:
 return MEMTX_ERROR;
@@ -557,7 +638,7 @@ static MemTxResult gicr_writell(GICv3CPUState *cs, hwaddr 
offset,
 cs->gicr_vpropbaser = value;
 return MEMTX_OK;
 case GICR_VPENDBASER:
-cs->gicr_vpendbaser = value;
+

[PATCH 40/41] hw/arm/virt: Abstract out calculation of redistributor region capacity

2022-04-08 Thread Peter Maydell

In several places in virt.c we calculate the number of redistributors that
fit in a region of our memory map, which is the size of the region
divided by the size of a single redistributor frame. For GICv4, the
redistributor frame is a different size from that for GICv3. Abstract
out the calculation of redistributor region capacity so that we have
one place we need to change to handle GICv4 rather than several.

Signed-off-by: Peter Maydell 
---
 include/hw/arm/virt.h |  9 +++--
 hw/arm/virt.c | 11 ---
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index 7e76ee26198..360463e6bfb 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -185,11 +185,16 @@ OBJECT_DECLARE_TYPE(VirtMachineState, VirtMachineClass, 
VIRT_MACHINE)
 void virt_acpi_setup(VirtMachineState *vms);
 bool virt_is_acpi_enabled(VirtMachineState *vms);
 
+/* Return number of redistributors that fit in the specified region */
+static uint32_t virt_redist_capacity(VirtMachineState *vms, int region)
+{
+return vms->memmap[region].size / GICV3_REDIST_SIZE;
+}
+
 /* Return the number of used redistributor regions  */
 static inline int virt_gicv3_redist_region_count(VirtMachineState *vms)
 {
-uint32_t redist0_capacity =
-vms->memmap[VIRT_GIC_REDIST].size / GICV3_REDIST_SIZE;
+uint32_t redist0_capacity = virt_redist_capacity(vms, VIRT_GIC_REDIST);
 
 assert(vms->gic_version == VIRT_GIC_VERSION_3);
 
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 594a3d0660a..577c1e65188 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -723,8 +723,7 @@ static void create_gic(VirtMachineState *vms, MemoryRegion 
*mem)
 }
 
 if (vms->gic_version == VIRT_GIC_VERSION_3) {
-uint32_t redist0_capacity =
-vms->memmap[VIRT_GIC_REDIST].size / GICV3_REDIST_SIZE;
+uint32_t redist0_capacity = virt_redist_capacity(vms, VIRT_GIC_REDIST);
 uint32_t redist0_count = MIN(smp_cpus, redist0_capacity);
 
 nb_redist_regions = virt_gicv3_redist_region_count(vms);
@@ -743,7 +742,7 @@ static void create_gic(VirtMachineState *vms, MemoryRegion 
*mem)
 
 if (nb_redist_regions == 2) {
 uint32_t redist1_capacity =
-vms->memmap[VIRT_HIGH_GIC_REDIST2].size / 
GICV3_REDIST_SIZE;
+virt_redist_capacity(vms, VIRT_HIGH_GIC_REDIST2);
 
 qdev_prop_set_uint32(vms->gic, "redist-region-count[1]",
 MIN(smp_cpus - redist0_count, redist1_capacity));
@@ -2048,10 +2047,8 @@ static void machvirt_init(MachineState *machine)
  * many redistributors we can fit into the memory map.
  */
 if (vms->gic_version == VIRT_GIC_VERSION_3) {
-virt_max_cpus =
-vms->memmap[VIRT_GIC_REDIST].size / GICV3_REDIST_SIZE;
-virt_max_cpus +=
-vms->memmap[VIRT_HIGH_GIC_REDIST2].size / GICV3_REDIST_SIZE;
+virt_max_cpus = virt_redist_capacity(vms, VIRT_GIC_REDIST) +
+virt_redist_capacity(vms, VIRT_HIGH_GIC_REDIST2);
 } else {
 virt_max_cpus = GIC_NCPU;
 }
-- 
2.25.1

[PATCH 36/41] hw/intc/arm_gicv3_redist: Implement gicv3_redist_inv_vlpi()

2022-04-08 Thread Peter Maydell

Implement the function gicv3_redist_inv_vlpi(), which was previously
left as a stub.  This is the function that does the work of the INV
command for a virtual interrupt.

Signed-off-by: Peter Maydell 
---
 hw/intc/arm_gicv3_redist.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/intc/arm_gicv3_redist.c b/hw/intc/arm_gicv3_redist.c
index a586c9ef498..0738d3822d1 100644
--- a/hw/intc/arm_gicv3_redist.c
+++ b/hw/intc/arm_gicv3_redist.c
@@ -1102,9 +1102,12 @@ void gicv3_redist_vinvall(GICv3CPUState *cs, uint64_t 
vptaddr)
 void gicv3_redist_inv_vlpi(GICv3CPUState *cs, int irq, uint64_t vptaddr)
 {
 /*
- * The redistributor handling for invalidating cached information
- * about a VLPI will be added in a subsequent commit.
+ * The only cached information for LPIs we have is the HPPLPI.
+ * We could be cleverer about identifying when we don't need
+ * to do a full rescan of the pending table, but until we find
+ * this is a performance issue, just always recalculate.
  */
+gicv3_redist_vinvall(cs, vptaddr);
 }
 
 void gicv3_redist_set_irq(GICv3CPUState *cs, int irq, int level)
-- 
2.25.1

[PATCH 35/41] hw/intc/arm_gicv3_redist: Implement gicv3_redist_vinvall()

2022-04-08 Thread Peter Maydell

Implement the gicv3_redist_vinvall() function (previously left as a
stub).  This function handles the work of a VINVALL command: it must
invalidate any cached information associated with a specific vCPU.

Signed-off-by: Peter Maydell 
---
 hw/intc/arm_gicv3_redist.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/intc/arm_gicv3_redist.c b/hw/intc/arm_gicv3_redist.c
index 9866dd94c60..a586c9ef498 100644
--- a/hw/intc/arm_gicv3_redist.c
+++ b/hw/intc/arm_gicv3_redist.c
@@ -1090,7 +1090,13 @@ void gicv3_redist_mov_vlpi(GICv3CPUState *src, uint64_t 
src_vptaddr,
 
 void gicv3_redist_vinvall(GICv3CPUState *cs, uint64_t vptaddr)
 {
-/* The redistributor handling will be added in a subsequent commit */
+if (!vcpu_resident(cs, vptaddr)) {
+/* We don't have anything cached if the vCPU isn't resident */
+return;
+}
+
+/* Otherwise, our only cached information is the HPPVLPI info */
+gicv3_redist_update_vlpi(cs);
 }
 
 void gicv3_redist_inv_vlpi(GICv3CPUState *cs, int irq, uint64_t vptaddr)
-- 
2.25.1

Re: [PATCH 00/41] arm: Implement GICv4

2022-04-08 Thread Peter Maydell

On Fri, 8 Apr 2022 at 15:15, Peter Maydell  wrote:
>
> This patchset implements emulation of GICv4 in our TCG GIC and ITS
> models, and makes the virt board use it where appropriate.

> Tested with a Linux kernel passing through a virtio-blk device
> to an inner Linux VM with KVM/QEMU. (NB that to get the outer
> Linux kernel to actually use the new GICv4 functionality you
> need to pass it "kvm-arm.vgic_v4_enable=1", as the kernel
> will not use it by default.)

I guess I might as well post my notes here about how I set up
that test environment. These are a bit too scrappy (and rather
specific about a niche thing) to be proper documentation, but
having them in the list archives might be helpful in future...

===nested-setup.txt===
How to set up an environment to test QEMU's emulation of virtualization,
with PCI passthrough of a virtio-blk-pci device to the L2 guest

(1) Set up a Debian aarch64 guest (the instructions in the old
blog post
https://translatedcode.wordpress.com/2017/07/24/installing-debian-on-qemus-64-bit-arm-virt-board/
still work; I used Debian bullseye for my testing).

(2) Copy the hda.qcow2 to hda-for-inner.qcow2; run the L1 guest
using the 'runme' script.

Caution: the virtio devices need to be in this order (hda.qcow2,
network,hda-for-inner.qcow2), because systemd in the guest names
the ethernet interface based on which PCI slot it goes in.

(3) In the L1 guest, first we need to fix up the hda-for-inner.qcow2
so that it has different UUIDs and partition UUIDs from hda.qcow2.
You'll need to make sure you have the blkid, gdisk, tune2fs, swaplabel
utilities installed in the guest.

 swapoff -a   # L1 guest might have swapped onto /dev/vdb2 by accident
 # print current partition IDs; you'll see that vda and vdb currently
 # share IDs for their partitions, and we must change those for vdb
 blkid
 # first change the PARTUUIDs with gdisk; this is the answer from
 # https://askubuntu.com/questions/1250224/how-to-change-partuuid
 gdisk /dev/vdb
 x   # change to experts menu
 c   # change partition ID
 1   # for partition 1
 R   # pick a random ID
 c   # ditto for partitions 2, 3
 2
 R
 c
 3
 R
 m   # back to main menu
 w   # write partition table
 q   # quit
 # change UUIDs; from
https://unix.stackexchange.com/questions/12858/how-to-change-filesystem-uuid-2-same-uuid
 tune2fs -U random /dev/vdb1
 tune2fs -U random /dev/vdb2
 swaplabel -U $(uuidgen) /dev/vdb3
 # Check the UUIDs and PARTUUIDs are now all changed:
 blkid
 # Now update the fstab in the L2 filesystem:
 mount /dev/vdb2 /mnt
 # Finally, edit /mnt/etc/fstab to set the UUID values for /, /boot and swap to
 # the new ones for /dev/vdb's partitions
 vi /mnt/etc/fstab # or editor of your choice
 umount /mnt
 # shutdown the L1 guest now, to ensure that all the changes to that
 # qcow2 file are committed
 shutdown -h now

(4) Copy necessary files into the L1 guest's filesystem;
you can run the L1 guest and run scp there to copy from your host machine,
or any other method you like. You'll need:
 - the vmlinuz (same one being used for L1)
 - the initrd
 - some scripts [runme-inner, runme-inner-nopassthru, reassign-vdb]
 - a copy of hda-for-inner.qcow2 (probably best to copy it to a temporary
   file while the L1 guest is not running, then copy that into the guest)
 - the qemu-system-aarch64 you want to use as the L2 QEMU
   (I cross-compiled this on my x86-64 host. The packaged Debian bullseye
   qemu-system-aarch64 will also work if you don't need to use a custom
   QEMU for L2.)

(5) Now you can run the L2 guest without using PCI passthrough like this:
 ./runme-inner-nopassthru ./qemu-system-aarch64

(6) And you can run the L2 guest with PCI passthrough like this:
 # you only need to run reassign-vdb once for any given run of the
 # L1 guest, to give the PCI device to vfio-pci rather than to the
 # L1 virtio driver. After that you can run the L2 QEMU multiple times.
 ./reassign-vdb
 ./runme-inner ./qemu-system-aarch64

Notes:

I have set up the various 'runme' scripts so that L1 has a mux of
stdio and the monitor, which means that you can kill it with ^A-x,
and ^C will be delivered to the L1 guest. The L2 guest has plain
'-serial stdio', which means that ^C will kill the L2 guest.

The 'runme' scripts expect their first argument to be the path to
the QEMU you want to run; any further arguments are extra arguments
to that QEMU. So you can do things like:

   # pass more arguments to QEMU, here disabling the ITS
   ./runme ~/qemu-system-aarch64 -machine its=off
   # run gdb, and run QEMU under gdb
   ./runme gdb --args ~/qemu-system-aarch64 -machine its=off

The 'runme' scripts should be in the same directory as the
kernel etc files they go with; but you don't need to be
in that directory to run them.
===endit===

===runme===
#!/bin/sh -e
TESTDIR="$(cd "$(dirname "$0")"; pwd)"
QEMU="$@"

# Run with GICv3 and the disk image with a nested copy in it
# (for testing EL2/GICv3-virt emulation)

: ${KERNEL:=$TESTDIR/vmlinuz-5.10.0-9-arm64}
: ${INITRD

[PATCH 41/41] hw/arm/virt: Support TCG GICv4

2022-04-08 Thread Peter Maydell

Add support for the TCG GICv4 to the virt board. For the board,
the GICv4 is very similar to the GICv3, with the only difference
being the size of the redistributor frame. The changes here are thus:
 * calculating virt_redist_capacity correctly for GICv4
 * changing various places which were "if GICv3" to be "if not GICv2"
 * the commandline option handling

Note that using GICv4 reduces the maximum possible number of CPUs on
the virt board from 512 to 317, because we can now only fit half as
many redistributors into the redistributor regions we have defined.

Signed-off-by: Peter Maydell 
---
 docs/system/arm/virt.rst |  5 ++-
 include/hw/arm/virt.h| 12 +--
 hw/arm/virt.c| 70 ++--
 3 files changed, 67 insertions(+), 20 deletions(-)

diff --git a/docs/system/arm/virt.rst b/docs/system/arm/virt.rst
index 1544632b674..5d13ec2798a 100644
--- a/docs/system/arm/virt.rst
+++ b/docs/system/arm/virt.rst
@@ -99,11 +99,14 @@ gic-version
 GICv2
   ``3``
 GICv3
+  ``4``
+GICv4 (requires ``virtualization`` to be ``on``)
   ``host``
 Use the same GIC version the host provides, when using KVM
   ``max``
 Use the best GIC version possible (same as host when using KVM;
-currently same as ``3``` for TCG, but this may change in future)
+with TCG this is currently ``3`` if ``virtualization`` is ``off`` and
+``4`` if ``virtualization`` is ``on``, but this may change in future)
 
 its
   Set ``on``/``off`` to enable/disable ITS instantiation. The default is ``on``
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index 360463e6bfb..15feabac63d 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -113,6 +113,7 @@ typedef enum VirtGICType {
 VIRT_GIC_VERSION_HOST,
 VIRT_GIC_VERSION_2,
 VIRT_GIC_VERSION_3,
+VIRT_GIC_VERSION_4,
 VIRT_GIC_VERSION_NOSEL,
 } VirtGICType;
 
@@ -188,7 +189,14 @@ bool virt_is_acpi_enabled(VirtMachineState *vms);
 /* Return number of redistributors that fit in the specified region */
 static uint32_t virt_redist_capacity(VirtMachineState *vms, int region)
 {
-return vms->memmap[region].size / GICV3_REDIST_SIZE;
+uint32_t redist_size;
+
+if (vms->gic_version == VIRT_GIC_VERSION_3) {
+redist_size = GICV3_REDIST_SIZE;
+} else {
+redist_size = GICV4_REDIST_SIZE;
+}
+return vms->memmap[region].size / redist_size;
 }
 
 /* Return the number of used redistributor regions  */
@@ -196,7 +204,7 @@ static inline int 
virt_gicv3_redist_region_count(VirtMachineState *vms)
 {
 uint32_t redist0_capacity = virt_redist_capacity(vms, VIRT_GIC_REDIST);
 
-assert(vms->gic_version == VIRT_GIC_VERSION_3);
+assert(vms->gic_version != VIRT_GIC_VERSION_2);
 
 return (MACHINE(vms)->smp.cpus > redist0_capacity &&
 vms->highmem_redists) ? 2 : 1;
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 577c1e65188..dfedc6b22ee 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -522,7 +522,7 @@ static void fdt_add_gic_node(VirtMachineState *vms)
 qemu_fdt_setprop_cell(ms->fdt, nodename, "#address-cells", 0x2);
 qemu_fdt_setprop_cell(ms->fdt, nodename, "#size-cells", 0x2);
 qemu_fdt_setprop(ms->fdt, nodename, "ranges", NULL, 0);
-if (vms->gic_version == VIRT_GIC_VERSION_3) {
+if (vms->gic_version != VIRT_GIC_VERSION_2) {
 int nb_redist_regions = virt_gicv3_redist_region_count(vms);
 
 qemu_fdt_setprop_string(ms->fdt, nodename, "compatible",
@@ -708,6 +708,9 @@ static void create_gic(VirtMachineState *vms, MemoryRegion 
*mem)
 case VIRT_GIC_VERSION_3:
 revision = 3;
 break;
+case VIRT_GIC_VERSION_4:
+revision = 4;
+break;
 default:
 g_assert_not_reached();
 }
@@ -722,7 +725,7 @@ static void create_gic(VirtMachineState *vms, MemoryRegion 
*mem)
 qdev_prop_set_bit(vms->gic, "has-security-extensions", vms->secure);
 }
 
-if (vms->gic_version == VIRT_GIC_VERSION_3) {
+if (vms->gic_version != VIRT_GIC_VERSION_2) {
 uint32_t redist0_capacity = virt_redist_capacity(vms, VIRT_GIC_REDIST);
 uint32_t redist0_count = MIN(smp_cpus, redist0_capacity);
 
@@ -756,7 +759,7 @@ static void create_gic(VirtMachineState *vms, MemoryRegion 
*mem)
 gicbusdev = SYS_BUS_DEVICE(vms->gic);
 sysbus_realize_and_unref(gicbusdev, &error_fatal);
 sysbus_mmio_map(gicbusdev, 0, vms->memmap[VIRT_GIC_DIST].base);
-if (vms->gic_version == VIRT_GIC_VERSION_3) {
+if (vms->gic_version != VIRT_GIC_VERSION_2) {
 sysbus_mmio_map(gicbusdev, 1, vms->memmap[VIRT_GIC_REDIST].base);
 if (nb_redist_regions == 2) {
 sysbus_mmio_map(gicbusdev, 2,
@@ -794,7 +797,7 @@ static void create_gic(VirtMachineState *vms, MemoryRegion 
*mem)
ppibase + timer_irq[irq]));
 }
 
-if (vms->gic_version == VIRT_GIC_VERSION_3) {
+if (vms->gic_version != VIRT_GIC_VERSION_2)

Re: [PATCH] x86: Implement Linear Address Masking support

2022-04-08 Thread Richard Henderson


On 4/7/22 08:27, Kirill A. Shutemov wrote:

The fast path does not clear the bits, so you enter the slow path before you
get to clearing the bits.  You've lost most of the advantage of the tlb
already.


Sorry for my ignorance, but what do you mean by fast path here?

My understanding is that it is the case when tlb_hit() is true and you
don't need to get into tlb_fill(). Are we talking about the same scheme?


We are not.  Paulo already mentioned the JIT.  One example is tcg_out_tlb_load in 
tcg/i386/tcg-target.c.inc.  Obviously, there's an implementation of that for each host 
architecture in the other tcg/arch/ subdirectories.



I've just now had a browse through the Intel docs, and I see that you're not
performing the required modified canonicality check.


Modified is effectively done by clearing (and sign-extending) the address
before the check.


While a proper tagged address will have the tag removed in CR2 during a
page fault, an improper tagged address (with bit 63 != {47,56}) should
have the original address reported to CR2.


Hm. I don't see it in spec. It rather points to other direction:

Page faults report the faulting linear address in CR2. Because LAM
masking (by sign-extension) applies before paging, the faulting
linear address recorded in CR2 does not contain the masked
metadata.



# Regardless of the paging mode, the processor performs a modified
# canonicality check that enforces that bit 47 of the pointer matches
# bit 63. As illustrated in Figure 14-1, bits 62:48 are not checked
# and are thus available for software metadata. After this modified
# canonicality check is performed, bits 62:48 are masked by
# sign-extending the value of bit 47

Note especially that the sign-extension happens after canonicality check.


But what other options do you see. Clering the bits before TLB look up
matches the architectural spec and makes INVLPG match described behaviour
without special handling.


We have special handling for INVLPG: tlb_flush_page_bits_by_mmuidx.  That's how we handle 
TBI for ARM.  You'd supply 48 or 57 here.



r~

Procedures adding new CPUs in sbsa-ref

2022-04-08 Thread Itaru Kitayama

Hi,
I'd like to add a64fx cpu to the sbsa-ref board, if there's a quick and dirty
way of completing that, advice from the  maintainers is greatly appreciated.

Thanks,
Itaru.

Re: [PATCH v9 09/11] 9p: darwin: Implement compatibility for mknodat

2022-04-08 Thread Greg Kurz

On Fri, 08 Apr 2022 15:52:25 +0200
Christian Schoenebeck  wrote:

> On Sonntag, 27. Februar 2022 23:35:20 CEST Will Cohen wrote:
> > From: Keno Fischer 
> > 
> > Darwin does not support mknodat. However, to avoid race conditions
> > with later setting the permissions, we must avoid using mknod on
> > the full path instead. We could try to fchdir, but that would cause
> > problems if multiple threads try to call mknodat at the same time.
> > However, luckily there is a solution: Darwin includes a function
> > that sets the cwd for the current thread only.
> > This should suffice to use mknod safely.
> [...]
> > diff --git a/hw/9pfs/9p-util-darwin.c b/hw/9pfs/9p-util-darwin.c
> > index cdb4c9e24c..bec0253474 100644
> > --- a/hw/9pfs/9p-util-darwin.c
> > +++ b/hw/9pfs/9p-util-darwin.c
> > @@ -7,6 +7,8 @@
> > 
> >  #include "qemu/osdep.h"
> >  #include "qemu/xattr.h"
> > +#include "qapi/error.h"
> > +#include "qemu/error-report.h"
> >  #include "9p-util.h"
> > 
> >  ssize_t fgetxattrat_nofollow(int dirfd, const char *filename, const char
> > *name, @@ -62,3 +64,34 @@ int fsetxattrat_nofollow(int dirfd, const char
> > *filename, const char *name, close_preserve_errno(fd);
> >  return ret;
> >  }
> > +
> > +/*
> > + * As long as mknodat is not available on macOS, this workaround
> > + * using pthread_fchdir_np is needed.
> > + *
> > + * Radar filed with Apple for implementing mknodat:
> > + * rdar://FB9862426 (https://openradar.appspot.com/FB9862426)
> > + */
> > +#if defined CONFIG_PTHREAD_FCHDIR_NP
> > +
> > +int qemu_mknodat(int dirfd, const char *filename, mode_t mode, dev_t dev)
> > +{
> > +int preserved_errno, err;
> > +if (!pthread_fchdir_np) {
> > +error_report_once("pthread_fchdir_np() not available on this
> > version of macOS"); +return -ENOTSUP;
> > +}
> > +if (pthread_fchdir_np(dirfd) < 0) {
> > +return -1;
> > +}
> > +err = mknod(filename, mode, dev);
> 
> I just tested this on macOS Monterey and realized mknod() seems to require 
> admin privileges on macOS to work. So if you run QEMU as ordinary user on 
> macOS then mknod() would fail with errno=1 (Operation not permitted).
> 
> That means a lot of stuff would simply not work on macOS, unless you really 
> want to run QEMU with super user privileges, which does not sound appealing 
> to 
> me. :/
> 
> Should we introduce another fake behaviour here, i.e. remapping this on macOS 
> hosts as regular file and make guest believe it would create a device, 
> similar 
> as we already do for mapped links?
> 

I'd rather keep that for the mapped security mode only to avoid
confusion, but qemu_mknodat() is also used in passthrough mode.

Anyway, it seems that macOS's mknod() only creates device files,
unlike linux (POSIX) which is also used to create FIFOs, sockets
and regular files. And it also requires elevated privileges,
CAP_MKNOD, in order to create device files.

It seems that this implementation of qemu_mknodat() is just missing
some features that can be implemented with unprivileged syscalls like
mkfifo(), socket() and open().

> > +preserved_errno = errno;
> > +/* Stop using the thread-local cwd */
> > +pthread_fchdir_np(-1);
> > +if (err < 0) {
> > +errno = preserved_errno;
> > +}
> > +return err;
> > +}
> > +
> > +#endif
> 
>

Support for x86_64 on aarch64 emulation

2022-04-08 Thread Redha Gouicem

We are working on support for x86_64 emulation on aarch64, mainly
related to memory ordering issues. We first wanted to know what the
community thinks about our proposal, and its chance of getting merged
one day.

Note that we worked with qemu-user, so there may be issues in system
mode that we missed.

# Problem

When generating the TCG instructions for memory accesses, fences are
always inserted *before* the access, following this translation rule:

x86   --> TCG -->aarch64
-
RMOV  -->  Fm_ld; ld  -->  DMBLD; LDR
WMOV  -->  Fm_st; st  -->  DMBFF; STR

Here, Fm_ld is a fence that orders any preceding memory access with
the subsequent load. F_m_st is a fence that orders any preceding
memory access with the subsequent store. This means that, in TCG, all
memory accesses are ordered by fences. Thus, no memory accesses can be
re-ordered in TCG. This is a problem, because it is *stricter than
x86*. Consider when a program contains:

WMOV; RMOV


x86 allows re-ordering independent store-load pairs, so the above pair
can safely re-order on an x86 host. However, with QEMU's current
translation, it becomes:

DMBFF; STR; DMBLD; LDR

In this target aarch64 code, no re-ordering is possible. Hence, QEMU
enforces a stronger model than x86. While that is correct, it harms
performance.

# Solution

We propose an alternative scheme, which we formally proved correct
(paper under review):

x86   -->  TCG-->aarch64
-
RMOV  -->  ld; Fld_m  -->  LDR; DMBLD
WMOV  -->  Fst_st; st -->  DMBST; STR

This new scheme precisely captures the observable behaviors of the
input program (in x86's memory model). This behavior is preserved in
the resulting TCG and aarch64 programs. Which the inserted fences
enforce (formally verified). Note that this scheme enforces fewer
ordering than the previous (unnecessarily strong) mapping scheme. This
new scheme benefits performance. We evaluated this on benchmarks
(PARSEC) and got up to 19.7% improvement, 6.7% on average.

# Implementation Considerations
 
Different (source and host) architectures may demand different such
mapping schemes. Some schemes may place fences before an instruction,
while others place them after. The implementation of fence placement
should thus be sufficiently flexible that either is possible. Though,
note that write-read pairs are unordered in almost all architectures.
 
We see two ways of doing this:
- extracting the placement of the fence from the 
  tcg_gen_qemu_ld/st_i32/i64 functions, and have each architecture
  explicitly generate the fence at the correct place
- adding two parameters to these functions specifying the strength of
  the "before" and "after" fences. The function would then generate
  both fences in the IR (one of them may be a NOP fence), which in
  turn will be translated back to the host


We are eager to see what you think about this change in TCG.
Cheers!


-- 
Redha Gouicem
Post doctoral researcher
Chair of Decentralized Systems Engineering
Department of Informatics, Technical University of Munich (TUM)

Re: [PATCH v4 3/7] iotests: add copy-before-write: on-cbw-error tests

2022-04-08 Thread Hanna Reitz


On 07.04.22 15:27, Vladimir Sementsov-Ogievskiy wrote:

Add tests for new option of copy-before-write filter: on-cbw-error.

Note that we use QEMUMachine instead of VM class, because in further
commit we'll want to use throttling which doesn't work with -accel
qtest used by VM.

We also touch pylintrc to not break iotest 297.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/pylintrc   |   5 +
  tests/qemu-iotests/tests/copy-before-write| 132 ++
  .../qemu-iotests/tests/copy-before-write.out  |   5 +
  3 files changed, 142 insertions(+)
  create mode 100755 tests/qemu-iotests/tests/copy-before-write
  create mode 100644 tests/qemu-iotests/tests/copy-before-write.out


Reviewed-by: Hanna Reitz

Re: [PATCH v4 7/7] iotests: copy-before-write: add cases for cbw-timeout option

2022-04-08 Thread Hanna Reitz


On 07.04.22 15:27, Vladimir Sementsov-Ogievskiy wrote:

Add two simple test-cases: timeout failure with
break-snapshot-on-cbw-error behavior and similar with
break-guest-write-on-cbw-error behavior.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  tests/qemu-iotests/tests/copy-before-write| 81 +++
  .../qemu-iotests/tests/copy-before-write.out  |  4 +-
  2 files changed, 83 insertions(+), 2 deletions(-)


Reviewed-by: Hanna Reitz

1 2 >

1 - 100 of 171 matches

Mail list logo