Re: Reducing vdpa migration downtime because of memory pin / maps

2023-07-19 Thread Si-Wei Liu




On 7/19/2023 3:40 AM, Eugenio Perez Martin wrote:

On Mon, Jul 17, 2023 at 9:57 PM Si-Wei Liu  wrote:

Hey,

I am now back from the break. Sorry for the delayed response, please see
in line.

On 7/9/2023 11:04 PM, Eugenio Perez Martin wrote:

On Sat, Jul 8, 2023 at 11:14 AM Si-Wei Liu  wrote:


On 7/5/2023 10:46 PM, Eugenio Perez Martin wrote:

On Thu, Jul 6, 2023 at 2:13 AM Si-Wei Liu  wrote:

On 7/5/2023 11:03 AM, Eugenio Perez Martin wrote:

On Tue, Jun 27, 2023 at 8:36 AM Si-Wei Liu  wrote:

On 6/9/2023 7:32 AM, Eugenio Perez Martin wrote:

On Fri, Jun 9, 2023 at 12:39 AM Si-Wei Liu  wrote:

On 6/7/23 01:08, Eugenio Perez Martin wrote:

On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu  wrote:

Sorry for reviving this old thread, I lost the best timing to follow up
on this while I was on vacation. I have been working on this and found
out some discrepancy, please see below.

On 4/5/23 04:37, Eugenio Perez Martin wrote:

Hi!

As mentioned in the last upstream virtio-networking meeting, one of
the factors that adds more downtime to migration is the handling of
the guest memory (pin, map, etc). At this moment this handling is
bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the
destination device waits until all the guest memory / state is
migrated to start pinning all the memory.

The proposal is to bind it to the char device life cycle (open vs
close),

Hmmm, really? If it's the life cycle for char device, the next guest /
qemu launch on the same vhost-vdpa device node won't make it work.


Maybe my sentence was not accurate, but I think we're on the same page here.

Two qemu instances opening the same char device at the same time are
not allowed, and vhost_vdpa_release clean all the maps. So the next
qemu that opens the char device should see a clean device anyway.

I mean the pin can't be done at the time of char device open, where the
user address space is not known/bound yet. The earliest point possible
for pinning would be until the vhost_attach_mm() call from SET_OWNER is
done.

Maybe we are deviating, let me start again.

Using QEMU code, what I'm proposing is to modify the lifecycle of the
.listener member of struct vhost_vdpa.

At this moment, the memory listener is registered at
vhost_vdpa_dev_start(dev, started=true) call for the last vhost_dev,
and is unregistered in both vhost_vdpa_reset_status and
vhost_vdpa_cleanup.

My original proposal was just to move the memory listener registration
to the last vhost_vdpa_init, and remove the unregister from
vhost_vdpa_reset_status. The calls to vhost_vdpa_dma_map/unmap would
be the same, the device should not realize this change.

This can address LM downtime latency for sure, but it won't help
downtime during dynamic SVQ switch - which still needs to go through the
full unmap/map cycle (that includes the slow part for pinning) from
passthrough to SVQ mode. Be noted not every device could work with a
separate ASID for SVQ descriptors. The fix should expect to work on
normal vDPA vendor devices without a separate descriptor ASID, with
platform IOMMU underneath or with on-chip IOMMU.


At this moment the SVQ switch is very inefficient mapping-wise, as it
unmap all the GPA->HVA maps and overrides it. In particular, SVQ is
allocated in low regions of the iova space, and then the guest memory
is allocated in this new IOVA region incrementally.

Yep. The key to build this fast path for SVQ switching I think is to
maintain the identity mapping for the passthrough queues so that QEMU
can reuse the old mappings for guest memory (e.g. GIOVA identity mapped
to GPA) while incrementally adding new mappings for SVQ vrings.


We can optimize that if we place SVQ in a free GPA area instead.

Here's a question though: it might not be hard to find a free GPA range
for the non-vIOMMU case (allocate iova from beyond the 48bit or 52bit
ranges), but I'm not sure if easy to find a free GIOVA range for the
vIOMMU case - particularly this has to work in the same entire 64bit
IOVA address ranges that (for now) QEMU won't be able to "reserve" a
specific IOVA ranges for SVQ from the vIOMMU. Do you foresee this can be
done for every QEMU emulated vIOMMU (intel-iommu amd-iommu, arm smmu and
virito-iommu) so that we can call it out as a generic means for SVQ
switching optimization?


In the case vIOMMU allocates a new block we will use the same algorithm as now:
* Find a new free IOVA chunk of the same size
* Map this new SVQ IOVA, that may or may not be the same as SVQ

Since we must go through the translation phase to sanitize guest's
available descriptors anyway, it has zero added cost.

Not sure I followed, this can work but doesn't seem able to reuse the
old host kernel mappings for guest memory, hence still requires remap of
the entire host IOVA ranges when SVQ IOVA comes along. I think by
maintaining 1:1 identity map on guest memory, we don't have to bother
tearing down existing HVA-

Re: [PATCH 1/2] Reduce vdpa initialization / startup overhead

2023-07-21 Thread Si-Wei Liu




On 7/21/2023 3:39 AM, Eugenio Perez Martin wrote:

On Tue, Jul 18, 2023 at 12:55 PM Michael S. Tsirkin  wrote:

On Thu, Apr 20, 2023 at 10:59:56AM +0200, Eugenio Perez Martin wrote:

On Thu, Apr 20, 2023 at 7:25 AM Pei Li  wrote:

Hi all,

My bad, I just submitted the kernel patch. If we are passing some generic 
command, still we have to add an additional field in the structure to indicate 
what is the unbatched version of this command, and the struct vhost_ioctls 
would be some command specific structure. In summary, the structure would be 
something like
struct vhost_cmd_batch {
 int ncmds;
 int cmd;

The unbatched version should go in each vhost_ioctls. That allows us
to send many different commands in one ioctl instead of having to
resort to many ioctls, each one for a different task.

The problem with that is the size of that struct vhost_ioctl, so we
can build an array. I think it should be enough with the biggest of
them (vhost_vring_addr ?) for a long time, but I would like to know if
anybody finds a drawback here. We could always resort to pointers if
we find we need more space, or start with them from the beginning.

CCing Si-Wei here too, as he is also interested in reducing the startup time.

Thanks!

And copying my response too:
This is all very exciting, but what exactly is the benefit?
No optimization patches are going to be merged without
numbers showing performance gains.
In this case, can you show gains in process startup time?
Are they significant enough to warrant adding new UAPI?


This should have been marked as RFC in that regard.

When this was sent it was one of the planned actions to reduce
overhead. After Si-Wei benchmarks, all the efforts will focus on
reducing the pinning / maps for the moment. It is unlikely that this
will be moved forward soon.
Right, this work has comparatively lower priority in terms of 
significance of impact to migration downtime (to vdpa h/w device that 
does DMA), but after getting the pinning/map latency effect removed from 
the performance path, it'd be easier to see same scalability effect 
subjected to vq count as how software vp_vdpa performs today. I think in 
order to profile the vq scalability effect with large queue count, we 
first would need to have proper implementation of CVQ replay and 
multiqueue LM in place - I'm not sure if x-svq=on could be a good 
approximate, but maybe that can be used to collect some initial 
profiling data. Would this be sufficient to move this forward in parallel?


Regards,
-Siwei



Thanks!





 struct vhost_ioctls[];
};

This is doable. Also, this is my first time submitting patches to open source, 
sorry about the mess in advance. That being said, feel free to throw questions 
/ comments!

Thanks and best regards,
Pei

On Wed, Apr 19, 2023 at 9:19 PM Jason Wang  wrote:

On Wed, Apr 19, 2023 at 11:33 PM Eugenio Perez Martin
 wrote:

On Wed, Apr 19, 2023 at 12:56 AM  wrote:

From: Pei Li 

Currently, part of the vdpa initialization / startup process
needs to trigger many ioctls per vq, which is very inefficient
and causing unnecessary context switch between user mode and
kernel mode.

This patch creates an additional ioctl() command, namely
VHOST_VDPA_GET_VRING_GROUP_BATCH, that will batching
commands of VHOST_VDPA_GET_VRING_GROUP into a single
ioctl() call.

I'd expect there's a kernel patch but I didn't see that?

If we want to go this way. Why simply have a more generic way, that is
introducing something like:

VHOST_CMD_BATCH which did something like

struct vhost_cmd_batch {
 int ncmds;
 struct vhost_ioctls[];
};

Then you can batch other ioctls other than GET_VRING_GROUP?

Thanks


It seems to me you forgot to send the 0/2 cover letter :).

Please include the time we save thanks to avoiding the repetitive
ioctls in each patch.

CCing Jason and Michael.


Signed-off-by: Pei Li 
---
  hw/virtio/vhost-vdpa.c   | 31 +++-
  include/standard-headers/linux/vhost_types.h |  3 ++
  linux-headers/linux/vhost.h  |  7 +
  3 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index bc6bad23d5..6d45ff8539 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -679,7 +679,8 @@ static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
  uint64_t f = 0x1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2 |
  0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH |
  0x1ULL << VHOST_BACKEND_F_IOTLB_ASID |
-0x1ULL << VHOST_BACKEND_F_SUSPEND;
+0x1ULL << VHOST_BACKEND_F_SUSPEND |
+0x1ULL << VHOST_BACKEND_F_IOCTL_BATCH;
  int r;

  if (vhost_vdpa_call(dev, VHOST_GET_BACKEND_FEATURES, &features)) {
@@ -731,14 +732,28 @@ static int vhost_vdpa_get_vq_index(struct vhost_dev *dev, 
int idx)

  static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev)
  {
-int i;
+int i, nvqs = dev->nvqs;
+uint64_t backend_features = dev->backend_cap;
+
  trac

Re: [RFC PATCH 07/12] vdpa: add vhost_vdpa_reset_queue

2023-07-21 Thread Si-Wei Liu




On 7/20/2023 11:14 AM, Eugenio Pérez wrote:

Split out vq reset operation in its own function, as it may be called
with ring reset.

Signed-off-by: Eugenio Pérez 
---
  hw/virtio/vhost-vdpa.c | 16 
  1 file changed, 16 insertions(+)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 6ae276ccde..df2515a247 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -547,6 +547,21 @@ int vhost_vdpa_set_vring_ready(struct vhost_vdpa *v, 
unsigned idx)
  return vhost_vdpa_set_vring_ready_internal(v, idx, true);
  }
  
+/* TODO: Properly reorder static functions */

+static void vhost_vdpa_svq_stop(struct vhost_dev *dev, unsigned idx);
+static void vhost_vdpa_reset_queue(struct vhost_dev *dev, int idx)
+{
+struct vhost_vdpa *v = dev->opaque;
+
+if (dev->features & VIRTIO_F_RING_RESET) {
+vhost_vdpa_set_vring_ready_internal(v, idx, false);
I'm not sure I understand this patch - this is NOT the spec defined way 
to initiate RING_RESET? Quoting the spec diff from the original 
RING_RESET tex doc:


+The device MUST reset the queue when 1 is written to \field{queue_reset}, and
+present a 1 in \field{queue_reset} after the queue has been reset, until the
+driver re-enables the queue via \field{queue_enable} or the device is reset.
+The device MUST present consistent default values after queue reset.
+(see \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue 
Reset}).

Or you intend to rewrite it to be spec conforming later on?

-Siwei

+}
+
+if (v->shadow_vqs_enabled) {
+vhost_vdpa_svq_stop(dev, idx - dev->vq_index);
+}
+}
+
  /*
   * The use of this function is for requests that only need to be
   * applied once. Typically such request occurs at the beginning
@@ -1543,4 +1558,5 @@ const VhostOps vdpa_ops = {
  .vhost_force_iommu = vhost_vdpa_force_iommu,
  .vhost_set_config_call = vhost_vdpa_set_config_call,
  .vhost_reset_status = vhost_vdpa_reset_status,
+.vhost_reset_queue = vhost_vdpa_reset_queue,
  };





Re: [RFC PATCH 11/12] vdpa: use SVQ to stall dataplane while NIC state is being restored

2023-07-21 Thread Si-Wei Liu




On 7/20/2023 11:14 AM, Eugenio Pérez wrote:

Some dynamic state of a virtio-net vDPA devices is restored from CVQ in
the event of a live migration.  However, dataplane needs to be disabled
so the NIC does not receive buffers in the invalid ring.

As a default method to achieve it, let's offer a shadow vring with 0
avail idx.  As a fallback method, we will enable dataplane vqs later, as
proposed previously.
Let's not jump to conclusion too early what will be the default v.s. 
fallback [1] - as this is on a latency sensitive path, I'm not fully 
convinced ring reset could perform better than or equally same as the 
deferred dataplane enablement approach on hardware. At this stage I 
think ring_reset has no adoption on vendors device, while it's 
definitely easier with lower hardware overhead for vendor to implement 
deferred dataplane enabling. If at some point vendor's device has to 
support RING_RESET for other use cases (MTU change propagation for ex., 
a prerequisite for GRO HW) than live migration, defaulting to RING_RESET 
on this SVQ path has no real benefit but adds complications needlessly 
to vendor's device.


[1] 
https://lore.kernel.org/virtualization/bf2164a9-1dfd-14d9-be2a-8bb7620a0...@oracle.com/T/#m15caca6fbb00ca9c00e2b33391297a2d8282ff89


Thanks,
-Siwei



Signed-off-by: Eugenio Pérez 
---
  net/vhost-vdpa.c | 49 +++-
  1 file changed, 44 insertions(+), 5 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index af83de92f8..e14ae48f23 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -338,10 +338,25 @@ static int vhost_vdpa_net_data_start(NetClientState *nc)
  {
  VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
  struct vhost_vdpa *v = &s->vhost_vdpa;
+bool has_cvq = v->dev->vq_index_end % 2;
  
  assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
  
-if (s->always_svq ||

+if (has_cvq && (v->dev->features & VIRTIO_F_RING_RESET)) {
+/*
+ * Offer a fake vring to the device while the state is restored
+ * through CVQ.  That way, the guest will not see packets in unexpected
+ * queues.
+ *
+ * This will be undone after loading all state through CVQ, at
+ * vhost_vdpa_net_load.
+ *
+ * TODO: Future optimizations may skip some SVQ setup and teardown,
+ * like set the right kick and call fd or doorbell maps directly, and
+ * the iova tree.
+ */
+v->shadow_vqs_enabled = true;
+} else if (s->always_svq ||
  migration_is_setup_or_active(migrate_get_current()->state)) {
  v->shadow_vqs_enabled = true;
  v->shadow_data = true;
@@ -738,10 +753,34 @@ static int vhost_vdpa_net_load(NetClientState *nc)
  return r;
  }
  
-for (int i = 0; i < v->dev->vq_index; ++i) {

-r = vhost_vdpa_set_vring_ready(v, i);
-if (unlikely(r)) {
-return r;
+if (v->dev->features & VIRTIO_F_RING_RESET && !s->always_svq &&
+!migration_is_setup_or_active(migrate_get_current()->state)) {
+NICState *nic = qemu_get_nic(s->nc.peer);
+int queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
+
+for (int i = 0; i < queue_pairs; ++i) {
+NetClientState *ncs = qemu_get_peer(nic->ncs, i);
+VhostVDPAState *s_i = DO_UPCAST(VhostVDPAState, nc, ncs);
+
+for (int j = 0; j < 2; ++j) {
+vhost_net_virtqueue_reset(v->dev->vdev, ncs->peer, j);
+}
+
+s_i->vhost_vdpa.shadow_vqs_enabled = false;
+
+for (int j = 0; j < 2; ++j) {
+r = vhost_net_virtqueue_restart(v->dev->vdev, ncs->peer, j);
+if (unlikely(r < 0)) {
+return r;
+}
+}
+}
+} else {
+for (int i = 0; i < v->dev->vq_index; ++i) {
+r = vhost_vdpa_set_vring_ready(v, i);
+if (unlikely(r)) {
+return r;
+}
  }
  }
  





[PATCH 00/12] Preparatory patches for live migration downtime improvement

2024-02-14 Thread Si-Wei Liu
This small series is a spin-off from [1], where the patches
already acked from that large patchset may get merged earlier
without having to wait for those that are still in review.

The last 3 patches (10 - 12) are bug fix to an issue where
cancellation of ongoing migration may lead to busted network.
These are the only outstanding patches in this patchset with
no acknowledgement received as yet. Please try to review
them at the earliest oppotunity. Thanks!

Regards,
-Siwei

[1] [PATCH 00/40] vdpa-net: improve migration downtime through descriptor ASID 
and persistent IOTLB
https://lore.kernel.org/qemu-devel/1701970793-6865-1-git-send-email-si-wei@oracle.com/

---

Si-Wei Liu (12):
  vdpa: add back vhost_vdpa_net_first_nc_vdpa
  vdpa: no repeat setting shadow_data
  vdpa: factor out vhost_vdpa_last_dev
  vdpa: factor out vhost_vdpa_net_get_nc_vdpa
  vdpa: add vhost_vdpa_set_address_space_id trace
  vdpa: add vhost_vdpa_get_vring_base trace for svq mode
  vdpa: add vhost_vdpa_set_dev_vring_base trace for svq mode
  vdpa: add trace events for vhost_vdpa_net_load_cmd
  vdpa: add trace event for vhost_vdpa_net_load_mq
  vdpa: define SVQ transitioning state for mode switching
  vdpa: indicate transitional state for SVQ switching
  vdpa: fix network breakage after cancelling migration

 hw/virtio/trace-events |  4 ++--
 hw/virtio/vhost-vdpa.c | 27 ++-
 include/hw/virtio/vhost-vdpa.h |  9 +
 net/trace-events   |  6 ++
 net/vhost-vdpa.c   | 33 +
 5 files changed, 68 insertions(+), 11 deletions(-)

-- 
1.8.3.1




[PATCH 02/12] vdpa: no repeat setting shadow_data

2024-02-14 Thread Si-Wei Liu
Since shadow_data is now shared in the parent data struct, it
just needs to be set only once by the first vq. This change
will make shadow_data independent of svq enabled state, which
can be optionally turned off when SVQ descritors and device
driver areas are all isolated to a separate address space.

Reviewed-by: Eugenio Pérez 
Acked-by: Jason Wang 
Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 4479ffa..06c83b4 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -354,13 +354,12 @@ static int vhost_vdpa_net_data_start(NetClientState *nc)
 if (s->always_svq ||
 migration_is_setup_or_active(migrate_get_current()->state)) {
 v->shadow_vqs_enabled = true;
-v->shared->shadow_data = true;
 } else {
 v->shadow_vqs_enabled = false;
-v->shared->shadow_data = false;
 }
 
 if (v->index == 0) {
+v->shared->shadow_data = v->shadow_vqs_enabled;
 vhost_vdpa_net_data_start_first(s);
 return 0;
 }
-- 
1.8.3.1




[PATCH 12/12] vdpa: fix network breakage after cancelling migration

2024-02-14 Thread Si-Wei Liu
Fix an issue where cancellation of ongoing migration ends up
with no network connectivity.

When canceling migration, SVQ will be switched back to the
passthrough mode, but the right call fd is not programed to
the device and the svq's own call fd is still used. At the
point of this transitioning period, the shadow_vqs_enabled
hadn't been set back to false yet, causing the installation
of call fd inadvertently bypassed.

Fixes: a8ac88585da1 ("vhost: Add Shadow VirtQueue call forwarding capabilities")
Cc: Eugenio Pérez 
Acked-by: Jason Wang 
Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost-vdpa.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 004110f..dfeca8b 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1468,7 +1468,15 @@ static int vhost_vdpa_set_vring_call(struct vhost_dev 
*dev,
 
 /* Remember last call fd because we can switch to SVQ anytime. */
 vhost_svq_set_svq_call_fd(svq, file->fd);
-if (v->shadow_vqs_enabled) {
+/*
+ * When SVQ is transitioning to off, shadow_vqs_enabled has
+ * not been set back to false yet, but the underlying call fd
+ * will have to switch back to the guest notifier to signal the
+ * passthrough virtqueues. In other situations, SVQ's own call
+ * fd shall be used to signal the device model.
+ */
+if (v->shadow_vqs_enabled &&
+v->shared->svq_switching != SVQ_TSTATE_DISABLING) {
 return 0;
 }
 
-- 
1.8.3.1




[PATCH 11/12] vdpa: indicate transitional state for SVQ switching

2024-02-14 Thread Si-Wei Liu
svq_switching indicates the transitional state whether
or not SVQ mode switching is in progress, and towards
which direction. Add the neccessary state around where
the switching would take place.

Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 9f25221..96d95b9 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -317,6 +317,8 @@ static void vhost_vdpa_net_log_global_enable(VhostVDPAState 
*s, bool enable)
 data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
 cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
   n->max_ncs - n->max_queue_pairs : 0;
+v->shared->svq_switching = enable ?
+SVQ_TSTATE_ENABLING : SVQ_TSTATE_DISABLING;
 /*
  * TODO: vhost_net_stop does suspend, get_base and reset. We can be smarter
  * in the future and resume the device if read-only operations between
@@ -329,6 +331,7 @@ static void vhost_vdpa_net_log_global_enable(VhostVDPAState 
*s, bool enable)
 if (unlikely(r < 0)) {
 error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
 }
+v->shared->svq_switching = SVQ_TSTATE_DONE;
 }
 
 static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
-- 
1.8.3.1




[PATCH 05/12] vdpa: add vhost_vdpa_set_address_space_id trace

2024-02-14 Thread Si-Wei Liu
For better debuggability and observability.

Reviewed-by: Eugenio Pérez 
Signed-off-by: Si-Wei Liu 
---
 net/trace-events | 3 +++
 net/vhost-vdpa.c | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/net/trace-events b/net/trace-events
index 823a071..aab666a 100644
--- a/net/trace-events
+++ b/net/trace-events
@@ -23,3 +23,6 @@ colo_compare_tcp_info(const char *pkt, uint32_t seq, uint32_t 
ack, int hdlen, in
 # filter-rewriter.c
 colo_filter_rewriter_pkt_info(const char *func, const char *src, const char 
*dst, uint32_t seq, uint32_t ack, uint32_t flag) "%s: src/dst: %s/%s p: 
seq/ack=%u/%u  flags=0x%x"
 colo_filter_rewriter_conn_offset(uint32_t offset) ": offset=%u"
+
+# vhost-vdpa.c
+vhost_vdpa_set_address_space_id(void *v, unsigned vq_group, unsigned asid_num) 
"vhost_vdpa: %p vq_group: %u asid: %u"
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 4168cad..48a5608 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -29,6 +29,7 @@
 #include "migration/migration.h"
 #include "migration/misc.h"
 #include "hw/virtio/vhost.h"
+#include "trace.h"
 
 /* Todo:need to add the multiqueue support here */
 typedef struct VhostVDPAState {
@@ -440,6 +441,8 @@ static int vhost_vdpa_set_address_space_id(struct 
vhost_vdpa *v,
 };
 int r;
 
+trace_vhost_vdpa_set_address_space_id(v, vq_group, asid_num);
+
 r = ioctl(v->shared->device_fd, VHOST_VDPA_SET_GROUP_ASID, &asid);
 if (unlikely(r < 0)) {
 error_report("Can't set vq group %u asid %u, errno=%d (%s)",
-- 
1.8.3.1




[PATCH 03/12] vdpa: factor out vhost_vdpa_last_dev

2024-02-14 Thread Si-Wei Liu
Generalize duplicated condition check for the last vq of vdpa
device to a common function.

Reviewed-by: Eugenio Pérez 
Acked-by: Jason Wang 
Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost-vdpa.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index f7162da..1d3154a 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -551,6 +551,11 @@ static bool vhost_vdpa_first_dev(struct vhost_dev *dev)
 return v->index == 0;
 }
 
+static bool vhost_vdpa_last_dev(struct vhost_dev *dev)
+{
+return dev->vq_index + dev->nvqs == dev->vq_index_end;
+}
+
 static int vhost_vdpa_get_dev_features(struct vhost_dev *dev,
uint64_t *features)
 {
@@ -1317,7 +1322,7 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, 
bool started)
 vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
 }
 
-if (dev->vq_index + dev->nvqs != dev->vq_index_end) {
+if (!vhost_vdpa_last_dev(dev)) {
 return 0;
 }
 
@@ -1347,7 +1352,7 @@ static void vhost_vdpa_reset_status(struct vhost_dev *dev)
 {
 struct vhost_vdpa *v = dev->opaque;
 
-if (dev->vq_index + dev->nvqs != dev->vq_index_end) {
+if (!vhost_vdpa_last_dev(dev)) {
 return;
 }
 
-- 
1.8.3.1




[PATCH 06/12] vdpa: add vhost_vdpa_get_vring_base trace for svq mode

2024-02-14 Thread Si-Wei Liu
For better debuggability and observability.

Reviewed-by: Eugenio Pérez 
Acked-by: Jason Wang 
Signed-off-by: Si-Wei Liu 
---
 hw/virtio/trace-events | 2 +-
 hw/virtio/vhost-vdpa.c | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 77905d1..28d6d78 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -58,7 +58,7 @@ vhost_vdpa_set_log_base(void *dev, uint64_t base, unsigned 
long long size, int r
 vhost_vdpa_set_vring_addr(void *dev, unsigned int index, unsigned int flags, 
uint64_t desc_user_addr, uint64_t used_user_addr, uint64_t avail_user_addr, 
uint64_t log_guest_addr) "dev: %p index: %u flags: 0x%x desc_user_addr: 
0x%"PRIx64" used_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" 
log_guest_addr: 0x%"PRIx64
 vhost_vdpa_set_vring_num(void *dev, unsigned int index, unsigned int num) 
"dev: %p index: %u num: %u"
 vhost_vdpa_set_vring_base(void *dev, unsigned int index, unsigned int num) 
"dev: %p index: %u num: %u"
-vhost_vdpa_get_vring_base(void *dev, unsigned int index, unsigned int num) 
"dev: %p index: %u num: %u"
+vhost_vdpa_get_vring_base(void *dev, unsigned int index, unsigned int num, 
bool svq) "dev: %p index: %u num: %u svq: %d"
 vhost_vdpa_set_vring_kick(void *dev, unsigned int index, int fd) "dev: %p 
index: %u fd: %d"
 vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p 
index: %u fd: %d"
 vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 
0x%"PRIx64
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 1d3154a..0de7bdf 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1424,6 +1424,7 @@ static int vhost_vdpa_get_vring_base(struct vhost_dev 
*dev,
 
 if (v->shadow_vqs_enabled) {
 ring->num = virtio_queue_get_last_avail_idx(dev->vdev, ring->index);
+trace_vhost_vdpa_get_vring_base(dev, ring->index, ring->num, true);
 return 0;
 }
 
@@ -1436,7 +1437,7 @@ static int vhost_vdpa_get_vring_base(struct vhost_dev 
*dev,
 }
 
 ret = vhost_vdpa_call(dev, VHOST_GET_VRING_BASE, ring);
-trace_vhost_vdpa_get_vring_base(dev, ring->index, ring->num);
+trace_vhost_vdpa_get_vring_base(dev, ring->index, ring->num, false);
 return ret;
 }
 
-- 
1.8.3.1




[PATCH 09/12] vdpa: add trace event for vhost_vdpa_net_load_mq

2024-02-14 Thread Si-Wei Liu
For better debuggability and observability.

Reviewed-by: Eugenio Pérez 
Signed-off-by: Si-Wei Liu 
---
 net/trace-events | 1 +
 net/vhost-vdpa.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/net/trace-events b/net/trace-events
index 88f56f2..cda960f 100644
--- a/net/trace-events
+++ b/net/trace-events
@@ -28,3 +28,4 @@ colo_filter_rewriter_conn_offset(uint32_t offset) ": 
offset=%u"
 vhost_vdpa_set_address_space_id(void *v, unsigned vq_group, unsigned asid_num) 
"vhost_vdpa: %p vq_group: %u asid: %u"
 vhost_vdpa_net_load_cmd(void *s, uint8_t class, uint8_t cmd, int data_num, int 
data_size) "vdpa state: %p class: %u cmd: %u sg_num: %d size: %d"
 vhost_vdpa_net_load_cmd_retval(void *s, uint8_t class, uint8_t cmd, int r) 
"vdpa state: %p class: %u cmd: %u retval: %d"
+vhost_vdpa_net_load_mq(void *s, int ncurqps) "vdpa state: %p current_qpairs: 
%d"
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 6ee438f..9f25221 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -901,6 +901,8 @@ static int vhost_vdpa_net_load_mq(VhostVDPAState *s,
 return 0;
 }
 
+trace_vhost_vdpa_net_load_mq(s, n->curr_queue_pairs);
+
 mq.virtqueue_pairs = cpu_to_le16(n->curr_queue_pairs);
 const struct iovec data = {
 .iov_base = &mq,
-- 
1.8.3.1




[PATCH 01/12] vdpa: add back vhost_vdpa_net_first_nc_vdpa

2024-02-14 Thread Si-Wei Liu
Previous commits had it removed. Now adding it back because
this function will be needed by future patches.

Reviewed-by: Eugenio Pérez 
Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 46e350a..4479ffa 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -280,6 +280,16 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, 
const uint8_t *buf,
 return size;
 }
 
+
+/** From any vdpa net client, get the netclient of the first queue pair */
+static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
+{
+NICState *nic = qemu_get_nic(s->nc.peer);
+NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
+
+return DO_UPCAST(VhostVDPAState, nc, nc0);
+}
+
 static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
 {
 struct vhost_vdpa *v = &s->vhost_vdpa;
@@ -492,7 +502,7 @@ dma_map_err:
 
 static int vhost_vdpa_net_cvq_start(NetClientState *nc)
 {
-VhostVDPAState *s;
+VhostVDPAState *s, *s0;
 struct vhost_vdpa *v;
 int64_t cvq_group;
 int r;
@@ -503,7 +513,8 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc)
 s = DO_UPCAST(VhostVDPAState, nc, nc);
 v = &s->vhost_vdpa;
 
-v->shadow_vqs_enabled = v->shared->shadow_data;
+s0 = vhost_vdpa_net_first_nc_vdpa(s);
+v->shadow_vqs_enabled = s0->vhost_vdpa.shadow_vqs_enabled;
 s->vhost_vdpa.address_space_id = VHOST_VDPA_GUEST_PA_ASID;
 
 if (v->shared->shadow_data) {
-- 
1.8.3.1




[PATCH 10/12] vdpa: define SVQ transitioning state for mode switching

2024-02-14 Thread Si-Wei Liu
Will be used in following patches.

DISABLING(-1) means SVQ is being switched off to passthrough
mode.

ENABLING(1) means passthrough VQs are being switched to SVQ.

DONE(0) means SVQ switching is completed.

Signed-off-by: Si-Wei Liu 
---
 include/hw/virtio/vhost-vdpa.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index ad754eb..449bf5c 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -30,6 +30,12 @@ typedef struct VhostVDPAHostNotifier {
 void *addr;
 } VhostVDPAHostNotifier;
 
+typedef enum SVQTransitionState {
+SVQ_TSTATE_DISABLING = -1,
+SVQ_TSTATE_DONE,
+SVQ_TSTATE_ENABLING
+} SVQTransitionState;
+
 /* Info shared by all vhost_vdpa device models */
 typedef struct vhost_vdpa_shared {
 int device_fd;
@@ -67,6 +73,9 @@ typedef struct vhost_vdpa_shared {
 
 /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
 bool shadow_data;
+
+/* SVQ switching is in progress, or already completed? */
+SVQTransitionState svq_switching;
 } VhostVDPAShared;
 
 typedef struct vhost_vdpa {
-- 
1.8.3.1




[PATCH 08/12] vdpa: add trace events for vhost_vdpa_net_load_cmd

2024-02-14 Thread Si-Wei Liu
For better debuggability and observability.

Reviewed-by: Eugenio Pérez 
Signed-off-by: Si-Wei Liu 
---
 net/trace-events | 2 ++
 net/vhost-vdpa.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/net/trace-events b/net/trace-events
index aab666a..88f56f2 100644
--- a/net/trace-events
+++ b/net/trace-events
@@ -26,3 +26,5 @@ colo_filter_rewriter_conn_offset(uint32_t offset) ": 
offset=%u"
 
 # vhost-vdpa.c
 vhost_vdpa_set_address_space_id(void *v, unsigned vq_group, unsigned asid_num) 
"vhost_vdpa: %p vq_group: %u asid: %u"
+vhost_vdpa_net_load_cmd(void *s, uint8_t class, uint8_t cmd, int data_num, int 
data_size) "vdpa state: %p class: %u cmd: %u sg_num: %d size: %d"
+vhost_vdpa_net_load_cmd_retval(void *s, uint8_t class, uint8_t cmd, int r) 
"vdpa state: %p class: %u cmd: %u retval: %d"
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 48a5608..6ee438f 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -677,6 +677,7 @@ static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s,
 
 assert(data_size < vhost_vdpa_net_cvq_cmd_page_len() - sizeof(ctrl));
 cmd_size = sizeof(ctrl) + data_size;
+trace_vhost_vdpa_net_load_cmd(s, class, cmd, data_num, data_size);
 if (vhost_svq_available_slots(svq) < 2 ||
 iov_size(out_cursor, 1) < cmd_size) {
 /*
@@ -708,6 +709,7 @@ static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s,
 
 r = vhost_vdpa_net_cvq_add(s, &out, 1, &in, 1);
 if (unlikely(r < 0)) {
+trace_vhost_vdpa_net_load_cmd_retval(s, class, cmd, r);
 return r;
 }
 
-- 
1.8.3.1




[PATCH 04/12] vdpa: factor out vhost_vdpa_net_get_nc_vdpa

2024-02-14 Thread Si-Wei Liu
Introduce new API. No functional change on existing API.

Acked-by: Jason Wang 
Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 06c83b4..4168cad 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -281,13 +281,18 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, 
const uint8_t *buf,
 }
 
 
-/** From any vdpa net client, get the netclient of the first queue pair */
-static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
+/** From any vdpa net client, get the netclient of the i-th queue pair */
+static VhostVDPAState *vhost_vdpa_net_get_nc_vdpa(VhostVDPAState *s, int i)
 {
 NICState *nic = qemu_get_nic(s->nc.peer);
-NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
+NetClientState *nc_i = qemu_get_peer(nic->ncs, i);
+
+return DO_UPCAST(VhostVDPAState, nc, nc_i);
+}
 
-return DO_UPCAST(VhostVDPAState, nc, nc0);
+static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
+{
+return vhost_vdpa_net_get_nc_vdpa(s, 0);
 }
 
 static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
-- 
1.8.3.1




[PATCH 07/12] vdpa: add vhost_vdpa_set_dev_vring_base trace for svq mode

2024-02-14 Thread Si-Wei Liu
For better debuggability and observability.

Reviewed-by: Eugenio Pérez 
Acked-by: Jason Wang 
Signed-off-by: Si-Wei Liu 
---
 hw/virtio/trace-events | 2 +-
 hw/virtio/vhost-vdpa.c | 5 -
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 28d6d78..20577aa 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -57,7 +57,7 @@ vhost_vdpa_dev_start(void *dev, bool started) "dev: %p 
started: %d"
 vhost_vdpa_set_log_base(void *dev, uint64_t base, unsigned long long size, int 
refcnt, int fd, void *log) "dev: %p base: 0x%"PRIx64" size: %llu refcnt: %d fd: 
%d log: %p"
 vhost_vdpa_set_vring_addr(void *dev, unsigned int index, unsigned int flags, 
uint64_t desc_user_addr, uint64_t used_user_addr, uint64_t avail_user_addr, 
uint64_t log_guest_addr) "dev: %p index: %u flags: 0x%x desc_user_addr: 
0x%"PRIx64" used_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" 
log_guest_addr: 0x%"PRIx64
 vhost_vdpa_set_vring_num(void *dev, unsigned int index, unsigned int num) 
"dev: %p index: %u num: %u"
-vhost_vdpa_set_vring_base(void *dev, unsigned int index, unsigned int num) 
"dev: %p index: %u num: %u"
+vhost_vdpa_set_dev_vring_base(void *dev, unsigned int index, unsigned int num, 
bool svq) "dev: %p index: %u num: %u svq: %d"
 vhost_vdpa_get_vring_base(void *dev, unsigned int index, unsigned int num, 
bool svq) "dev: %p index: %u num: %u svq: %d"
 vhost_vdpa_set_vring_kick(void *dev, unsigned int index, int fd) "dev: %p 
index: %u fd: %d"
 vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p 
index: %u fd: %d"
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 0de7bdf..004110f 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -972,7 +972,10 @@ static int vhost_vdpa_get_config(struct vhost_dev *dev, 
uint8_t *config,
 static int vhost_vdpa_set_dev_vring_base(struct vhost_dev *dev,
  struct vhost_vring_state *ring)
 {
-trace_vhost_vdpa_set_vring_base(dev, ring->index, ring->num);
+struct vhost_vdpa *v = dev->opaque;
+
+trace_vhost_vdpa_set_dev_vring_base(dev, ring->index, ring->num,
+v->shadow_vqs_enabled);
 return vhost_vdpa_call(dev, VHOST_SET_VRING_BASE, ring);
 }
 
-- 
1.8.3.1




[PATCH v2 1/2] vhost: dirty log should be per backend type

2024-02-14 Thread Si-Wei Liu
There could be a mix of both vhost-user and vhost-kernel clients
in the same QEMU process, where separate vhost loggers for the
specific vhost type have to be used. Make the vhost logger per
backend type, and have them properly reference counted.

Suggested-by: Michael S. Tsirkin 
Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost.c | 49 +
 1 file changed, 37 insertions(+), 12 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2c9ac79..ef6d9b5 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -43,8 +43,8 @@
 do { } while (0)
 #endif
 
-static struct vhost_log *vhost_log;
-static struct vhost_log *vhost_log_shm;
+static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
+static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
 
 /* Memslots used by backends that support private memslots (without an fd). */
 static unsigned int used_memslots;
@@ -287,6 +287,8 @@ static int vhost_set_backend_type(struct vhost_dev *dev,
 r = -1;
 }
 
+assert(dev->vhost_ops->backend_type == backend_type || r < 0);
+
 return r;
 }
 
@@ -319,16 +321,23 @@ static struct vhost_log *vhost_log_alloc(uint64_t size, 
bool share)
 return log;
 }
 
-static struct vhost_log *vhost_log_get(uint64_t size, bool share)
+static struct vhost_log *vhost_log_get(VhostBackendType backend_type,
+   uint64_t size, bool share)
 {
-struct vhost_log *log = share ? vhost_log_shm : vhost_log;
+struct vhost_log *log;
+
+if (backend_type == VHOST_BACKEND_TYPE_NONE ||
+backend_type >= VHOST_BACKEND_TYPE_MAX)
+return NULL;
+
+log = share ? vhost_log_shm[backend_type] : vhost_log[backend_type];
 
 if (!log || log->size != size) {
 log = vhost_log_alloc(size, share);
 if (share) {
-vhost_log_shm = log;
+vhost_log_shm[backend_type] = log;
 } else {
-vhost_log = log;
+vhost_log[backend_type] = log;
 }
 } else {
 ++log->refcnt;
@@ -340,11 +349,20 @@ static struct vhost_log *vhost_log_get(uint64_t size, 
bool share)
 static void vhost_log_put(struct vhost_dev *dev, bool sync)
 {
 struct vhost_log *log = dev->log;
+VhostBackendType backend_type;
 
 if (!log) {
 return;
 }
 
+assert(dev->vhost_ops);
+backend_type = dev->vhost_ops->backend_type;
+
+if (backend_type == VHOST_BACKEND_TYPE_NONE ||
+backend_type >= VHOST_BACKEND_TYPE_MAX) {
+return;
+}
+
 --log->refcnt;
 if (log->refcnt == 0) {
 /* Sync only the range covered by the old log */
@@ -352,13 +370,13 @@ static void vhost_log_put(struct vhost_dev *dev, bool 
sync)
 vhost_log_sync_range(dev, 0, dev->log_size * VHOST_LOG_CHUNK - 1);
 }
 
-if (vhost_log == log) {
+if (vhost_log[backend_type] == log) {
 g_free(log->log);
-vhost_log = NULL;
-} else if (vhost_log_shm == log) {
+vhost_log[backend_type] = NULL;
+} else if (vhost_log_shm[backend_type] == log) {
 qemu_memfd_free(log->log, log->size * sizeof(*(log->log)),
 log->fd);
-vhost_log_shm = NULL;
+vhost_log_shm[backend_type] = NULL;
 }
 
 g_free(log);
@@ -376,7 +394,8 @@ static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
 
 static inline void vhost_dev_log_resize(struct vhost_dev *dev, uint64_t size)
 {
-struct vhost_log *log = vhost_log_get(size, vhost_dev_log_is_shared(dev));
+struct vhost_log *log = vhost_log_get(dev->vhost_ops->backend_type,
+  size, vhost_dev_log_is_shared(dev));
 uint64_t log_base = (uintptr_t)log->log;
 int r;
 
@@ -2037,8 +2056,14 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
*vdev, bool vrings)
 uint64_t log_base;
 
 hdev->log_size = vhost_get_log_size(hdev);
-hdev->log = vhost_log_get(hdev->log_size,
+hdev->log = vhost_log_get(hdev->vhost_ops->backend_type,
+  hdev->log_size,
   vhost_dev_log_is_shared(hdev));
+if (!hdev->log) {
+VHOST_OPS_DEBUG(r, "vhost_log_get failed");
+goto fail_vq;
+}
+
 log_base = (uintptr_t)hdev->log->log;
 r = hdev->vhost_ops->vhost_set_log_base(hdev,
 hdev->log_size ? log_base : 0,
-- 
1.8.3.1




[PATCH v2 2/2] vhost: Perform memory section dirty scans once per iteration

2024-02-14 Thread Si-Wei Liu
On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So
essentially we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost.c | 75 +++
 include/hw/virtio/vhost.h |  1 +
 2 files changed, 70 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index ef6d9b5..997d560 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,9 @@
 
 static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
 static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static struct vhost_dev *vhost_mem_logger[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_mlog_devices =
+QLIST_HEAD_INITIALIZER(vhost_mlog_devices);
 
 /* Memslots used by backends that support private memslots (without an fd). */
 static unsigned int used_memslots;
@@ -149,6 +152,53 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
 }
 }
 
+static bool vhost_log_dev_enabled(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == vhost_mem_logger[dev->vhost_ops->backend_type];
+}
+
+static void vhost_mlog_set_dev(struct vhost_dev *hdev, bool enable)
+{
+struct vhost_dev *logdev = NULL;
+VhostBackendType backend_type;
+bool reelect = false;
+
+assert(hdev->vhost_ops);
+assert(hdev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(hdev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+backend_type = hdev->vhost_ops->backend_type;
+
+if (enable && !QLIST_IS_INSERTED(hdev, logdev_entry)) {
+reelect = !vhost_mem_logger[backend_type];
+QLIST_INSERT_HEAD(&vhost_mlog_devices, hdev, logdev_entry);
+} else if (!enable && QLIST_IS_INSERTED(hdev, logdev_entry)) {
+reelect = vhost_mem_logger[backend_type] == hdev;
+QLIST_REMOVE(hdev, logdev_entry);
+}
+
+if (!reelect)
+return;
+
+QLIST_FOREACH(hdev, &vhost_mlog_devices, logdev_entry) {
+if (!hdev->vhost_ops ||
+hdev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_NONE ||
+hdev->vhost_ops->backend_type >= VHOST_BACKEND_TYPE_MAX)
+continue;
+
+if (hdev->vhost_ops->backend_type == backend_type) {
+logdev = hdev;
+break;
+}
+}
+
+vhost_mem_logger[backend_type] = logdev;
+}
+
 static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
MemoryRegionSection *section,
hwaddr first,
@@ -166,12 +216,14 @@ static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
 start_addr = MAX(first, start_addr);
 end_addr = MIN(last, end_addr);
 
-for (i = 0; i < dev->mem->nregions; ++i) {
-struct vhost_memory_region *reg = dev->mem->regions + i;
-vhost_dev_sync_region(dev, section, start_addr, end_addr,
-  reg->guest_phys_addr,
-  range_get_last(reg->guest_phys_addr,
- reg->memory_size));
+if (vhost_log_dev_enabled(dev)) {
+for (i = 0; i < dev->mem->nregions; ++i) {
+struct vhost_memory_region *reg = dev->mem->regions + i;
+vhost_dev_sync_region(dev, section, start_addr, end_addr,
+  reg->guest_phys_addr,
+  range_get_last(reg->guest_phys_addr,
+   

Re: [PATCH v2 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init

2024-02-14 Thread Si-Wei Liu

Hi Michael,

On 2/13/2024 2:22 AM, Michael S. Tsirkin wrote:

On Mon, Feb 05, 2024 at 05:10:36PM -0800, Si-Wei Liu wrote:

Hi Eugenio,

I thought this new code looks good to me and the original issue I saw with
x-svq=on should be gone. However, after rebase my tree on top of this,
there's a new failure I found around setting up guest mappings at early
boot, please see attached the specific QEMU config and corresponding event
traces. Haven't checked into the detail yet, thinking you would need to be
aware of ahead.

Regards,
-Siwei

Eugenio were you able to reproduce? Siwei did you have time to
look into this?
Didn't get a chance to look into the detail yet in the past week, but 
thought it may have something to do with the (internals of) iova tree 
range allocation and the lookup routine. It started to fall apart at the 
first vhost_vdpa_dma_unmap call showing up in the trace events, where it 
should've gotten IOVA=0x201000,  but an incorrect IOVA address 
0x1000 was ended up returning from the iova tree lookup routine.


HVA                    GPA                IOVA
-
Map
[0x7f7903e0, 0x7f7983e0)    [0x0, 0x8000) [0x1000, 0x8000)
[0x7f7983e0, 0x7f9903e0)    [0x1, 0x208000) 
[0x80001000, 0x201000)
[0x7f7903ea, 0x7f7903ec)    [0xfeda, 0xfedc) 
[0x201000, 0x221000)


Unmap
[0x7f7903ea, 0x7f7903ec)    [0xfeda, 0xfedc) [0x1000, 
0x2) ???
                                shouldn't it be [0x201000, 
0x221000) ???


PS, I will be taking off from today and for the next two weeks. Will try 
to help out looking more closely after I get back.


-Siwei

  Can't merge patches which are known to break things ...




Re: [PATCH v2 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init

2024-02-14 Thread Si-Wei Liu

Hi Eugenio,

Just to answer the question you had in the sync meeting as I've just 
tried, it seems that the issue is also reproducible even with VGA device 
and VNC display removed, and also reproducible with 8G mem size. You 
already knew that I can only repro with x-svq=on.


Regards,
-Siwei

On 2/13/2024 8:26 AM, Eugenio Perez Martin wrote:

On Tue, Feb 13, 2024 at 11:22 AM Michael S. Tsirkin  wrote:

On Mon, Feb 05, 2024 at 05:10:36PM -0800, Si-Wei Liu wrote:

Hi Eugenio,

I thought this new code looks good to me and the original issue I saw with
x-svq=on should be gone. However, after rebase my tree on top of this,
there's a new failure I found around setting up guest mappings at early
boot, please see attached the specific QEMU config and corresponding event
traces. Haven't checked into the detail yet, thinking you would need to be
aware of ahead.

Regards,
-Siwei

Eugenio were you able to reproduce? Siwei did you have time to
look into this? Can't merge patches which are known to break things ...


Sorry for the lack of news, I'll try to reproduce this week. Meanwhile
this patch should not be merged, as you mention.

Thanks!






Re: [PATCH v2 1/2] vhost: dirty log should be per backend type

2024-02-14 Thread Si-Wei Liu

Hi Michael,

I'm taking off for 2+ weeks, but please feel free to provide comment and 
feedback while I'm off. I'll be checking emails still, and am about to 
address any opens as soon as I am back.


Thanks,
-Siwei

On 2/14/2024 3:50 AM, Si-Wei Liu wrote:

There could be a mix of both vhost-user and vhost-kernel clients
in the same QEMU process, where separate vhost loggers for the
specific vhost type have to be used. Make the vhost logger per
backend type, and have them properly reference counted.

Suggested-by: Michael S. Tsirkin 
Signed-off-by: Si-Wei Liu 
---
  hw/virtio/vhost.c | 49 +
  1 file changed, 37 insertions(+), 12 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2c9ac79..ef6d9b5 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -43,8 +43,8 @@
  do { } while (0)
  #endif
  
-static struct vhost_log *vhost_log;

-static struct vhost_log *vhost_log_shm;
+static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
+static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
  
  /* Memslots used by backends that support private memslots (without an fd). */

  static unsigned int used_memslots;
@@ -287,6 +287,8 @@ static int vhost_set_backend_type(struct vhost_dev *dev,
  r = -1;
  }
  
+assert(dev->vhost_ops->backend_type == backend_type || r < 0);

+
  return r;
  }
  
@@ -319,16 +321,23 @@ static struct vhost_log *vhost_log_alloc(uint64_t size, bool share)

  return log;
  }
  
-static struct vhost_log *vhost_log_get(uint64_t size, bool share)

+static struct vhost_log *vhost_log_get(VhostBackendType backend_type,
+   uint64_t size, bool share)
  {
-struct vhost_log *log = share ? vhost_log_shm : vhost_log;
+struct vhost_log *log;
+
+if (backend_type == VHOST_BACKEND_TYPE_NONE ||
+backend_type >= VHOST_BACKEND_TYPE_MAX)
+return NULL;
+
+log = share ? vhost_log_shm[backend_type] : vhost_log[backend_type];
  
  if (!log || log->size != size) {

  log = vhost_log_alloc(size, share);
  if (share) {
-vhost_log_shm = log;
+vhost_log_shm[backend_type] = log;
  } else {
-vhost_log = log;
+vhost_log[backend_type] = log;
  }
  } else {
  ++log->refcnt;
@@ -340,11 +349,20 @@ static struct vhost_log *vhost_log_get(uint64_t size, 
bool share)
  static void vhost_log_put(struct vhost_dev *dev, bool sync)
  {
  struct vhost_log *log = dev->log;
+VhostBackendType backend_type;
  
  if (!log) {

  return;
  }
  
+assert(dev->vhost_ops);

+backend_type = dev->vhost_ops->backend_type;
+
+if (backend_type == VHOST_BACKEND_TYPE_NONE ||
+backend_type >= VHOST_BACKEND_TYPE_MAX) {
+return;
+}
+
  --log->refcnt;
  if (log->refcnt == 0) {
  /* Sync only the range covered by the old log */
@@ -352,13 +370,13 @@ static void vhost_log_put(struct vhost_dev *dev, bool 
sync)
  vhost_log_sync_range(dev, 0, dev->log_size * VHOST_LOG_CHUNK - 1);
  }
  
-if (vhost_log == log) {

+if (vhost_log[backend_type] == log) {
  g_free(log->log);
-vhost_log = NULL;
-} else if (vhost_log_shm == log) {
+vhost_log[backend_type] = NULL;
+} else if (vhost_log_shm[backend_type] == log) {
  qemu_memfd_free(log->log, log->size * sizeof(*(log->log)),
  log->fd);
-vhost_log_shm = NULL;
+vhost_log_shm[backend_type] = NULL;
  }
  
  g_free(log);

@@ -376,7 +394,8 @@ static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
  
  static inline void vhost_dev_log_resize(struct vhost_dev *dev, uint64_t size)

  {
-struct vhost_log *log = vhost_log_get(size, vhost_dev_log_is_shared(dev));
+struct vhost_log *log = vhost_log_get(dev->vhost_ops->backend_type,
+  size, vhost_dev_log_is_shared(dev));
  uint64_t log_base = (uintptr_t)log->log;
  int r;
  
@@ -2037,8 +2056,14 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)

  uint64_t log_base;
  
  hdev->log_size = vhost_get_log_size(hdev);

-hdev->log = vhost_log_get(hdev->log_size,
+hdev->log = vhost_log_get(hdev->vhost_ops->backend_type,
+  hdev->log_size,
vhost_dev_log_is_shared(hdev));
+if (!hdev->log) {
+VHOST_OPS_DEBUG(r, "vhost_log_get failed");
+goto fail_vq;
+}
+
  log_base = (uintptr_t)hdev->log->log;
  r = hdev->vhost_ops->vhost_set_log_base(hdev,
  hdev->log_size ? log_base : 0,





Re: [PATCH 04/12] vdpa: factor out vhost_vdpa_net_get_nc_vdpa

2024-02-14 Thread Si-Wei Liu




On 2/14/2024 10:54 AM, Eugenio Perez Martin wrote:

On Wed, Feb 14, 2024 at 1:39 PM Si-Wei Liu  wrote:

Introduce new API. No functional change on existing API.

Acked-by: Jason Wang 
Signed-off-by: Si-Wei Liu 

I'm ok with the new function, but doesn't the compiler complain
because adding a static function is not used?
Hmmm, which one? vhost_vdpa_net_get_nc_vdpa is used by 
vhost_vdpa_net_first_nc_vdpa internally, and 
vhost_vdpa_net_first_nc_vdpa is used by vhost_vdpa_net_cvq_start (Patch 
01). I think we should be fine?


-Siwei



---
  net/vhost-vdpa.c | 13 +
  1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 06c83b4..4168cad 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -281,13 +281,18 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, 
const uint8_t *buf,
  }


-/** From any vdpa net client, get the netclient of the first queue pair */
-static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
+/** From any vdpa net client, get the netclient of the i-th queue pair */
+static VhostVDPAState *vhost_vdpa_net_get_nc_vdpa(VhostVDPAState *s, int i)
  {
  NICState *nic = qemu_get_nic(s->nc.peer);
-NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
+NetClientState *nc_i = qemu_get_peer(nic->ncs, i);
+
+return DO_UPCAST(VhostVDPAState, nc, nc_i);
+}

-return DO_UPCAST(VhostVDPAState, nc, nc0);
+static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
+{
+return vhost_vdpa_net_get_nc_vdpa(s, 0);
  }

  static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
--
1.8.3.1






Re: [PATCH 1/6] vdpa: check for iova tree initialized at net_client_start

2024-01-31 Thread Si-Wei Liu

Hi Eugenio,

Maybe there's some patch missing, but I saw this core dump when x-svq=on 
is specified while waiting for the incoming migration on destination host:


(gdb) bt
#0  0x5643b24cc13c in vhost_iova_tree_map_alloc (tree=0x0, 
map=map@entry=0x7ffd58c54830) at ../hw/virtio/vhost-iova-tree.c:89
#1  0x5643b234f193 in vhost_vdpa_listener_region_add 
(listener=0x5643b4403fd8, section=0x7ffd58c548d0) at 
/home/opc/qemu-upstream/include/qemu/int128.h:34
#2  0x5643b24e6a61 in address_space_update_topology_pass 
(as=as@entry=0x5643b35a3840 , 
old_view=old_view@entry=0x5643b442b5f0, 
new_view=new_view@entry=0x5643b44a2130, adding=adding@entry=true) at 
../system/memory.c:1004
#3  0x5643b24e6e60 in address_space_set_flatview (as=0x5643b35a3840 
) at ../system/memory.c:1080
#4  0x5643b24ea750 in memory_region_transaction_commit () at 
../system/memory.c:1132
#5  0x5643b24ea750 in memory_region_transaction_commit () at 
../system/memory.c:1117
#6  0x5643b241f4c1 in pc_memory_init 
(pcms=pcms@entry=0x5643b43c8400, 
system_memory=system_memory@entry=0x5643b43d18b0, 
rom_memory=rom_memory@entry=0x5643b449a960, pci_hole64_size=out>) at ../hw/i386/pc.c:954
#7  0x5643b240d088 in pc_q35_init (machine=0x5643b43c8400) at 
../hw/i386/pc_q35.c:222
#8  0x5643b21e1da8 in machine_run_board_init (machine=out>, mem_path=, errp=, 
errp@entry=0x5643b35b7958 )

    at ../hw/core/machine.c:1509
#9  0x5643b237c0f6 in qmp_x_exit_preconfig () at ../system/vl.c:2613
#10 0x5643b237c0f6 in qmp_x_exit_preconfig (errp=) at 
../system/vl.c:2704
#11 0x5643b237fcdd in qemu_init (errp=) at 
../system/vl.c:3753
#12 0x5643b237fcdd in qemu_init (argc=, 
argv=) at ../system/vl.c:3753
#13 0x5643b2158249 in main (argc=, argv=out>) at ../system/main.c:47


Shall we create the iova tree early during vdpa dev int for the x-svq=on 
case?


+    if (s->always_svq) {
+    /* iova tree is needed because of SVQ */
+    shared->iova_tree = vhost_iova_tree_new(shared->iova_range.first,
+ shared->iova_range.last);
+    }
+

Regards,
-Siwei

On 1/11/2024 11:02 AM, Eugenio Pérez wrote:

To map the guest memory while it is migrating we need to create the
iova_tree, as long as the destination uses x-svq=on. Checking to not
override it.

The function vhost_vdpa_net_client_stop clear it if the device is
stopped. If the guest starts the device again, the iova tree is
recreated by vhost_vdpa_net_data_start_first or vhost_vdpa_net_cvq_start
if needed, so old behavior is kept.

Signed-off-by: Eugenio Pérez 
---
  net/vhost-vdpa.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 3726ee5d67..e11b390466 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -341,7 +341,9 @@ static void vhost_vdpa_net_data_start_first(VhostVDPAState 
*s)
  
  migration_add_notifier(&s->migration_state,

 vdpa_net_migration_state_notifier);
-if (v->shadow_vqs_enabled) {
+
+/* iova_tree may be initialized by vhost_vdpa_net_load_setup */
+if (v->shadow_vqs_enabled && !v->shared->iova_tree) {
  v->shared->iova_tree = 
vhost_iova_tree_new(v->shared->iova_range.first,
 
v->shared->iova_range.last);
  }





Re: [PATCH v2 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init

2024-02-05 Thread Si-Wei Liu

Hi Eugenio,

I thought this new code looks good to me and the original issue I saw 
with x-svq=on should be gone. However, after rebase my tree on top of 
this, there's a new failure I found around setting up guest mappings at 
early boot, please see attached the specific QEMU config and 
corresponding event traces. Haven't checked into the detail yet, 
thinking you would need to be aware of ahead.


Regards,
-Siwei

On 2/1/2024 10:09 AM, Eugenio Pérez wrote:

As we are moving to keep the mapping through all the vdpa device life
instead of resetting it at VirtIO reset, we need to move all its
dependencies to the initialization too.  In particular devices with
x-svq=on need a valid iova_tree from the beginning.

Simplify the code also consolidating the two creation points: the first
data vq in case of SVQ active and CVQ start in case only CVQ uses it.

Suggested-by: Si-Wei Liu 
Signed-off-by: Eugenio Pérez 
---
  include/hw/virtio/vhost-vdpa.h | 16 ++-
  net/vhost-vdpa.c   | 36 +++---
  2 files changed, 18 insertions(+), 34 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 03ed2f2be3..ad754eb803 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -37,7 +37,21 @@ typedef struct vhost_vdpa_shared {
  struct vhost_vdpa_iova_range iova_range;
  QLIST_HEAD(, vdpa_iommu) iommu_list;
  
-/* IOVA mapping used by the Shadow Virtqueue */

+/*
+ * IOVA mapping used by the Shadow Virtqueue
+ *
+ * It is shared among all ASID for simplicity, whether CVQ shares ASID with
+ * guest or not:
+ * - Memory listener need access to guest's memory addresses allocated in
+ *   the IOVA tree.
+ * - There should be plenty of IOVA address space for both ASID not to
+ *   worry about collisions between them.  Guest's translations are still
+ *   validated with virtio virtqueue_pop so there is no risk for the guest
+ *   to access memory that it shouldn't.
+ *
+ * To allocate a iova tree per ASID is doable but it complicates the code
+ * and it is not worth it for the moment.
+ */
  VhostIOVATree *iova_tree;
  
  /* Copy of backend features */

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index cc589dd148..57edcf34d0 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -232,6 +232,7 @@ static void vhost_vdpa_cleanup(NetClientState *nc)
  return;
  }
  qemu_close(s->vhost_vdpa.shared->device_fd);
+g_clear_pointer(&s->vhost_vdpa.shared->iova_tree, vhost_iova_tree_delete);
  g_free(s->vhost_vdpa.shared);
  }
  
@@ -329,16 +330,8 @@ static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
  
  static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)

  {
-struct vhost_vdpa *v = &s->vhost_vdpa;
-
  migration_add_notifier(&s->migration_state,
 vdpa_net_migration_state_notifier);
-
-/* iova_tree may be initialized by vhost_vdpa_net_load_setup */
-if (v->shadow_vqs_enabled && !v->shared->iova_tree) {
-v->shared->iova_tree = vhost_iova_tree_new(v->shared->iova_range.first,
-   v->shared->iova_range.last);
-}
  }
  
  static int vhost_vdpa_net_data_start(NetClientState *nc)

@@ -383,19 +376,12 @@ static int vhost_vdpa_net_data_load(NetClientState *nc)
  static void vhost_vdpa_net_client_stop(NetClientState *nc)
  {
  VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
-struct vhost_dev *dev;
  
  assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
  
  if (s->vhost_vdpa.index == 0) {

  migration_remove_notifier(&s->migration_state);
  }
-
-dev = s->vhost_vdpa.dev;
-if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
-g_clear_pointer(&s->vhost_vdpa.shared->iova_tree,
-vhost_iova_tree_delete);
-}
  }
  
  static NetClientInfo net_vhost_vdpa_info = {

@@ -557,24 +543,6 @@ out:
  return 0;
  }
  
-/*

- * If other vhost_vdpa already have an iova_tree, reuse it for simplicity,
- * whether CVQ shares ASID with guest or not, because:
- * - Memory listener need access to guest's memory addresses allocated in
- *   the IOVA tree.
- * - There should be plenty of IOVA address space for both ASID not to
- *   worry about collisions between them.  Guest's translations are still
- *   validated with virtio virtqueue_pop so there is no risk for the guest
- *   to access memory that it shouldn't.
- *
- * To allocate a iova tree per ASID is doable but it complicates the code
- * and it is not worth it for the moment.
- */
-if (!v->shared->iova_tree) {
-v->shared->

Re: [PATCH v4 1/2] vhost: dirty log should be per backend type

2024-03-20 Thread Si-Wei Liu




On 3/19/2024 8:25 PM, Jason Wang wrote:

On Tue, Mar 19, 2024 at 6:06 AM Si-Wei Liu  wrote:



On 3/17/2024 8:20 PM, Jason Wang wrote:

On Sat, Mar 16, 2024 at 2:33 AM Si-Wei Liu  wrote:


On 3/14/2024 8:50 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

There could be a mix of both vhost-user and vhost-kernel clients
in the same QEMU process, where separate vhost loggers for the
specific vhost type have to be used. Make the vhost logger per
backend type, and have them properly reference counted.

It's better to describe what's the advantage of doing this.

Yes, I can add that to the log. Although it's a niche use case, it was
actually a long standing limitation / bug that vhost-user and
vhost-kernel loggers can't co-exist per QEMU process, but today it's
just silent failure that may be ended up with. This bug fix removes that
implicit limitation in the code.

Ok.


Suggested-by: Michael S. Tsirkin 
Signed-off-by: Si-Wei Liu 

---
v3->v4:
 - remove checking NULL return value from vhost_log_get

v2->v3:
 - remove non-effective assertion that never be reached
 - do not return NULL from vhost_log_get()
 - add neccessary assertions to vhost_log_get()
---
hw/virtio/vhost.c | 45 +
1 file changed, 33 insertions(+), 12 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2c9ac79..612f4db 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -43,8 +43,8 @@
do { } while (0)
#endif

-static struct vhost_log *vhost_log;
-static struct vhost_log *vhost_log_shm;
+static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
+static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];

/* Memslots used by backends that support private memslots (without an fd). 
*/
static unsigned int used_memslots;
@@ -287,6 +287,10 @@ static int vhost_set_backend_type(struct vhost_dev *dev,
r = -1;
}

+if (r == 0) {
+assert(dev->vhost_ops->backend_type == backend_type);
+}
+

Under which condition could we hit this?

Just in case some other function inadvertently corrupted this earlier,
we have to capture discrepancy in the first place... On the other hand,
it will be helpful for other vhost backend writers to diagnose day-one
bug in the code. I feel just code comment here will not be
sufficient/helpful.

See below.


It seems not good to assert a local logic.

It seems to me quite a few local asserts are in the same file already,
vhost_save_backend_state,

For example it has assert for

assert(!dev->started);

which is not the logic of the function itself but require
vhost_dev_start() not to be called before.

But it looks like this patch you assert the code just a few lines
above the assert itself?

Yes, that was the intent - for e.g. xxx_ops may contain corrupted
xxx_ops.backend_type already before coming to this
vhost_set_backend_type() function. And we may capture this corrupted
state by asserting the expected xxx_ops.backend_type (to be consistent
with the backend_type passed in),

This can happen for all variables. Not sure why backend_ops is special.
The assert is just checking the backend_type field only. The other op 
fields in backend_ops have similar assert within the op function itself 
also. For e.g. vhost_user_requires_shm_log() and a lot of other 
vhost_user ops have the following:


    assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_USER);

vhost_vdpa_vq_get_addr() and a lot of other vhost_vdpa ops have:

    assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA);

vhost_kernel ops has similar assertions as well.

The reason why it has to be checked against here is now the callers of 
vhost_log_get(), would pass in dev->vhost_ops->backend_type to the API, 
which are unable to verify the validity of the backend_type by 
themselves. The vhost_log_get() has necessary asserts to make bound 
check for the vhost_log[] or vhost_log_shm[] array, but specific assert 
against the exact backend type in vhost_set_backend_type() will further 
harden the implementation in vhost_log_get() and other backend ops.





which needs be done in the first place
when this discrepancy is detected. In practice I think there should be
no harm to add this assert, but this will add warranted guarantee to the
current code.

For example, such corruption can happen after the assert() so a TOCTOU issue.
Sure, it's best effort only. As pointed out earlier, I think together 
with this, there are other similar asserts already in various backend 
ops, which could be helpful to nail down the earliest point or a 
specific range where things may go wrong in the first place.


Thanks,
-Siwei



Thanks


Regards,
-Siwei


dev->vhost_ops = &xxx_ops;

...

assert(dev->vhost_ops->backend_type == backend_type)

?

Thanks


vhost_load_backend_state,
vhost_virtqueue_mask, vhost_config_mask, just to name a few. Why local
assert a problem?

Thanks,
-Siwei


Thanks






Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-20 Thread Si-Wei Liu




On 3/19/2024 8:27 PM, Jason Wang wrote:

On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu  wrote:



On 3/17/2024 8:22 PM, Jason Wang wrote:

On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu  wrote:


On 3/14/2024 9:03 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 

---
v3 -> v4:
 - add comment to clarify effect on cache locality and
   performance

v2 -> v3:
 - add after-fix benchmark to commit log
 - rename vhost_log_dev_enabled to vhost_dev_should_log
 - remove unneeded comparisons for backend_type
 - use QLIST array instead of single flat list to store vhost
   logger devices
 - simplify logger election logic
---
hw/virtio/vhost.c | 67 
++-
include/hw/virtio/vhost.h |  1 +
2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 612f4db..58522f1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@

static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];

/* Memslots used by backends that support private memslots (without an fd). 
*/
static unsigned int used_memslots;
@@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
}
}

+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);

A dumb question, why not simple check

dev->log == vhost_log_shm[dev->vhost_ops->backend_type]

Because we are not sure if the logger comes from vhost_log_shm[] or
vhost_log[]. Don't want to complicate the check here by calling into
vhost_dev_log_is_shared() everytime when the .log_sync() is called.

It has very low overhead, isn't it?

Whether this has low overhead will have to depend on the specific
backend's implementation for .vhost_requires_shm_log(), which the common
vhost layer should not assume upon or rely on the current implementation.


static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
{
  return dev->vhost_ops->vhost_requires_shm_log &&
 dev->vhost_ops->vhost_requires_shm_log(dev);
}

For example, if I understand the code correctly, the log type won't be
changed during runtime, so we can endup with a boolean to record that
instead of a query ops?
Right now the log type won't change during runtime, but I am not sure if 
this may prohibit future revisit to allow change at the runtime, then 
there'll be complex code involvled to maintain the state.


Other than this, I think it's insufficient to just check the shm log 
v.s. normal log. The logger device requires to identify a leading logger 
device that gets elected in vhost_dev_elect_mem_logger(), as all the 
dev->log points to the same logger that is refenerce counted, that we 
have to add extra field and complex logic to maintain the election 

Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-21 Thread Si-Wei Liu




On 3/20/2024 8:56 PM, Jason Wang wrote:

On Thu, Mar 21, 2024 at 5:03 AM Si-Wei Liu  wrote:



On 3/19/2024 8:27 PM, Jason Wang wrote:

On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu  wrote:


On 3/17/2024 8:22 PM, Jason Wang wrote:

On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu  wrote:

On 3/14/2024 9:03 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 

---
v3 -> v4:
  - add comment to clarify effect on cache locality and
performance

v2 -> v3:
  - add after-fix benchmark to commit log
  - rename vhost_log_dev_enabled to vhost_dev_should_log
  - remove unneeded comparisons for backend_type
  - use QLIST array instead of single flat list to store vhost
logger devices
  - simplify logger election logic
---
 hw/virtio/vhost.c | 67 
++-
 include/hw/virtio/vhost.h |  1 +
 2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 612f4db..58522f1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@

 static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
 static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];

 /* Memslots used by backends that support private memslots (without an 
fd). */
 static unsigned int used_memslots;
@@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
 }
 }

+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);

A dumb question, why not simple check

dev->log == vhost_log_shm[dev->vhost_ops->backend_type]

Because we are not sure if the logger comes from vhost_log_shm[] or
vhost_log[]. Don't want to complicate the check here by calling into
vhost_dev_log_is_shared() everytime when the .log_sync() is called.

It has very low overhead, isn't it?

Whether this has low overhead will have to depend on the specific
backend's implementation for .vhost_requires_shm_log(), which the common
vhost layer should not assume upon or rely on the current implementation.


static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
{
   return dev->vhost_ops->vhost_requires_shm_log &&
  dev->vhost_ops->vhost_requires_shm_log(dev);
}

For example, if I understand the code correctly, the log type won't be
changed during runtime, so we can endup with a boolean to record that
instead of a query ops?

Right now the log type won't change during runtime, but I am not sure if
this may prohibit future revisit to allow change at the runtime,

We can be bothered when we have such a request then.


then
there'll be complex code involvled to maintain the state.

Other than this, I think it's insufficient to just check the shm log
v.s. normal log. The logger device requires to identify a leading logger
device that gets elected in vhost_dev_elect_mem_logger(

Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-22 Thread Si-Wei Liu




On 3/21/2024 10:08 PM, Jason Wang wrote:

On Fri, Mar 22, 2024 at 5:43 AM Si-Wei Liu  wrote:



On 3/20/2024 8:56 PM, Jason Wang wrote:

On Thu, Mar 21, 2024 at 5:03 AM Si-Wei Liu  wrote:


On 3/19/2024 8:27 PM, Jason Wang wrote:

On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu  wrote:

On 3/17/2024 8:22 PM, Jason Wang wrote:

On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu  wrote:

On 3/14/2024 9:03 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 

---
v3 -> v4:
   - add comment to clarify effect on cache locality and
 performance

v2 -> v3:
   - add after-fix benchmark to commit log
   - rename vhost_log_dev_enabled to vhost_dev_should_log
   - remove unneeded comparisons for backend_type
   - use QLIST array instead of single flat list to store vhost
 logger devices
   - simplify logger election logic
---
  hw/virtio/vhost.c | 67 
++-
  include/hw/virtio/vhost.h |  1 +
  2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 612f4db..58522f1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@

  static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
  static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];

  /* Memslots used by backends that support private memslots (without an 
fd). */
  static unsigned int used_memslots;
@@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
  }
  }

+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);

A dumb question, why not simple check

dev->log == vhost_log_shm[dev->vhost_ops->backend_type]

Because we are not sure if the logger comes from vhost_log_shm[] or
vhost_log[]. Don't want to complicate the check here by calling into
vhost_dev_log_is_shared() everytime when the .log_sync() is called.

It has very low overhead, isn't it?

Whether this has low overhead will have to depend on the specific
backend's implementation for .vhost_requires_shm_log(), which the common
vhost layer should not assume upon or rely on the current implementation.


static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
{
return dev->vhost_ops->vhost_requires_shm_log &&
   dev->vhost_ops->vhost_requires_shm_log(dev);
}

For example, if I understand the code correctly, the log type won't be
changed during runtime, so we can endup with a boolean to record that
instead of a query ops?

Right now the log type won't change during runtime, but I am not sure if
this may prohibit future revisit to allow change at the runtime,

We can be bothered when we have such a request then.


then
there'll be complex code involvled to maintain the state.

Other than this, I think it's insufficient to just check the shm log
v.s. normal lo

Re: [External] : Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-25 Thread Si-Wei Liu




On 3/24/2024 11:13 PM, Jason Wang wrote:

On Sat, Mar 23, 2024 at 5:14 AM Si-Wei Liu  wrote:



On 3/21/2024 10:08 PM, Jason Wang wrote:

On Fri, Mar 22, 2024 at 5:43 AM Si-Wei Liu  wrote:


On 3/20/2024 8:56 PM, Jason Wang wrote:

On Thu, Mar 21, 2024 at 5:03 AM Si-Wei Liu  wrote:

On 3/19/2024 8:27 PM, Jason Wang wrote:

On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu  wrote:

On 3/17/2024 8:22 PM, Jason Wang wrote:

On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu  wrote:

On 3/14/2024 9:03 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 

---
v3 -> v4:
- add comment to clarify effect on cache locality and
  performance

v2 -> v3:
- add after-fix benchmark to commit log
- rename vhost_log_dev_enabled to vhost_dev_should_log
- remove unneeded comparisons for backend_type
- use QLIST array instead of single flat list to store vhost
  logger devices
- simplify logger election logic
---
   hw/virtio/vhost.c | 67 
++-
   include/hw/virtio/vhost.h |  1 +
   2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 612f4db..58522f1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@

   static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
   static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];

   /* Memslots used by backends that support private memslots (without an 
fd). */
   static unsigned int used_memslots;
@@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
   }
   }

+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);

A dumb question, why not simple check

dev->log == vhost_log_shm[dev->vhost_ops->backend_type]

Because we are not sure if the logger comes from vhost_log_shm[] or
vhost_log[]. Don't want to complicate the check here by calling into
vhost_dev_log_is_shared() everytime when the .log_sync() is called.

It has very low overhead, isn't it?

Whether this has low overhead will have to depend on the specific
backend's implementation for .vhost_requires_shm_log(), which the common
vhost layer should not assume upon or rely on the current implementation.


static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
{
 return dev->vhost_ops->vhost_requires_shm_log &&
dev->vhost_ops->vhost_requires_shm_log(dev);
}

For example, if I understand the code correctly, the log type won't be
changed during runtime, so we can endup with a boolean to record that
instead of a query ops?

Right now the log type won't change during runtime, but I am not sure if
this may prohibit future revisit to allow change at the runtime,

We can be bothered when we have such a request then.


then
there'll be complex code involvled

Re: [PATCH v2 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init

2024-04-01 Thread Si-Wei Liu




On 2/14/2024 11:11 AM, Eugenio Perez Martin wrote:

On Wed, Feb 14, 2024 at 7:29 PM Si-Wei Liu  wrote:

Hi Michael,

On 2/13/2024 2:22 AM, Michael S. Tsirkin wrote:

On Mon, Feb 05, 2024 at 05:10:36PM -0800, Si-Wei Liu wrote:

Hi Eugenio,

I thought this new code looks good to me and the original issue I saw with
x-svq=on should be gone. However, after rebase my tree on top of this,
there's a new failure I found around setting up guest mappings at early
boot, please see attached the specific QEMU config and corresponding event
traces. Haven't checked into the detail yet, thinking you would need to be
aware of ahead.

Regards,
-Siwei

Eugenio were you able to reproduce? Siwei did you have time to
look into this?

Didn't get a chance to look into the detail yet in the past week, but
thought it may have something to do with the (internals of) iova tree
range allocation and the lookup routine. It started to fall apart at the
first vhost_vdpa_dma_unmap call showing up in the trace events, where it
should've gotten IOVA=0x201000,  but an incorrect IOVA address
0x1000 was ended up returning from the iova tree lookup routine.

HVAGPAIOVA
-
Map
[0x7f7903e0, 0x7f7983e0)[0x0, 0x8000) [0x1000, 0x8000)
[0x7f7983e0, 0x7f9903e0)[0x1, 0x208000)
[0x80001000, 0x201000)
[0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc)
[0x201000, 0x221000)

Unmap
[0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc) [0x1000,
0x2) ???
  shouldn't it be [0x201000,
0x221000) ???

It looks the SVQ iova tree lookup routine vhost_iova_tree_find_iova(), 
which is called from vhost_vdpa_listener_region_del(), can't properly 
deal with overlapped region. Specifically, q35's mch_realize() has the 
following:


579 memory_region_init_alias(&mch->open_high_smram, OBJECT(mch), 
"smram-open-high",
580  mch->ram_memory, 
MCH_HOST_BRIDGE_SMRAM_C_BASE,

581  MCH_HOST_BRIDGE_SMRAM_C_SIZE);
582 memory_region_add_subregion_overlap(mch->system_memory, 0xfeda,
583 &mch->open_high_smram, 1);
584 memory_region_set_enabled(&mch->open_high_smram, false);

#0  0x564c30bf6980 in iova_tree_find_address_iterator 
(key=0x564c331cf8e0, value=0x564c331cf8e0, data=0x7fffb6d749b0) at 
../util/iova-tree.c:96

#1  0x7f5f66479654 in g_tree_foreach () at /lib64/libglib-2.0.so.0
#2  0x564c30bf6b53 in iova_tree_find_iova (tree=, 
map=map@entry=0x7fffb6d74a00) at ../util/iova-tree.c:114
#3  0x564c309da0a9 in vhost_iova_tree_find_iova (tree=out>, map=map@entry=0x7fffb6d74a00) at ../hw/virtio/vhost-iova-tree.c:70
#4  0x564c3085e49d in vhost_vdpa_listener_region_del 
(listener=0x564c331024c8, section=0x7fffb6d74aa0) at 
../hw/virtio/vhost-vdpa.c:444
#5  0x564c309f4931 in address_space_update_topology_pass 
(as=as@entry=0x564c31ab1840 , 
old_view=old_view@entry=0x564c33364cc0, 
new_view=new_view@entry=0x564c333640f0, adding=adding@entry=false) at 
../system/memory.c:977
#6  0x564c309f4dcd in address_space_set_flatview (as=0x564c31ab1840 
) at ../system/memory.c:1079
#7  0x564c309f86d0 in memory_region_transaction_commit () at 
../system/memory.c:1132
#8  0x564c309f86d0 in memory_region_transaction_commit () at 
../system/memory.c:1117
#9  0x564c307cce64 in mch_realize (d=, 
errp=) at ../hw/pci-host/q35.c:584


However, it looks like iova_tree_find_address_iterator() only check if 
the translated address (HVA) falls in to the range when trying to locate 
the desired IOVA, causing the first DMAMap that happens to overlap in 
the translated address (HVA) space to be returned prematurely:


 89 static gboolean iova_tree_find_address_iterator(gpointer key, 
gpointer value,

 90 gpointer data)
 91 {
 :
 :
 99 if (map->translated_addr + map->size < needle->translated_addr ||
100 needle->translated_addr + needle->size < map->translated_addr) {
101 return false;
102 }
103
104 args->result = map;
105 return true;
106 }

In the QEMU trace file, it reveals that the first DMAMap as below gets 
returned incorrectly instead the second, the latter of which is what the 
actual IOVA corresponds to:


HVA GPA 
IOVA
[0x7f7903e0, 0x7f7983e0)[0x0, 0x8000)   
[0x1000, 0x80001000)
[0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc)
[0x201000, 0x221000)


Maybe other than check the HVA range, we should also match GPA, or at 
least the size should exactly match?



Yes, 

Re: [PATCH v2 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init

2024-04-02 Thread Si-Wei Liu



On 4/2/2024 5:01 AM, Eugenio Perez Martin wrote:

On Tue, Apr 2, 2024 at 8:19 AM Si-Wei Liu  wrote:



On 2/14/2024 11:11 AM, Eugenio Perez Martin wrote:

On Wed, Feb 14, 2024 at 7:29 PM Si-Wei Liu  wrote:

Hi Michael,

On 2/13/2024 2:22 AM, Michael S. Tsirkin wrote:

On Mon, Feb 05, 2024 at 05:10:36PM -0800, Si-Wei Liu wrote:

Hi Eugenio,

I thought this new code looks good to me and the original issue I saw with
x-svq=on should be gone. However, after rebase my tree on top of this,
there's a new failure I found around setting up guest mappings at early
boot, please see attached the specific QEMU config and corresponding event
traces. Haven't checked into the detail yet, thinking you would need to be
aware of ahead.

Regards,
-Siwei

Eugenio were you able to reproduce? Siwei did you have time to
look into this?

Didn't get a chance to look into the detail yet in the past week, but
thought it may have something to do with the (internals of) iova tree
range allocation and the lookup routine. It started to fall apart at the
first vhost_vdpa_dma_unmap call showing up in the trace events, where it
should've gotten IOVA=0x201000,  but an incorrect IOVA address
0x1000 was ended up returning from the iova tree lookup routine.

HVAGPAIOVA
-
Map
[0x7f7903e0, 0x7f7983e0)[0x0, 0x8000) [0x1000, 0x8000)
[0x7f7983e0, 0x7f9903e0)[0x1, 0x208000)
[0x80001000, 0x201000)
[0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc)
[0x201000, 0x221000)

Unmap
[0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc) [0x1000,
0x2) ???
   shouldn't it be [0x201000,
0x221000) ???


It looks the SVQ iova tree lookup routine vhost_iova_tree_find_iova(),
which is called from vhost_vdpa_listener_region_del(), can't properly
deal with overlapped region. Specifically, q35's mch_realize() has the
following:

579 memory_region_init_alias(&mch->open_high_smram, OBJECT(mch),
"smram-open-high",
580  mch->ram_memory,
MCH_HOST_BRIDGE_SMRAM_C_BASE,
581  MCH_HOST_BRIDGE_SMRAM_C_SIZE);
582 memory_region_add_subregion_overlap(mch->system_memory, 0xfeda,
583 &mch->open_high_smram, 1);
584 memory_region_set_enabled(&mch->open_high_smram, false);

#0  0x564c30bf6980 in iova_tree_find_address_iterator
(key=0x564c331cf8e0, value=0x564c331cf8e0, data=0x7fffb6d749b0) at
../util/iova-tree.c:96
#1  0x7f5f66479654 in g_tree_foreach () at /lib64/libglib-2.0.so.0
#2  0x564c30bf6b53 in iova_tree_find_iova (tree=,
map=map@entry=0x7fffb6d74a00) at ../util/iova-tree.c:114
#3  0x564c309da0a9 in vhost_iova_tree_find_iova (tree=, map=map@entry=0x7fffb6d74a00) at ../hw/virtio/vhost-iova-tree.c:70
#4  0x564c3085e49d in vhost_vdpa_listener_region_del
(listener=0x564c331024c8, section=0x7fffb6d74aa0) at
../hw/virtio/vhost-vdpa.c:444
#5  0x564c309f4931 in address_space_update_topology_pass
(as=as@entry=0x564c31ab1840 ,
old_view=old_view@entry=0x564c33364cc0,
new_view=new_view@entry=0x564c333640f0, adding=adding@entry=false) at
../system/memory.c:977
#6  0x564c309f4dcd in address_space_set_flatview (as=0x564c31ab1840
) at ../system/memory.c:1079
#7  0x564c309f86d0 in memory_region_transaction_commit () at
../system/memory.c:1132
#8  0x564c309f86d0 in memory_region_transaction_commit () at
../system/memory.c:1117
#9  0x564c307cce64 in mch_realize (d=,
errp=) at ../hw/pci-host/q35.c:584

However, it looks like iova_tree_find_address_iterator() only check if
the translated address (HVA) falls in to the range when trying to locate
the desired IOVA, causing the first DMAMap that happens to overlap in
the translated address (HVA) space to be returned prematurely:

   89 static gboolean iova_tree_find_address_iterator(gpointer key,
gpointer value,
   90 gpointer data)
   91 {
   :
   :
   99 if (map->translated_addr + map->size < needle->translated_addr ||
100 needle->translated_addr + needle->size < map->translated_addr) {
101 return false;
102 }
103
104 args->result = map;
105 return true;
106 }

In the QEMU trace file, it reveals that the first DMAMap as below gets
returned incorrectly instead the second, the latter of which is what the
actual IOVA corresponds to:

HVA GPA 
IOVA
[0x7f7903e0, 0x7f7983e0)[0x0, 0x8000)   
[0x1000, 0x80001000)
[0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc)
[0x201000, 0x221000)


I think the analysis is totally accurat

Re: [PATCH 12/12] vdpa: fix network breakage after cancelling migration

2024-03-13 Thread Si-Wei Liu




On 3/13/2024 11:12 AM, Michael Tokarev wrote:

14.02.2024 14:28, Si-Wei Liu wrote:

Fix an issue where cancellation of ongoing migration ends up
with no network connectivity.

When canceling migration, SVQ will be switched back to the
passthrough mode, but the right call fd is not programed to
the device and the svq's own call fd is still used. At the
point of this transitioning period, the shadow_vqs_enabled
hadn't been set back to false yet, causing the installation
of call fd inadvertently bypassed.

Fixes: a8ac88585da1 ("vhost: Add Shadow VirtQueue call forwarding 
capabilities")

Cc: Eugenio Pérez 
Acked-by: Jason Wang 
Signed-off-by: Si-Wei Liu 
---
  hw/virtio/vhost-vdpa.c | 10 +-
  1 file changed, 9 insertions(+), 1 deletion(-)


Is this a -stable material?
Probably yes, the pre-requisites of this patch are PATCH #10 and #11 
from this series (where SVQ_TSTATE_DISABLING gets defined and set).




If yes, is it also applicable for stable-7.2 (mentioned commit is in 
7.2.0),

which lacks v7.2.0-2327-gb276524386 "vdpa: Remember last call fd set",
or should this one also be picked up?
Eugenio can judge, but seems to me the relevant code path cannot be 
effectively called as the dynamic SVQ feature (switching over to SVQ 
dynamically when migration is started) is not supported from 7.2. Maybe 
not worth it to cherry-pick this one to 7.2. Cherry-pick to stable-8.0 
and above should be applicable though (it needs some tweaks on patch #10 
to move svq_switching from @struct VhostVDPAShared to @struct vhost_vdpa).


Regards,
-Siwei



Thanks,

/mjt


diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 004110f..dfeca8b 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1468,7 +1468,15 @@ static int vhost_vdpa_set_vring_call(struct 
vhost_dev *dev,
    /* Remember last call fd because we can switch to SVQ 
anytime. */

  vhost_svq_set_svq_call_fd(svq, file->fd);
-    if (v->shadow_vqs_enabled) {
+    /*
+ * When SVQ is transitioning to off, shadow_vqs_enabled has
+ * not been set back to false yet, but the underlying call fd
+ * will have to switch back to the guest notifier to signal the
+ * passthrough virtqueues. In other situations, SVQ's own call
+ * fd shall be used to signal the device model.
+ */
+    if (v->shadow_vqs_enabled &&
+    v->shared->svq_switching != SVQ_TSTATE_DISABLING) {
  return 0;
  }







Re: [PATCH v2 1/2] vhost: dirty log should be per backend type

2024-03-13 Thread Si-Wei Liu




On 3/12/2024 8:07 AM, Michael S. Tsirkin wrote:

On Wed, Feb 14, 2024 at 10:42:29AM -0800, Si-Wei Liu wrote:

Hi Michael,

I'm taking off for 2+ weeks, but please feel free to provide comment and
feedback while I'm off. I'll be checking emails still, and am about to
address any opens as soon as I am back.

Thanks,
-Siwei

Eugenio sent some comments. I don't have more, just address these
please. Thanks!


Thanks Michael, good to know you don't have more other than the one from 
Eugenio. I will post a v3 shortly to address his comments.


-Siwei



[PATCH v3 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-14 Thread Si-Wei Liu
On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 
---
v2 -> v3:
  - add after-fix benchmark to commit log
  - rename vhost_log_dev_enabled to vhost_dev_should_log
  - remove unneeded comparisons for backend_type
  - use QLIST array instead of single flat list to store vhost
logger devices
  - simplify logger election logic

---
 hw/virtio/vhost.c | 63 ++-
 include/hw/virtio/vhost.h |  1 +
 2 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index efe2f74..d91858b 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@
 
 static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
 static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];
 
 /* Memslots used by backends that support private memslots (without an fd). */
 static unsigned int used_memslots;
@@ -149,6 +150,43 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
 }
 }
 
+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);
+}
+
+static inline void vhost_dev_elect_mem_logger(struct vhost_dev *hdev, bool add)
+{
+VhostBackendType backend_type;
+
+assert(hdev->vhost_ops);
+
+backend_type = hdev->vhost_ops->backend_type;
+assert(backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(backend_type < VHOST_BACKEND_TYPE_MAX);
+
+if (add && !QLIST_IS_INSERTED(hdev, logdev_entry)) {
+if (QLIST_EMPTY(&vhost_log_devs[backend_type])) {
+QLIST_INSERT_HEAD(&vhost_log_devs[backend_type],
+  hdev, logdev_entry);
+} else {
+/*
+ * The first vhost_device in the list is selected as the shared
+ * logger to scan memory sections. Put new entry next to the head
+ * to avoid inadvertent change to the underlying logger device.
+ */
+QLIST_INSERT_AFTER(QLIST_FIRST(&vhost_log_devs[backend_type]),
+   hdev, logdev_entry);
+}
+} else if (!add && QLIST_IS_INSERTED(hdev, logdev_entry)) {
+QLIST_REMOVE(hdev, logdev_entry);
+}
+}
+
 static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
MemoryRegionSection *section,
hwaddr first,
@@ -166,12 +204,14 @@ static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
 start_addr = MAX(first, start_addr);
 end_addr = MIN(last, end_addr);
 
-for (i = 0; i < dev->mem->nregions; ++i) {
-struct vhost_memory_region *reg = dev->mem->regions + i;
-vhost_dev_sync_region(dev, section, start_addr, end_addr,
-  reg->guest_phys_addr,
-  range_get_last(reg->guest_phys_addr,
- reg->memory_size));
+if (vhost_dev_should_log(dev)) {
+for (i = 0; i < dev->mem->nregio

[PATCH v3 1/2] vhost: dirty log should be per backend type

2024-03-14 Thread Si-Wei Liu
There could be a mix of both vhost-user and vhost-kernel clients
in the same QEMU process, where separate vhost loggers for the
specific vhost type have to be used. Make the vhost logger per
backend type, and have them properly reference counted.

Suggested-by: Michael S. Tsirkin 
Signed-off-by: Si-Wei Liu 
---
v2->v3: 
  - remove non-effective assertion that never be reached
  - do not return NULL from vhost_log_get()
  - add neccessary assertions to vhost_log_get()

---
 hw/virtio/vhost.c | 50 ++
 1 file changed, 38 insertions(+), 12 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2c9ac79..efe2f74 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -43,8 +43,8 @@
 do { } while (0)
 #endif
 
-static struct vhost_log *vhost_log;
-static struct vhost_log *vhost_log_shm;
+static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
+static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
 
 /* Memslots used by backends that support private memslots (without an fd). */
 static unsigned int used_memslots;
@@ -287,6 +287,10 @@ static int vhost_set_backend_type(struct vhost_dev *dev,
 r = -1;
 }
 
+if (r == 0) {
+assert(dev->vhost_ops->backend_type == backend_type);
+}
+
 return r;
 }
 
@@ -319,16 +323,22 @@ static struct vhost_log *vhost_log_alloc(uint64_t size, 
bool share)
 return log;
 }
 
-static struct vhost_log *vhost_log_get(uint64_t size, bool share)
+static struct vhost_log *vhost_log_get(VhostBackendType backend_type,
+   uint64_t size, bool share)
 {
-struct vhost_log *log = share ? vhost_log_shm : vhost_log;
+struct vhost_log *log;
+
+assert(backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(backend_type < VHOST_BACKEND_TYPE_MAX);
+
+log = share ? vhost_log_shm[backend_type] : vhost_log[backend_type];
 
 if (!log || log->size != size) {
 log = vhost_log_alloc(size, share);
 if (share) {
-vhost_log_shm = log;
+vhost_log_shm[backend_type] = log;
 } else {
-vhost_log = log;
+vhost_log[backend_type] = log;
 }
 } else {
 ++log->refcnt;
@@ -340,11 +350,20 @@ static struct vhost_log *vhost_log_get(uint64_t size, 
bool share)
 static void vhost_log_put(struct vhost_dev *dev, bool sync)
 {
 struct vhost_log *log = dev->log;
+VhostBackendType backend_type;
 
 if (!log) {
 return;
 }
 
+assert(dev->vhost_ops);
+backend_type = dev->vhost_ops->backend_type;
+
+if (backend_type == VHOST_BACKEND_TYPE_NONE ||
+backend_type >= VHOST_BACKEND_TYPE_MAX) {
+return;
+}
+
 --log->refcnt;
 if (log->refcnt == 0) {
 /* Sync only the range covered by the old log */
@@ -352,13 +371,13 @@ static void vhost_log_put(struct vhost_dev *dev, bool 
sync)
 vhost_log_sync_range(dev, 0, dev->log_size * VHOST_LOG_CHUNK - 1);
 }
 
-if (vhost_log == log) {
+if (vhost_log[backend_type] == log) {
 g_free(log->log);
-vhost_log = NULL;
-} else if (vhost_log_shm == log) {
+vhost_log[backend_type] = NULL;
+} else if (vhost_log_shm[backend_type] == log) {
 qemu_memfd_free(log->log, log->size * sizeof(*(log->log)),
 log->fd);
-vhost_log_shm = NULL;
+vhost_log_shm[backend_type] = NULL;
 }
 
 g_free(log);
@@ -376,7 +395,8 @@ static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
 
 static inline void vhost_dev_log_resize(struct vhost_dev *dev, uint64_t size)
 {
-struct vhost_log *log = vhost_log_get(size, vhost_dev_log_is_shared(dev));
+struct vhost_log *log = vhost_log_get(dev->vhost_ops->backend_type,
+  size, vhost_dev_log_is_shared(dev));
 uint64_t log_base = (uintptr_t)log->log;
 int r;
 
@@ -2037,8 +2057,14 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
*vdev, bool vrings)
 uint64_t log_base;
 
 hdev->log_size = vhost_get_log_size(hdev);
-hdev->log = vhost_log_get(hdev->log_size,
+hdev->log = vhost_log_get(hdev->vhost_ops->backend_type,
+  hdev->log_size,
   vhost_dev_log_is_shared(hdev));
+if (!hdev->log) {
+VHOST_OPS_DEBUG(r, "vhost_log_get failed");
+goto fail_vq;
+}
+
 log_base = (uintptr_t)hdev->log->log;
 r = hdev->vhost_ops->vhost_set_log_base(hdev,
 hdev->log_size ? log_base : 0,
-- 
1.8.3.1




Re: [PATCH v3 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-14 Thread Si-Wei Liu




On 3/14/2024 8:34 AM, Eugenio Perez Martin wrote:

On Thu, Mar 14, 2024 at 9:38 AM Si-Wei Liu  wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 
---
v2 -> v3:
   - add after-fix benchmark to commit log
   - rename vhost_log_dev_enabled to vhost_dev_should_log
   - remove unneeded comparisons for backend_type
   - use QLIST array instead of single flat list to store vhost
 logger devices
   - simplify logger election logic

---
  hw/virtio/vhost.c | 63 ++-
  include/hw/virtio/vhost.h |  1 +
  2 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index efe2f74..d91858b 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@

  static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
  static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];

  /* Memslots used by backends that support private memslots (without an fd). */
  static unsigned int used_memslots;
@@ -149,6 +150,43 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
  }
  }

+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);
+}
+
+static inline void vhost_dev_elect_mem_logger(struct vhost_dev *hdev, bool add)
+{
+VhostBackendType backend_type;
+
+assert(hdev->vhost_ops);
+
+backend_type = hdev->vhost_ops->backend_type;
+assert(backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(backend_type < VHOST_BACKEND_TYPE_MAX);
+
+if (add && !QLIST_IS_INSERTED(hdev, logdev_entry)) {
+if (QLIST_EMPTY(&vhost_log_devs[backend_type])) {
+QLIST_INSERT_HEAD(&vhost_log_devs[backend_type],
+  hdev, logdev_entry);
+} else {
+/*
+ * The first vhost_device in the list is selected as the shared
+ * logger to scan memory sections. Put new entry next to the head
+ * to avoid inadvertent change to the underlying logger device.
+ */

Why is changing the logger device a problem? All the code paths are
either changing the QLIST or logging, isn't it?
Changing logger device doesn't affect functionality for sure, but may 
have inadvertent effect on cache locality, particularly it's relevant to 
the log scanning process in the hot path. The code makes sure there's no 
churn on the leading logger selection as a result of adding new vhost 
device, unless the selected logger device will be gone and a re-election 
of another logger is needed.


-Siwei




+QLIST_INSERT_AFTER(QLIST_FIRST(&vhost_log_devs[backend_type]),
+   hdev, logdev_entry);
+}
+} else if (!add && QLIST_IS_INSERTED(hdev, logdev_entry)) {
+QLIST_REMOVE(hdev, logdev_entry);
+}
+}
+
  static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
 MemoryRegionSection *section,
  

Re: [PATCH v3 1/2] vhost: dirty log should be per backend type

2024-03-14 Thread Si-Wei Liu




On 3/14/2024 8:25 AM, Eugenio Perez Martin wrote:

On Thu, Mar 14, 2024 at 9:38 AM Si-Wei Liu  wrote:

There could be a mix of both vhost-user and vhost-kernel clients
in the same QEMU process, where separate vhost loggers for the
specific vhost type have to be used. Make the vhost logger per
backend type, and have them properly reference counted.

Suggested-by: Michael S. Tsirkin 
Signed-off-by: Si-Wei Liu 
---
v2->v3:
   - remove non-effective assertion that never be reached
   - do not return NULL from vhost_log_get()
   - add neccessary assertions to vhost_log_get()

---
  hw/virtio/vhost.c | 50 ++
  1 file changed, 38 insertions(+), 12 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2c9ac79..efe2f74 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -43,8 +43,8 @@
  do { } while (0)
  #endif

-static struct vhost_log *vhost_log;
-static struct vhost_log *vhost_log_shm;
+static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
+static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];

  /* Memslots used by backends that support private memslots (without an fd). */
  static unsigned int used_memslots;
@@ -287,6 +287,10 @@ static int vhost_set_backend_type(struct vhost_dev *dev,
  r = -1;
  }

+if (r == 0) {
+assert(dev->vhost_ops->backend_type == backend_type);
+}
+
  return r;
  }

@@ -319,16 +323,22 @@ static struct vhost_log *vhost_log_alloc(uint64_t size, 
bool share)
  return log;
  }

-static struct vhost_log *vhost_log_get(uint64_t size, bool share)
+static struct vhost_log *vhost_log_get(VhostBackendType backend_type,
+   uint64_t size, bool share)
  {
-struct vhost_log *log = share ? vhost_log_shm : vhost_log;
+struct vhost_log *log;
+
+assert(backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(backend_type < VHOST_BACKEND_TYPE_MAX);
+
+log = share ? vhost_log_shm[backend_type] : vhost_log[backend_type];

  if (!log || log->size != size) {
  log = vhost_log_alloc(size, share);
  if (share) {
-vhost_log_shm = log;
+vhost_log_shm[backend_type] = log;
  } else {
-vhost_log = log;
+vhost_log[backend_type] = log;
  }
  } else {
  ++log->refcnt;
@@ -340,11 +350,20 @@ static struct vhost_log *vhost_log_get(uint64_t size, 
bool share)
  static void vhost_log_put(struct vhost_dev *dev, bool sync)
  {
  struct vhost_log *log = dev->log;
+VhostBackendType backend_type;

  if (!log) {
  return;
  }

+assert(dev->vhost_ops);
+backend_type = dev->vhost_ops->backend_type;
+
+if (backend_type == VHOST_BACKEND_TYPE_NONE ||
+backend_type >= VHOST_BACKEND_TYPE_MAX) {
+return;
+}
+
  --log->refcnt;
  if (log->refcnt == 0) {
  /* Sync only the range covered by the old log */
@@ -352,13 +371,13 @@ static void vhost_log_put(struct vhost_dev *dev, bool 
sync)
  vhost_log_sync_range(dev, 0, dev->log_size * VHOST_LOG_CHUNK - 1);
  }

-if (vhost_log == log) {
+if (vhost_log[backend_type] == log) {
  g_free(log->log);
-vhost_log = NULL;
-} else if (vhost_log_shm == log) {
+vhost_log[backend_type] = NULL;
+} else if (vhost_log_shm[backend_type] == log) {
  qemu_memfd_free(log->log, log->size * sizeof(*(log->log)),
  log->fd);
-vhost_log_shm = NULL;
+vhost_log_shm[backend_type] = NULL;
  }

  g_free(log);
@@ -376,7 +395,8 @@ static bool vhost_dev_log_is_shared(struct vhost_dev *dev)

  static inline void vhost_dev_log_resize(struct vhost_dev *dev, uint64_t size)
  {
-struct vhost_log *log = vhost_log_get(size, vhost_dev_log_is_shared(dev));
+struct vhost_log *log = vhost_log_get(dev->vhost_ops->backend_type,
+  size, vhost_dev_log_is_shared(dev));
  uint64_t log_base = (uintptr_t)log->log;
  int r;

@@ -2037,8 +2057,14 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
*vdev, bool vrings)
  uint64_t log_base;

  hdev->log_size = vhost_get_log_size(hdev);
-hdev->log = vhost_log_get(hdev->log_size,
+hdev->log = vhost_log_get(hdev->vhost_ops->backend_type,
+  hdev->log_size,
vhost_dev_log_is_shared(hdev));
+if (!hdev->log) {

I thought vhost_log_get couldn't return NULL :).

Sure, missed that. Will post a revised v4.

-Siwei


Other than that,

Acked-by: Eugenio Pérez 


+VHOST_OPS_DEBUG(r, "vhost_log_get failed");
+goto fail_vq;
+}
+
  log_base = (uintptr_t)hdev->log->log;
  r = hde

[PATCH v4 1/2] vhost: dirty log should be per backend type

2024-03-14 Thread Si-Wei Liu
There could be a mix of both vhost-user and vhost-kernel clients
in the same QEMU process, where separate vhost loggers for the
specific vhost type have to be used. Make the vhost logger per
backend type, and have them properly reference counted.

Suggested-by: Michael S. Tsirkin 
Signed-off-by: Si-Wei Liu 

---
v3->v4:
  - remove checking NULL return value from vhost_log_get

v2->v3:
  - remove non-effective assertion that never be reached
  - do not return NULL from vhost_log_get()
  - add neccessary assertions to vhost_log_get()
---
 hw/virtio/vhost.c | 45 +
 1 file changed, 33 insertions(+), 12 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2c9ac79..612f4db 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -43,8 +43,8 @@
 do { } while (0)
 #endif
 
-static struct vhost_log *vhost_log;
-static struct vhost_log *vhost_log_shm;
+static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
+static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
 
 /* Memslots used by backends that support private memslots (without an fd). */
 static unsigned int used_memslots;
@@ -287,6 +287,10 @@ static int vhost_set_backend_type(struct vhost_dev *dev,
 r = -1;
 }
 
+if (r == 0) {
+assert(dev->vhost_ops->backend_type == backend_type);
+}
+
 return r;
 }
 
@@ -319,16 +323,22 @@ static struct vhost_log *vhost_log_alloc(uint64_t size, 
bool share)
 return log;
 }
 
-static struct vhost_log *vhost_log_get(uint64_t size, bool share)
+static struct vhost_log *vhost_log_get(VhostBackendType backend_type,
+   uint64_t size, bool share)
 {
-struct vhost_log *log = share ? vhost_log_shm : vhost_log;
+struct vhost_log *log;
+
+assert(backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(backend_type < VHOST_BACKEND_TYPE_MAX);
+
+log = share ? vhost_log_shm[backend_type] : vhost_log[backend_type];
 
 if (!log || log->size != size) {
 log = vhost_log_alloc(size, share);
 if (share) {
-vhost_log_shm = log;
+vhost_log_shm[backend_type] = log;
 } else {
-vhost_log = log;
+vhost_log[backend_type] = log;
 }
 } else {
 ++log->refcnt;
@@ -340,11 +350,20 @@ static struct vhost_log *vhost_log_get(uint64_t size, 
bool share)
 static void vhost_log_put(struct vhost_dev *dev, bool sync)
 {
 struct vhost_log *log = dev->log;
+VhostBackendType backend_type;
 
 if (!log) {
 return;
 }
 
+assert(dev->vhost_ops);
+backend_type = dev->vhost_ops->backend_type;
+
+if (backend_type == VHOST_BACKEND_TYPE_NONE ||
+backend_type >= VHOST_BACKEND_TYPE_MAX) {
+return;
+}
+
 --log->refcnt;
 if (log->refcnt == 0) {
 /* Sync only the range covered by the old log */
@@ -352,13 +371,13 @@ static void vhost_log_put(struct vhost_dev *dev, bool 
sync)
 vhost_log_sync_range(dev, 0, dev->log_size * VHOST_LOG_CHUNK - 1);
 }
 
-if (vhost_log == log) {
+if (vhost_log[backend_type] == log) {
 g_free(log->log);
-vhost_log = NULL;
-} else if (vhost_log_shm == log) {
+vhost_log[backend_type] = NULL;
+} else if (vhost_log_shm[backend_type] == log) {
 qemu_memfd_free(log->log, log->size * sizeof(*(log->log)),
 log->fd);
-vhost_log_shm = NULL;
+vhost_log_shm[backend_type] = NULL;
 }
 
 g_free(log);
@@ -376,7 +395,8 @@ static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
 
 static inline void vhost_dev_log_resize(struct vhost_dev *dev, uint64_t size)
 {
-struct vhost_log *log = vhost_log_get(size, vhost_dev_log_is_shared(dev));
+struct vhost_log *log = vhost_log_get(dev->vhost_ops->backend_type,
+  size, vhost_dev_log_is_shared(dev));
 uint64_t log_base = (uintptr_t)log->log;
 int r;
 
@@ -2037,7 +2057,8 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
*vdev, bool vrings)
 uint64_t log_base;
 
 hdev->log_size = vhost_get_log_size(hdev);
-hdev->log = vhost_log_get(hdev->log_size,
+hdev->log = vhost_log_get(hdev->vhost_ops->backend_type,
+  hdev->log_size,
   vhost_dev_log_is_shared(hdev));
 log_base = (uintptr_t)hdev->log->log;
 r = hdev->vhost_ops->vhost_set_log_base(hdev,
-- 
1.8.3.1




[PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-14 Thread Si-Wei Liu
On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 

---
v3 -> v4:
  - add comment to clarify effect on cache locality and
performance

v2 -> v3:
  - add after-fix benchmark to commit log
  - rename vhost_log_dev_enabled to vhost_dev_should_log
  - remove unneeded comparisons for backend_type
  - use QLIST array instead of single flat list to store vhost
logger devices
  - simplify logger election logic
---
 hw/virtio/vhost.c | 67 ++-
 include/hw/virtio/vhost.h |  1 +
 2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 612f4db..58522f1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@
 
 static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
 static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];
 
 /* Memslots used by backends that support private memslots (without an fd). */
 static unsigned int used_memslots;
@@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
 }
 }
 
+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);
+}
+
+static inline void vhost_dev_elect_mem_logger(struct vhost_dev *hdev, bool add)
+{
+VhostBackendType backend_type;
+
+assert(hdev->vhost_ops);
+
+backend_type = hdev->vhost_ops->backend_type;
+assert(backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(backend_type < VHOST_BACKEND_TYPE_MAX);
+
+if (add && !QLIST_IS_INSERTED(hdev, logdev_entry)) {
+if (QLIST_EMPTY(&vhost_log_devs[backend_type])) {
+QLIST_INSERT_HEAD(&vhost_log_devs[backend_type],
+  hdev, logdev_entry);
+} else {
+/*
+ * The first vhost_device in the list is selected as the shared
+ * logger to scan memory sections. Put new entry next to the head
+ * to avoid inadvertent change to the underlying logger device.
+ * This is done in order to get better cache locality and to avoid
+ * performance churn on the hot path for log scanning. Even when
+ * new devices come and go quickly, it wouldn't end up changing
+ * the active leading logger device at all.
+ */
+QLIST_INSERT_AFTER(QLIST_FIRST(&vhost_log_devs[backend_type]),
+   hdev, logdev_entry);
+}
+} else if (!add && QLIST_IS_INSERTED(hdev, logdev_entry)) {
+QLIST_REMOVE(hdev, logdev_entry);
+}
+}
+
 static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
MemoryRegionSection *section,
hwaddr first,
@@ -166,12 +208,14 @@ static int vhost_sync_dirty_bitmap(struct vhost_dev *dev,
 start_addr = MAX(first, start_addr);
 end_addr = MIN(last, end_addr);
 
-for (i = 0; i < dev->mem->nregions; ++i) {
-struct vho

Re: [PATCH v4 1/2] vhost: dirty log should be per backend type

2024-03-15 Thread Si-Wei Liu




On 3/14/2024 8:50 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

There could be a mix of both vhost-user and vhost-kernel clients
in the same QEMU process, where separate vhost loggers for the
specific vhost type have to be used. Make the vhost logger per
backend type, and have them properly reference counted.

It's better to describe what's the advantage of doing this.
Yes, I can add that to the log. Although it's a niche use case, it was 
actually a long standing limitation / bug that vhost-user and 
vhost-kernel loggers can't co-exist per QEMU process, but today it's 
just silent failure that may be ended up with. This bug fix removes that 
implicit limitation in the code.



Suggested-by: Michael S. Tsirkin 
Signed-off-by: Si-Wei Liu 

---
v3->v4:
   - remove checking NULL return value from vhost_log_get

v2->v3:
   - remove non-effective assertion that never be reached
   - do not return NULL from vhost_log_get()
   - add neccessary assertions to vhost_log_get()
---
  hw/virtio/vhost.c | 45 +
  1 file changed, 33 insertions(+), 12 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2c9ac79..612f4db 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -43,8 +43,8 @@
  do { } while (0)
  #endif

-static struct vhost_log *vhost_log;
-static struct vhost_log *vhost_log_shm;
+static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
+static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];

  /* Memslots used by backends that support private memslots (without an fd). */
  static unsigned int used_memslots;
@@ -287,6 +287,10 @@ static int vhost_set_backend_type(struct vhost_dev *dev,
  r = -1;
  }

+if (r == 0) {
+assert(dev->vhost_ops->backend_type == backend_type);
+}
+

Under which condition could we hit this?
Just in case some other function inadvertently corrupted this earlier, 
we have to capture discrepancy in the first place... On the other hand, 
it will be helpful for other vhost backend writers to diagnose day-one 
bug in the code. I feel just code comment here will not be 
sufficient/helpful.



  It seems not good to assert a local logic.
It seems to me quite a few local asserts are in the same file already, 
vhost_save_backend_state, vhost_load_backend_state, 
vhost_virtqueue_mask, vhost_config_mask, just to name a few. Why local 
assert a problem?


Thanks,
-Siwei


Thanks






Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-15 Thread Si-Wei Liu




On 3/14/2024 9:03 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 

---
v3 -> v4:
   - add comment to clarify effect on cache locality and
 performance

v2 -> v3:
   - add after-fix benchmark to commit log
   - rename vhost_log_dev_enabled to vhost_dev_should_log
   - remove unneeded comparisons for backend_type
   - use QLIST array instead of single flat list to store vhost
 logger devices
   - simplify logger election logic
---
  hw/virtio/vhost.c | 67 ++-
  include/hw/virtio/vhost.h |  1 +
  2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 612f4db..58522f1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@

  static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
  static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];

  /* Memslots used by backends that support private memslots (without an fd). */
  static unsigned int used_memslots;
@@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
  }
  }

+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);

A dumb question, why not simple check

dev->log == vhost_log_shm[dev->vhost_ops->backend_type]
Because we are not sure if the logger comes from vhost_log_shm[] or 
vhost_log[]. Don't want to complicate the check here by calling into 
vhost_dev_log_is_shared() everytime when the .log_sync() is called.


-Siwei

?

Thanks






Re: [PATCH v4 1/2] vhost: dirty log should be per backend type

2024-03-18 Thread Si-Wei Liu




On 3/17/2024 8:20 PM, Jason Wang wrote:

On Sat, Mar 16, 2024 at 2:33 AM Si-Wei Liu  wrote:



On 3/14/2024 8:50 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

There could be a mix of both vhost-user and vhost-kernel clients
in the same QEMU process, where separate vhost loggers for the
specific vhost type have to be used. Make the vhost logger per
backend type, and have them properly reference counted.

It's better to describe what's the advantage of doing this.

Yes, I can add that to the log. Although it's a niche use case, it was
actually a long standing limitation / bug that vhost-user and
vhost-kernel loggers can't co-exist per QEMU process, but today it's
just silent failure that may be ended up with. This bug fix removes that
implicit limitation in the code.

Ok.


Suggested-by: Michael S. Tsirkin 
Signed-off-by: Si-Wei Liu 

---
v3->v4:
- remove checking NULL return value from vhost_log_get

v2->v3:
- remove non-effective assertion that never be reached
- do not return NULL from vhost_log_get()
- add neccessary assertions to vhost_log_get()
---
   hw/virtio/vhost.c | 45 +
   1 file changed, 33 insertions(+), 12 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 2c9ac79..612f4db 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -43,8 +43,8 @@
   do { } while (0)
   #endif

-static struct vhost_log *vhost_log;
-static struct vhost_log *vhost_log_shm;
+static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
+static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];

   /* Memslots used by backends that support private memslots (without an fd). 
*/
   static unsigned int used_memslots;
@@ -287,6 +287,10 @@ static int vhost_set_backend_type(struct vhost_dev *dev,
   r = -1;
   }

+if (r == 0) {
+assert(dev->vhost_ops->backend_type == backend_type);
+}
+

Under which condition could we hit this?

Just in case some other function inadvertently corrupted this earlier,
we have to capture discrepancy in the first place... On the other hand,
it will be helpful for other vhost backend writers to diagnose day-one
bug in the code. I feel just code comment here will not be
sufficient/helpful.

See below.


   It seems not good to assert a local logic.

It seems to me quite a few local asserts are in the same file already,
vhost_save_backend_state,

For example it has assert for

assert(!dev->started);

which is not the logic of the function itself but require
vhost_dev_start() not to be called before.

But it looks like this patch you assert the code just a few lines
above the assert itself?
Yes, that was the intent - for e.g. xxx_ops may contain corrupted 
xxx_ops.backend_type already before coming to this 
vhost_set_backend_type() function. And we may capture this corrupted 
state by asserting the expected xxx_ops.backend_type (to be consistent 
with the backend_type passed in), which needs be done in the first place 
when this discrepancy is detected. In practice I think there should be 
no harm to add this assert, but this will add warranted guarantee to the 
current code.


Regards,
-Siwei



dev->vhost_ops = &xxx_ops;

...

assert(dev->vhost_ops->backend_type == backend_type)

?

Thanks


vhost_load_backend_state,
vhost_virtqueue_mask, vhost_config_mask, just to name a few. Why local
assert a problem?

Thanks,
-Siwei


Thanks






Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration

2024-03-18 Thread Si-Wei Liu




On 3/17/2024 8:22 PM, Jason Wang wrote:

On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu  wrote:



On 3/14/2024 9:03 PM, Jason Wang wrote:

On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu  wrote:

On setups with one or more virtio-net devices with vhost on,
dirty tracking iteration increases cost the bigger the number
amount of queues are set up e.g. on idle guests migration the
following is observed with virtio-net with vhost=on:

48 queues -> 78.11%  [.] vhost_dev_sync_region.isra.13
8 queues -> 40.50%   [.] vhost_dev_sync_region.isra.13
1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13
2 devices, 1 queue -> 18.60%  [.] vhost_dev_sync_region.isra.14

With high memory rates the symptom is lack of convergence as soon
as it has a vhost device with a sufficiently high number of queues,
the sufficient number of vhost devices.

On every migration iteration (every 100msecs) it will redundantly
query the *shared log* the number of queues configured with vhost
that exist in the guest. For the virtqueue data, this is necessary,
but not for the memory sections which are the same. So essentially
we end up scanning the dirty log too often.

To fix that, select a vhost device responsible for scanning the
log with regards to memory sections dirty tracking. It is selected
when we enable the logger (during migration) and cleared when we
disable the logger. If the vhost logger device goes away for some
reason, the logger will be re-selected from the rest of vhost
devices.

After making mem-section logger a singleton instance, constant cost
of 7%-9% (like the 1 queue report) will be seen, no matter how many
queues or how many vhost devices are configured:

48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13
2 devices, 8 queues -> 7.97%   [.] vhost_dev_sync_region.isra.14

Co-developed-by: Joao Martins 
Signed-off-by: Joao Martins 
Signed-off-by: Si-Wei Liu 

---
v3 -> v4:
- add comment to clarify effect on cache locality and
  performance

v2 -> v3:
- add after-fix benchmark to commit log
- rename vhost_log_dev_enabled to vhost_dev_should_log
- remove unneeded comparisons for backend_type
- use QLIST array instead of single flat list to store vhost
  logger devices
- simplify logger election logic
---
   hw/virtio/vhost.c | 67 
++-
   include/hw/virtio/vhost.h |  1 +
   2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 612f4db..58522f1 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -45,6 +45,7 @@

   static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX];
   static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX];
+static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX];

   /* Memslots used by backends that support private memslots (without an fd). 
*/
   static unsigned int used_memslots;
@@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev)
   }
   }

+static inline bool vhost_dev_should_log(struct vhost_dev *dev)
+{
+assert(dev->vhost_ops);
+assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE);
+assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX);
+
+return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]);

A dumb question, why not simple check

dev->log == vhost_log_shm[dev->vhost_ops->backend_type]

Because we are not sure if the logger comes from vhost_log_shm[] or
vhost_log[]. Don't want to complicate the check here by calling into
vhost_dev_log_is_shared() everytime when the .log_sync() is called.

It has very low overhead, isn't it?
Whether this has low overhead will have to depend on the specific 
backend's implementation for .vhost_requires_shm_log(), which the common 
vhost layer should not assume upon or rely on the current implementation.




static bool vhost_dev_log_is_shared(struct vhost_dev *dev)
{
 return dev->vhost_ops->vhost_requires_shm_log &&
dev->vhost_ops->vhost_requires_shm_log(dev);
}

And it helps to simplify the logic.
Generally yes, but when it comes to hot path operations the performance 
consideration could override this principle. I think there's no harm to 
check against logger device cached in vhost layer itself, and the 
current patch does not create a lot of complexity or performance side 
effect (actually I think the conditional should be very straightforward 
to turn into just a couple of assembly compare and branch instructions 
rather than indirection through another jmp call).


-Siwei



Thanks


-Siwei

?

Thanks






Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-04-18 Thread Si-Wei Liu




On 4/10/2024 3:03 AM, Eugenio Pérez wrote:

IOVA tree is also used to track the mappings of virtio-net shadow
virtqueue.  This mappings may not match with the GPA->HVA ones.

This causes a problem when overlapped regions (different GPA but same
translated HVA) exists in the tree, as looking them by HVA will return
them twice.  To solve this, create an id member so we can assign unique
identifiers (GPA) to the maps.

Signed-off-by: Eugenio Pérez 
---
  include/qemu/iova-tree.h | 5 +++--
  util/iova-tree.c | 3 ++-
  2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 2a10a7052e..34ee230e7d 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -36,6 +36,7 @@ typedef struct DMAMap {
  hwaddr iova;
  hwaddr translated_addr;
  hwaddr size;/* Inclusive */
+uint64_t id;
  IOMMUAccessFlags perm;
  } QEMU_PACKED DMAMap;
  typedef gboolean (*iova_tree_iterator)(DMAMap *map);
@@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const 
DMAMap *map);
   * @map: the mapping to search
   *
   * Search for a mapping in the iova tree that translated_addr overlaps with 
the
- * mapping range specified.  Only the first found mapping will be
- * returned.
+ * mapping range specified and map->id is equal.  Only the first found
+ * mapping will be returned.
   *
   * Return: DMAMap pointer if found, or NULL if not found.  Note that
   * the returned DMAMap pointer is maintained internally.  User should
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..0863e0a3b8 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, 
gpointer value,
  
  needle = args->needle;

  if (map->translated_addr + map->size < needle->translated_addr ||
-needle->translated_addr + needle->size < map->translated_addr) {
+needle->translated_addr + needle->size < map->translated_addr ||
+needle->id != map->id) {


It looks this iterator can also be invoked by SVQ from 
vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA 
space will be searched on without passing in the ID (GPA), and exact 
match for the same GPA range is not actually needed unlike the mapping 
removal case. Could we create an API variant, for the SVQ lookup case 
specifically? Or alternatively, add a special flag, say skip_id_match to 
DMAMap, and the id match check may look like below:


(!needle->skip_id_match && needle->id != map->id)

I think vhost_svq_translate_addr() could just call the API variant or 
pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().


Thanks,
-Siwei

  return false;
  }
  





Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-04-19 Thread Si-Wei Liu




On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:

On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  wrote:



On 4/10/2024 3:03 AM, Eugenio Pérez wrote:

IOVA tree is also used to track the mappings of virtio-net shadow
virtqueue.  This mappings may not match with the GPA->HVA ones.

This causes a problem when overlapped regions (different GPA but same
translated HVA) exists in the tree, as looking them by HVA will return
them twice.  To solve this, create an id member so we can assign unique
identifiers (GPA) to the maps.

Signed-off-by: Eugenio Pérez 
---
   include/qemu/iova-tree.h | 5 +++--
   util/iova-tree.c | 3 ++-
   2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 2a10a7052e..34ee230e7d 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -36,6 +36,7 @@ typedef struct DMAMap {
   hwaddr iova;
   hwaddr translated_addr;
   hwaddr size;/* Inclusive */
+uint64_t id;
   IOMMUAccessFlags perm;
   } QEMU_PACKED DMAMap;
   typedef gboolean (*iova_tree_iterator)(DMAMap *map);
@@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const 
DMAMap *map);
* @map: the mapping to search
*
* Search for a mapping in the iova tree that translated_addr overlaps with 
the
- * mapping range specified.  Only the first found mapping will be
- * returned.
+ * mapping range specified and map->id is equal.  Only the first found
+ * mapping will be returned.
*
* Return: DMAMap pointer if found, or NULL if not found.  Note that
* the returned DMAMap pointer is maintained internally.  User should
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..0863e0a3b8 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, 
gpointer value,

   needle = args->needle;
   if (map->translated_addr + map->size < needle->translated_addr ||
-needle->translated_addr + needle->size < map->translated_addr) {
+needle->translated_addr + needle->size < map->translated_addr ||
+needle->id != map->id) {

It looks this iterator can also be invoked by SVQ from
vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
space will be searched on without passing in the ID (GPA), and exact
match for the same GPA range is not actually needed unlike the mapping
removal case. Could we create an API variant, for the SVQ lookup case
specifically? Or alternatively, add a special flag, say skip_id_match to
DMAMap, and the id match check may look like below:

(!needle->skip_id_match && needle->id != map->id)

I think vhost_svq_translate_addr() could just call the API variant or
pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().


I think you're totally right. But I'd really like to not complicate
the API of the iova_tree more.

I think we can look for the hwaddr using memory_region_from_host and
then get the hwaddr. It is another lookup though...
Yeah, that will be another means of doing translation without having to 
complicate the API around iova_tree. I wonder how the lookup through 
memory_region_from_host() may perform compared to the iova tree one, the 
former looks to be an O(N) linear search on a linked list while the 
latter would be roughly O(log N) on an AVL tree? Of course, 
memory_region_from_host() won't search out of the guest memory space for 
sure. As this could be on the hot data path I have a little bit 
hesitance over the potential cost or performance regression this change 
could bring in, but maybe I'm overthinking it too much...


Thanks,
-Siwei




Thanks,
-Siwei

   return false;
   }






Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-04-23 Thread Si-Wei Liu




On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:

On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  wrote:



On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:

On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  wrote:


On 4/10/2024 3:03 AM, Eugenio Pérez wrote:

IOVA tree is also used to track the mappings of virtio-net shadow
virtqueue.  This mappings may not match with the GPA->HVA ones.

This causes a problem when overlapped regions (different GPA but same
translated HVA) exists in the tree, as looking them by HVA will return
them twice.  To solve this, create an id member so we can assign unique
identifiers (GPA) to the maps.

Signed-off-by: Eugenio Pérez 
---
include/qemu/iova-tree.h | 5 +++--
util/iova-tree.c | 3 ++-
2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 2a10a7052e..34ee230e7d 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -36,6 +36,7 @@ typedef struct DMAMap {
hwaddr iova;
hwaddr translated_addr;
hwaddr size;/* Inclusive */
+uint64_t id;
IOMMUAccessFlags perm;
} QEMU_PACKED DMAMap;
typedef gboolean (*iova_tree_iterator)(DMAMap *map);
@@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const 
DMAMap *map);
 * @map: the mapping to search
 *
 * Search for a mapping in the iova tree that translated_addr overlaps with 
the
- * mapping range specified.  Only the first found mapping will be
- * returned.
+ * mapping range specified and map->id is equal.  Only the first found
+ * mapping will be returned.
 *
 * Return: DMAMap pointer if found, or NULL if not found.  Note that
 * the returned DMAMap pointer is maintained internally.  User should
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..0863e0a3b8 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, 
gpointer value,

needle = args->needle;
if (map->translated_addr + map->size < needle->translated_addr ||
-needle->translated_addr + needle->size < map->translated_addr) {
+needle->translated_addr + needle->size < map->translated_addr ||
+needle->id != map->id) {

It looks this iterator can also be invoked by SVQ from
vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
space will be searched on without passing in the ID (GPA), and exact
match for the same GPA range is not actually needed unlike the mapping
removal case. Could we create an API variant, for the SVQ lookup case
specifically? Or alternatively, add a special flag, say skip_id_match to
DMAMap, and the id match check may look like below:

(!needle->skip_id_match && needle->id != map->id)

I think vhost_svq_translate_addr() could just call the API variant or
pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().


I think you're totally right. But I'd really like to not complicate
the API of the iova_tree more.

I think we can look for the hwaddr using memory_region_from_host and
then get the hwaddr. It is another lookup though...

Yeah, that will be another means of doing translation without having to
complicate the API around iova_tree. I wonder how the lookup through
memory_region_from_host() may perform compared to the iova tree one, the
former looks to be an O(N) linear search on a linked list while the
latter would be roughly O(log N) on an AVL tree?

Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
linear too. It is not even ordered.
Oh Sorry, I misread the code and I should look for g_tree_foreach () 
instead of g_tree_search_node(). So the former is indeed linear 
iteration, but it looks to be ordered?


https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115


But apart from this detail you're right, I have the same concerns with
this solution too. If we see a hard performance regression we could go
to more complicated solutions, like maintaining a reverse IOVATree in
vhost-iova-tree too. First RFCs of SVQ did that actually.
Agreed, yeap we can use memory_region_from_host for now.  Any reason why 
reverse IOVATree was dropped, lack of users? But now we have one!


Thanks,
-Siwei


Thanks!


Of course,
memory_region_from_host() won't search out of the guest memory space for
sure. As this could be on the hot data path I have a little bit
hesitance over the potential cost or performance regression this change
could bring in, but maybe I'm overthinking it too much...

Thanks,
-Siwei


Thanks,
-Siwei

return false;
}






Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-04-25 Thread Si-Wei Liu




On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:

On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu  wrote:



On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:

On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  wrote:


On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:

On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  wrote:

On 4/10/2024 3:03 AM, Eugenio Pérez wrote:

IOVA tree is also used to track the mappings of virtio-net shadow
virtqueue.  This mappings may not match with the GPA->HVA ones.

This causes a problem when overlapped regions (different GPA but same
translated HVA) exists in the tree, as looking them by HVA will return
them twice.  To solve this, create an id member so we can assign unique
identifiers (GPA) to the maps.

Signed-off-by: Eugenio Pérez 
---
 include/qemu/iova-tree.h | 5 +++--
 util/iova-tree.c | 3 ++-
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 2a10a7052e..34ee230e7d 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -36,6 +36,7 @@ typedef struct DMAMap {
 hwaddr iova;
 hwaddr translated_addr;
 hwaddr size;/* Inclusive */
+uint64_t id;
 IOMMUAccessFlags perm;
 } QEMU_PACKED DMAMap;
 typedef gboolean (*iova_tree_iterator)(DMAMap *map);
@@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const 
DMAMap *map);
  * @map: the mapping to search
  *
  * Search for a mapping in the iova tree that translated_addr overlaps 
with the
- * mapping range specified.  Only the first found mapping will be
- * returned.
+ * mapping range specified and map->id is equal.  Only the first found
+ * mapping will be returned.
  *
  * Return: DMAMap pointer if found, or NULL if not found.  Note that
  * the returned DMAMap pointer is maintained internally.  User should
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..0863e0a3b8 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, 
gpointer value,

 needle = args->needle;
 if (map->translated_addr + map->size < needle->translated_addr ||
-needle->translated_addr + needle->size < map->translated_addr) {
+needle->translated_addr + needle->size < map->translated_addr ||
+needle->id != map->id) {

It looks this iterator can also be invoked by SVQ from
vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
space will be searched on without passing in the ID (GPA), and exact
match for the same GPA range is not actually needed unlike the mapping
removal case. Could we create an API variant, for the SVQ lookup case
specifically? Or alternatively, add a special flag, say skip_id_match to
DMAMap, and the id match check may look like below:

(!needle->skip_id_match && needle->id != map->id)

I think vhost_svq_translate_addr() could just call the API variant or
pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().


I think you're totally right. But I'd really like to not complicate
the API of the iova_tree more.

I think we can look for the hwaddr using memory_region_from_host and
then get the hwaddr. It is another lookup though...

Yeah, that will be another means of doing translation without having to
complicate the API around iova_tree. I wonder how the lookup through
memory_region_from_host() may perform compared to the iova tree one, the
former looks to be an O(N) linear search on a linked list while the
latter would be roughly O(log N) on an AVL tree?

Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
linear too. It is not even ordered.

Oh Sorry, I misread the code and I should look for g_tree_foreach ()
instead of g_tree_search_node(). So the former is indeed linear
iteration, but it looks to be ordered?

https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115

The GPA / IOVA are ordered but we're looking by QEMU's vaddr.

If we have these translations:
[0x1000, 0x2000] -> [0x1, 0x11000]
[0x2000, 0x3000] -> [0x6000, 0x7000]

We will see them in this order, so we cannot stop the search at the first node.

Yeah, reverse lookup is unordered indeed, anyway.




But apart from this detail you're right, I have the same concerns with
this solution too. If we see a hard performance regression we could go
to more complicated solutions, like maintaining a reverse IOVATree in
vhost-iova-tree too. First RFCs of SVQ did that actually.

Agreed, yeap we can use memory_region_from_host for now.  Any reason why
reverse IOVATree was dropped, lack of users? But now we have one!


No, it is just simplicity. We already have an user in the hot patch in
the master branch, vhost_svq_vring_write_descs. But I never profiled
enough to find if it is a bottleneck or not to

Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-04-29 Thread Si-Wei Liu




On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote:

On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu  wrote:



On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:

On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu  wrote:


On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:

On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  wrote:

On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:

On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  wrote:

On 4/10/2024 3:03 AM, Eugenio Pérez wrote:

IOVA tree is also used to track the mappings of virtio-net shadow
virtqueue.  This mappings may not match with the GPA->HVA ones.

This causes a problem when overlapped regions (different GPA but same
translated HVA) exists in the tree, as looking them by HVA will return
them twice.  To solve this, create an id member so we can assign unique
identifiers (GPA) to the maps.

Signed-off-by: Eugenio Pérez 
---
  include/qemu/iova-tree.h | 5 +++--
  util/iova-tree.c | 3 ++-
  2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 2a10a7052e..34ee230e7d 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -36,6 +36,7 @@ typedef struct DMAMap {
  hwaddr iova;
  hwaddr translated_addr;
  hwaddr size;/* Inclusive */
+uint64_t id;
  IOMMUAccessFlags perm;
  } QEMU_PACKED DMAMap;
  typedef gboolean (*iova_tree_iterator)(DMAMap *map);
@@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const 
DMAMap *map);
   * @map: the mapping to search
   *
   * Search for a mapping in the iova tree that translated_addr overlaps 
with the
- * mapping range specified.  Only the first found mapping will be
- * returned.
+ * mapping range specified and map->id is equal.  Only the first found
+ * mapping will be returned.
   *
   * Return: DMAMap pointer if found, or NULL if not found.  Note that
   * the returned DMAMap pointer is maintained internally.  User should
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..0863e0a3b8 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, 
gpointer value,

  needle = args->needle;
  if (map->translated_addr + map->size < needle->translated_addr ||
-needle->translated_addr + needle->size < map->translated_addr) {
+needle->translated_addr + needle->size < map->translated_addr ||
+needle->id != map->id) {

It looks this iterator can also be invoked by SVQ from
vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
space will be searched on without passing in the ID (GPA), and exact
match for the same GPA range is not actually needed unlike the mapping
removal case. Could we create an API variant, for the SVQ lookup case
specifically? Or alternatively, add a special flag, say skip_id_match to
DMAMap, and the id match check may look like below:

(!needle->skip_id_match && needle->id != map->id)

I think vhost_svq_translate_addr() could just call the API variant or
pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().


I think you're totally right. But I'd really like to not complicate
the API of the iova_tree more.

I think we can look for the hwaddr using memory_region_from_host and
then get the hwaddr. It is another lookup though...

Yeah, that will be another means of doing translation without having to
complicate the API around iova_tree. I wonder how the lookup through
memory_region_from_host() may perform compared to the iova tree one, the
former looks to be an O(N) linear search on a linked list while the
latter would be roughly O(log N) on an AVL tree?

Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
linear too. It is not even ordered.

Oh Sorry, I misread the code and I should look for g_tree_foreach ()
instead of g_tree_search_node(). So the former is indeed linear
iteration, but it looks to be ordered?

https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115

The GPA / IOVA are ordered but we're looking by QEMU's vaddr.

If we have these translations:
[0x1000, 0x2000] -> [0x1, 0x11000]
[0x2000, 0x3000] -> [0x6000, 0x7000]

We will see them in this order, so we cannot stop the search at the first node.

Yeah, reverse lookup is unordered indeed, anyway.


But apart from this detail you're right, I have the same concerns with
this solution too. If we see a hard performance regression we could go
to more complicated solutions, like maintaining a reverse IOVATree in
vhost-iova-tree too. First RFCs of SVQ did that actually.

Agreed, yeap we can use memory_region_from_host for now.  Any reason why
reverse IOVATree was dropped, lack of users? But now we have one!


No, it is just simplicity. We already have an user in the hot patch in
the ma

Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-05-01 Thread Si-Wei Liu




On 4/30/2024 11:11 AM, Eugenio Perez Martin wrote:

On Mon, Apr 29, 2024 at 1:19 PM Jonah Palmer  wrote:



On 4/29/24 4:14 AM, Eugenio Perez Martin wrote:

On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu  wrote:



On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:

On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu  wrote:


On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:

On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  wrote:

On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:

On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  wrote:

On 4/10/2024 3:03 AM, Eugenio Pérez wrote:

IOVA tree is also used to track the mappings of virtio-net shadow
virtqueue.  This mappings may not match with the GPA->HVA ones.

This causes a problem when overlapped regions (different GPA but same
translated HVA) exists in the tree, as looking them by HVA will return
them twice.  To solve this, create an id member so we can assign unique
identifiers (GPA) to the maps.

Signed-off-by: Eugenio Pérez 
---
   include/qemu/iova-tree.h | 5 +++--
   util/iova-tree.c | 3 ++-
   2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 2a10a7052e..34ee230e7d 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -36,6 +36,7 @@ typedef struct DMAMap {
   hwaddr iova;
   hwaddr translated_addr;
   hwaddr size;/* Inclusive */
+uint64_t id;
   IOMMUAccessFlags perm;
   } QEMU_PACKED DMAMap;
   typedef gboolean (*iova_tree_iterator)(DMAMap *map);
@@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const 
DMAMap *map);
* @map: the mapping to search
*
* Search for a mapping in the iova tree that translated_addr overlaps 
with the
- * mapping range specified.  Only the first found mapping will be
- * returned.
+ * mapping range specified and map->id is equal.  Only the first found
+ * mapping will be returned.
*
* Return: DMAMap pointer if found, or NULL if not found.  Note that
* the returned DMAMap pointer is maintained internally.  User should
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..0863e0a3b8 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, 
gpointer value,

   needle = args->needle;
   if (map->translated_addr + map->size < needle->translated_addr ||
-needle->translated_addr + needle->size < map->translated_addr) {
+needle->translated_addr + needle->size < map->translated_addr ||
+needle->id != map->id) {

It looks this iterator can also be invoked by SVQ from
vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
space will be searched on without passing in the ID (GPA), and exact
match for the same GPA range is not actually needed unlike the mapping
removal case. Could we create an API variant, for the SVQ lookup case
specifically? Or alternatively, add a special flag, say skip_id_match to
DMAMap, and the id match check may look like below:

(!needle->skip_id_match && needle->id != map->id)

I think vhost_svq_translate_addr() could just call the API variant or
pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().


I think you're totally right. But I'd really like to not complicate
the API of the iova_tree more.

I think we can look for the hwaddr using memory_region_from_host and
then get the hwaddr. It is another lookup though...

Yeah, that will be another means of doing translation without having to
complicate the API around iova_tree. I wonder how the lookup through
memory_region_from_host() may perform compared to the iova tree one, the
former looks to be an O(N) linear search on a linked list while the
latter would be roughly O(log N) on an AVL tree?

Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
linear too. It is not even ordered.

Oh Sorry, I misread the code and I should look for g_tree_foreach ()
instead of g_tree_search_node(). So the former is indeed linear
iteration, but it looks to be ordered?

https://urldefense.com/v3/__https://github.com/GNOME/glib/blob/main/glib/gtree.c*L1115__;Iw!!ACWV5N9M2RV99hQ!Ng2rLfRd9tLyNTNocW50Mf5AcxSt0uF0wOdv120djff-z_iAdbujYK-jMi5UC1DZLxb1yLUv2vV0j3wJo8o$

The GPA / IOVA are ordered but we're looking by QEMU's vaddr.

If we have these translations:
[0x1000, 0x2000] -> [0x1, 0x11000]
[0x2000, 0x3000] -> [0x6000, 0x7000]

We will see them in this order, so we cannot stop the search at the first node.

Yeah, reverse lookup is unordered indeed, anyway.


But apart from this detail you're right, I have the same concerns with
this solution too. If we see a hard performance regression we could go
to more complicated solutions, like maintaining a reverse IOVATree in
vhost-iova-tree to

Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-05-01 Thread Si-Wei Liu




On 4/30/2024 10:19 AM, Eugenio Perez Martin wrote:

On Tue, Apr 30, 2024 at 7:55 AM Si-Wei Liu  wrote:



On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote:

On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu  wrote:


On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:

On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu  wrote:

On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:

On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  wrote:

On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:

On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  wrote:

On 4/10/2024 3:03 AM, Eugenio Pérez wrote:

IOVA tree is also used to track the mappings of virtio-net shadow
virtqueue.  This mappings may not match with the GPA->HVA ones.

This causes a problem when overlapped regions (different GPA but same
translated HVA) exists in the tree, as looking them by HVA will return
them twice.  To solve this, create an id member so we can assign unique
identifiers (GPA) to the maps.

Signed-off-by: Eugenio Pérez 
---
   include/qemu/iova-tree.h | 5 +++--
   util/iova-tree.c | 3 ++-
   2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 2a10a7052e..34ee230e7d 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -36,6 +36,7 @@ typedef struct DMAMap {
   hwaddr iova;
   hwaddr translated_addr;
   hwaddr size;/* Inclusive */
+uint64_t id;
   IOMMUAccessFlags perm;
   } QEMU_PACKED DMAMap;
   typedef gboolean (*iova_tree_iterator)(DMAMap *map);
@@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const 
DMAMap *map);
* @map: the mapping to search
*
* Search for a mapping in the iova tree that translated_addr overlaps 
with the
- * mapping range specified.  Only the first found mapping will be
- * returned.
+ * mapping range specified and map->id is equal.  Only the first found
+ * mapping will be returned.
*
* Return: DMAMap pointer if found, or NULL if not found.  Note that
* the returned DMAMap pointer is maintained internally.  User should
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..0863e0a3b8 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, 
gpointer value,

   needle = args->needle;
   if (map->translated_addr + map->size < needle->translated_addr ||
-needle->translated_addr + needle->size < map->translated_addr) {
+needle->translated_addr + needle->size < map->translated_addr ||
+needle->id != map->id) {

It looks this iterator can also be invoked by SVQ from
vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
space will be searched on without passing in the ID (GPA), and exact
match for the same GPA range is not actually needed unlike the mapping
removal case. Could we create an API variant, for the SVQ lookup case
specifically? Or alternatively, add a special flag, say skip_id_match to
DMAMap, and the id match check may look like below:

(!needle->skip_id_match && needle->id != map->id)

I think vhost_svq_translate_addr() could just call the API variant or
pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().


I think you're totally right. But I'd really like to not complicate
the API of the iova_tree more.

I think we can look for the hwaddr using memory_region_from_host and
then get the hwaddr. It is another lookup though...

Yeah, that will be another means of doing translation without having to
complicate the API around iova_tree. I wonder how the lookup through
memory_region_from_host() may perform compared to the iova tree one, the
former looks to be an O(N) linear search on a linked list while the
latter would be roughly O(log N) on an AVL tree?

Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
linear too. It is not even ordered.

Oh Sorry, I misread the code and I should look for g_tree_foreach ()
instead of g_tree_search_node(). So the former is indeed linear
iteration, but it looks to be ordered?

https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115

The GPA / IOVA are ordered but we're looking by QEMU's vaddr.

If we have these translations:
[0x1000, 0x2000] -> [0x1, 0x11000]
[0x2000, 0x3000] -> [0x6000, 0x7000]

We will see them in this order, so we cannot stop the search at the first node.

Yeah, reverse lookup is unordered indeed, anyway.


But apart from this detail you're right, I have the same concerns with
this solution too. If we see a hard performance regression we could go
to more complicated solutions, like maintaining a reverse IOVATree in
vhost-iova-tree too. First RFCs of SVQ did that actually.

Agreed, yeap we can use memory_region_from_host for now.  Any reason why
reverse IOVATree was dr

Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-05-07 Thread Si-Wei Liu




On 5/1/2024 11:18 PM, Eugenio Perez Martin wrote:

On Thu, May 2, 2024 at 12:09 AM Si-Wei Liu  wrote:



On 4/30/2024 11:11 AM, Eugenio Perez Martin wrote:

On Mon, Apr 29, 2024 at 1:19 PM Jonah Palmer  wrote:


On 4/29/24 4:14 AM, Eugenio Perez Martin wrote:

On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu  wrote:


On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:

On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu  wrote:

On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:

On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  wrote:

On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:

On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  wrote:

On 4/10/2024 3:03 AM, Eugenio Pérez wrote:

IOVA tree is also used to track the mappings of virtio-net shadow
virtqueue.  This mappings may not match with the GPA->HVA ones.

This causes a problem when overlapped regions (different GPA but same
translated HVA) exists in the tree, as looking them by HVA will return
them twice.  To solve this, create an id member so we can assign unique
identifiers (GPA) to the maps.

Signed-off-by: Eugenio Pérez 
---
include/qemu/iova-tree.h | 5 +++--
util/iova-tree.c | 3 ++-
2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 2a10a7052e..34ee230e7d 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -36,6 +36,7 @@ typedef struct DMAMap {
hwaddr iova;
hwaddr translated_addr;
hwaddr size;/* Inclusive */
+uint64_t id;
IOMMUAccessFlags perm;
} QEMU_PACKED DMAMap;
typedef gboolean (*iova_tree_iterator)(DMAMap *map);
@@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const 
DMAMap *map);
 * @map: the mapping to search
 *
 * Search for a mapping in the iova tree that translated_addr overlaps 
with the
- * mapping range specified.  Only the first found mapping will be
- * returned.
+ * mapping range specified and map->id is equal.  Only the first found
+ * mapping will be returned.
 *
 * Return: DMAMap pointer if found, or NULL if not found.  Note that
 * the returned DMAMap pointer is maintained internally.  User should
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..0863e0a3b8 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, 
gpointer value,

needle = args->needle;
if (map->translated_addr + map->size < needle->translated_addr ||
-needle->translated_addr + needle->size < map->translated_addr) {
+needle->translated_addr + needle->size < map->translated_addr ||
+needle->id != map->id) {

It looks this iterator can also be invoked by SVQ from
vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
space will be searched on without passing in the ID (GPA), and exact
match for the same GPA range is not actually needed unlike the mapping
removal case. Could we create an API variant, for the SVQ lookup case
specifically? Or alternatively, add a special flag, say skip_id_match to
DMAMap, and the id match check may look like below:

(!needle->skip_id_match && needle->id != map->id)

I think vhost_svq_translate_addr() could just call the API variant or
pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().


I think you're totally right. But I'd really like to not complicate
the API of the iova_tree more.

I think we can look for the hwaddr using memory_region_from_host and
then get the hwaddr. It is another lookup though...

Yeah, that will be another means of doing translation without having to
complicate the API around iova_tree. I wonder how the lookup through
memory_region_from_host() may perform compared to the iova tree one, the
former looks to be an O(N) linear search on a linked list while the
latter would be roughly O(log N) on an AVL tree?

Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
linear too. It is not even ordered.

Oh Sorry, I misread the code and I should look for g_tree_foreach ()
instead of g_tree_search_node(). So the former is indeed linear
iteration, but it looks to be ordered?

https://urldefense.com/v3/__https://github.com/GNOME/glib/blob/main/glib/gtree.c*L1115__;Iw!!ACWV5N9M2RV99hQ!Ng2rLfRd9tLyNTNocW50Mf5AcxSt0uF0wOdv120djff-z_iAdbujYK-jMi5UC1DZLxb1yLUv2vV0j3wJo8o$

The GPA / IOVA are ordered but we're looking by QEMU's vaddr.

If we have these translations:
[0x1000, 0x2000] -> [0x1, 0x11000]
[0x2000, 0x3000] -> [0x6000, 0x7000]

We will see them in this order, so we cannot stop the search at the first node.

Yeah, reverse lookup is unordered indeed, anyway.


But apart from this detail you're right, I have the same concerns with
this solution too. If we see a hard pe

Re: [RFC 1/2] iova_tree: add an id member to DMAMap

2024-05-07 Thread Si-Wei Liu




On 5/1/2024 11:44 PM, Eugenio Perez Martin wrote:

On Thu, May 2, 2024 at 1:16 AM Si-Wei Liu  wrote:



On 4/30/2024 10:19 AM, Eugenio Perez Martin wrote:

On Tue, Apr 30, 2024 at 7:55 AM Si-Wei Liu  wrote:


On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote:

On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu  wrote:

On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:

On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu  wrote:

On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:

On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu  wrote:

On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:

On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu  wrote:

On 4/10/2024 3:03 AM, Eugenio Pérez wrote:

IOVA tree is also used to track the mappings of virtio-net shadow
virtqueue.  This mappings may not match with the GPA->HVA ones.

This causes a problem when overlapped regions (different GPA but same
translated HVA) exists in the tree, as looking them by HVA will return
them twice.  To solve this, create an id member so we can assign unique
identifiers (GPA) to the maps.

Signed-off-by: Eugenio Pérez 
---
include/qemu/iova-tree.h | 5 +++--
util/iova-tree.c | 3 ++-
2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
index 2a10a7052e..34ee230e7d 100644
--- a/include/qemu/iova-tree.h
+++ b/include/qemu/iova-tree.h
@@ -36,6 +36,7 @@ typedef struct DMAMap {
hwaddr iova;
hwaddr translated_addr;
hwaddr size;/* Inclusive */
+uint64_t id;
IOMMUAccessFlags perm;
} QEMU_PACKED DMAMap;
typedef gboolean (*iova_tree_iterator)(DMAMap *map);
@@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const 
DMAMap *map);
 * @map: the mapping to search
 *
 * Search for a mapping in the iova tree that translated_addr overlaps 
with the
- * mapping range specified.  Only the first found mapping will be
- * returned.
+ * mapping range specified and map->id is equal.  Only the first found
+ * mapping will be returned.
 *
 * Return: DMAMap pointer if found, or NULL if not found.  Note that
 * the returned DMAMap pointer is maintained internally.  User should
diff --git a/util/iova-tree.c b/util/iova-tree.c
index 536789797e..0863e0a3b8 100644
--- a/util/iova-tree.c
+++ b/util/iova-tree.c
@@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, 
gpointer value,

needle = args->needle;
if (map->translated_addr + map->size < needle->translated_addr ||
-needle->translated_addr + needle->size < map->translated_addr) {
+needle->translated_addr + needle->size < map->translated_addr ||
+needle->id != map->id) {

It looks this iterator can also be invoked by SVQ from
vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
space will be searched on without passing in the ID (GPA), and exact
match for the same GPA range is not actually needed unlike the mapping
removal case. Could we create an API variant, for the SVQ lookup case
specifically? Or alternatively, add a special flag, say skip_id_match to
DMAMap, and the id match check may look like below:

(!needle->skip_id_match && needle->id != map->id)

I think vhost_svq_translate_addr() could just call the API variant or
pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().


I think you're totally right. But I'd really like to not complicate
the API of the iova_tree more.

I think we can look for the hwaddr using memory_region_from_host and
then get the hwaddr. It is another lookup though...

Yeah, that will be another means of doing translation without having to
complicate the API around iova_tree. I wonder how the lookup through
memory_region_from_host() may perform compared to the iova tree one, the
former looks to be an O(N) linear search on a linked list while the
latter would be roughly O(log N) on an AVL tree?

Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
linear too. It is not even ordered.

Oh Sorry, I misread the code and I should look for g_tree_foreach ()
instead of g_tree_search_node(). So the former is indeed linear
iteration, but it looks to be ordered?

https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115

The GPA / IOVA are ordered but we're looking by QEMU's vaddr.

If we have these translations:
[0x1000, 0x2000] -> [0x1, 0x11000]
[0x2000, 0x3000] -> [0x6000, 0x7000]

We will see them in this order, so we cannot stop the search at the first node.

Yeah, reverse lookup is unordered indeed, anyway.


But apart from this detail you're right, I have the same concerns with
this solution too. If we see a hard performance regression we could go
to more complicated solutions, like maintaining a reverse IOVATree in
vhost-iova-tree too. First RFCs of SV

Re: [RFC v2 12/13] vdpa: preemptive kick at enable

2023-02-01 Thread Si-Wei Liu




On 1/13/2023 1:06 AM, Eugenio Perez Martin wrote:

On Fri, Jan 13, 2023 at 4:39 AM Jason Wang  wrote:

On Fri, Jan 13, 2023 at 11:25 AM Zhu, Lingshan  wrote:



On 1/13/2023 10:31 AM, Jason Wang wrote:

On Fri, Jan 13, 2023 at 1:27 AM Eugenio Pérez  wrote:

Spuriously kick the destination device's queue so it knows in case there
are new descriptors.

RFC: This is somehow a gray area. The guest may have placed descriptors
in a virtqueue but not kicked it, so it might be surprised if the device
starts processing it.

So I think this is kind of the work of the vDPA parent. For the parent
that needs this trick, we should do it in the parent driver.

Agree, it looks easier implementing this in parent driver,
I can implement it in ifcvf set_vq_ready right now

Great, but please check whether or not it is really needed.

Some device implementation could check the available descriptions
after DRIVER_OK without waiting for a kick.


So IIUC we can entirely drop this from the series (and I hope we can).
But then, what with the devices that does *not* check for them?
I wonder how the kick can be missed from the first place. Supposedly the 
moment when vhost_dev_stop() calls .suspend() into vdpa driver, the 
vcpus already stopped running (vm_running = false) and all pending kicks 
are delivered through vhost-vdpa's host notifiers or mapped doorbell 
already then device won't get new ones. If the device intends to 
purposely ignore (note: this could be a device bug) pending kicks during 
.suspend(), then consequently it should check available descriptors 
after reaching driver_ok to process outstanding descriptors, making up 
for the missing kick. If the vdpa driver doesn't support .suspend(), 
then it should flush the work before .reset() - vhost-scsi does it this 
way.  Or otherwise I think it's the norm (right thing to do) device 
should process pending kicks before guest memory is to be unmapped at 
the late game of vhost_dev_stop(). Is there any case kicks may be missing?


-Siwei




If we drop it it seems to me we must mandate devices to check for
descriptors at queue_enable. The queue could stall if not, isn't it?

Thanks!


Thanks


Thanks
Zhu Lingshan

Thanks


However, that information is not in the migration stream and it should
be an edge case anyhow, being resilient to parallel notifications from
the guest.

Signed-off-by: Eugenio Pérez 
---
   hw/virtio/vhost-vdpa.c | 5 +
   1 file changed, 5 insertions(+)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 40b7e8706a..dff94355dd 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -732,11 +732,16 @@ static int vhost_vdpa_set_vring_ready(struct vhost_dev 
*dev, int ready)
   }
   trace_vhost_vdpa_set_vring_ready(dev);
   for (i = 0; i < dev->nvqs; ++i) {
+VirtQueue *vq;
   struct vhost_vring_state state = {
   .index = dev->vq_index + i,
   .num = 1,
   };
   vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state);
+
+/* Preemptive kick */
+vq = virtio_get_queue(dev->vdev, dev->vq_index + i);
+event_notifier_set(virtio_queue_get_host_notifier(vq));
   }
   return 0;
   }
--
2.31.1






Re: [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration

2023-02-01 Thread Si-Wei Liu




On 1/12/2023 9:24 AM, Eugenio Pérez wrote:

It's possible to migrate vdpa net devices if they are shadowed from the

start.  But to always shadow the dataplane is effectively break its host

passthrough, so its not convenient in vDPA scenarios.



This series enables dynamically switching to shadow mode only at

migration time.  This allow full data virtqueues passthrough all the

time qemu is not migrating.



Successfully tested with vdpa_sim_net (but it needs some patches, I

will send them soon) and qemu emulated device with vp_vdpa with

some restrictions:

* No CVQ.

* VIRTIO_RING_F_STATE patches.
What are these patches (I'm not sure I follow VIRTIO_RING_F_STATE, is it 
a new feature that other vdpa driver would need for live migration)?


-Siwei



* Expose _F_SUSPEND, but ignore it and suspend on ring state fetch like

   DPDK.



Comments are welcome, especially in the patcheswith RFC in the message.



v2:

- Use a migration listener instead of a memory listener to know when

   the migration starts.

- Add stuff not picked with ASID patches, like enable rings after

   driver_ok

- Add rewinding on the migration src, not in dst

- v1 at https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01664.html



Eugenio Pérez (13):

   vdpa: fix VHOST_BACKEND_F_IOTLB_ASID flag check

   vdpa net: move iova tree creation from init to start

   vdpa: copy cvq shadow_data from data vqs, not from x-svq

   vdpa: rewind at get_base, not set_base

   vdpa net: add migration blocker if cannot migrate cvq

   vhost: delay set_vring_ready after DRIVER_OK

   vdpa: delay set_vring_ready after DRIVER_OK

   vdpa: Negotiate _F_SUSPEND feature

   vdpa: add feature_log parameter to vhost_vdpa

   vdpa net: allow VHOST_F_LOG_ALL

   vdpa: add vdpa net migration state notifier

   vdpa: preemptive kick at enable

   vdpa: Conditionally expose _F_LOG in vhost_net devices



  include/hw/virtio/vhost-backend.h |   4 +

  include/hw/virtio/vhost-vdpa.h|   1 +

  hw/net/vhost_net.c|  25 ++-

  hw/virtio/vhost-vdpa.c|  64 +---

  hw/virtio/vhost.c |   3 +

  net/vhost-vdpa.c  | 247 +-

  6 files changed, 278 insertions(+), 66 deletions(-)








Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier

2023-02-01 Thread Si-Wei Liu




On 1/12/2023 9:24 AM, Eugenio Pérez wrote:

This allows net to restart the device backend to configure SVQ on it.

Ideally, these changes should not be net specific. However, the vdpa net
backend is the one with enough knowledge to configure everything because
of some reasons:
* Queues might need to be shadowed or not depending on its kind (control
   vs data).
* Queues need to share the same map translations (iova tree).

Because of that it is cleaner to restart the whole net backend and
configure again as expected, similar to how vhost-kernel moves between
userspace and passthrough.

If more kinds of devices need dynamic switching to SVQ we can create a
callback struct like VhostOps and move most of the code there.
VhostOps cannot be reused since all vdpa backend share them, and to
personalize just for networking would be too heavy.

Signed-off-by: Eugenio Pérez 
---
  net/vhost-vdpa.c | 84 
  1 file changed, 84 insertions(+)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 5d7ad6e4d7..f38532b1df 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -26,6 +26,8 @@
  #include 
  #include "standard-headers/linux/virtio_net.h"
  #include "monitor/monitor.h"
+#include "migration/migration.h"
+#include "migration/misc.h"
  #include "migration/blocker.h"
  #include "hw/virtio/vhost.h"
  
@@ -33,6 +35,7 @@

  typedef struct VhostVDPAState {
  NetClientState nc;
  struct vhost_vdpa vhost_vdpa;
+Notifier migration_state;
  Error *migration_blocker;
  VHostNetState *vhost_net;
  
@@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)

  return DO_UPCAST(VhostVDPAState, nc, nc0);
  }
  
+static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)

+{
+struct vhost_vdpa *v = &s->vhost_vdpa;
+VirtIONet *n;
+VirtIODevice *vdev;
+int data_queue_pairs, cvq, r;
+NetClientState *peer;
+
+/* We are only called on the first data vqs and only if x-svq is not set */
+if (s->vhost_vdpa.shadow_vqs_enabled == enable) {
+return;
+}
+
+vdev = v->dev->vdev;
+n = VIRTIO_NET(vdev);
+if (!n->vhost_started) {
+return;
+}
+
+if (enable) {
+ioctl(v->device_fd, VHOST_VDPA_SUSPEND);
+}
+data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
+cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
+  n->max_ncs - n->max_queue_pairs : 0;
+vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq);
+
+peer = s->nc.peer;
+for (int i = 0; i < data_queue_pairs + cvq; i++) {
+VhostVDPAState *vdpa_state;
+NetClientState *nc;
+
+if (i < data_queue_pairs) {
+nc = qemu_get_peer(peer, i);
+} else {
+nc = qemu_get_peer(peer, n->max_queue_pairs);
+}
+
+vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc);
+vdpa_state->vhost_vdpa.shadow_data = enable;
+
+if (i < data_queue_pairs) {
+/* Do not override CVQ shadow_vqs_enabled */
+vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable;
+}
+}
+
+r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq);
As the first revision, this method (vhost_net_stop followed by 
vhost_net_start) should be fine for software vhost-vdpa backend for e.g. 
vp_vdpa and vdpa_sim_net. However, I would like to get your attention 
that this method implies substantial blackout time for mode switching on 
real hardware - get a full cycle of device reset of getting memory 
mappings torn down, unpin & repin same set of pages, and set up new 
mapping would take very significant amount of time, especially for a 
large VM. Maybe we can do:


1) replace reset with the RESUME feature that was just added to the 
vhost-vdpa ioctls in kernel
2) add new vdpa ioctls to allow iova range rebound to new virtual 
address for QEMU's shadow vq or back to device's vq
3) use a light-weighted sequence of suspend+rebind+resume to switch mode 
on the fly instead of getting through the whole reset+restart cycle


I suspect the same idea could even be used to address high live 
migration downtime seen on hardware vdpa device. What do you think?


Thanks,
-Siwei


+if (unlikely(r < 0)) {
+error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
+}
+}
+
+static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
+{
+MigrationState *migration = data;
+VhostVDPAState *s = container_of(notifier, VhostVDPAState,
+ migration_state);
+
+switch (migration->state) {
+case MIGRATION_STATUS_SETUP:
+vhost_vdpa_net_log_global_enable(s, true);
+return;
+
+case MIGRATION_STATUS_CANCELLING:
+case MIGRATION_STATUS_CANCELLED:
+case MIGRATION_STATUS_FAILED:
+vhost_vdpa_net_log_global_enable(s, false);
+return;
+};
+}
+
  static void vhost

Re: [PATCH v2 2/5] virtio-net: align ctrl_vq index for non-mq guest for vhost_vdpa

2022-04-29 Thread Si-Wei Liu




On 4/28/2022 7:23 PM, Jason Wang wrote:


在 2022/4/27 16:30, Si-Wei Liu 写道:

With MQ enabled vdpa device and non-MQ supporting guest e.g.
booting vdpa with mq=on over OVMF of single vqp, below assert
failure is seen:

../hw/virtio/vhost-vdpa.c:560: vhost_vdpa_get_vq_index: Assertion 
`idx >= dev->vq_index && idx < dev->vq_index + dev->nvqs' failed.


0  0x7f8ce3ff3387 in raise () at /lib64/libc.so.6
1  0x7f8ce3ff4a78 in abort () at /lib64/libc.so.6
2  0x7f8ce3fec1a6 in __assert_fail_base () at /lib64/libc.so.6
3  0x7f8ce3fec252 in  () at /lib64/libc.so.6
4  0x558f52d79421 in vhost_vdpa_get_vq_index (dev=out>, idx=) at ../hw/virtio/vhost-vdpa.c:563
5  0x558f52d79421 in vhost_vdpa_get_vq_index (dev=out>, idx=) at ../hw/virtio/vhost-vdpa.c:558
6  0x558f52d7329a in vhost_virtqueue_mask (hdev=0x558f55c01800, 
vdev=0x558f568f91f0, n=2, mask=) at 
../hw/virtio/vhost.c:1557
7  0x558f52c6b89a in virtio_pci_set_guest_notifier 
(d=d@entry=0x558f568f0f60, n=n@entry=2, assign=assign@entry=true, 
with_irqfd=with_irqfd@entry=false)

    at ../hw/virtio/virtio-pci.c:974
8  0x558f52c6c0d8 in virtio_pci_set_guest_notifiers 
(d=0x558f568f0f60, nvqs=3, assign=true) at 
../hw/virtio/virtio-pci.c:1019
9  0x558f52bf091d in vhost_net_start 
(dev=dev@entry=0x558f568f91f0, ncs=0x558f56937cd0, 
data_queue_pairs=data_queue_pairs@entry=1, cvq=cvq@entry=1)

    at ../hw/net/vhost_net.c:361
10 0x558f52d4e5e7 in virtio_net_set_status (status=out>, n=0x558f568f91f0) at ../hw/net/virtio-net.c:289
11 0x558f52d4e5e7 in virtio_net_set_status (vdev=0x558f568f91f0, 
status=15 '\017') at ../hw/net/virtio-net.c:370
12 0x558f52d6c4b2 in virtio_set_status 
(vdev=vdev@entry=0x558f568f91f0, val=val@entry=15 '\017') at 
../hw/virtio/virtio.c:1945
13 0x558f52c69eff in virtio_pci_common_write 
(opaque=0x558f568f0f60, addr=, val=, 
size=) at ../hw/virtio/virtio-pci.c:1292
14 0x558f52d15d6e in memory_region_write_accessor 
(mr=0x558f568f19d0, addr=20, value=, size=1, 
shift=, mask=, attrs=...)

    at ../softmmu/memory.c:492
15 0x558f52d127de in access_with_adjusted_size 
(addr=addr@entry=20, value=value@entry=0x7f8cdbffe748, 
size=size@entry=1, access_size_min=, 
access_size_max=, access_fn=0x558f52d15cf0 
, mr=0x558f568f19d0, attrs=...) at 
../softmmu/memory.c:554
16 0x558f52d157ef in memory_region_dispatch_write 
(mr=mr@entry=0x558f568f19d0, addr=20, data=, 
op=, attrs=attrs@entry=...)

    at ../softmmu/memory.c:1504
17 0x558f52d078e7 in flatview_write_continue 
(fv=fv@entry=0x7f8accbc3b90, addr=addr@entry=103079215124, attrs=..., 
ptr=ptr@entry=0x7f8ce6300028, len=len@entry=1, addr1=, 
l=, mr=0x558f568f19d0) at 
/home/opc/qemu-upstream/include/qemu/host-utils.h:165
18 0x558f52d07b06 in flatview_write (fv=0x7f8accbc3b90, 
addr=103079215124, attrs=..., buf=0x7f8ce6300028, len=1) at 
../softmmu/physmem.c:2822
19 0x558f52d0b36b in address_space_write (as=, 
addr=, attrs=..., buf=buf@entry=0x7f8ce6300028, 
len=)

    at ../softmmu/physmem.c:2914
20 0x558f52d0b3da in address_space_rw (as=, 
addr=, attrs=...,
    attrs@entry=..., buf=buf@entry=0x7f8ce6300028, len=out>, is_write=) at ../softmmu/physmem.c:2924
21 0x558f52dced09 in kvm_cpu_exec (cpu=cpu@entry=0x558f55c2da60) 
at ../accel/kvm/kvm-all.c:2903
22 0x558f52dcfabd in kvm_vcpu_thread_fn 
(arg=arg@entry=0x558f55c2da60) at ../accel/kvm/kvm-accel-ops.c:49
23 0x558f52f9f04a in qemu_thread_start (args=) at 
../util/qemu-thread-posix.c:556

24 0x7f8ce4392ea5 in start_thread () at /lib64/libpthread.so.0
25 0x7f8ce40bb9fd in clone () at /lib64/libc.so.6

The cause for the assert failure is due to that the vhost_dev index
for the ctrl vq was not aligned with actual one in use by the guest.
Upon multiqueue feature negotiation in virtio_net_set_multiqueue(),
if guest doesn't support multiqueue, the guest vq layout would shrink
to a single queue pair, consisting of 3 vqs in total (rx, tx and ctrl).
This results in ctrl_vq taking a different vhost_dev group index than
the default. We can map vq to the correct vhost_dev group by checking
if MQ is supported by guest and successfully negotiated. Since the
MQ feature is only present along with CTRL_VQ, we make sure the index
2 is only meant for the control vq while MQ is not supported by guest.

Fixes: 22288fe ("virtio-net: vhost control virtqueue support")
Suggested-by: Jason Wang 
Signed-off-by: Si-Wei Liu 
---
  hw/net/virtio-net.c | 22 --
  1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index ffb3475..8ca0b80 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -3171,8 +3171,17 @@ static NetClientInfo net_virtio_info = {
  static bool virtio_net_guest_notifier_pending(VirtIODevice *vdev, 
int idx)

  {
  VirtIONet *n = VIRTIO_NET(vdev);
-    NetClientState *nc = qemu_get_subqueue(n->nic

Re: [PATCH v2 2/5] virtio-net: align ctrl_vq index for non-mq guest for vhost_vdpa

2022-04-29 Thread Si-Wei Liu




On 4/28/2022 7:24 PM, Jason Wang wrote:

On Fri, Apr 29, 2022 at 10:24 AM Jason Wang  wrote:


在 2022/4/27 16:30, Si-Wei Liu 写道:

With MQ enabled vdpa device and non-MQ supporting guest e.g.
booting vdpa with mq=on over OVMF of single vqp, below assert
failure is seen:

../hw/virtio/vhost-vdpa.c:560: vhost_vdpa_get_vq_index: Assertion `idx >= dev->vq_index 
&& idx < dev->vq_index + dev->nvqs' failed.

0  0x7f8ce3ff3387 in raise () at /lib64/libc.so.6
1  0x7f8ce3ff4a78 in abort () at /lib64/libc.so.6
2  0x7f8ce3fec1a6 in __assert_fail_base () at /lib64/libc.so.6
3  0x7f8ce3fec252 in  () at /lib64/libc.so.6
4  0x558f52d79421 in vhost_vdpa_get_vq_index (dev=, 
idx=) at ../hw/virtio/vhost-vdpa.c:563
5  0x558f52d79421 in vhost_vdpa_get_vq_index (dev=, 
idx=) at ../hw/virtio/vhost-vdpa.c:558
6  0x558f52d7329a in vhost_virtqueue_mask (hdev=0x558f55c01800, 
vdev=0x558f568f91f0, n=2, mask=) at ../hw/virtio/vhost.c:1557
7  0x558f52c6b89a in virtio_pci_set_guest_notifier 
(d=d@entry=0x558f568f0f60, n=n@entry=2, assign=assign@entry=true, 
with_irqfd=with_irqfd@entry=false)
 at ../hw/virtio/virtio-pci.c:974
8  0x558f52c6c0d8 in virtio_pci_set_guest_notifiers (d=0x558f568f0f60, 
nvqs=3, assign=true) at ../hw/virtio/virtio-pci.c:1019
9  0x558f52bf091d in vhost_net_start (dev=dev@entry=0x558f568f91f0, 
ncs=0x558f56937cd0, data_queue_pairs=data_queue_pairs@entry=1, cvq=cvq@entry=1)
 at ../hw/net/vhost_net.c:361
10 0x558f52d4e5e7 in virtio_net_set_status (status=, 
n=0x558f568f91f0) at ../hw/net/virtio-net.c:289
11 0x558f52d4e5e7 in virtio_net_set_status (vdev=0x558f568f91f0, status=15 
'\017') at ../hw/net/virtio-net.c:370
12 0x558f52d6c4b2 in virtio_set_status (vdev=vdev@entry=0x558f568f91f0, 
val=val@entry=15 '\017') at ../hw/virtio/virtio.c:1945
13 0x558f52c69eff in virtio_pci_common_write (opaque=0x558f568f0f60, addr=, val=, size=) at ../hw/virtio/virtio-pci.c:1292
14 0x558f52d15d6e in memory_region_write_accessor (mr=0x558f568f19d0, addr=20, 
value=, size=1, shift=, mask=, 
attrs=...)
 at ../softmmu/memory.c:492
15 0x558f52d127de in access_with_adjusted_size (addr=addr@entry=20, 
value=value@entry=0x7f8cdbffe748, size=size@entry=1, access_size_min=, 
access_size_max=, access_fn=0x558f52d15cf0 
, mr=0x558f568f19d0, attrs=...) at ../softmmu/memory.c:554
16 0x558f52d157ef in memory_region_dispatch_write (mr=mr@entry=0x558f568f19d0, addr=20, 
data=, op=, attrs=attrs@entry=...)
 at ../softmmu/memory.c:1504
17 0x558f52d078e7 in flatview_write_continue (fv=fv@entry=0x7f8accbc3b90, 
addr=addr@entry=103079215124, attrs=..., ptr=ptr@entry=0x7f8ce6300028, len=len@entry=1, 
addr1=, l=, mr=0x558f568f19d0) at 
/home/opc/qemu-upstream/include/qemu/host-utils.h:165
18 0x558f52d07b06 in flatview_write (fv=0x7f8accbc3b90, addr=103079215124, 
attrs=..., buf=0x7f8ce6300028, len=1) at ../softmmu/physmem.c:2822
19 0x558f52d0b36b in address_space_write (as=, addr=, attrs=..., buf=buf@entry=0x7f8ce6300028, len=)
 at ../softmmu/physmem.c:2914
20 0x558f52d0b3da in address_space_rw (as=, addr=, attrs=...,
 attrs@entry=..., buf=buf@entry=0x7f8ce6300028, len=, 
is_write=) at ../softmmu/physmem.c:2924
21 0x558f52dced09 in kvm_cpu_exec (cpu=cpu@entry=0x558f55c2da60) at 
../accel/kvm/kvm-all.c:2903
22 0x558f52dcfabd in kvm_vcpu_thread_fn (arg=arg@entry=0x558f55c2da60) at 
../accel/kvm/kvm-accel-ops.c:49
23 0x558f52f9f04a in qemu_thread_start (args=) at 
../util/qemu-thread-posix.c:556
24 0x7f8ce4392ea5 in start_thread () at /lib64/libpthread.so.0
25 0x7f8ce40bb9fd in clone () at /lib64/libc.so.6

The cause for the assert failure is due to that the vhost_dev index
for the ctrl vq was not aligned with actual one in use by the guest.
Upon multiqueue feature negotiation in virtio_net_set_multiqueue(),
if guest doesn't support multiqueue, the guest vq layout would shrink
to a single queue pair, consisting of 3 vqs in total (rx, tx and ctrl).
This results in ctrl_vq taking a different vhost_dev group index than
the default. We can map vq to the correct vhost_dev group by checking
if MQ is supported by guest and successfully negotiated. Since the
MQ feature is only present along with CTRL_VQ, we make sure the index
2 is only meant for the control vq while MQ is not supported by guest.

Fixes: 22288fe ("virtio-net: vhost control virtqueue support")
Suggested-by: Jason Wang 
Signed-off-by: Si-Wei Liu 
---
   hw/net/virtio-net.c | 22 --
   1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index ffb3475..8ca0b80 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -3171,8 +3171,17 @@ static NetClientInfo net_virtio_info = {
   static bool virtio_net_guest_notifier_pending(VirtIODevice *vdev, int idx)
   {
   VirtIONet *n = VIRTIO_NET(vdev);
-NetClientState *nc = qemu_get_subque

Re: [PATCH 0/7] vhost-vdpa multiqueue fixes

2022-04-29 Thread Si-Wei Liu




On 4/28/2022 7:30 PM, Jason Wang wrote:

On Wed, Apr 27, 2022 at 5:09 PM Si-Wei Liu  wrote:



On 4/27/2022 1:38 AM, Jason Wang wrote:

On Wed, Apr 27, 2022 at 4:30 PM Si-Wei Liu  wrote:


On 4/26/2022 9:28 PM, Jason Wang wrote:

在 2022/3/30 14:33, Si-Wei Liu 写道:

Hi,

This patch series attempt to fix a few issues in vhost-vdpa
multiqueue functionality.

Patch #1 is the formal submission for RFC patch in:
https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/c3e931ee-1a1b-9c2f-2f59-cb4395c23...@oracle.com/__;!!ACWV5N9M2RV99hQ!OoUKcyWauHGQOM4MTAUn88TINQo5ZP4aaYyvyUCK9ggrI_L6diSZo5Nmq55moGH769SD87drxQyqg3ysNsk$

Patch #2 and #3 were taken from a previous patchset posted on
qemu-devel:
https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/2027192851.65529-1-epere...@redhat.com/__;!!ACWV5N9M2RV99hQ!OoUKcyWauHGQOM4MTAUn88TINQo5ZP4aaYyvyUCK9ggrI_L6diSZo5Nmq55moGH769SD87drxQyqc3mXqDs$

albeit abandoned, two patches in that set turn out to be useful for
patch #4, which is to fix a QEMU crash due to race condition.

Patch #5 through #7 are obviously small bug fixes. Please find the
description of each in the commit log.

Thanks,
-Siwei

Hi Si Wei:

I think we need another version of this series?

Hi Jason,

Apologies for the long delay. I was in the middle of reworking the patch
"virtio: don't read pending event on host notifier if disabled", but
found out that it would need quite some code change for the userspace
fallback handler to work properly (for the queue no. change case
specifically).

We probably need this fix for -stable, so I wonder if we can have a
workaround first and do refactoring on top?

Hmmm, a nasty fix but may well address the segfault is something like this:

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 8ca0b80..3ac93a4 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -368,6 +368,10 @@ static void virtio_net_set_status(struct
VirtIODevice *vdev, uint8_t status)
   int i;
   uint8_t queue_status;

+if (n->status_pending)
+return;
+
+n->status_pending = true;
   virtio_net_vnet_endian_status(n, status);
   virtio_net_vhost_status(n, status);

@@ -416,6 +420,7 @@ static void virtio_net_set_status(struct
VirtIODevice *vdev, uint8_t status)
   }
   }
   }
+n->status_pending = false;
   }

   static void virtio_net_set_link_status(NetClientState *nc)
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index eb87032..95efea8 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -216,6 +216,7 @@ struct VirtIONet {
   VirtioNetRssData rss_data;
   struct NetRxPkt *rx_pkt;
   struct EBPFRSSContext ebpf_rss;
+bool status_pending;
   };

   void virtio_net_set_netclient_name(VirtIONet *n, const char *name,

To be honest, I am not sure if this is worth a full blown fix to make it
completely work. Without applying vq suspend patch (the one I posted in
https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/df7c9a87-b2bd-7758-a6b6-bd834a733...@oracle.com/__;!!ACWV5N9M2RV99hQ!L4qque3YpPr-CGp12NrNdMMT1HROfEY_Juw2vnfZXHjOhtT0XJCR9GB8cvWEbJL9Aeh-WhDogBVArJn91P0$
 ),
it's very hard for me to effectively verify my code change - it's very
easy for the guest vq index to be out of sync if not stopping the vq
once the vhost is up and running (I tested it with repeatedly set_link
off and on).

Can we test via vmstop?
Yes, of coz, that's where the segfault happened. The tight loop of 
set_link on/off doesn't even work for the single queue case, hence 
that's why I doubted it ever worked for vhost-vdpa.





I am not sure if there's real chance we can run into issue
in practice due to the incomplete fix, if we don't fix the vq
stop/suspend issue first. Anyway I will try, as other use case e.g, live
migration is likely to get stumbled on it, too.

Ok, so I think we probably don't need the "nasty" fix above. Let's fix
it with the issue of stop/resume.
Ok, then does below tentative code change suffice the need? i.e. it 
would fail the request of changing queue pairs when the vhost-vdpa 
backend falls back to the userspace handler, but it's probably the 
easiest way to fix the vmstop segfault.


diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index ed231f9..8ba9f09 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1177,6 +1177,7 @@ static int virtio_net_handle_mq(VirtIONet *n, 
uint8_t cmd,

 struct virtio_net_ctrl_mq mq;
 size_t s;
 uint16_t queue_pairs;
+    NetClientState *nc = qemu_get_queue(n->nic);

 s = iov_to_buf(iov, iov_cnt, 0, &mq, sizeof(mq));
 if (s != sizeof(mq)) {
@@ -1196,6 +1197,13 @@ static int virtio_net_handle_mq(VirtIONet *n, 
uint8_t cmd,

 return VIRTIO_NET_ERR;
 }

+    /* avoid changing the number of queue_pairs for vdpa device in
+ * userspace handler.
+ * TO

[PATCH v3 5/6] vhost-vdpa: backend feature should set only once

2022-05-05 Thread Si-Wei Liu
The vhost_vdpa_one_time_request() branch in
vhost_vdpa_set_backend_cap() incorrectly sends down
ioctls on vhost_dev with non-zero index. This may
end up with multiple VHOST_SET_BACKEND_FEATURES
ioctl calls sent down on the vhost-vdpa fd that is
shared between all these vhost_dev's.

To fix it, send down ioctl only once via the first
vhost_dev with index 0. For more readibility of
code, vhost_vdpa_one_time_request() is renamed to
vhost_vdpa_first_dev() with polarity flipped.
This call is only applicable to the request that
performs operation before setting up queues, and
usually at the beginning of operation. Document
the requirement for it in place.

Signed-off-by: Si-Wei Liu 
Acked-by: Jason Wang 
Acked-by: Eugenio Pérez 
---
 hw/virtio/vhost-vdpa.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 8adf7c0..fd1268e 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -366,11 +366,18 @@ static void vhost_vdpa_get_iova_range(struct vhost_vdpa 
*v)
 v->iova_range.last);
 }
 
-static bool vhost_vdpa_one_time_request(struct vhost_dev *dev)
+/*
+ * The use of this function is for requests that only need to be
+ * applied once. Typically such request occurs at the beginning 
+ * of operation, and before setting up queues. It should not be
+ * used for request that performs operation until all queues are
+ * set, which would need to check dev->vq_index_end instead.
+ */
+static bool vhost_vdpa_first_dev(struct vhost_dev *dev)
 {
 struct vhost_vdpa *v = dev->opaque;
 
-return v->index != 0;
+return v->index == 0;
 }
 
 static int vhost_vdpa_get_dev_features(struct vhost_dev *dev,
@@ -451,7 +458,7 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void 
*opaque, Error **errp)
 
 vhost_vdpa_get_iova_range(v);
 
-if (vhost_vdpa_one_time_request(dev)) {
+if (!vhost_vdpa_first_dev(dev)) {
 return 0;
 }
 
@@ -594,7 +601,7 @@ static int vhost_vdpa_memslots_limit(struct vhost_dev *dev)
 static int vhost_vdpa_set_mem_table(struct vhost_dev *dev,
 struct vhost_memory *mem)
 {
-if (vhost_vdpa_one_time_request(dev)) {
+if (!vhost_vdpa_first_dev(dev)) {
 return 0;
 }
 
@@ -623,7 +630,7 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
 struct vhost_vdpa *v = dev->opaque;
 int ret;
 
-if (vhost_vdpa_one_time_request(dev)) {
+if (!vhost_vdpa_first_dev(dev)) {
 return 0;
 }
 
@@ -665,7 +672,7 @@ static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
 
 features &= f;
 
-if (vhost_vdpa_one_time_request(dev)) {
+if (vhost_vdpa_first_dev(dev)) {
 r = vhost_vdpa_call(dev, VHOST_SET_BACKEND_FEATURES, &features);
 if (r) {
 return -EFAULT;
@@ -1118,7 +1125,7 @@ static int vhost_vdpa_set_log_base(struct vhost_dev *dev, 
uint64_t base,
  struct vhost_log *log)
 {
 struct vhost_vdpa *v = dev->opaque;
-if (v->shadow_vqs_enabled || vhost_vdpa_one_time_request(dev)) {
+if (v->shadow_vqs_enabled || !vhost_vdpa_first_dev(dev)) {
 return 0;
 }
 
@@ -1240,7 +1247,7 @@ static int vhost_vdpa_get_features(struct vhost_dev *dev,
 
 static int vhost_vdpa_set_owner(struct vhost_dev *dev)
 {
-if (vhost_vdpa_one_time_request(dev)) {
+if (!vhost_vdpa_first_dev(dev)) {
 return 0;
 }
 
-- 
1.8.3.1




[PATCH v3 3/6] vhost-vdpa: fix improper cleanup in net_init_vhost_vdpa

2022-05-05 Thread Si-Wei Liu
... such that no memory leaks on dangling net clients in case of
error.

Signed-off-by: Si-Wei Liu 
Acked-by: Jason Wang 
---
 net/vhost-vdpa.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 1e9fe47..df1e69e 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -306,7 +306,9 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char 
*name,
 
 err:
 if (i) {
-qemu_del_net_client(ncs[0]);
+for (i--; i >= 0; i--) {
+qemu_del_net_client(ncs[i]);
+}
 }
 qemu_close(vdpa_device_fd);
 
-- 
1.8.3.1




[PATCH v3 2/6] virtio-net: align ctrl_vq index for non-mq guest for vhost_vdpa

2022-05-05 Thread Si-Wei Liu
With MQ enabled vdpa device and non-MQ supporting guest e.g.
booting vdpa with mq=on over OVMF of single vqp, below assert
failure is seen:

../hw/virtio/vhost-vdpa.c:560: vhost_vdpa_get_vq_index: Assertion `idx >= 
dev->vq_index && idx < dev->vq_index + dev->nvqs' failed.

0  0x7f8ce3ff3387 in raise () at /lib64/libc.so.6
1  0x7f8ce3ff4a78 in abort () at /lib64/libc.so.6
2  0x7f8ce3fec1a6 in __assert_fail_base () at /lib64/libc.so.6
3  0x7f8ce3fec252 in  () at /lib64/libc.so.6
4  0x558f52d79421 in vhost_vdpa_get_vq_index (dev=, 
idx=) at ../hw/virtio/vhost-vdpa.c:563
5  0x558f52d79421 in vhost_vdpa_get_vq_index (dev=, 
idx=) at ../hw/virtio/vhost-vdpa.c:558
6  0x558f52d7329a in vhost_virtqueue_mask (hdev=0x558f55c01800, 
vdev=0x558f568f91f0, n=2, mask=) at ../hw/virtio/vhost.c:1557
7  0x558f52c6b89a in virtio_pci_set_guest_notifier 
(d=d@entry=0x558f568f0f60, n=n@entry=2, assign=assign@entry=true, 
with_irqfd=with_irqfd@entry=false)
   at ../hw/virtio/virtio-pci.c:974
8  0x558f52c6c0d8 in virtio_pci_set_guest_notifiers (d=0x558f568f0f60, 
nvqs=3, assign=true) at ../hw/virtio/virtio-pci.c:1019
9  0x558f52bf091d in vhost_net_start (dev=dev@entry=0x558f568f91f0, 
ncs=0x558f56937cd0, data_queue_pairs=data_queue_pairs@entry=1, cvq=cvq@entry=1)
   at ../hw/net/vhost_net.c:361
10 0x558f52d4e5e7 in virtio_net_set_status (status=, 
n=0x558f568f91f0) at ../hw/net/virtio-net.c:289
11 0x558f52d4e5e7 in virtio_net_set_status (vdev=0x558f568f91f0, status=15 
'\017') at ../hw/net/virtio-net.c:370
12 0x558f52d6c4b2 in virtio_set_status (vdev=vdev@entry=0x558f568f91f0, 
val=val@entry=15 '\017') at ../hw/virtio/virtio.c:1945
13 0x558f52c69eff in virtio_pci_common_write (opaque=0x558f568f0f60, 
addr=, val=, size=) at 
../hw/virtio/virtio-pci.c:1292
14 0x558f52d15d6e in memory_region_write_accessor (mr=0x558f568f19d0, 
addr=20, value=, size=1, shift=, mask=, attrs=...)
   at ../softmmu/memory.c:492
15 0x558f52d127de in access_with_adjusted_size (addr=addr@entry=20, 
value=value@entry=0x7f8cdbffe748, size=size@entry=1, access_size_min=, access_size_max=, access_fn=0x558f52d15cf0 
, mr=0x558f568f19d0, attrs=...) at 
../softmmu/memory.c:554
16 0x558f52d157ef in memory_region_dispatch_write 
(mr=mr@entry=0x558f568f19d0, addr=20, data=, op=, 
attrs=attrs@entry=...)
   at ../softmmu/memory.c:1504
17 0x558f52d078e7 in flatview_write_continue (fv=fv@entry=0x7f8accbc3b90, 
addr=addr@entry=103079215124, attrs=..., ptr=ptr@entry=0x7f8ce6300028, 
len=len@entry=1, addr1=, l=, mr=0x558f568f19d0) 
at /home/opc/qemu-upstream/include/qemu/host-utils.h:165
18 0x558f52d07b06 in flatview_write (fv=0x7f8accbc3b90, addr=103079215124, 
attrs=..., buf=0x7f8ce6300028, len=1) at ../softmmu/physmem.c:2822
19 0x558f52d0b36b in address_space_write (as=, 
addr=, attrs=..., buf=buf@entry=0x7f8ce6300028, len=)
   at ../softmmu/physmem.c:2914
20 0x558f52d0b3da in address_space_rw (as=, addr=, attrs=...,
   attrs@entry=..., buf=buf@entry=0x7f8ce6300028, len=, 
is_write=) at ../softmmu/physmem.c:2924
21 0x558f52dced09 in kvm_cpu_exec (cpu=cpu@entry=0x558f55c2da60) at 
../accel/kvm/kvm-all.c:2903
22 0x558f52dcfabd in kvm_vcpu_thread_fn (arg=arg@entry=0x558f55c2da60) at 
../accel/kvm/kvm-accel-ops.c:49
23 0x558f52f9f04a in qemu_thread_start (args=) at 
../util/qemu-thread-posix.c:556
24 0x7f8ce4392ea5 in start_thread () at /lib64/libpthread.so.0
25 0x7f8ce40bb9fd in clone () at /lib64/libc.so.6

The cause for the assert failure is due to that the vhost_dev index
for the ctrl vq was not aligned with actual one in use by the guest.
Upon multiqueue feature negotiation in virtio_net_set_multiqueue(),
if guest doesn't support multiqueue, the guest vq layout would shrink
to a single queue pair, consisting of 3 vqs in total (rx, tx and ctrl).
This results in ctrl_vq taking a different vhost_dev group index than
the default. We can map vq to the correct vhost_dev group by checking
if MQ is supported by guest and successfully negotiated. Since the
MQ feature is only present along with CTRL_VQ, we ensure the index
2 is only meant for the control vq while MQ is not supported by guest.

Fixes: 22288fe ("virtio-net: vhost control virtqueue support")
Suggested-by: Jason Wang 
Signed-off-by: Si-Wei Liu 
---
 hw/net/virtio-net.c | 33 +++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index ffb3475..f0bb29c 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -14,6 +14,7 @@
 #include "qemu/osdep.h"
 #include "qemu/atomic.h"
 #include "qemu/iov.h"
+#include "qemu/log.h"
 #include "qemu/main-loop.h"
 #include "qemu/module.h"
 #include "hw/virtio/virtio.h"
@@ -3171,8 +3172,22 @@ static NetClientInfo net_virtio_info = {
 static bool virtio_net_guest

[PATCH v3 0/6] vhost-vdpa multiqueue fixes

2022-05-05 Thread Si-Wei Liu
Hi,

This patch series attempt to fix a few issues in vhost-vdpa multiqueue 
functionality.

Patch #1 and #2 are the formal submission for RFC patch in:
https://lore.kernel.org/qemu-devel/c3e931ee-1a1b-9c2f-2f59-cb4395c23...@oracle.com/

Patch #3 through #5 are obviously small bug fixes. Please find the description 
of
each in the commit log.

Patch #6 is a workaround fix for the QEMU segfault described in:
https://lore.kernel.org/qemu-devel/4f2acb7a-d436-9d97-80b1-3308c1b39...@oracle.com/


Thanks,
-Siwei

---
v3:
  - switch to LOG_GUEST_ERROR for guest trigger-able error
  - add temporary band-aid fix for QEMU crash due to recursive call
v2:
  - split off vhost_dev notifier patch from "align ctrl_vq index for non-mq
guest for vhost_vdpa"
  - change assert to error message
  - rename vhost_vdpa_one_time_request to vhost_vdpa_first_dev for clarity

Si-Wei Liu (6):
  virtio-net: setup vhost_dev and notifiers for cvq only when feature is
negotiated
  virtio-net: align ctrl_vq index for non-mq guest for vhost_vdpa
  vhost-vdpa: fix improper cleanup in net_init_vhost_vdpa
  vhost-net: fix improper cleanup in vhost_net_start
  vhost-vdpa: backend feature should set only once
  virtio-net: don't handle mq request in userspace handler for
vhost-vdpa

 hw/net/vhost_net.c |  4 +++-
 hw/net/virtio-net.c| 49 ++---
 hw/virtio/vhost-vdpa.c | 23 +++
 net/vhost-vdpa.c   |  4 +++-
 4 files changed, 67 insertions(+), 13 deletions(-)

-- 
1.8.3.1




[PATCH v3 6/6] virtio-net: don't handle mq request in userspace handler for vhost-vdpa

2022-05-05 Thread Si-Wei Liu
virtio_queue_host_notifier_read() tends to read pending event
left behind on ioeventfd in the vhost_net_stop() path, and
attempts to handle outstanding kicks from userspace vq handler.
However, in the ctrl_vq handler, virtio_net_handle_mq() has a
recursive call into virtio_net_set_status(), which may lead to
segmentation fault as shown in below stack trace:

0  0x55f800df1780 in qdev_get_parent_bus (dev=0x0) at ../hw/core/qdev.c:376
1  0x55f800c68ad8 in virtio_bus_device_iommu_enabled (vdev=vdev@entry=0x0) 
at ../hw/virtio/virtio-bus.c:331
2  0x55f800d70d7f in vhost_memory_unmap (dev=) at 
../hw/virtio/vhost.c:318
3  0x55f800d70d7f in vhost_memory_unmap (dev=, 
buffer=0x7fc19bec5240, len=2052, is_write=1, access_len=2052) at 
../hw/virtio/vhost.c:336
4  0x55f800d71867 in vhost_virtqueue_stop (dev=dev@entry=0x55f8037ccc30, 
vdev=vdev@entry=0x55f8044ec590, vq=0x55f8037cceb0, idx=0) at 
../hw/virtio/vhost.c:1241
5  0x55f800d7406c in vhost_dev_stop (hdev=hdev@entry=0x55f8037ccc30, 
vdev=vdev@entry=0x55f8044ec590) at ../hw/virtio/vhost.c:1839
6  0x55f800bf00a7 in vhost_net_stop_one (net=0x55f8037ccc30, 
dev=0x55f8044ec590) at ../hw/net/vhost_net.c:315
7  0x55f800bf0678 in vhost_net_stop (dev=dev@entry=0x55f8044ec590, 
ncs=0x55f80452bae0, data_queue_pairs=data_queue_pairs@entry=7, cvq=cvq@entry=1)
   at ../hw/net/vhost_net.c:423
8  0x55f800d4e628 in virtio_net_set_status (status=, 
n=0x55f8044ec590) at ../hw/net/virtio-net.c:296
9  0x55f800d4e628 in virtio_net_set_status (vdev=vdev@entry=0x55f8044ec590, 
status=15 '\017') at ../hw/net/virtio-net.c:370
10 0x55f800d534d8 in virtio_net_handle_ctrl (iov_cnt=, 
iov=, cmd=0 '\000', n=0x55f8044ec590) at 
../hw/net/virtio-net.c:1408
11 0x55f800d534d8 in virtio_net_handle_ctrl (vdev=0x55f8044ec590, 
vq=0x7fc1a7e888d0) at ../hw/net/virtio-net.c:1452
12 0x55f800d69f37 in virtio_queue_host_notifier_read (vq=0x7fc1a7e888d0) at 
../hw/virtio/virtio.c:2331
13 0x55f800d69f37 in virtio_queue_host_notifier_read 
(n=n@entry=0x7fc1a7e8894c) at ../hw/virtio/virtio.c:3575
14 0x55f800c688e6 in virtio_bus_cleanup_host_notifier (bus=, 
n=n@entry=14) at ../hw/virtio/virtio-bus.c:312
15 0x55f800d73106 in vhost_dev_disable_notifiers 
(hdev=hdev@entry=0x55f8035b51b0, vdev=vdev@entry=0x55f8044ec590)
   at ../../../include/hw/virtio/virtio-bus.h:35
16 0x55f800bf00b2 in vhost_net_stop_one (net=0x55f8035b51b0, 
dev=0x55f8044ec590) at ../hw/net/vhost_net.c:316
17 0x55f800bf0678 in vhost_net_stop (dev=dev@entry=0x55f8044ec590, 
ncs=0x55f80452bae0, data_queue_pairs=data_queue_pairs@entry=7, cvq=cvq@entry=1)
   at ../hw/net/vhost_net.c:423
18 0x55f800d4e628 in virtio_net_set_status (status=, 
n=0x55f8044ec590) at ../hw/net/virtio-net.c:296
19 0x55f800d4e628 in virtio_net_set_status (vdev=0x55f8044ec590, status=15 
'\017') at ../hw/net/virtio-net.c:370
20 0x55f800d6c4b2 in virtio_set_status (vdev=0x55f8044ec590, val=) at ../hw/virtio/virtio.c:1945
21 0x55f800d11d9d in vm_state_notify (running=running@entry=false, 
state=state@entry=RUN_STATE_SHUTDOWN) at ../softmmu/runstate.c:333
22 0x55f800d04e7a in do_vm_stop (state=state@entry=RUN_STATE_SHUTDOWN, 
send_stop=send_stop@entry=false) at ../softmmu/cpus.c:262
23 0x55f800d04e99 in vm_shutdown () at ../softmmu/cpus.c:280
24 0x55f800d126af in qemu_cleanup () at ../softmmu/runstate.c:812
25 0x55f800ad5b13 in main (argc=, argv=, 
envp=) at ../softmmu/main.c:51

For now, temporarily disable handling MQ request from the ctrl_vq
userspace hanlder to avoid the recursive virtio_net_set_status()
call. Some rework is needed to allow changing the number of
queues without going through a full virtio_net_set_status cycle,
particularly for vhost-vdpa backend.

This patch will need to be reverted as soon as future patches of
having the change of #queues handled in userspace is merged.

Fixes: 402378407db ("vhost-vdpa: multiqueue support")
Signed-off-by: Si-Wei Liu 
---
 hw/net/virtio-net.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index f0bb29c..e263116 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1381,6 +1381,7 @@ static int virtio_net_handle_mq(VirtIONet *n, uint8_t cmd,
 {
 VirtIODevice *vdev = VIRTIO_DEVICE(n);
 uint16_t queue_pairs;
+NetClientState *nc = qemu_get_queue(n->nic);
 
 virtio_net_disable_rss(n);
 if (cmd == VIRTIO_NET_CTRL_MQ_HASH_CONFIG) {
@@ -1412,6 +1413,18 @@ static int virtio_net_handle_mq(VirtIONet *n, uint8_t 
cmd,
 return VIRTIO_NET_ERR;
 }
 
+/* Avoid changing the number of queue_pairs for vdpa device in
+ * userspace handler. A future fix is needed to handle the mq
+ * change in userspace handler with vhost-vdpa. Let's disable
+ * the mq handling from userspace for now and only allow get
+ * done through the kernel. Ripples may be seen when falling
+   

[PATCH v3 1/6] virtio-net: setup vhost_dev and notifiers for cvq only when feature is negotiated

2022-05-05 Thread Si-Wei Liu
When the control virtqueue feature is absent or not negotiated,
vhost_net_start() still tries to set up vhost_dev and install
vhost notifiers for the control virtqueue, which results in
erroneous ioctl calls with incorrect queue index sending down
to driver. Do that only when needed.

Fixes: 22288fe ("virtio-net: vhost control virtqueue support")
Signed-off-by: Si-Wei Liu 
---
 hw/net/virtio-net.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 1067e72..ffb3475 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -245,7 +245,8 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t 
status)
 VirtIODevice *vdev = VIRTIO_DEVICE(n);
 NetClientState *nc = qemu_get_queue(n->nic);
 int queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
-int cvq = n->max_ncs - n->max_queue_pairs;
+int cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
+  n->max_ncs - n->max_queue_pairs : 0;
 
 if (!get_vhost_net(nc->peer)) {
 return;
-- 
1.8.3.1




[PATCH v3 4/6] vhost-net: fix improper cleanup in vhost_net_start

2022-05-05 Thread Si-Wei Liu
vhost_net_start() missed a corresponding stop_one() upon error from
vhost_set_vring_enable(). While at it, make the error handling for
err_start more robust. No real issue was found due to this though.

Signed-off-by: Si-Wei Liu 
Acked-by: Jason Wang 
---
 hw/net/vhost_net.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index 30379d2..d6d7c51 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -381,6 +381,7 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
 r = vhost_set_vring_enable(peer, peer->vring_enable);
 
 if (r < 0) {
+vhost_net_stop_one(get_vhost_net(peer), dev);
 goto err_start;
 }
 }
@@ -390,7 +391,8 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
 
 err_start:
 while (--i >= 0) {
-peer = qemu_get_peer(ncs , i);
+peer = qemu_get_peer(ncs, i < data_queue_pairs ?
+  i : n->max_queue_pairs);
 vhost_net_stop_one(get_vhost_net(peer), dev);
 }
 e = k->set_guest_notifiers(qbus->parent, total_notifiers, false);
-- 
1.8.3.1




[PATCH v4 1/7] virtio-net: setup vhost_dev and notifiers for cvq only when feature is negotiated

2022-05-06 Thread Si-Wei Liu
When the control virtqueue feature is absent or not negotiated,
vhost_net_start() still tries to set up vhost_dev and install
vhost notifiers for the control virtqueue, which results in
erroneous ioctl calls with incorrect queue index sending down
to driver. Do that only when needed.

Fixes: 22288fe ("virtio-net: vhost control virtqueue support")
Signed-off-by: Si-Wei Liu 
Acked-by: Jason Wang 
---
 hw/net/virtio-net.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 1067e72..ffb3475 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -245,7 +245,8 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t 
status)
 VirtIODevice *vdev = VIRTIO_DEVICE(n);
 NetClientState *nc = qemu_get_queue(n->nic);
 int queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
-int cvq = n->max_ncs - n->max_queue_pairs;
+int cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
+  n->max_ncs - n->max_queue_pairs : 0;
 
 if (!get_vhost_net(nc->peer)) {
 return;
-- 
1.8.3.1




[PATCH v4 6/7] vhost-vdpa: change name and polarity for vhost_vdpa_one_time_request()

2022-05-06 Thread Si-Wei Liu
The name vhost_vdpa_one_time_request() was confusing. No
matter whatever it returns, its typical occurrence had
always been at requests that only need to be applied once.
And the name didn't suggest what it actually checks for.
Change it to vhost_vdpa_first_dev() with polarity flipped
for better readibility of code. That way it is able to
reflect what the check is really about.

This call is applicable to request which performs operation
only once, before queues are set up, and usually at the beginning
of the caller function. Document the requirement for it in place.

Signed-off-by: Si-Wei Liu 
---
 hw/virtio/vhost-vdpa.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 6e3dbd9..33dcaa1 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -366,11 +366,18 @@ static void vhost_vdpa_get_iova_range(struct vhost_vdpa 
*v)
 v->iova_range.last);
 }
 
-static bool vhost_vdpa_one_time_request(struct vhost_dev *dev)
+/*
+ * The use of this function is for requests that only need to be
+ * applied once. Typically such request occurs at the beginning
+ * of operation, and before setting up queues. It should not be
+ * used for request that performs operation until all queues are
+ * set, which would need to check dev->vq_index_end instead.
+ */
+static bool vhost_vdpa_first_dev(struct vhost_dev *dev)
 {
 struct vhost_vdpa *v = dev->opaque;
 
-return v->index != 0;
+return v->index == 0;
 }
 
 static int vhost_vdpa_get_dev_features(struct vhost_dev *dev,
@@ -451,7 +458,7 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void 
*opaque, Error **errp)
 
 vhost_vdpa_get_iova_range(v);
 
-if (vhost_vdpa_one_time_request(dev)) {
+if (!vhost_vdpa_first_dev(dev)) {
 return 0;
 }
 
@@ -594,7 +601,7 @@ static int vhost_vdpa_memslots_limit(struct vhost_dev *dev)
 static int vhost_vdpa_set_mem_table(struct vhost_dev *dev,
 struct vhost_memory *mem)
 {
-if (vhost_vdpa_one_time_request(dev)) {
+if (!vhost_vdpa_first_dev(dev)) {
 return 0;
 }
 
@@ -623,7 +630,7 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
 struct vhost_vdpa *v = dev->opaque;
 int ret;
 
-if (vhost_vdpa_one_time_request(dev)) {
+if (!vhost_vdpa_first_dev(dev)) {
 return 0;
 }
 
@@ -665,7 +672,7 @@ static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
 
 features &= f;
 
-if (!vhost_vdpa_one_time_request(dev)) {
+if (vhost_vdpa_first_dev(dev)) {
 r = vhost_vdpa_call(dev, VHOST_SET_BACKEND_FEATURES, &features);
 if (r) {
 return -EFAULT;
@@ -1118,7 +1125,7 @@ static int vhost_vdpa_set_log_base(struct vhost_dev *dev, 
uint64_t base,
  struct vhost_log *log)
 {
 struct vhost_vdpa *v = dev->opaque;
-if (v->shadow_vqs_enabled || vhost_vdpa_one_time_request(dev)) {
+if (v->shadow_vqs_enabled || !vhost_vdpa_first_dev(dev)) {
 return 0;
 }
 
@@ -1240,7 +1247,7 @@ static int vhost_vdpa_get_features(struct vhost_dev *dev,
 
 static int vhost_vdpa_set_owner(struct vhost_dev *dev)
 {
-if (vhost_vdpa_one_time_request(dev)) {
+if (!vhost_vdpa_first_dev(dev)) {
 return 0;
 }
 
-- 
1.8.3.1




[PATCH v4 0/7] vhost-vdpa multiqueue fixes

2022-05-06 Thread Si-Wei Liu
Hi,

This patch series attempt to fix a few issues in vhost-vdpa multiqueue 
functionality.

Patch #1 and #2 are the formal submission for RFC patch as in:
https://lore.kernel.org/qemu-devel/c3e931ee-1a1b-9c2f-2f59-cb4395c23...@oracle.com/

Patch #3 through #6 are obviously small bug fixes. Please find the description 
of
each in the commit log.

Patch #7 is a workaround fix for the QEMU segfault described in:
https://lore.kernel.org/qemu-devel/4f2acb7a-d436-9d97-80b1-3308c1b39...@oracle.com/


Thanks,
-Siwei

---
v4:
  - split off the vhost_vdpa_set_backend_cap patch

v3:
  - switch to LOG_GUEST_ERROR for guest trigger-able error
  - add temporary band-aid fix for QEMU crash due to recursive call

v2:
  - split off vhost_dev notifier patch from "align ctrl_vq index for non-mq
guest for vhost_vdpa"
  - change assert to error message
  - rename vhost_vdpa_one_time_request to vhost_vdpa_first_dev for clarity

---
Si-Wei Liu (7):
  virtio-net: setup vhost_dev and notifiers for cvq only when feature is
negotiated
  virtio-net: align ctrl_vq index for non-mq guest for vhost_vdpa
  vhost-vdpa: fix improper cleanup in net_init_vhost_vdpa
  vhost-net: fix improper cleanup in vhost_net_start
  vhost-vdpa: backend feature should set only once
  vhost-vdpa: change name and polarity for vhost_vdpa_one_time_request()
  virtio-net: don't handle mq request in userspace handler for
vhost-vdpa

 hw/net/vhost_net.c |  4 +++-
 hw/net/virtio-net.c| 49 ++---
 hw/virtio/vhost-vdpa.c | 23 +++
 net/vhost-vdpa.c   |  4 +++-
 4 files changed, 67 insertions(+), 13 deletions(-)

-- 
1.8.3.1




[PATCH v4 3/7] vhost-vdpa: fix improper cleanup in net_init_vhost_vdpa

2022-05-06 Thread Si-Wei Liu
... such that no memory leaks on dangling net clients in case of
error.

Signed-off-by: Si-Wei Liu 
Acked-by: Jason Wang 
---
 net/vhost-vdpa.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 1e9fe47..df1e69e 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -306,7 +306,9 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char 
*name,
 
 err:
 if (i) {
-qemu_del_net_client(ncs[0]);
+for (i--; i >= 0; i--) {
+qemu_del_net_client(ncs[i]);
+}
 }
 qemu_close(vdpa_device_fd);
 
-- 
1.8.3.1




[PATCH v4 2/7] virtio-net: align ctrl_vq index for non-mq guest for vhost_vdpa

2022-05-06 Thread Si-Wei Liu
With MQ enabled vdpa device and non-MQ supporting guest e.g.
booting vdpa with mq=on over OVMF of single vqp, below assert
failure is seen:

../hw/virtio/vhost-vdpa.c:560: vhost_vdpa_get_vq_index: Assertion `idx >= 
dev->vq_index && idx < dev->vq_index + dev->nvqs' failed.

0  0x7f8ce3ff3387 in raise () at /lib64/libc.so.6
1  0x7f8ce3ff4a78 in abort () at /lib64/libc.so.6
2  0x7f8ce3fec1a6 in __assert_fail_base () at /lib64/libc.so.6
3  0x7f8ce3fec252 in  () at /lib64/libc.so.6
4  0x558f52d79421 in vhost_vdpa_get_vq_index (dev=, 
idx=) at ../hw/virtio/vhost-vdpa.c:563
5  0x558f52d79421 in vhost_vdpa_get_vq_index (dev=, 
idx=) at ../hw/virtio/vhost-vdpa.c:558
6  0x558f52d7329a in vhost_virtqueue_mask (hdev=0x558f55c01800, 
vdev=0x558f568f91f0, n=2, mask=) at ../hw/virtio/vhost.c:1557
7  0x558f52c6b89a in virtio_pci_set_guest_notifier 
(d=d@entry=0x558f568f0f60, n=n@entry=2, assign=assign@entry=true, 
with_irqfd=with_irqfd@entry=false)
   at ../hw/virtio/virtio-pci.c:974
8  0x558f52c6c0d8 in virtio_pci_set_guest_notifiers (d=0x558f568f0f60, 
nvqs=3, assign=true) at ../hw/virtio/virtio-pci.c:1019
9  0x558f52bf091d in vhost_net_start (dev=dev@entry=0x558f568f91f0, 
ncs=0x558f56937cd0, data_queue_pairs=data_queue_pairs@entry=1, cvq=cvq@entry=1)
   at ../hw/net/vhost_net.c:361
10 0x558f52d4e5e7 in virtio_net_set_status (status=, 
n=0x558f568f91f0) at ../hw/net/virtio-net.c:289
11 0x558f52d4e5e7 in virtio_net_set_status (vdev=0x558f568f91f0, status=15 
'\017') at ../hw/net/virtio-net.c:370
12 0x558f52d6c4b2 in virtio_set_status (vdev=vdev@entry=0x558f568f91f0, 
val=val@entry=15 '\017') at ../hw/virtio/virtio.c:1945
13 0x558f52c69eff in virtio_pci_common_write (opaque=0x558f568f0f60, 
addr=, val=, size=) at 
../hw/virtio/virtio-pci.c:1292
14 0x558f52d15d6e in memory_region_write_accessor (mr=0x558f568f19d0, 
addr=20, value=, size=1, shift=, mask=, attrs=...)
   at ../softmmu/memory.c:492
15 0x558f52d127de in access_with_adjusted_size (addr=addr@entry=20, 
value=value@entry=0x7f8cdbffe748, size=size@entry=1, access_size_min=, access_size_max=, access_fn=0x558f52d15cf0 
, mr=0x558f568f19d0, attrs=...) at 
../softmmu/memory.c:554
16 0x558f52d157ef in memory_region_dispatch_write 
(mr=mr@entry=0x558f568f19d0, addr=20, data=, op=, 
attrs=attrs@entry=...)
   at ../softmmu/memory.c:1504
17 0x558f52d078e7 in flatview_write_continue (fv=fv@entry=0x7f8accbc3b90, 
addr=addr@entry=103079215124, attrs=..., ptr=ptr@entry=0x7f8ce6300028, 
len=len@entry=1, addr1=, l=, mr=0x558f568f19d0) 
at /home/opc/qemu-upstream/include/qemu/host-utils.h:165
18 0x558f52d07b06 in flatview_write (fv=0x7f8accbc3b90, addr=103079215124, 
attrs=..., buf=0x7f8ce6300028, len=1) at ../softmmu/physmem.c:2822
19 0x558f52d0b36b in address_space_write (as=, 
addr=, attrs=..., buf=buf@entry=0x7f8ce6300028, len=)
   at ../softmmu/physmem.c:2914
20 0x558f52d0b3da in address_space_rw (as=, addr=, attrs=...,
   attrs@entry=..., buf=buf@entry=0x7f8ce6300028, len=, 
is_write=) at ../softmmu/physmem.c:2924
21 0x558f52dced09 in kvm_cpu_exec (cpu=cpu@entry=0x558f55c2da60) at 
../accel/kvm/kvm-all.c:2903
22 0x558f52dcfabd in kvm_vcpu_thread_fn (arg=arg@entry=0x558f55c2da60) at 
../accel/kvm/kvm-accel-ops.c:49
23 0x558f52f9f04a in qemu_thread_start (args=) at 
../util/qemu-thread-posix.c:556
24 0x7f8ce4392ea5 in start_thread () at /lib64/libpthread.so.0
25 0x7f8ce40bb9fd in clone () at /lib64/libc.so.6

The cause for the assert failure is due to that the vhost_dev index
for the ctrl vq was not aligned with actual one in use by the guest.
Upon multiqueue feature negotiation in virtio_net_set_multiqueue(),
if guest doesn't support multiqueue, the guest vq layout would shrink
to a single queue pair, consisting of 3 vqs in total (rx, tx and ctrl).
This results in ctrl_vq taking a different vhost_dev group index than
the default. We can map vq to the correct vhost_dev group by checking
if MQ is supported by guest and successfully negotiated. Since the
MQ feature is only present along with CTRL_VQ, we ensure the index
2 is only meant for the control vq while MQ is not supported by guest.

Fixes: 22288fe ("virtio-net: vhost control virtqueue support")
Suggested-by: Jason Wang 
Signed-off-by: Si-Wei Liu 
Acked-by: Jason Wang 
---
 hw/net/virtio-net.c | 33 +++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index ffb3475..f0bb29c 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -14,6 +14,7 @@
 #include "qemu/osdep.h"
 #include "qemu/atomic.h"
 #include "qemu/iov.h"
+#include "qemu/log.h"
 #include "qemu/main-loop.h"
 #include "qemu/module.h"
 #include "hw/virtio/virtio.h"
@@ -3171,8 +3172,22 @@ static NetClientInfo net_virtio_info = {
 static

[PATCH v4 4/7] vhost-net: fix improper cleanup in vhost_net_start

2022-05-06 Thread Si-Wei Liu
vhost_net_start() missed a corresponding stop_one() upon error from
vhost_set_vring_enable(). While at it, make the error handling for
err_start more robust. No real issue was found due to this though.

Signed-off-by: Si-Wei Liu 
Acked-by: Jason Wang 
---
 hw/net/vhost_net.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index 30379d2..d6d7c51 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -381,6 +381,7 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
 r = vhost_set_vring_enable(peer, peer->vring_enable);
 
 if (r < 0) {
+vhost_net_stop_one(get_vhost_net(peer), dev);
 goto err_start;
 }
 }
@@ -390,7 +391,8 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs,
 
 err_start:
 while (--i >= 0) {
-peer = qemu_get_peer(ncs , i);
+peer = qemu_get_peer(ncs, i < data_queue_pairs ?
+  i : n->max_queue_pairs);
 vhost_net_stop_one(get_vhost_net(peer), dev);
 }
 e = k->set_guest_notifiers(qbus->parent, total_notifiers, false);
-- 
1.8.3.1




[PATCH v4 5/7] vhost-vdpa: backend feature should set only once

2022-05-06 Thread Si-Wei Liu
The vhost_vdpa_one_time_request() branch in
vhost_vdpa_set_backend_cap() incorrectly sends down
ioctls on vhost_dev with non-zero index. This may
end up with multiple VHOST_SET_BACKEND_FEATURES
ioctl calls sent down on the vhost-vdpa fd that is
shared between all these vhost_dev's.

To fix it, send down ioctl only once via the first
vhost_dev with index 0. Toggle the polarity of the
vhost_vdpa_one_time_request() test should do the
trick.

Fixes: 4d191cfdc7de ("vhost-vdpa: classify one time request")
Signed-off-by: Si-Wei Liu 
Reviewed-by: Stefano Garzarella 
Acked-by: Jason Wang 
Acked-by: Eugenio Pérez 
---
 hw/virtio/vhost-vdpa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 8adf7c0..6e3dbd9 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -665,7 +665,7 @@ static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
 
 features &= f;
 
-if (vhost_vdpa_one_time_request(dev)) {
+if (!vhost_vdpa_one_time_request(dev)) {
 r = vhost_vdpa_call(dev, VHOST_SET_BACKEND_FEATURES, &features);
 if (r) {
 return -EFAULT;
-- 
1.8.3.1




[PATCH v4 7/7] virtio-net: don't handle mq request in userspace handler for vhost-vdpa

2022-05-06 Thread Si-Wei Liu
virtio_queue_host_notifier_read() tends to read pending event
left behind on ioeventfd in the vhost_net_stop() path, and
attempts to handle outstanding kicks from userspace vq handler.
However, in the ctrl_vq handler, virtio_net_handle_mq() has a
recursive call into virtio_net_set_status(), which may lead to
segmentation fault as shown in below stack trace:

0  0x55f800df1780 in qdev_get_parent_bus (dev=0x0) at ../hw/core/qdev.c:376
1  0x55f800c68ad8 in virtio_bus_device_iommu_enabled (vdev=vdev@entry=0x0) 
at ../hw/virtio/virtio-bus.c:331
2  0x55f800d70d7f in vhost_memory_unmap (dev=) at 
../hw/virtio/vhost.c:318
3  0x55f800d70d7f in vhost_memory_unmap (dev=, 
buffer=0x7fc19bec5240, len=2052, is_write=1, access_len=2052) at 
../hw/virtio/vhost.c:336
4  0x55f800d71867 in vhost_virtqueue_stop (dev=dev@entry=0x55f8037ccc30, 
vdev=vdev@entry=0x55f8044ec590, vq=0x55f8037cceb0, idx=0) at 
../hw/virtio/vhost.c:1241
5  0x55f800d7406c in vhost_dev_stop (hdev=hdev@entry=0x55f8037ccc30, 
vdev=vdev@entry=0x55f8044ec590) at ../hw/virtio/vhost.c:1839
6  0x55f800bf00a7 in vhost_net_stop_one (net=0x55f8037ccc30, 
dev=0x55f8044ec590) at ../hw/net/vhost_net.c:315
7  0x55f800bf0678 in vhost_net_stop (dev=dev@entry=0x55f8044ec590, 
ncs=0x55f80452bae0, data_queue_pairs=data_queue_pairs@entry=7, cvq=cvq@entry=1)
   at ../hw/net/vhost_net.c:423
8  0x55f800d4e628 in virtio_net_set_status (status=, 
n=0x55f8044ec590) at ../hw/net/virtio-net.c:296
9  0x55f800d4e628 in virtio_net_set_status (vdev=vdev@entry=0x55f8044ec590, 
status=15 '\017') at ../hw/net/virtio-net.c:370
10 0x55f800d534d8 in virtio_net_handle_ctrl (iov_cnt=, 
iov=, cmd=0 '\000', n=0x55f8044ec590) at 
../hw/net/virtio-net.c:1408
11 0x55f800d534d8 in virtio_net_handle_ctrl (vdev=0x55f8044ec590, 
vq=0x7fc1a7e888d0) at ../hw/net/virtio-net.c:1452
12 0x55f800d69f37 in virtio_queue_host_notifier_read (vq=0x7fc1a7e888d0) at 
../hw/virtio/virtio.c:2331
13 0x55f800d69f37 in virtio_queue_host_notifier_read 
(n=n@entry=0x7fc1a7e8894c) at ../hw/virtio/virtio.c:3575
14 0x55f800c688e6 in virtio_bus_cleanup_host_notifier (bus=, 
n=n@entry=14) at ../hw/virtio/virtio-bus.c:312
15 0x55f800d73106 in vhost_dev_disable_notifiers 
(hdev=hdev@entry=0x55f8035b51b0, vdev=vdev@entry=0x55f8044ec590)
   at ../../../include/hw/virtio/virtio-bus.h:35
16 0x55f800bf00b2 in vhost_net_stop_one (net=0x55f8035b51b0, 
dev=0x55f8044ec590) at ../hw/net/vhost_net.c:316
17 0x55f800bf0678 in vhost_net_stop (dev=dev@entry=0x55f8044ec590, 
ncs=0x55f80452bae0, data_queue_pairs=data_queue_pairs@entry=7, cvq=cvq@entry=1)
   at ../hw/net/vhost_net.c:423
18 0x55f800d4e628 in virtio_net_set_status (status=, 
n=0x55f8044ec590) at ../hw/net/virtio-net.c:296
19 0x55f800d4e628 in virtio_net_set_status (vdev=0x55f8044ec590, status=15 
'\017') at ../hw/net/virtio-net.c:370
20 0x55f800d6c4b2 in virtio_set_status (vdev=0x55f8044ec590, val=) at ../hw/virtio/virtio.c:1945
21 0x55f800d11d9d in vm_state_notify (running=running@entry=false, 
state=state@entry=RUN_STATE_SHUTDOWN) at ../softmmu/runstate.c:333
22 0x55f800d04e7a in do_vm_stop (state=state@entry=RUN_STATE_SHUTDOWN, 
send_stop=send_stop@entry=false) at ../softmmu/cpus.c:262
23 0x55f800d04e99 in vm_shutdown () at ../softmmu/cpus.c:280
24 0x55f800d126af in qemu_cleanup () at ../softmmu/runstate.c:812
25 0x55f800ad5b13 in main (argc=, argv=, 
envp=) at ../softmmu/main.c:51

For now, temporarily disable handling MQ request from the ctrl_vq
userspace hanlder to avoid the recursive virtio_net_set_status()
call. Some rework is needed to allow changing the number of
queues without going through a full virtio_net_set_status cycle,
particularly for vhost-vdpa backend.

This patch will need to be reverted as soon as future patches of
having the change of #queues handled in userspace is merged.

Fixes: 402378407db ("vhost-vdpa: multiqueue support")
Signed-off-by: Si-Wei Liu 
Acked-by: Jason Wang 
---
 hw/net/virtio-net.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index f0bb29c..099e650 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1381,6 +1381,7 @@ static int virtio_net_handle_mq(VirtIONet *n, uint8_t cmd,
 {
 VirtIODevice *vdev = VIRTIO_DEVICE(n);
 uint16_t queue_pairs;
+NetClientState *nc = qemu_get_queue(n->nic);
 
 virtio_net_disable_rss(n);
 if (cmd == VIRTIO_NET_CTRL_MQ_HASH_CONFIG) {
@@ -1412,6 +1413,18 @@ static int virtio_net_handle_mq(VirtIONet *n, uint8_t 
cmd,
 return VIRTIO_NET_ERR;
 }
 
+/* Avoid changing the number of queue_pairs for vdpa device in
+ * userspace handler. A future fix is needed to handle the mq
+ * change in userspace handler with vhost-vdpa. Let's disable
+ * the mq handling from userspace for now and only allow get
+ * done through the kernel. Ripples may be 

Re: [PATCH 4/5] virtio-net: Update virtio-net curr_queue_pairs in vdpa backends

2022-08-24 Thread Si-Wei Liu




On 8/23/2022 9:27 PM, Jason Wang wrote:


在 2022/8/20 01:13, Eugenio Pérez 写道:

It was returned as error before. Instead of it, simply update the
corresponding field so qemu can send it in the migration data.

Signed-off-by: Eugenio Pérez 
---



Looks correct.

Adding Si Wei for double check.
Hmmm, I understand why this change is needed for live migration, but 
this would easily cause userspace out of sync with the kernel for other 
use cases, such as link down or userspace fallback due to vdpa ioctl 
error. Yes, these are edge cases. Not completely against it, but I 
wonder if there's a way we can limit the change scope to live migration 
case only?


-Siwei



Thanks



  hw/net/virtio-net.c | 17 ++---
  1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index dd0d056fde..63a8332cd0 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1412,19 +1412,14 @@ static int virtio_net_handle_mq(VirtIONet *n, 
uint8_t cmd,

  return VIRTIO_NET_ERR;
  }
  -    /* Avoid changing the number of queue_pairs for vdpa device in
- * userspace handler. A future fix is needed to handle the mq
- * change in userspace handler with vhost-vdpa. Let's disable
- * the mq handling from userspace for now and only allow get
- * done through the kernel. Ripples may be seen when falling
- * back to userspace, but without doing it qemu process would
- * crash on a recursive entry to virtio_net_set_status().
- */
+    n->curr_queue_pairs = queue_pairs;
  if (nc->peer && nc->peer->info->type == 
NET_CLIENT_DRIVER_VHOST_VDPA) {

-    return VIRTIO_NET_ERR;
+    /*
+ * Avoid updating the backend for a vdpa device: We're only 
interested

+ * in updating the device model queues.
+ */
+    return VIRTIO_NET_OK;
  }
-
-    n->curr_queue_pairs = queue_pairs;
  /* stop the backend before changing the number of queue_pairs 
to avoid handling a

   * disabled queue */
  virtio_net_set_status(vdev, vdev->status);







Re: [PATCH 4/5] virtio-net: Update virtio-net curr_queue_pairs in vdpa backends

2022-08-25 Thread Si-Wei Liu

Hi Jason,

On 8/24/2022 7:53 PM, Jason Wang wrote:

On Thu, Aug 25, 2022 at 8:38 AM Si-Wei Liu  wrote:



On 8/23/2022 9:27 PM, Jason Wang wrote:

在 2022/8/20 01:13, Eugenio Pérez 写道:

It was returned as error before. Instead of it, simply update the
corresponding field so qemu can send it in the migration data.

Signed-off-by: Eugenio Pérez 
---


Looks correct.

Adding Si Wei for double check.

Hmmm, I understand why this change is needed for live migration, but
this would easily cause userspace out of sync with the kernel for other
use cases, such as link down or userspace fallback due to vdpa ioctl
error. Yes, these are edge cases.

Considering 7.2 will start, maybe it's time to fix the root cause
instead of having a workaround like this?
The fix for the immediate cause is not hard, though what is missing from 
my WIP series for a full blown fix is something similar to Shadow CVQ 
for all general cases than just live migration: QEMU will have to apply 
the curr_queue_pairs to the kernel once switched back from the userspace 
virtqueues. I think Shadow CVQ won't work if ASID support is missing 
from kernel. Do you think if it bother to build another similar 
facility, or we reuse Shadow CVQ code to make it work without ASID support?


I have been a bit busy with internal project for the moment, but I hope 
I can post my series next week. Here's what I have for the relevant 
patches from the WIP series.


diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index dd0d056..16edfa3 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -361,16 +361,13 @@ static void 
virtio_net_drop_tx_queue_data(VirtIODevice *vdev, VirtQueue *vq)

 }
 }

-static void virtio_net_set_status(struct VirtIODevice *vdev, uint8_t 
status)

+static void virtio_net_queue_status(struct VirtIONet *n, uint8_t status)
 {
-    VirtIONet *n = VIRTIO_NET(vdev);
+    VirtIODevice *vdev = VIRTIO_DEVICE(n);
 VirtIONetQueue *q;
 int i;
 uint8_t queue_status;

-    virtio_net_vnet_endian_status(n, status);
-    virtio_net_vhost_status(n, status);
-
 for (i = 0; i < n->max_queue_pairs; i++) {
 NetClientState *ncs = qemu_get_subqueue(n->nic, i);
 bool queue_started;
@@ -418,6 +415,15 @@ static void virtio_net_set_status(struct 
VirtIODevice *vdev, uint8_t status)

 }
 }

+static void virtio_net_set_status(struct VirtIODevice *vdev, uint8_t 
status)

+{
+    VirtIONet *n = VIRTIO_NET(vdev);
+
+    virtio_net_vnet_endian_status(n, status);
+    virtio_net_vhost_status(n, status);
+    virtio_net_queue_status(n, status);
+}
+
 static void virtio_net_set_link_status(NetClientState *nc)
 {
 VirtIONet *n = qemu_get_nic_opaque(nc);
@@ -1380,7 +1386,6 @@ static int virtio_net_handle_mq(VirtIONet *n, 
uint8_t cmd,

 {
 VirtIODevice *vdev = VIRTIO_DEVICE(n);
 uint16_t queue_pairs;
-    NetClientState *nc = qemu_get_queue(n->nic);

 virtio_net_disable_rss(n);
 if (cmd == VIRTIO_NET_CTRL_MQ_HASH_CONFIG) {
@@ -1412,22 +1417,10 @@ static int virtio_net_handle_mq(VirtIONet *n, 
uint8_t cmd,

 return VIRTIO_NET_ERR;
 }

-    /* Avoid changing the number of queue_pairs for vdpa device in
- * userspace handler. A future fix is needed to handle the mq
- * change in userspace handler with vhost-vdpa. Let's disable
- * the mq handling from userspace for now and only allow get
- * done through the kernel. Ripples may be seen when falling
- * back to userspace, but without doing it qemu process would
- * crash on a recursive entry to virtio_net_set_status().
- */
-    if (nc->peer && nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
-    return VIRTIO_NET_ERR;
-    }
-
 n->curr_queue_pairs = queue_pairs;
 /* stop the backend before changing the number of queue_pairs to 
avoid handling a

  * disabled queue */
-    virtio_net_set_status(vdev, vdev->status);
+    virtio_net_queue_status(n, vdev->status);
 virtio_net_set_queue_pairs(n);

 return VIRTIO_NET_OK;


Regards,
-Siwei


THanks


Not completely against it, but I
wonder if there's a way we can limit the change scope to live migration
case only?

-Siwei


Thanks



   hw/net/virtio-net.c | 17 ++---
   1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index dd0d056fde..63a8332cd0 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1412,19 +1412,14 @@ static int virtio_net_handle_mq(VirtIONet *n,
uint8_t cmd,
   return VIRTIO_NET_ERR;
   }
   -/* Avoid changing the number of queue_pairs for vdpa device in
- * userspace handler. A future fix is needed to handle the mq
- * change in userspace handler with vhost-vdpa. Let's disable
- * the mq handling from userspace for now and only allow get
- * done through the kernel. Ripples may be seen when falling
- * back to userspace, but wi

Re: [PATCH 4/5] virtio-net: Update virtio-net curr_queue_pairs in vdpa backends

2022-08-25 Thread Si-Wei Liu




On 8/24/2022 8:05 PM, Jason Wang wrote:

On Thu, Aug 25, 2022 at 10:53 AM Jason Wang  wrote:

On Thu, Aug 25, 2022 at 8:38 AM Si-Wei Liu  wrote:



On 8/23/2022 9:27 PM, Jason Wang wrote:

在 2022/8/20 01:13, Eugenio Pérez 写道:

It was returned as error before. Instead of it, simply update the
corresponding field so qemu can send it in the migration data.

Signed-off-by: Eugenio Pérez 
---


Looks correct.

Adding Si Wei for double check.

Hmmm, I understand why this change is needed for live migration, but
this would easily cause userspace out of sync with the kernel for other
use cases, such as link down or userspace fallback due to vdpa ioctl
error. Yes, these are edge cases.

Considering 7.2 will start, maybe it's time to fix the root cause
instead of having a workaround like this?

Btw, the patch actually tries its best to limit the behaviour, e.g it
doesn't do the following set_status() stuff. So I think it won't
trigger the issue you mentioned here?
Well, we can claim we don't support the link down+up case while changing 
queue numbers in between. On the other hand, the error recovery from 
fallback userspace is another story, which would need more attention and 
care on the error path. Yes, if see it from that perspective the change 
is fine. For completeness, please refer to the patch in the other email.


-Siwei



Thanks


THanks


Not completely against it, but I
wonder if there's a way we can limit the change scope to live migration
case only?

-Siwei


Thanks



   hw/net/virtio-net.c | 17 ++---
   1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index dd0d056fde..63a8332cd0 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -1412,19 +1412,14 @@ static int virtio_net_handle_mq(VirtIONet *n,
uint8_t cmd,
   return VIRTIO_NET_ERR;
   }
   -/* Avoid changing the number of queue_pairs for vdpa device in
- * userspace handler. A future fix is needed to handle the mq
- * change in userspace handler with vhost-vdpa. Let's disable
- * the mq handling from userspace for now and only allow get
- * done through the kernel. Ripples may be seen when falling
- * back to userspace, but without doing it qemu process would
- * crash on a recursive entry to virtio_net_set_status().
- */
+n->curr_queue_pairs = queue_pairs;
   if (nc->peer && nc->peer->info->type ==
NET_CLIENT_DRIVER_VHOST_VDPA) {
-return VIRTIO_NET_ERR;
+/*
+ * Avoid updating the backend for a vdpa device: We're only
interested
+ * in updating the device model queues.
+ */
+return VIRTIO_NET_OK;
   }
-
-n->curr_queue_pairs = queue_pairs;
   /* stop the backend before changing the number of queue_pairs
to avoid handling a
* disabled queue */
   virtio_net_set_status(vdev, vdev->status);





Re: [PATCH 4/5] virtio-net: Update virtio-net curr_queue_pairs in vdpa backends

2022-08-25 Thread Si-Wei Liu




On 8/24/2022 11:19 PM, Eugenio Perez Martin wrote:

On Thu, Aug 25, 2022 at 2:38 AM Si-Wei Liu  wrote:



On 8/23/2022 9:27 PM, Jason Wang wrote:

在 2022/8/20 01:13, Eugenio Pérez 写道:

It was returned as error before. Instead of it, simply update the
corresponding field so qemu can send it in the migration data.

Signed-off-by: Eugenio Pérez 
---


Looks correct.

Adding Si Wei for double check.

Hmmm, I understand why this change is needed for live migration, but
this would easily cause userspace out of sync with the kernel for other
use cases, such as link down or userspace fallback due to vdpa ioctl
error. Yes, these are edge cases.

The link down case is not possible at this moment because that cvq
command does not call virtio_net_handle_ctrl_iov.
Right. Though shadow cvq would need to rely on extra ASID support from 
kernel. For the case without shadow cvq we still need to look for an 
alternative mechanism.



A similar treatment
than mq would be needed when supported, and the call to
virtio_net_set_status will be avoided.
So, maybe the seemingly "right" fix for the moment is to prohibit manual 
set_link at all (for vDPA only)? In longer term we'd need to come up 
with appropriate support for applying mq config regardless of asid or 
shadow cvq support.




I'll double check device initialization ioctl failure with
n->curr_queue_pairs > 1 in the destination, but I think we should be
safe.


Not completely against it, but I
wonder if there's a way we can limit the change scope to live migration
case only?


The reason to update the device model is to send the curr_queue_pairs
to the destination in a backend agnostic way. To send it otherwise
would limit the live migration possibilities, but sure we can explore
another way.
A hacky workaround that came off the top of my head was to allow sending 
curr_queue_pairs for the !vm_running case for vdpa. It doesn't look it 
would affect other backend I think. But I agree with Jason, this doesn't 
look decent so I give up on this idea. Hence for this patch,


Acked-by: Si-Wei Liu 



Thanks!






Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier

2023-02-13 Thread Si-Wei Liu




On 2/13/2023 1:47 AM, Eugenio Perez Martin wrote:

On Sat, Feb 4, 2023 at 3:04 AM Si-Wei Liu  wrote:



On 2/2/2023 7:28 AM, Eugenio Perez Martin wrote:

On Thu, Feb 2, 2023 at 2:53 AM Si-Wei Liu  wrote:


On 1/12/2023 9:24 AM, Eugenio Pérez wrote:

This allows net to restart the device backend to configure SVQ on it.

Ideally, these changes should not be net specific. However, the vdpa net
backend is the one with enough knowledge to configure everything because
of some reasons:
* Queues might need to be shadowed or not depending on its kind (control
 vs data).
* Queues need to share the same map translations (iova tree).

Because of that it is cleaner to restart the whole net backend and
configure again as expected, similar to how vhost-kernel moves between
userspace and passthrough.

If more kinds of devices need dynamic switching to SVQ we can create a
callback struct like VhostOps and move most of the code there.
VhostOps cannot be reused since all vdpa backend share them, and to
personalize just for networking would be too heavy.

Signed-off-by: Eugenio Pérez 
---
net/vhost-vdpa.c | 84 
1 file changed, 84 insertions(+)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 5d7ad6e4d7..f38532b1df 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -26,6 +26,8 @@
#include 
#include "standard-headers/linux/virtio_net.h"
#include "monitor/monitor.h"
+#include "migration/migration.h"
+#include "migration/misc.h"
#include "migration/blocker.h"
#include "hw/virtio/vhost.h"

@@ -33,6 +35,7 @@
typedef struct VhostVDPAState {
NetClientState nc;
struct vhost_vdpa vhost_vdpa;
+Notifier migration_state;
Error *migration_blocker;
VHostNetState *vhost_net;

@@ -243,10 +246,86 @@ static VhostVDPAState 
*vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
return DO_UPCAST(VhostVDPAState, nc, nc0);
}

+static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
+{
+struct vhost_vdpa *v = &s->vhost_vdpa;
+VirtIONet *n;
+VirtIODevice *vdev;
+int data_queue_pairs, cvq, r;
+NetClientState *peer;
+
+/* We are only called on the first data vqs and only if x-svq is not set */
+if (s->vhost_vdpa.shadow_vqs_enabled == enable) {
+return;
+}
+
+vdev = v->dev->vdev;
+n = VIRTIO_NET(vdev);
+if (!n->vhost_started) {
+return;
+}
+
+if (enable) {
+ioctl(v->device_fd, VHOST_VDPA_SUSPEND);
+}
+data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
+cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
+  n->max_ncs - n->max_queue_pairs : 0;
+vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq);
+
+peer = s->nc.peer;
+for (int i = 0; i < data_queue_pairs + cvq; i++) {
+VhostVDPAState *vdpa_state;
+NetClientState *nc;
+
+if (i < data_queue_pairs) {
+nc = qemu_get_peer(peer, i);
+} else {
+nc = qemu_get_peer(peer, n->max_queue_pairs);
+}
+
+vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc);
+vdpa_state->vhost_vdpa.shadow_data = enable;
+
+if (i < data_queue_pairs) {
+/* Do not override CVQ shadow_vqs_enabled */
+vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable;
+}
+}
+
+r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq);

As the first revision, this method (vhost_net_stop followed by
vhost_net_start) should be fine for software vhost-vdpa backend for e.g.
vp_vdpa and vdpa_sim_net. However, I would like to get your attention
that this method implies substantial blackout time for mode switching on
real hardware - get a full cycle of device reset of getting memory
mappings torn down, unpin & repin same set of pages, and set up new
mapping would take very significant amount of time, especially for a
large VM. Maybe we can do:


Right, I think this is something that deserves optimization in the future.

Note that we must replace the mappings anyway, with all passthrough
queues stopped.

Yes, unmap and remap is needed indeed. I haven't checked, does shadow vq
keep mapping to the same GPA where passthrough data virtqueues were
associated with across switch (so that the mode switch is transparent to
the guest)?

I don't get this question, SVQ switching is already transparent to the guest.
Never mind, you seem to have answered the question in the reply here and 
below. I was thinking of possibility to do incremental in-place update 
for a given IOVA range with one single call (for the on-chip IOMMU 
case), instead of separate unmap() and map() calls. Things like 
.set_map_replace(vdpa, asid, iova_start, size, iotlb_new_maps) as I ever 
mentioned.





For platform IOMMU the ma

Re: [PATCH v2 01/13] vdpa net: move iova tree creation from init to start

2023-02-14 Thread Si-Wei Liu




On 2/13/2023 3:14 AM, Eugenio Perez Martin wrote:

On Mon, Feb 13, 2023 at 7:51 AM Si-Wei Liu  wrote:



On 2/8/2023 1:42 AM, Eugenio Pérez wrote:

Only create iova_tree if and when it is needed.

The cleanup keeps being responsible of last VQ but this change allows it
to merge both cleanup functions.

Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
   net/vhost-vdpa.c | 99 ++--
   1 file changed, 71 insertions(+), 28 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index de5ed8ff22..a9e6c8f28e 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -178,13 +178,9 @@ err_init:
   static void vhost_vdpa_cleanup(NetClientState *nc)
   {
   VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
-struct vhost_dev *dev = &s->vhost_net->dev;

   qemu_vfree(s->cvq_cmd_out_buffer);
   qemu_vfree(s->status);
-if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
-g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
-}
   if (s->vhost_net) {
   vhost_net_cleanup(s->vhost_net);
   g_free(s->vhost_net);
@@ -234,10 +230,64 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, 
const uint8_t *buf,
   return size;
   }

+/** From any vdpa net client, get the netclient of first queue pair */
+static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
+{
+NICState *nic = qemu_get_nic(s->nc.peer);
+NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
+
+return DO_UPCAST(VhostVDPAState, nc, nc0);
+}
+
+static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
+{
+struct vhost_vdpa *v = &s->vhost_vdpa;
+
+if (v->shadow_vqs_enabled) {
+v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
+   v->iova_range.last);
+}
+}
+
+static int vhost_vdpa_net_data_start(NetClientState *nc)
+{
+VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+struct vhost_vdpa *v = &s->vhost_vdpa;
+
+assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
+
+if (v->index == 0) {
+vhost_vdpa_net_data_start_first(s);
+return 0;
+}
+
+if (v->shadow_vqs_enabled) {
+VhostVDPAState *s0 = vhost_vdpa_net_first_nc_vdpa(s);
+v->iova_tree = s0->vhost_vdpa.iova_tree;
+}
+
+return 0;
+}
+
+static void vhost_vdpa_net_client_stop(NetClientState *nc)
+{
+VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+struct vhost_dev *dev;
+
+assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
+
+dev = s->vhost_vdpa.dev;
+if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
+g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
+}
+}
+
   static NetClientInfo net_vhost_vdpa_info = {
   .type = NET_CLIENT_DRIVER_VHOST_VDPA,
   .size = sizeof(VhostVDPAState),
   .receive = vhost_vdpa_receive,
+.start = vhost_vdpa_net_data_start,
+.stop = vhost_vdpa_net_client_stop,
   .cleanup = vhost_vdpa_cleanup,
   .has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
   .has_ufo = vhost_vdpa_has_ufo,
@@ -351,7 +401,7 @@ dma_map_err:

   static int vhost_vdpa_net_cvq_start(NetClientState *nc)
   {
-VhostVDPAState *s;
+VhostVDPAState *s, *s0;
   struct vhost_vdpa *v;
   uint64_t backend_features;
   int64_t cvq_group;
@@ -425,6 +475,15 @@ out:
   return 0;
   }

+s0 = vhost_vdpa_net_first_nc_vdpa(s);
+if (s0->vhost_vdpa.iova_tree) {
+/* SVQ is already configured for all virtqueues */
+v->iova_tree = s0->vhost_vdpa.iova_tree;
+} else {
+v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
+   v->iova_range.last);

I wonder how this case could happen, vhost_vdpa_net_data_start_first()
should've allocated an iova tree on the first data vq. Is zero data vq
ever possible on net vhost-vdpa?


It's the case of the current qemu master when only CVQ is being
shadowed. It's not that "there are no data vq": If that case were
possible, CVQ vhost-vdpa state would be s0.

The case is that since only CVQ vhost-vdpa is the one being migrated,
only CVQ has an iova tree.
OK, so this corresponds to the case where live migration is not started 
and CVQ starts in its own address space of VHOST_VDPA_NET_CVQ_ASID. 
Thanks for explaining it!




With this series applied and with no migration running, the case is
the same as before: only SVQ gets shadowed. When migration starts, all
vqs are migrated, and share iova tree.
I wonder what is the reason to share the iova tree when migration 
starts, I think CVQ may stay on its own VHOST_VDPA_NET_CVQ_ASID still?


Actually there's discrepancy in vhost_vdpa_net_log_global_enable(), I 
don't see explicit c

Re: [PATCH v2 01/13] vdpa net: move iova tree creation from init to start

2023-02-15 Thread Si-Wei Liu




On 2/14/2023 11:07 AM, Eugenio Perez Martin wrote:

On Tue, Feb 14, 2023 at 2:45 AM Si-Wei Liu  wrote:



On 2/13/2023 3:14 AM, Eugenio Perez Martin wrote:

On Mon, Feb 13, 2023 at 7:51 AM Si-Wei Liu  wrote:


On 2/8/2023 1:42 AM, Eugenio Pérez wrote:

Only create iova_tree if and when it is needed.

The cleanup keeps being responsible of last VQ but this change allows it
to merge both cleanup functions.

Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
net/vhost-vdpa.c | 99 ++--
1 file changed, 71 insertions(+), 28 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index de5ed8ff22..a9e6c8f28e 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -178,13 +178,9 @@ err_init:
static void vhost_vdpa_cleanup(NetClientState *nc)
{
VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
-struct vhost_dev *dev = &s->vhost_net->dev;

qemu_vfree(s->cvq_cmd_out_buffer);
qemu_vfree(s->status);
-if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
-g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
-}
if (s->vhost_net) {
vhost_net_cleanup(s->vhost_net);
g_free(s->vhost_net);
@@ -234,10 +230,64 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, 
const uint8_t *buf,
return size;
}

+/** From any vdpa net client, get the netclient of first queue pair */
+static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
+{
+NICState *nic = qemu_get_nic(s->nc.peer);
+NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
+
+return DO_UPCAST(VhostVDPAState, nc, nc0);
+}
+
+static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
+{
+struct vhost_vdpa *v = &s->vhost_vdpa;
+
+if (v->shadow_vqs_enabled) {
+v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
+   v->iova_range.last);
+}
+}
+
+static int vhost_vdpa_net_data_start(NetClientState *nc)
+{
+VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+struct vhost_vdpa *v = &s->vhost_vdpa;
+
+assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
+
+if (v->index == 0) {
+vhost_vdpa_net_data_start_first(s);
+return 0;
+}
+
+if (v->shadow_vqs_enabled) {
+VhostVDPAState *s0 = vhost_vdpa_net_first_nc_vdpa(s);
+v->iova_tree = s0->vhost_vdpa.iova_tree;
+}
+
+return 0;
+}
+
+static void vhost_vdpa_net_client_stop(NetClientState *nc)
+{
+VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+struct vhost_dev *dev;
+
+assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
+
+dev = s->vhost_vdpa.dev;
+if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
+g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
+}
+}
+
static NetClientInfo net_vhost_vdpa_info = {
.type = NET_CLIENT_DRIVER_VHOST_VDPA,
.size = sizeof(VhostVDPAState),
.receive = vhost_vdpa_receive,
+.start = vhost_vdpa_net_data_start,
+.stop = vhost_vdpa_net_client_stop,
.cleanup = vhost_vdpa_cleanup,
.has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
.has_ufo = vhost_vdpa_has_ufo,
@@ -351,7 +401,7 @@ dma_map_err:

static int vhost_vdpa_net_cvq_start(NetClientState *nc)
{
-VhostVDPAState *s;
+VhostVDPAState *s, *s0;
struct vhost_vdpa *v;
uint64_t backend_features;
int64_t cvq_group;
@@ -425,6 +475,15 @@ out:
return 0;
}

+s0 = vhost_vdpa_net_first_nc_vdpa(s);
+if (s0->vhost_vdpa.iova_tree) {
+/* SVQ is already configured for all virtqueues */
+v->iova_tree = s0->vhost_vdpa.iova_tree;
+} else {
+v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
+   v->iova_range.last);

I wonder how this case could happen, vhost_vdpa_net_data_start_first()
should've allocated an iova tree on the first data vq. Is zero data vq
ever possible on net vhost-vdpa?


It's the case of the current qemu master when only CVQ is being
shadowed. It's not that "there are no data vq": If that case were
possible, CVQ vhost-vdpa state would be s0.

The case is that since only CVQ vhost-vdpa is the one being migrated,
only CVQ has an iova tree.

OK, so this corresponds to the case where live migration is not started
and CVQ starts in its own address space of VHOST_VDPA_NET_CVQ_ASID.
Thanks for explaining it!


With this series applied and with no migration running, the case is
the same as before: only SVQ gets shadowed. When migration starts, all
vqs are migrated, and share iova tree.

I wonder what is the reason to share the iova tree when migration
starts, I think CVQ may stay on its own VHOST_VD

Re: [PATCH v2 01/13] vdpa net: move iova tree creation from init to start

2023-02-16 Thread Si-Wei Liu




On 2/15/2023 11:35 PM, Eugenio Perez Martin wrote:

On Thu, Feb 16, 2023 at 3:15 AM Si-Wei Liu  wrote:



On 2/14/2023 11:07 AM, Eugenio Perez Martin wrote:

On Tue, Feb 14, 2023 at 2:45 AM Si-Wei Liu  wrote:


On 2/13/2023 3:14 AM, Eugenio Perez Martin wrote:

On Mon, Feb 13, 2023 at 7:51 AM Si-Wei Liu  wrote:

On 2/8/2023 1:42 AM, Eugenio Pérez wrote:

Only create iova_tree if and when it is needed.

The cleanup keeps being responsible of last VQ but this change allows it
to merge both cleanup functions.

Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
 net/vhost-vdpa.c | 99 ++--
 1 file changed, 71 insertions(+), 28 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index de5ed8ff22..a9e6c8f28e 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -178,13 +178,9 @@ err_init:
 static void vhost_vdpa_cleanup(NetClientState *nc)
 {
 VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
-struct vhost_dev *dev = &s->vhost_net->dev;

 qemu_vfree(s->cvq_cmd_out_buffer);
 qemu_vfree(s->status);
-if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
-g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
-}
 if (s->vhost_net) {
 vhost_net_cleanup(s->vhost_net);
 g_free(s->vhost_net);
@@ -234,10 +230,64 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, 
const uint8_t *buf,
 return size;
 }

+/** From any vdpa net client, get the netclient of first queue pair */
+static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
+{
+NICState *nic = qemu_get_nic(s->nc.peer);
+NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
+
+return DO_UPCAST(VhostVDPAState, nc, nc0);
+}
+
+static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
+{
+struct vhost_vdpa *v = &s->vhost_vdpa;
+
+if (v->shadow_vqs_enabled) {
+v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
+   v->iova_range.last);
+}
+}
+
+static int vhost_vdpa_net_data_start(NetClientState *nc)
+{
+VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+struct vhost_vdpa *v = &s->vhost_vdpa;
+
+assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
+
+if (v->index == 0) {
+vhost_vdpa_net_data_start_first(s);
+return 0;
+}
+
+if (v->shadow_vqs_enabled) {
+VhostVDPAState *s0 = vhost_vdpa_net_first_nc_vdpa(s);
+v->iova_tree = s0->vhost_vdpa.iova_tree;
+}
+
+return 0;
+}
+
+static void vhost_vdpa_net_client_stop(NetClientState *nc)
+{
+VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+struct vhost_dev *dev;
+
+assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
+
+dev = s->vhost_vdpa.dev;
+if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
+g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
+}
+}
+
 static NetClientInfo net_vhost_vdpa_info = {
 .type = NET_CLIENT_DRIVER_VHOST_VDPA,
 .size = sizeof(VhostVDPAState),
 .receive = vhost_vdpa_receive,
+.start = vhost_vdpa_net_data_start,
+.stop = vhost_vdpa_net_client_stop,
 .cleanup = vhost_vdpa_cleanup,
 .has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
 .has_ufo = vhost_vdpa_has_ufo,
@@ -351,7 +401,7 @@ dma_map_err:

 static int vhost_vdpa_net_cvq_start(NetClientState *nc)
 {
-VhostVDPAState *s;
+VhostVDPAState *s, *s0;
 struct vhost_vdpa *v;
 uint64_t backend_features;
 int64_t cvq_group;
@@ -425,6 +475,15 @@ out:
 return 0;
 }

+s0 = vhost_vdpa_net_first_nc_vdpa(s);
+if (s0->vhost_vdpa.iova_tree) {
+/* SVQ is already configured for all virtqueues */
+v->iova_tree = s0->vhost_vdpa.iova_tree;
+} else {
+v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
+   v->iova_range.last);

I wonder how this case could happen, vhost_vdpa_net_data_start_first()
should've allocated an iova tree on the first data vq. Is zero data vq
ever possible on net vhost-vdpa?


It's the case of the current qemu master when only CVQ is being
shadowed. It's not that "there are no data vq": If that case were
possible, CVQ vhost-vdpa state would be s0.

The case is that since only CVQ vhost-vdpa is the one being migrated,
only CVQ has an iova tree.

OK, so this corresponds to the case where live migration is not started
and CVQ starts in its own address space of VHOST_VDPA_NET_CVQ_ASID.
Thanks for explaining it!


With this series applied and with no migration running, the case is
the same as before: only SVQ gets shadowed. When migration starts, all
vqs are migrated

[PATCH] vhost-vdpa: fix assert !virtio_net_get_subqueue(nc)->async_tx.elem in virtio_net_reset

2022-10-04 Thread Si-Wei Liu
The citing commit has incorrect code in vhost_vdpa_receive() that returns
zero instead of full packet size to the caller. This renders pending packets
unable to be freed so then get clogged in the tx queue forever. When device
is being reset later on, below assertion failure ensues:

0  0x7f86d53bb387 in raise () from /lib64/libc.so.6
1  0x7f86d53bca78 in abort () from /lib64/libc.so.6
2  0x7f86d53b41a6 in __assert_fail_base () from /lib64/libc.so.6
3  0x7f86d53b4252 in __assert_fail () from /lib64/libc.so.6
4  0x55b8f6ff6fcc in virtio_net_reset (vdev=) at 
/usr/src/debug/qemu/hw/net/virtio-net.c:563
5  0x55b8f7012fcf in virtio_reset (opaque=0x55b8faf881f0) at 
/usr/src/debug/qemu/hw/virtio/virtio.c:1993
6  0x55b8f71f0086 in virtio_bus_reset (bus=bus@entry=0x55b8faf88178) at 
/usr/src/debug/qemu/hw/virtio/virtio-bus.c:102
7  0x55b8f71f1620 in virtio_pci_reset (qdev=) at 
/usr/src/debug/qemu/hw/virtio/virtio-pci.c:1845
8  0x55b8f6fafc6c in memory_region_write_accessor (mr=, 
addr=, value=,
   size=, shift=, mask=, 
attrs=...) at /usr/src/debug/qemu/memory.c:483
9  0x55b8f6fadce9 in access_with_adjusted_size (addr=addr@entry=20, 
value=value@entry=0x7f867e7fb7e8, size=size@entry=1,
   access_size_min=, access_size_max=, 
access_fn=0x55b8f6fafc20 ,
   mr=0x55b8faf80a50, attrs=...) at /usr/src/debug/qemu/memory.c:544
10 0x55b8f6fb1d0b in memory_region_dispatch_write 
(mr=mr@entry=0x55b8faf80a50, addr=addr@entry=20, data=0, op=,
   attrs=attrs@entry=...) at /usr/src/debug/qemu/memory.c:1470
11 0x55b8f6f62ada in flatview_write_continue (fv=fv@entry=0x7f86ac04cd20, 
addr=addr@entry=549755813908, attrs=...,
   attrs@entry=..., buf=buf@entry=0x7f86d0223028 , len=len@entry=1, addr1=20, l=1,
   mr=0x55b8faf80a50) at /usr/src/debug/qemu/exec.c:3266
12 0x55b8f6f62c8f in flatview_write (fv=0x7f86ac04cd20, addr=549755813908, 
attrs=...,
   buf=0x7f86d0223028 , len=1) at 
/usr/src/debug/qemu/exec.c:3306
13 0x55b8f6f674cb in address_space_write (as=, 
addr=, attrs=..., buf=,
   len=) at /usr/src/debug/qemu/exec.c:3396
14 0x55b8f6f67575 in address_space_rw (as=, addr=, attrs=..., attrs@entry=...,
   buf=buf@entry=0x7f86d0223028 , 
len=, is_write=)
   at /usr/src/debug/qemu/exec.c:3406
15 0x55b8f6fc1cc8 in kvm_cpu_exec (cpu=cpu@entry=0x55b8f9aa0e10) at 
/usr/src/debug/qemu/accel/kvm/kvm-all.c:2410
16 0x55b8f6fa5f5e in qemu_kvm_cpu_thread_fn (arg=0x55b8f9aa0e10) at 
/usr/src/debug/qemu/cpus.c:1318
17 0x55b8f7336e16 in qemu_thread_start (args=0x55b8f9ac8480) at 
/usr/src/debug/qemu/util/qemu-thread-posix.c:519
18 0x7f86d575aea5 in start_thread () from /lib64/libpthread.so.0
19 0x7f86d5483b2d in clone () from /lib64/libc.so.6

Make vhost_vdpa_receive() return the size passed in as is, so that the
caller qemu_deliver_packet_iov() would eventually propagate it back to
virtio_net_flush_tx() to release pending packets from the async_tx queue.
Which corresponds to the drop path where qemu_sendv_packet_async() returns
non-zero in virtio_net_flush_tx().

Fixes: 846a1e85da64 ("vdpa: Add dummy receive callback")
Cc: Eugenio Perez Martin 
Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 4bc3fd0..182b3a1 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -211,7 +211,7 @@ static bool vhost_vdpa_check_peer_type(NetClientState *nc, 
ObjectClass *oc,
 static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf,
   size_t size)
 {
-return 0;
+return size;
 }
 
 static NetClientInfo net_vhost_vdpa_info = {
-- 
1.8.3.1




[PATCH] vhost-vdpa: allow passing opened vhostfd to vhost-vdpa

2022-10-04 Thread Si-Wei Liu
Similar to other vhost backends, vhostfd can be passed to vhost-vdpa
backend as another parameter to instantiate vhost-vdpa net client.
This would benefit the use case where only open fd's, as oppposed to
raw vhost-vdpa device paths, are accessible from the QEMU process.

(qemu) netdev_add type=vhost-vdpa,vhostfd=61,id=vhost-vdpa1

Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 25 -
 qapi/net.json|  3 +++
 qemu-options.hx  |  6 --
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 182b3a1..366b070 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -683,14 +683,29 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char 
*name,
 
 assert(netdev->type == NET_CLIENT_DRIVER_VHOST_VDPA);
 opts = &netdev->u.vhost_vdpa;
-if (!opts->vhostdev) {
-error_setg(errp, "vdpa character device not specified with vhostdev");
+if (!opts->has_vhostdev && !opts->has_vhostfd) {
+error_setg(errp,
+   "vhost-vdpa: neither vhostdev= nor vhostfd= was specified");
 return -1;
 }
 
-vdpa_device_fd = qemu_open(opts->vhostdev, O_RDWR, errp);
-if (vdpa_device_fd == -1) {
-return -errno;
+if (opts->has_vhostdev && opts->has_vhostfd) {
+error_setg(errp,
+   "vhost-vdpa: vhostdev= and vhostfd= are mutually 
exclusive");
+return -1;
+}
+
+if (opts->has_vhostdev) {
+vdpa_device_fd = qemu_open(opts->vhostdev, O_RDWR, errp);
+if (vdpa_device_fd == -1) {
+return -errno;
+}
+} else if (opts->has_vhostfd) {
+vdpa_device_fd = monitor_fd_param(monitor_cur(), opts->vhostfd, errp);
+if (vdpa_device_fd == -1) {
+error_prepend(errp, "vhost-vdpa: unable to parse vhostfd: ");
+return -1;
+}
 }
 
 r = vhost_vdpa_get_features(vdpa_device_fd, &features, errp);
diff --git a/qapi/net.json b/qapi/net.json
index dd088c0..926ecc8 100644
--- a/qapi/net.json
+++ b/qapi/net.json
@@ -442,6 +442,8 @@
 # @vhostdev: path of vhost-vdpa device
 #(default:'/dev/vhost-vdpa-0')
 #
+# @vhostfd: file descriptor of an already opened vhost vdpa device
+#
 # @queues: number of queues to be created for multiqueue vhost-vdpa
 #  (default: 1)
 #
@@ -456,6 +458,7 @@
 { 'struct': 'NetdevVhostVDPAOptions',
   'data': {
 '*vhostdev': 'str',
+'*vhostfd':  'str',
 '*queues':   'int',
 '*x-svq':{'type': 'bool', 'features' : [ 'unstable'] } } }
 
diff --git a/qemu-options.hx b/qemu-options.hx
index 913c71e..c040f74 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -2774,8 +2774,10 @@ DEF("netdev", HAS_ARG, QEMU_OPTION_netdev,
 "configure a vhost-user network, backed by a chardev 
'dev'\n"
 #endif
 #ifdef __linux__
-"-netdev vhost-vdpa,id=str,vhostdev=/path/to/dev\n"
+"-netdev vhost-vdpa,id=str[,vhostdev=/path/to/dev][,vhostfd=h]\n"
 "configure a vhost-vdpa network,Establish a vhost-vdpa 
netdev\n"
+"use 'vhostdev=/path/to/dev' to open a vhost vdpa device\n"
+"use 'vhostfd=h' to connect to an already opened vhost 
vdpa device\n"
 #endif
 #ifdef CONFIG_VMNET
 "-netdev vmnet-host,id=str[,isolated=on|off][,net-uuid=uuid]\n"
@@ -3280,7 +3282,7 @@ SRST
  -netdev type=vhost-user,id=net0,chardev=chr0 \
  -device virtio-net-pci,netdev=net0
 
-``-netdev vhost-vdpa,vhostdev=/path/to/dev``
+``-netdev vhost-vdpa[,vhostdev=/path/to/dev][,vhostfd=h]``
 Establish a vhost-vdpa netdev.
 
 vDPA device is a device that uses a datapath which complies with
-- 
1.8.3.1




[PATCH] vhost-vdpa: fix assert !virtio_net_get_subqueue(nc)->async_tx.elem in virtio_net_reset

2022-10-04 Thread Si-Wei Liu
The citing commit has incorrect code in vhost_vdpa_receive() that returns
zero instead of full packet size to the caller. This renders pending packets
unable to be freed so then get clogged in the tx queue forever. When device
is being reset later on, below assertion failure ensues:

0  0x7f86d53bb387 in raise () from /lib64/libc.so.6
1  0x7f86d53bca78 in abort () from /lib64/libc.so.6
2  0x7f86d53b41a6 in __assert_fail_base () from /lib64/libc.so.6
3  0x7f86d53b4252 in __assert_fail () from /lib64/libc.so.6
4  0x55b8f6ff6fcc in virtio_net_reset (vdev=) at 
/usr/src/debug/qemu/hw/net/virtio-net.c:563
5  0x55b8f7012fcf in virtio_reset (opaque=0x55b8faf881f0) at 
/usr/src/debug/qemu/hw/virtio/virtio.c:1993
6  0x55b8f71f0086 in virtio_bus_reset (bus=bus@entry=0x55b8faf88178) at 
/usr/src/debug/qemu/hw/virtio/virtio-bus.c:102
7  0x55b8f71f1620 in virtio_pci_reset (qdev=) at 
/usr/src/debug/qemu/hw/virtio/virtio-pci.c:1845
8  0x55b8f6fafc6c in memory_region_write_accessor (mr=, 
addr=, value=,
   size=, shift=, mask=, 
attrs=...) at /usr/src/debug/qemu/memory.c:483
9  0x55b8f6fadce9 in access_with_adjusted_size (addr=addr@entry=20, 
value=value@entry=0x7f867e7fb7e8, size=size@entry=1,
   access_size_min=, access_size_max=, 
access_fn=0x55b8f6fafc20 ,
   mr=0x55b8faf80a50, attrs=...) at /usr/src/debug/qemu/memory.c:544
10 0x55b8f6fb1d0b in memory_region_dispatch_write 
(mr=mr@entry=0x55b8faf80a50, addr=addr@entry=20, data=0, op=,
   attrs=attrs@entry=...) at /usr/src/debug/qemu/memory.c:1470
11 0x55b8f6f62ada in flatview_write_continue (fv=fv@entry=0x7f86ac04cd20, 
addr=addr@entry=549755813908, attrs=...,
   attrs@entry=..., buf=buf@entry=0x7f86d0223028 , len=len@entry=1, addr1=20, l=1,
   mr=0x55b8faf80a50) at /usr/src/debug/qemu/exec.c:3266
12 0x55b8f6f62c8f in flatview_write (fv=0x7f86ac04cd20, addr=549755813908, 
attrs=...,
   buf=0x7f86d0223028 , len=1) at 
/usr/src/debug/qemu/exec.c:3306
13 0x55b8f6f674cb in address_space_write (as=, 
addr=, attrs=..., buf=,
   len=) at /usr/src/debug/qemu/exec.c:3396
14 0x55b8f6f67575 in address_space_rw (as=, addr=, attrs=..., attrs@entry=...,
   buf=buf@entry=0x7f86d0223028 , 
len=, is_write=)
   at /usr/src/debug/qemu/exec.c:3406
15 0x55b8f6fc1cc8 in kvm_cpu_exec (cpu=cpu@entry=0x55b8f9aa0e10) at 
/usr/src/debug/qemu/accel/kvm/kvm-all.c:2410
16 0x55b8f6fa5f5e in qemu_kvm_cpu_thread_fn (arg=0x55b8f9aa0e10) at 
/usr/src/debug/qemu/cpus.c:1318
17 0x55b8f7336e16 in qemu_thread_start (args=0x55b8f9ac8480) at 
/usr/src/debug/qemu/util/qemu-thread-posix.c:519
18 0x7f86d575aea5 in start_thread () from /lib64/libpthread.so.0
19 0x7f86d5483b2d in clone () from /lib64/libc.so.6

Make vhost_vdpa_receive() return the size passed in as is, so that the
caller qemu_deliver_packet_iov() would eventually propagate it back to
virtio_net_flush_tx() to release pending packets from the async_tx queue.
Which corresponds to the drop path where qemu_sendv_packet_async() returns
non-zero in virtio_net_flush_tx().

Fixes: 846a1e85da64 ("vdpa: Add dummy receive callback")
Cc: Eugenio Perez Martin 
Signed-off-by: Si-Wei Liu 
---
 net/vhost-vdpa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 4bc3fd0..182b3a1 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -211,7 +211,7 @@ static bool vhost_vdpa_check_peer_type(NetClientState *nc, 
ObjectClass *oc,
 static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf,
   size_t size)
 {
-return 0;
+return size;
 }
 
 static NetClientInfo net_vhost_vdpa_info = {
-- 
1.8.3.1




Re: [PATCH] vhost-vdpa: fix assert !virtio_net_get_subqueue(nc)->async_tx.elem in virtio_net_reset

2022-10-04 Thread Si-Wei Liu
Apologies, please disregard this email. Wrong target audience it was 
sent to, although the content of patch is correct. For those who want to 
review the patch, please reply to this thread:


Message-Id: <1664913563-3351-1-git-send-email-si-wei@oracle.com>

Thanks,
-Siwei

On 10/4/2022 12:58 PM, Si-Wei Liu wrote:

The citing commit has incorrect code in vhost_vdpa_receive() that returns
zero instead of full packet size to the caller. This renders pending packets
unable to be freed so then get clogged in the tx queue forever. When device
is being reset later on, below assertion failure ensues:

0  0x7f86d53bb387 in raise () from /lib64/libc.so.6
1  0x7f86d53bca78 in abort () from /lib64/libc.so.6
2  0x7f86d53b41a6 in __assert_fail_base () from /lib64/libc.so.6
3  0x7f86d53b4252 in __assert_fail () from /lib64/libc.so.6
4  0x55b8f6ff6fcc in virtio_net_reset (vdev=) at 
/usr/src/debug/qemu/hw/net/virtio-net.c:563
5  0x55b8f7012fcf in virtio_reset (opaque=0x55b8faf881f0) at 
/usr/src/debug/qemu/hw/virtio/virtio.c:1993
6  0x55b8f71f0086 in virtio_bus_reset (bus=bus@entry=0x55b8faf88178) at 
/usr/src/debug/qemu/hw/virtio/virtio-bus.c:102
7  0x55b8f71f1620 in virtio_pci_reset (qdev=) at 
/usr/src/debug/qemu/hw/virtio/virtio-pci.c:1845
8  0x55b8f6fafc6c in memory_region_write_accessor (mr=, 
addr=, value=,
size=, shift=, mask=, 
attrs=...) at /usr/src/debug/qemu/memory.c:483
9  0x55b8f6fadce9 in access_with_adjusted_size (addr=addr@entry=20, 
value=value@entry=0x7f867e7fb7e8, size=size@entry=1,
access_size_min=, access_size_max=, 
access_fn=0x55b8f6fafc20 ,
mr=0x55b8faf80a50, attrs=...) at /usr/src/debug/qemu/memory.c:544
10 0x55b8f6fb1d0b in memory_region_dispatch_write (mr=mr@entry=0x55b8faf80a50, 
addr=addr@entry=20, data=0, op=,
attrs=attrs@entry=...) at /usr/src/debug/qemu/memory.c:1470
11 0x55b8f6f62ada in flatview_write_continue (fv=fv@entry=0x7f86ac04cd20, 
addr=addr@entry=549755813908, attrs=...,
attrs@entry=..., buf=buf@entry=0x7f86d0223028 , len=len@entry=1, addr1=20, l=1,
mr=0x55b8faf80a50) at /usr/src/debug/qemu/exec.c:3266
12 0x55b8f6f62c8f in flatview_write (fv=0x7f86ac04cd20, addr=549755813908, 
attrs=...,
buf=0x7f86d0223028 , len=1) at 
/usr/src/debug/qemu/exec.c:3306
13 0x55b8f6f674cb in address_space_write (as=, addr=, attrs=..., buf=,
len=) at /usr/src/debug/qemu/exec.c:3396
14 0x55b8f6f67575 in address_space_rw (as=, addr=, attrs=..., attrs@entry=...,
buf=buf@entry=0x7f86d0223028 , len=, is_write=)
at /usr/src/debug/qemu/exec.c:3406
15 0x55b8f6fc1cc8 in kvm_cpu_exec (cpu=cpu@entry=0x55b8f9aa0e10) at 
/usr/src/debug/qemu/accel/kvm/kvm-all.c:2410
16 0x55b8f6fa5f5e in qemu_kvm_cpu_thread_fn (arg=0x55b8f9aa0e10) at 
/usr/src/debug/qemu/cpus.c:1318
17 0x55b8f7336e16 in qemu_thread_start (args=0x55b8f9ac8480) at 
/usr/src/debug/qemu/util/qemu-thread-posix.c:519
18 0x7f86d575aea5 in start_thread () from /lib64/libpthread.so.0
19 0x7f86d5483b2d in clone () from /lib64/libc.so.6

Make vhost_vdpa_receive() return the size passed in as is, so that the
caller qemu_deliver_packet_iov() would eventually propagate it back to
virtio_net_flush_tx() to release pending packets from the async_tx queue.
Which corresponds to the drop path where qemu_sendv_packet_async() returns
non-zero in virtio_net_flush_tx().

Fixes: 846a1e85da64 ("vdpa: Add dummy receive callback")
Cc: Eugenio Perez Martin
Signed-off-by: Si-Wei Liu
---
  net/vhost-vdpa.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 4bc3fd0..182b3a1 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -211,7 +211,7 @@ static bool vhost_vdpa_check_peer_type(NetClientState *nc, 
ObjectClass *oc,
  static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf,
size_t size)
  {
-return 0;
+return size;
  }
  
  static NetClientInfo net_vhost_vdpa_info = {


Re: [PATCH 2/3] vdpa: load vlan configuration at NIC startup

2022-10-04 Thread Si-Wei Liu



On 9/29/2022 12:13 AM, Michael S. Tsirkin wrote:

On Wed, Sep 21, 2022 at 04:00:58PM -0700, Si-Wei Liu wrote:

The spec doesn't explicitly say anything about that
as far as I see.

Here the spec is totally ruled by the (software artifact of)
implementation rather than what a real device is expected to work with
VLAN rx filters. Are we sure we'd stick to this flawed device
implementation? The guest driver seems to be agnostic with this broken
spec behavior so far, and I am afraid it's an overkill to add another
feature bit or ctrl command to VLAN filter in clean way.


I agree with all of the above. So, double checking, all vlan should be
allowed by default at device start?

That is true only when VIRTIO_NET_F_CTRL_VLAN is not negotiated. If the
guest already negotiated VIRTIO_NET_F_CTRL_VLAN before being migrated,
device should resume with all VLANs filtered/disallowed.


   Maybe the spec needs to be more
clear in that regard?

Yes, I think this is crucial. Otherwise we can't get consistent behavior,
either from software to vDPA, or cross various vDPA vendors.

OK. Can you open a github issue for the spec? We'll try to address.

Thanks, ticket filed at:
https://github.com/oasis-tcs/virtio-spec/issues/147

Also, is it ok if we make it a SHOULD, i.e. best effort filtering?


Yes, that's fine.

-Siwei

Re: [PATCH v2] vhost-vdpa: allow passing opened vhostfd to vhost-vdpa

2022-10-27 Thread Si-Wei Liu

Hi Jason,

Sorry for top posting, but are you going to queue this patch? It looks 
like the discussion has been settled and no further comment I got for 2 
weeks for this patch.


Thanks,
-Siwei

On 10/13/2022 4:12 PM, Si-Wei Liu wrote:

Jason,

On 10/12/2022 10:02 PM, Jason Wang wrote:


在 2022/10/12 13:59, Si-Wei Liu 写道:



On 10/11/2022 8:09 PM, Jason Wang wrote:
On Tue, Oct 11, 2022 at 1:18 AM Si-Wei Liu 
wrote:

On 10/8/2022 10:43 PM, Jason Wang wrote:

On Sat, Oct 8, 2022 at 5:04 PM Si-Wei Liu 
wrote:


Similar to other vhost backends, vhostfd can be passed to vhost-vdpa
backend as another parameter to instantiate vhost-vdpa net client.
This would benefit the use case where only open file descriptors, as
opposed to raw vhost-vdpa device paths, are accessible from the QEMU
process.

(qemu) netdev_add type=vhost-vdpa,vhostfd=61,id=vhost-vdpa1

Adding Cindy.

This has been discussed before, we've already had
vhostdev=/dev/fdset/$fd which should be functional equivalent to what
has been proposed here. (And this is how libvirt works if I 
understand

correctly).

Yes, I was aware of that discussion. However, our implementation 
of the management software is a bit different from libvirt, in 
which the paths in /dev/fdset/NNN can't be dynamically passed to 
the container where QEMU is running. By using a specific vhostfd 
property with existing code, it would allow our mgmt software 
smooth adaption without having to add too much infra code to 
support the /dev/fdset/NNN trick.

I think fdset has extra flexibility in e.g hot-plug to allow the file
descriptor to be passed with SCM_RIGHTS.
Yes, that's exactly the use case we'd like to support. Though the 
difference in our mgmt software stack from libvirt is that any 
dynamic path in /dev (like /dev/fdset/ABC or /dev/vhost-vdpa-XYZ) 
can't be allowed to get passed through to the container running QEMU 
on the fly for security reasons. fd passing is allowed, though, with 
very strict security checks.



Interesting, any reason for disallowing fd passing?
For our mgmt software stack, QEMU is running in a secured container 
with its own namespace(s) with minimally well known and trusted 
devices from root ns exposed (only) at the time when QEMU is being 
started.  Direct fd passing via SCM_RIGHTS is allowed, but fdset 
device node exposure is not allowed and not even considered useful to 
us, as it adds an unwarranted attack surface to the QEMU's secured 
container unnecessarily. This has been the case and our security model 
for a while now w.r.t hot plugging vhost-net/tap and vhost-scsi 
devices, so will do for vhost-vdpa with vhostfd. It's not an open 
source project, though what I can share is that it's not a simple 
script that can be easily changed, and allow passing extra devices 
e.g. fdset especially on the fly is not even in consideration per 
suggested security guideline. I think we don't do anything special 
here as with other secured containers that disallow dynamic device 
injection on the fly.


I'm asking since it's the way that libvirt work and it seems to me we 
don't get any complaints in the past.
I guess it was because libvirt doesn't run QEMU in a container with 
very limited device exposure, otherwise this sort of constraints would 
pop up. Anyway the point and the way I see it is that passing vhostfd 
is proved to be working well and secure with other vhost devices, I 
don't see why vhost-vdpa is treated special here that would need to 
enforce the fdset usage. It's an edge case for libvirt maybe, but 
supporting QEMU's vhost-vdpa device to run in a securely contained 
environment with no dynamic device injection shouldn't be an odd or 
bizarre use case.



Thanks,
-Siwei




That's the main motivation for this direct vhostfd passing support 
(noted fdset doesn't need to be used along with /dev/fdset node).


Having it said, I found there's also nuance in the 
vhostdev=/dev/fdset/XyZ interface besides the /dev node limitation: 
the fd to open has to be dup'ed from the original one passed via 
SCM_RIGHTS. This also has implication on security that any ioctl 
call from QEMU can't be audited through the original fd.



I'm not sure I get this, but management layer can enforce a ioctl 
whiltelist for safety.


Thanks


With this regard, I think vhostfd offers more flexibility than work 
around those qemu_open() specifics. Would these justify the use case 
of concern?


Thanks,
-Siwei


  It would still be good to add
the support.

On the other hand, the other vhost backends, e.g. tap (via 
vhost-net), vhost-scsi and vhost-vsock all accept vhostfd as 
parameter to instantiate device, although the /dev/fdset trick 
also works there. I think vhost-vdpa is not  unprecedented in this 
case?

Yes.

Thanks


Thanks,
-Siwei



Thanks

Signed-off-by: Si-Wei Liu
Acked-by: Eugenio Pérez

---
v2:
   - fixed typo in commit message
   

Re: [PATCH v2] vhost-vdpa: allow passing opened vhostfd to vhost-vdpa

2022-10-28 Thread Si-Wei Liu



On 10/27/2022 6:50 PM, Jason Wang wrote:

On Fri, Oct 28, 2022 at 5:56 AM Si-Wei Liu  wrote:

Hi Jason,

Sorry for top posting, but are you going to queue this patch? It looks
like the discussion has been settled and no further comment I got for 2
weeks for this patch.

Yes, I've queued this.

Excellent, thanks Jason. I see it gets pulled.

-Siwei


Thanks


Thanks,
-Siwei

On 10/13/2022 4:12 PM, Si-Wei Liu wrote:

Jason,

On 10/12/2022 10:02 PM, Jason Wang wrote:

在 2022/10/12 13:59, Si-Wei Liu 写道:


On 10/11/2022 8:09 PM, Jason Wang wrote:

On Tue, Oct 11, 2022 at 1:18 AM Si-Wei Liu
wrote:

On 10/8/2022 10:43 PM, Jason Wang wrote:

On Sat, Oct 8, 2022 at 5:04 PM Si-Wei Liu
wrote:

Similar to other vhost backends, vhostfd can be passed to vhost-vdpa
backend as another parameter to instantiate vhost-vdpa net client.
This would benefit the use case where only open file descriptors, as
opposed to raw vhost-vdpa device paths, are accessible from the QEMU
process.

(qemu) netdev_add type=vhost-vdpa,vhostfd=61,id=vhost-vdpa1

Adding Cindy.

This has been discussed before, we've already had
vhostdev=/dev/fdset/$fd which should be functional equivalent to what
has been proposed here. (And this is how libvirt works if I
understand
correctly).

Yes, I was aware of that discussion. However, our implementation
of the management software is a bit different from libvirt, in
which the paths in /dev/fdset/NNN can't be dynamically passed to
the container where QEMU is running. By using a specific vhostfd
property with existing code, it would allow our mgmt software
smooth adaption without having to add too much infra code to
support the /dev/fdset/NNN trick.

I think fdset has extra flexibility in e.g hot-plug to allow the file
descriptor to be passed with SCM_RIGHTS.

Yes, that's exactly the use case we'd like to support. Though the
difference in our mgmt software stack from libvirt is that any
dynamic path in /dev (like /dev/fdset/ABC or /dev/vhost-vdpa-XYZ)
can't be allowed to get passed through to the container running QEMU
on the fly for security reasons. fd passing is allowed, though, with
very strict security checks.


Interesting, any reason for disallowing fd passing?

For our mgmt software stack, QEMU is running in a secured container
with its own namespace(s) with minimally well known and trusted
devices from root ns exposed (only) at the time when QEMU is being
started.  Direct fd passing via SCM_RIGHTS is allowed, but fdset
device node exposure is not allowed and not even considered useful to
us, as it adds an unwarranted attack surface to the QEMU's secured
container unnecessarily. This has been the case and our security model
for a while now w.r.t hot plugging vhost-net/tap and vhost-scsi
devices, so will do for vhost-vdpa with vhostfd. It's not an open
source project, though what I can share is that it's not a simple
script that can be easily changed, and allow passing extra devices
e.g. fdset especially on the fly is not even in consideration per
suggested security guideline. I think we don't do anything special
here as with other secured containers that disallow dynamic device
injection on the fly.


I'm asking since it's the way that libvirt work and it seems to me we
don't get any complaints in the past.

I guess it was because libvirt doesn't run QEMU in a container with
very limited device exposure, otherwise this sort of constraints would
pop up. Anyway the point and the way I see it is that passing vhostfd
is proved to be working well and secure with other vhost devices, I
don't see why vhost-vdpa is treated special here that would need to
enforce the fdset usage. It's an edge case for libvirt maybe, but
supporting QEMU's vhost-vdpa device to run in a securely contained
environment with no dynamic device injection shouldn't be an odd or
bizarre use case.


Thanks,
-Siwei




That's the main motivation for this direct vhostfd passing support
(noted fdset doesn't need to be used along with /dev/fdset node).

Having it said, I found there's also nuance in the
vhostdev=/dev/fdset/XyZ interface besides the /dev node limitation:
the fd to open has to be dup'ed from the original one passed via
SCM_RIGHTS. This also has implication on security that any ioctl
call from QEMU can't be audited through the original fd.


I'm not sure I get this, but management layer can enforce a ioctl
whiltelist for safety.

Thanks



With this regard, I think vhostfd offers more flexibility than work
around those qemu_open() specifics. Would these justify the use case
of concern?

Thanks,
-Siwei


   It would still be good to add
the support.


On the other hand, the other vhost backends, e.g. tap (via
vhost-net), vhost-scsi and vhost-vsock all accept vhostfd as
parameter to instantiate device, although the /dev/fdset trick
also works there. I think vhost-vdpa is not  unprecedented in this
case?

Yes.


Re: [PATCH] vhost-vdpa: fix assert !virtio_net_get_subqueue(nc)->async_tx.elem in virtio_net_reset

2022-10-28 Thread Si-Wei Liu

Hi Jason,

This one is a one-line simple bug fix but seems to be missed from the 
pull request. If there's a v2 for the PULL, would appreciate if you can 
piggyback. Thanks in advance!


Regards,
-Siwei

On 10/7/2022 8:42 AM, Eugenio Perez Martin wrote:

On Tue, Oct 4, 2022 at 11:05 PM Si-Wei Liu  wrote:

The citing commit has incorrect code in vhost_vdpa_receive() that returns
zero instead of full packet size to the caller. This renders pending packets
unable to be freed so then get clogged in the tx queue forever. When device
is being reset later on, below assertion failure ensues:

0  0x7f86d53bb387 in raise () from /lib64/libc.so.6
1  0x7f86d53bca78 in abort () from /lib64/libc.so.6
2  0x7f86d53b41a6 in __assert_fail_base () from /lib64/libc.so.6
3  0x7f86d53b4252 in __assert_fail () from /lib64/libc.so.6
4  0x55b8f6ff6fcc in virtio_net_reset (vdev=) at 
/usr/src/debug/qemu/hw/net/virtio-net.c:563
5  0x55b8f7012fcf in virtio_reset (opaque=0x55b8faf881f0) at 
/usr/src/debug/qemu/hw/virtio/virtio.c:1993
6  0x55b8f71f0086 in virtio_bus_reset (bus=bus@entry=0x55b8faf88178) at 
/usr/src/debug/qemu/hw/virtio/virtio-bus.c:102
7  0x55b8f71f1620 in virtio_pci_reset (qdev=) at 
/usr/src/debug/qemu/hw/virtio/virtio-pci.c:1845
8  0x55b8f6fafc6c in memory_region_write_accessor (mr=, 
addr=, value=,
size=, shift=, mask=, 
attrs=...) at /usr/src/debug/qemu/memory.c:483
9  0x55b8f6fadce9 in access_with_adjusted_size (addr=addr@entry=20, 
value=value@entry=0x7f867e7fb7e8, size=size@entry=1,
access_size_min=, access_size_max=, 
access_fn=0x55b8f6fafc20 ,
mr=0x55b8faf80a50, attrs=...) at /usr/src/debug/qemu/memory.c:544
10 0x55b8f6fb1d0b in memory_region_dispatch_write (mr=mr@entry=0x55b8faf80a50, 
addr=addr@entry=20, data=0, op=,
attrs=attrs@entry=...) at /usr/src/debug/qemu/memory.c:1470
11 0x55b8f6f62ada in flatview_write_continue (fv=fv@entry=0x7f86ac04cd20, 
addr=addr@entry=549755813908, attrs=...,
attrs@entry=..., buf=buf@entry=0x7f86d0223028 , len=len@entry=1, addr1=20, l=1,
mr=0x55b8faf80a50) at /usr/src/debug/qemu/exec.c:3266
12 0x55b8f6f62c8f in flatview_write (fv=0x7f86ac04cd20, addr=549755813908, 
attrs=...,
buf=0x7f86d0223028 , len=1) at 
/usr/src/debug/qemu/exec.c:3306
13 0x55b8f6f674cb in address_space_write (as=, addr=, attrs=..., buf=,
len=) at /usr/src/debug/qemu/exec.c:3396
14 0x55b8f6f67575 in address_space_rw (as=, addr=, attrs=..., attrs@entry=...,
buf=buf@entry=0x7f86d0223028 , len=, is_write=)
at /usr/src/debug/qemu/exec.c:3406
15 0x55b8f6fc1cc8 in kvm_cpu_exec (cpu=cpu@entry=0x55b8f9aa0e10) at 
/usr/src/debug/qemu/accel/kvm/kvm-all.c:2410
16 0x55b8f6fa5f5e in qemu_kvm_cpu_thread_fn (arg=0x55b8f9aa0e10) at 
/usr/src/debug/qemu/cpus.c:1318
17 0x55b8f7336e16 in qemu_thread_start (args=0x55b8f9ac8480) at 
/usr/src/debug/qemu/util/qemu-thread-posix.c:519
18 0x7f86d575aea5 in start_thread () from /lib64/libpthread.so.0
19 0x7f86d5483b2d in clone () from /lib64/libc.so.6

Make vhost_vdpa_receive() return the size passed in as is, so that the
caller qemu_deliver_packet_iov() would eventually propagate it back to
virtio_net_flush_tx() to release pending packets from the async_tx queue.
Which corresponds to the drop path where qemu_sendv_packet_async() returns
non-zero in virtio_net_flush_tx().


Acked-by: Eugenio Pérez



Fixes: 846a1e85da64 ("vdpa: Add dummy receive callback")
Cc: Eugenio Perez Martin
Signed-off-by: Si-Wei Liu
---
  net/vhost-vdpa.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 4bc3fd0..182b3a1 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -211,7 +211,7 @@ static bool vhost_vdpa_check_peer_type(NetClientState *nc, 
ObjectClass *oc,
  static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf,
size_t size)
  {
-return 0;
+return size;
  }

  static NetClientInfo net_vhost_vdpa_info = {
--
1.8.3.1



Re: [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration

2023-02-02 Thread Si-Wei Liu




On 2/2/2023 3:27 AM, Eugenio Perez Martin wrote:

On Thu, Feb 2, 2023 at 2:00 AM Si-Wei Liu  wrote:



On 1/12/2023 9:24 AM, Eugenio Pérez wrote:

It's possible to migrate vdpa net devices if they are shadowed from the

start.  But to always shadow the dataplane is effectively break its host

passthrough, so its not convenient in vDPA scenarios.



This series enables dynamically switching to shadow mode only at

migration time.  This allow full data virtqueues passthrough all the

time qemu is not migrating.



Successfully tested with vdpa_sim_net (but it needs some patches, I

will send them soon) and qemu emulated device with vp_vdpa with

some restrictions:

* No CVQ.

* VIRTIO_RING_F_STATE patches.

What are these patches (I'm not sure I follow VIRTIO_RING_F_STATE, is it
a new feature that other vdpa driver would need for live migration)?


Not really,

Since vp_vdpa wraps a virtio-net-pci driver to give it vdpa
capabilities it needs a virtio in-band method to set and fetch the
virtqueue state. Jason sent a proposal some time ago [1], and I
implemented it in qemu's virtio emulated device.

I can send them as a RFC but I didn't worry about making it pretty,
nor I think they should be merged at the moment. vdpa parent drivers
should follow vdpa_sim changes.
Got it. No bother sending RFC for now, I think it's limited to virtio 
backed vdpa providers only. Thanks for the clarifications.


-Siwei



Thanks!

[1] https://lists.oasis-open.org/archives/virtio-comment/202103/msg00036.html


-Siwei


* Expose _F_SUSPEND, but ignore it and suspend on ring state fetch like

DPDK.



Comments are welcome, especially in the patcheswith RFC in the message.



v2:

- Use a migration listener instead of a memory listener to know when

the migration starts.

- Add stuff not picked with ASID patches, like enable rings after

driver_ok

- Add rewinding on the migration src, not in dst

- v1 at https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01664.html



Eugenio Pérez (13):

vdpa: fix VHOST_BACKEND_F_IOTLB_ASID flag check

vdpa net: move iova tree creation from init to start

vdpa: copy cvq shadow_data from data vqs, not from x-svq

vdpa: rewind at get_base, not set_base

vdpa net: add migration blocker if cannot migrate cvq

vhost: delay set_vring_ready after DRIVER_OK

vdpa: delay set_vring_ready after DRIVER_OK

vdpa: Negotiate _F_SUSPEND feature

vdpa: add feature_log parameter to vhost_vdpa

vdpa net: allow VHOST_F_LOG_ALL

vdpa: add vdpa net migration state notifier

vdpa: preemptive kick at enable

vdpa: Conditionally expose _F_LOG in vhost_net devices



   include/hw/virtio/vhost-backend.h |   4 +

   include/hw/virtio/vhost-vdpa.h|   1 +

   hw/net/vhost_net.c|  25 ++-

   hw/virtio/vhost-vdpa.c|  64 +---

   hw/virtio/vhost.c |   3 +

   net/vhost-vdpa.c  | 247 +-

   6 files changed, 278 insertions(+), 66 deletions(-)








Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier

2023-02-03 Thread Si-Wei Liu




On 2/2/2023 7:28 AM, Eugenio Perez Martin wrote:

On Thu, Feb 2, 2023 at 2:53 AM Si-Wei Liu  wrote:



On 1/12/2023 9:24 AM, Eugenio Pérez wrote:

This allows net to restart the device backend to configure SVQ on it.

Ideally, these changes should not be net specific. However, the vdpa net
backend is the one with enough knowledge to configure everything because
of some reasons:
* Queues might need to be shadowed or not depending on its kind (control
vs data).
* Queues need to share the same map translations (iova tree).

Because of that it is cleaner to restart the whole net backend and
configure again as expected, similar to how vhost-kernel moves between
userspace and passthrough.

If more kinds of devices need dynamic switching to SVQ we can create a
callback struct like VhostOps and move most of the code there.
VhostOps cannot be reused since all vdpa backend share them, and to
personalize just for networking would be too heavy.

Signed-off-by: Eugenio Pérez 
---
   net/vhost-vdpa.c | 84 
   1 file changed, 84 insertions(+)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 5d7ad6e4d7..f38532b1df 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -26,6 +26,8 @@
   #include 
   #include "standard-headers/linux/virtio_net.h"
   #include "monitor/monitor.h"
+#include "migration/migration.h"
+#include "migration/misc.h"
   #include "migration/blocker.h"
   #include "hw/virtio/vhost.h"

@@ -33,6 +35,7 @@
   typedef struct VhostVDPAState {
   NetClientState nc;
   struct vhost_vdpa vhost_vdpa;
+Notifier migration_state;
   Error *migration_blocker;
   VHostNetState *vhost_net;

@@ -243,10 +246,86 @@ static VhostVDPAState 
*vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
   return DO_UPCAST(VhostVDPAState, nc, nc0);
   }

+static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)
+{
+struct vhost_vdpa *v = &s->vhost_vdpa;
+VirtIONet *n;
+VirtIODevice *vdev;
+int data_queue_pairs, cvq, r;
+NetClientState *peer;
+
+/* We are only called on the first data vqs and only if x-svq is not set */
+if (s->vhost_vdpa.shadow_vqs_enabled == enable) {
+return;
+}
+
+vdev = v->dev->vdev;
+n = VIRTIO_NET(vdev);
+if (!n->vhost_started) {
+return;
+}
+
+if (enable) {
+ioctl(v->device_fd, VHOST_VDPA_SUSPEND);
+}
+data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
+cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
+  n->max_ncs - n->max_queue_pairs : 0;
+vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq);
+
+peer = s->nc.peer;
+for (int i = 0; i < data_queue_pairs + cvq; i++) {
+VhostVDPAState *vdpa_state;
+NetClientState *nc;
+
+if (i < data_queue_pairs) {
+nc = qemu_get_peer(peer, i);
+} else {
+nc = qemu_get_peer(peer, n->max_queue_pairs);
+}
+
+vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc);
+vdpa_state->vhost_vdpa.shadow_data = enable;
+
+if (i < data_queue_pairs) {
+/* Do not override CVQ shadow_vqs_enabled */
+vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable;
+}
+}
+
+r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq);

As the first revision, this method (vhost_net_stop followed by
vhost_net_start) should be fine for software vhost-vdpa backend for e.g.
vp_vdpa and vdpa_sim_net. However, I would like to get your attention
that this method implies substantial blackout time for mode switching on
real hardware - get a full cycle of device reset of getting memory
mappings torn down, unpin & repin same set of pages, and set up new
mapping would take very significant amount of time, especially for a
large VM. Maybe we can do:


Right, I think this is something that deserves optimization in the future.

Note that we must replace the mappings anyway, with all passthrough
queues stopped.
Yes, unmap and remap is needed indeed. I haven't checked, does shadow vq 
keep mapping to the same GPA where passthrough data virtqueues were 
associated with across switch (so that the mode switch is transparent to 
the guest)? For platform IOMMU the mapping and remapping cost is 
inevitable, though I wonder for the on-chip IOMMU case could it take 
some fast path to just replace IOVA in place without destroying and 
setting up all mapping entries, if the same GPA is going to be used for 
the data rings (copy Eli for his input).



  This is because SVQ vrings are not in the guest space.
The pin can be skipped though, I think that's a low hand fruit here.
Yes, that's right. For a large VM pining overhead usually overweighs the 
mapping cost. It would be a great amount of time saving if pin can be 
skipp

Re: [RFC v2 12/13] vdpa: preemptive kick at enable

2023-02-04 Thread Si-Wei Liu




On 2/2/2023 8:53 AM, Eugenio Perez Martin wrote:

On Thu, Feb 2, 2023 at 1:57 AM Si-Wei Liu  wrote:



On 1/13/2023 1:06 AM, Eugenio Perez Martin wrote:

On Fri, Jan 13, 2023 at 4:39 AM Jason Wang  wrote:

On Fri, Jan 13, 2023 at 11:25 AM Zhu, Lingshan  wrote:


On 1/13/2023 10:31 AM, Jason Wang wrote:

On Fri, Jan 13, 2023 at 1:27 AM Eugenio Pérez  wrote:

Spuriously kick the destination device's queue so it knows in case there
are new descriptors.

RFC: This is somehow a gray area. The guest may have placed descriptors
in a virtqueue but not kicked it, so it might be surprised if the device
starts processing it.

So I think this is kind of the work of the vDPA parent. For the parent
that needs this trick, we should do it in the parent driver.

Agree, it looks easier implementing this in parent driver,
I can implement it in ifcvf set_vq_ready right now

Great, but please check whether or not it is really needed.

Some device implementation could check the available descriptions
after DRIVER_OK without waiting for a kick.


So IIUC we can entirely drop this from the series (and I hope we can).
But then, what with the devices that does *not* check for them?

I wonder how the kick can be missed from the first place. Supposedly the
moment when vhost_dev_stop() calls .suspend() into vdpa driver, the
vcpus already stopped running (vm_running = false) and all pending kicks
are delivered through vhost-vdpa's host notifiers or mapped doorbell
already then device won't get new ones.

I'm thinking now in cases like the net rx queue.

When the guest starts it fills it and kicks the device. Let's say
avail_idx is 255.

Following the qemu emulated virtio net,
hw/virtio/virtio.c:virtqueue_split_pop will stash shadow_avail_idx =
255, and it will not check it again until it is out of rx descriptors.

Now the NIC fills N < 255 receive buffers, and VMM migrates. Will the
destination device check rx avail idx even if it has not received any
kick? (here could be at startup or when it needs to receive a packet).
- If the answer is yes, and it will be a bug not to check it, then we
can drop this patch. We're covered even if there is a possibility of
losing a kick in the source.
So this is not an issue of missing delivery of kicks, but more of how 
device is expected to handle pending kicks during suspend? For network 
device, it's not required to process up to avail_idx during suspend, but 
this doesn't mean it should ignore the kick for new descriptors, or 
instead I would say the device shouldn't specifically rely on kick, 
either at suspend or at startup. If at suspend, the device doesn't 
process up to avail_idx, correspondingly the implementation of it should 
sync the avail_idx in memory at startup. Even if the device 
implementation has to process up to avail_idx at suspend, for 
interoperability (i.e. source device didn't sync at suspend) point of 
view it still needs to check avail_idx at startup (resume) time and go 
on to process any pending request, right? So in any case, it seems to me 
the "implicit" kick at startup is needed for any device implementation 
anyway. I wouldn't say mandatory but that's the way how its supposed to 
work I feel.



- If the answer is that it is not mandatory, we need to solve it
somehow. To me, the best way is to spuriously kick as we don't need
changes in the device, all we need is here. A new feature flag
_F_CHECK_AVAIL_ON_STARTUP or equivalent would work the same, but I
think it complicates everything more.

For tx the device should suspend "immediately", so it may receive a
kick, fetch avail_idx with M pending descriptors, transmit P < M and
then receive the suspend. If we don't want to wait indefinitely, the
device should stop processing so there are still pending requests in
the queue for the destination to send. So the case now is the same as
rx, even if the source device actually receives the kick.

Having said that, I didn't check if any code drains the vhost host
notifier. Or, as mentioned in the meeting, check that HW cannot
reorder kick and suspend call.
Not sure how order matters here, though I thought device suspend/resume 
doesn't tie in with kick order?





If the device intends to
purposely ignore (note: this could be a device bug) pending kicks during
.suspend(), then consequently it should check available descriptors
after reaching driver_ok to process outstanding descriptors, making up
for the missing kick. If the vdpa driver doesn't support .suspend(),
then it should flush the work before .reset() - vhost-scsi does it this
way.  Or otherwise I think it's the norm (right thing to do) device
should process pending kicks before guest memory is to be unmapped at
the late game of vhost_dev_stop(). Is there any case kicks may be missing?


So process pending kicks means to drain all tx and rx descriptors?
No it doesn't have to. What I sai

Re: [RFC v2 12/13] vdpa: preemptive kick at enable

2023-02-05 Thread Si-Wei Liu




On 2/5/2023 2:00 AM, Michael S. Tsirkin wrote:

On Sat, Feb 04, 2023 at 03:04:02AM -0800, Si-Wei Liu wrote:

For network hardware device, I thought suspend
just needs to wait until the completion of ongoing Tx/Rx DMA transaction
already in the flight, rather than to drain all the upcoming packets until
avail_idx.

It depends I guess but if device expects to recover all state from just
ring state in memory then at least it has to drain until some index
value.
Yes, that's the general requirement for other devices than networking 
device. For e.g., if a storage device had posted request before 
suspending and there's no way to replay those requests from destination, 
it needs to drain until all posted requests are completed. For network 
device, this requirement can be lifted up somehow, as network (Ethernet) 
usually is tolerant to packet drops. Jason and I once had a long 
discussion about the expectation for {get,set}_vq_state() driver API and 
we came to conclusion that this is something networking device can stand 
up to:


https://lore.kernel.org/lkml/b2d18964-8cd6-6bb1-1995-5b9662070...@redhat.com/

-Siwei



Re: [PATCH v2 01/13] vdpa net: move iova tree creation from init to start

2023-02-12 Thread Si-Wei Liu




On 2/8/2023 1:42 AM, Eugenio Pérez wrote:

Only create iova_tree if and when it is needed.

The cleanup keeps being responsible of last VQ but this change allows it
to merge both cleanup functions.

Signed-off-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
  net/vhost-vdpa.c | 99 ++--
  1 file changed, 71 insertions(+), 28 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index de5ed8ff22..a9e6c8f28e 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -178,13 +178,9 @@ err_init:
  static void vhost_vdpa_cleanup(NetClientState *nc)
  {
  VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
-struct vhost_dev *dev = &s->vhost_net->dev;
  
  qemu_vfree(s->cvq_cmd_out_buffer);

  qemu_vfree(s->status);
-if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
-g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
-}
  if (s->vhost_net) {
  vhost_net_cleanup(s->vhost_net);
  g_free(s->vhost_net);
@@ -234,10 +230,64 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, 
const uint8_t *buf,
  return size;
  }
  
+/** From any vdpa net client, get the netclient of first queue pair */

+static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
+{
+NICState *nic = qemu_get_nic(s->nc.peer);
+NetClientState *nc0 = qemu_get_peer(nic->ncs, 0);
+
+return DO_UPCAST(VhostVDPAState, nc, nc0);
+}
+
+static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
+{
+struct vhost_vdpa *v = &s->vhost_vdpa;
+
+if (v->shadow_vqs_enabled) {
+v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
+   v->iova_range.last);
+}
+}
+
+static int vhost_vdpa_net_data_start(NetClientState *nc)
+{
+VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+struct vhost_vdpa *v = &s->vhost_vdpa;
+
+assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
+
+if (v->index == 0) {
+vhost_vdpa_net_data_start_first(s);
+return 0;
+}
+
+if (v->shadow_vqs_enabled) {
+VhostVDPAState *s0 = vhost_vdpa_net_first_nc_vdpa(s);
+v->iova_tree = s0->vhost_vdpa.iova_tree;
+}
+
+return 0;
+}
+
+static void vhost_vdpa_net_client_stop(NetClientState *nc)
+{
+VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+struct vhost_dev *dev;
+
+assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
+
+dev = s->vhost_vdpa.dev;
+if (dev->vq_index + dev->nvqs == dev->vq_index_end) {
+g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
+}
+}
+
  static NetClientInfo net_vhost_vdpa_info = {
  .type = NET_CLIENT_DRIVER_VHOST_VDPA,
  .size = sizeof(VhostVDPAState),
  .receive = vhost_vdpa_receive,
+.start = vhost_vdpa_net_data_start,
+.stop = vhost_vdpa_net_client_stop,
  .cleanup = vhost_vdpa_cleanup,
  .has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
  .has_ufo = vhost_vdpa_has_ufo,
@@ -351,7 +401,7 @@ dma_map_err:
  
  static int vhost_vdpa_net_cvq_start(NetClientState *nc)

  {
-VhostVDPAState *s;
+VhostVDPAState *s, *s0;
  struct vhost_vdpa *v;
  uint64_t backend_features;
  int64_t cvq_group;
@@ -425,6 +475,15 @@ out:
  return 0;
  }
  
+s0 = vhost_vdpa_net_first_nc_vdpa(s);

+if (s0->vhost_vdpa.iova_tree) {
+/* SVQ is already configured for all virtqueues */
+v->iova_tree = s0->vhost_vdpa.iova_tree;
+} else {
+v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
+   v->iova_range.last);
I wonder how this case could happen, vhost_vdpa_net_data_start_first() 
should've allocated an iova tree on the first data vq. Is zero data vq 
ever possible on net vhost-vdpa?


Thanks,
-Siwei

+}
+
  r = vhost_vdpa_cvq_map_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer,
 vhost_vdpa_net_cvq_cmd_page_len(), false);
  if (unlikely(r < 0)) {
@@ -449,15 +508,9 @@ static void vhost_vdpa_net_cvq_stop(NetClientState *nc)
  if (s->vhost_vdpa.shadow_vqs_enabled) {
  vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer);
  vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->status);
-if (!s->always_svq) {
-/*
- * If only the CVQ is shadowed we can delete this safely.
- * If all the VQs are shadows this will be needed by the time the
- * device is started again to register SVQ vrings and similar.
- */
-g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete);
-}
  }
+
+vhost_vdpa_net_client_stop(nc);
  }
  
  static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s, size_t out_len,

@@ -667,8 +720,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState 
*peer,
 int nvqs,

Re: [PATCH v2 09/13] vdpa net: block migration if the device has CVQ

2023-02-12 Thread Si-Wei Liu




On 2/8/2023 1:42 AM, Eugenio Pérez wrote:

Devices with CVQ needs to migrate state beyond vq state.  Leaving this
to future series.

Signed-off-by: Eugenio Pérez 
---
  net/vhost-vdpa.c | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index bca13f97fd..309861e56c 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -955,11 +955,17 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char 
*name,
  }
  
  if (has_cvq) {

+VhostVDPAState *s;
+
  nc = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name,
   vdpa_device_fd, i, 1, false,
   opts->x_svq, iova_range);
  if (!nc)
  goto err;
+
+s = DO_UPCAST(VhostVDPAState, nc, nc);
+error_setg(&s->vhost_vdpa.dev->migration_blocker,
+   "net vdpa cannot migrate with MQ feature");
Not sure how this can work: migration_blocker is only checked and gets 
added from vhost_dev_init(), which is already done through 
net_vhost_vdpa_init() above. Same question applies to the next patch of 
this series.


Thanks,
-Siwei


  }
  
  return 0;





Re: [PATCH v2 07/13] vdpa: add vdpa net migration state notifier

2023-02-12 Thread Si-Wei Liu



On 2/8/2023 1:42 AM, Eugenio Pérez wrote:

This allows net to restart the device backend to configure SVQ on it.

Ideally, these changes should not be net specific. However, the vdpa net
backend is the one with enough knowledge to configure everything because
of some reasons:
* Queues might need to be shadowed or not depending on its kind (control
   vs data).
* Queues need to share the same map translations (iova tree).

Because of that it is cleaner to restart the whole net backend and
configure again as expected, similar to how vhost-kernel moves between
userspace and passthrough.

If more kinds of devices need dynamic switching to SVQ we can create a
callback struct like VhostOps and move most of the code there.
VhostOps cannot be reused since all vdpa backend share them, and to
personalize just for networking would be too heavy.

Signed-off-by: Eugenio Pérez 
---
v3:
* Add TODO to use the resume operation in the future.
* Use migration_in_setup and migration_has_failed instead of a
   complicated switch case.
---
  net/vhost-vdpa.c | 76 
  1 file changed, 76 insertions(+)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index dd686b4514..bca13f97fd 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -26,12 +26,14 @@
  #include 
  #include "standard-headers/linux/virtio_net.h"
  #include "monitor/monitor.h"
+#include "migration/misc.h"
  #include "hw/virtio/vhost.h"
  
  /* Todo:need to add the multiqueue support here */

  typedef struct VhostVDPAState {
  NetClientState nc;
  struct vhost_vdpa vhost_vdpa;
+Notifier migration_state;
  VHostNetState *vhost_net;
  
  /* Control commands shadow buffers */

@@ -241,10 +243,79 @@ static VhostVDPAState 
*vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s)
  return DO_UPCAST(VhostVDPAState, nc, nc0);
  }
  
+static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable)

+{
+struct vhost_vdpa *v = &s->vhost_vdpa;
+VirtIONet *n;
+VirtIODevice *vdev;
+int data_queue_pairs, cvq, r;
+NetClientState *peer;
+
+/* We are only called on the first data vqs and only if x-svq is not set */
+if (s->vhost_vdpa.shadow_vqs_enabled == enable) {
+return;
+}
+
+vdev = v->dev->vdev;
+n = VIRTIO_NET(vdev);
+if (!n->vhost_started) {
+return;
What if vhost gets started after migration is started, will svq still be 
(dynamically) enabled during vhost_dev_start()? I don't see relevant 
code to deal with it?



+}
+
+data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1;
+cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ?
+  n->max_ncs - n->max_queue_pairs : 0;
+/*
+ * TODO: vhost_net_stop does suspend, get_base and reset. We can be smarter
+ * in the future and resume the device if read-only operations between
+ * suspend and reset goes wrong.
+ */
+vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq);
+
+peer = s->nc.peer;
+for (int i = 0; i < data_queue_pairs + cvq; i++) {
+VhostVDPAState *vdpa_state;
+NetClientState *nc;
+
+if (i < data_queue_pairs) {
+nc = qemu_get_peer(peer, i);
+} else {
+nc = qemu_get_peer(peer, n->max_queue_pairs);
+}
+
+vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc);
+vdpa_state->vhost_vdpa.shadow_data = enable;
Don't get why shadow_data is set on cvq's vhost_vdpa? This may result in 
address space collision: data vq's iova getting improperly allocated on 
cvq's address space in vhost_vdpa_listener_region_{add,del}(). Noted 
currently there's an issue where guest VM's memory listener registration 
is always hooked to the last vq, which could be on the cvq in a 
different iova address space VHOST_VDPA_NET_CVQ_ASID.


Thanks,
-Siwei


+
+if (i < data_queue_pairs) {
+/* Do not override CVQ shadow_vqs_enabled */
+vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable;
+}
+}
+
+r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq);
+if (unlikely(r < 0)) {
+error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r);
+}
+}
+
+static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data)
+{
+MigrationState *migration = data;
+VhostVDPAState *s = container_of(notifier, VhostVDPAState,
+ migration_state);
+
+if (migration_in_setup(migration)) {
+vhost_vdpa_net_log_global_enable(s, true);
+} else if (migration_has_failed(migration)) {
+vhost_vdpa_net_log_global_enable(s, false);
+}
+}
+
  static void vhost_vdpa_net_data_start_first(VhostVDPAState *s)
  {
  struct vhost_vdpa *v = &s->vhost_vdpa;
  
+add_migration_state_change_notifier(&s->migration_state);

  if (v->shadow_vqs_enabled) {
  v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
   

Re: [PATCH v4 07/15] vdpa: add vhost_vdpa_suspend

2023-02-28 Thread Si-Wei Liu




On 2/24/2023 7:54 AM, Eugenio Pérez wrote:

The function vhost.c:vhost_dev_stop fetches the vring base so the vq
state can be migrated to other devices.  However, this is unreliable in
vdpa, since we didn't signal the device to suspend the queues, making
the value fetched useless.

Suspend the device if possible before fetching first and subsequent
vring bases.

Moreover, vdpa totally reset and wipes the device at the last device
before fetch its vrings base, making that operation useless in the last
device. This will be fixed in later patches of this series.

Signed-off-by: Eugenio Pérez 
---
v4:
* Look for _F_SUSPEND at vhost_dev->backend_cap, not backend_features
* Fall back on reset & fetch used idx from guest's memory
---
  hw/virtio/vhost-vdpa.c | 25 +
  hw/virtio/trace-events |  1 +
  2 files changed, 26 insertions(+)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 228677895a..f542960a64 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -712,6 +712,7 @@ static int vhost_vdpa_reset_device(struct vhost_dev *dev)
  
  ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &status);

  trace_vhost_vdpa_reset_device(dev, status);
+v->suspended = false;
  return ret;
  }
  
@@ -1109,6 +1110,29 @@ static void vhost_vdpa_svqs_stop(struct vhost_dev *dev)

  }
  }
  
+static void vhost_vdpa_suspend(struct vhost_dev *dev)

+{
+struct vhost_vdpa *v = dev->opaque;
+int r;
+
+if (!vhost_vdpa_first_dev(dev)) {
+return;
+}
+
+if (!(dev->backend_cap & BIT_ULL(VHOST_BACKEND_F_SUSPEND))) {
Polarity reversed. This ends up device getting reset always even if the 
backend offers _F_SUSPEND.


-Siwei


+trace_vhost_vdpa_suspend(dev);
+r = ioctl(v->device_fd, VHOST_VDPA_SUSPEND);
+if (unlikely(r)) {
+error_report("Cannot suspend: %s(%d)", g_strerror(errno), errno);
+} else {
+v->suspended = true;
+return;
+}
+}
+
+vhost_vdpa_reset_device(dev);
+}
+
  static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
  {
  struct vhost_vdpa *v = dev->opaque;
@@ -1123,6 +1147,7 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, 
bool started)
  }
  vhost_vdpa_set_vring_ready(dev);
  } else {
+vhost_vdpa_suspend(dev);
  vhost_vdpa_svqs_stop(dev);
  vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
  }
diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index a87c5f39a2..8f8d05cf9b 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -50,6 +50,7 @@ vhost_vdpa_set_vring_ready(void *dev) "dev: %p"
  vhost_vdpa_dump_config(void *dev, const char *line) "dev: %p %s"
  vhost_vdpa_set_config(void *dev, uint32_t offset, uint32_t size, uint32_t flags) "dev: %p offset: 
%"PRIu32" size: %"PRIu32" flags: 0x%"PRIx32
  vhost_vdpa_get_config(void *dev, void *config, uint32_t config_len) "dev: %p 
config: %p config_len: %"PRIu32
+vhost_vdpa_suspend(void *dev) "dev: %p"
  vhost_vdpa_dev_start(void *dev, bool started) "dev: %p started: %d"
  vhost_vdpa_set_log_base(void *dev, uint64_t base, unsigned long long size, int refcnt, int fd, 
void *log) "dev: %p base: 0x%"PRIx64" size: %llu refcnt: %d fd: %d log: %p"
  vhost_vdpa_set_vring_addr(void *dev, unsigned int index, unsigned int flags, uint64_t desc_user_addr, uint64_t 
used_user_addr, uint64_t avail_user_addr, uint64_t log_guest_addr) "dev: %p index: %u flags: 0x%x desc_user_addr: 
0x%"PRIx64" used_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" log_guest_addr: 
0x%"PRIx64





Re: [RFC 1/2] vhost-vdpa: Decouple the IOVA allocator

2024-08-29 Thread Si-Wei Liu




On 8/29/2024 9:53 AM, Eugenio Perez Martin wrote:

On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer  wrote:

Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree
will hold all IOVA ranges that have been allocated (e.g. in the
IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.

A new API function vhost_iova_tree_insert() is also created to add a
IOVA->HVA mapping into the IOVA->HVA tree.


I think this is a good first iteration but we can take steps to
simplify it. Also, it is great to be able to make points on real code
instead of designs on the air :).

I expected a split of vhost_iova_tree_map_alloc between the current
vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or
similar. Similarly, a vhost_iova_tree_remove and
vhost_iova_tree_remove_gpa would be needed.

The first one is used for regions that don't exist in the guest, like
SVQ vrings or CVQ buffers. The second one is the one used by the
memory listener to map the guest regions into the vdpa device.

Implementation wise, only two trees are actually needed:
* Current iova_taddr_map that contains all IOVA->vaddr translations as
seen by the device, so both allocation functions can work on a single
tree. The function iova_tree_find_iova keeps using this one, so the
I thought we had thorough discussion about this and agreed upon the 
decoupled IOVA allocator solution. But maybe I missed something earlier, 
I am not clear how come this iova_tree_find_iova function could still 
work with the full IOVA-> HVA tree when it comes to aliased memory or 
overlapped HVAs? Granted, for the memory map removal in the 
.region_del() path, we could rely on the GPA tree to locate the 
corresponding IOVA, but how come the translation path could figure out 
which IOVA range to return when the vaddr happens to fall in an 
overlapped HVA range? Do we still assume some overlapping order so we 
always return the first match from the tree? Or we expect every current 
user of iova_tree_find_iova should pass in GPA rather than HVA and use 
the vhost_iova_xxx_gpa API variant to look up IOVA?


Thanks,
-Siwei


user does not need to know if the address is from the guest or only
exists in QEMU by using RAMBlock etc. All insert and remove functions
use this tree.
* A new tree that relates IOVA to GPA, that only
vhost_iova_tree_map_alloc_gpa and vhost_iova_tree_remove_gpa uses.

The ideal case is that the key in this new tree is the GPA and the
value is the IOVA. But IOVATree's DMA is named the reverse: iova is
the key and translated_addr is the vaddr. We can create a new tree
struct for that, use GTree directly, or translate the reverse
linearly. As memory add / remove should not be frequent, I think the
simpler is the last one, but I'd be ok with creating a new tree.

vhost_iova_tree_map_alloc_gpa needs to add the map to this new tree
also. Similarly, vhost_iova_tree_remove_gpa must look for the GPA in
this tree, and only remove the associated DMAMap in iova_taddr_map
that matches the IOVA.

Does it make sense to you?


Signed-off-by: Jonah Palmer 
---
  hw/virtio/vhost-iova-tree.c | 38 -
  hw/virtio/vhost-iova-tree.h |  1 +
  hw/virtio/vhost-vdpa.c  | 31 --
  net/vhost-vdpa.c| 13 +++--
  4 files changed, 70 insertions(+), 13 deletions(-)

diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
index 3d03395a77..32c03db2f5 100644
--- a/hw/virtio/vhost-iova-tree.c
+++ b/hw/virtio/vhost-iova-tree.c
@@ -28,12 +28,17 @@ struct VhostIOVATree {

  /* IOVA address to qemu memory maps. */
  IOVATree *iova_taddr_map;
+
+/* IOVA tree (IOVA allocator) */
+IOVATree *iova_map;
  };

  /**
- * Create a new IOVA tree
+ * Create a new VhostIOVATree with a new set of IOVATree's:

s/IOVA tree/VhostIOVATree/ is good, but I think the rest is more an
implementation detail.


+ * - IOVA allocator (iova_map)
+ * - IOVA->HVA tree (iova_taddr_map)
   *
- * Returns the new IOVA tree
+ * Returns the new VhostIOVATree
   */
  VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last)
  {
@@ -44,6 +49,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr 
iova_last)
  tree->iova_last = iova_last;

  tree->iova_taddr_map = iova_tree_new();
+tree->iova_map = iova_tree_new();
  return tree;
  }

@@ -53,6 +59,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr 
iova_last)
  void vhost_iova_tree_delete(VhostIOVATree *iova_tree)
  {
  iova_tree_destroy(iova_tree->iova_taddr_map);
+iova_tree_destroy(iova_tree->iova_map);
  g_free(iova_tree);
  }

@@ -88,13 +95,12 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap 
*map)
  /* Some vhost devices do not like addr 0. Skip first page */
  hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size();

-if (map->translated_addr + map->size < m

Re: [RFC 1/2] vhost-vdpa: Decouple the IOVA allocator

2024-08-30 Thread Si-Wei Liu




On 8/30/2024 1:05 AM, Eugenio Perez Martin wrote:

On Fri, Aug 30, 2024 at 6:20 AM Si-Wei Liu  wrote:



On 8/29/2024 9:53 AM, Eugenio Perez Martin wrote:

On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer  wrote:

Decouples the IOVA allocator from the IOVA->HVA tree and instead adds
the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree
will hold all IOVA ranges that have been allocated (e.g. in the
IOVA->HVA tree) and are removed when any IOVA ranges are deallocated.

A new API function vhost_iova_tree_insert() is also created to add a
IOVA->HVA mapping into the IOVA->HVA tree.


I think this is a good first iteration but we can take steps to
simplify it. Also, it is great to be able to make points on real code
instead of designs on the air :).

I expected a split of vhost_iova_tree_map_alloc between the current
vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or
similar. Similarly, a vhost_iova_tree_remove and
vhost_iova_tree_remove_gpa would be needed.

The first one is used for regions that don't exist in the guest, like
SVQ vrings or CVQ buffers. The second one is the one used by the
memory listener to map the guest regions into the vdpa device.

Implementation wise, only two trees are actually needed:
* Current iova_taddr_map that contains all IOVA->vaddr translations as
seen by the device, so both allocation functions can work on a single
tree. The function iova_tree_find_iova keeps using this one, so the

I thought we had thorough discussion about this and agreed upon the
decoupled IOVA allocator solution.

My interpretation of it is to leave the allocator as the current one,
and create a new tree with GPA which is guaranteed to be unique. But
we can talk over it of course.


But maybe I missed something earlier,
I am not clear how come this iova_tree_find_iova function could still
work with the full IOVA-> HVA tree when it comes to aliased memory or
overlapped HVAs? Granted, for the memory map removal in the
.region_del() path, we could rely on the GPA tree to locate the
corresponding IOVA, but how come the translation path could figure out
which IOVA range to return when the vaddr happens to fall in an
overlapped HVA range?

That is not a problem, as they both translate to the same address at the device.
Not sure I followed, it might return a wrong IOVA (range) which the host 
kernel may have conflict or unmatched attribute i.e. permission, size et 
al in the map.




The most complicated situation is where we have a region contained in
another region, and the requested buffer crosses them. If the IOVA
tree returns the inner region, it will return the buffer chained with
the rest of the content in the outer region. Not optimal, but solved
either way.
Don't quite understand what it means... So in this overlapping case, 
speaking of the expectation of the translation API, you would like to 
have all IOVA ranges that match the overlapped HVA to be returned? And 
then to rely on the user (caller) to figure out which one is correct? 
Wouldn't it be easier for the user (SVQ) to use the memory system API 
directly to figure out?


As we are talking about API we may want to build it in a way generic 
enough to address all possible needs (which goes with what memory 
subsystem is capable of), rather than just look on the current usage 
which has kind of narrow scope. Although virtio-net device doesn't work 
with aliased region now, some other virtio device may do, or maybe some 
day virtio-net would need to use aliased region than the API and the 
users (SVQ) would have to go with another round of significant 
refactoring due to the iova-tree internal working. I feel it's just too 
early or too tight to abstract the iova-tree layer and get the API 
customized for the current use case with a lot of limitations on how 
user should expect to use it. We need some more flexibility and ease on 
extensibility if we want to take the chance to get it rewritten, given 
it is not a lot of code that Jonah had showed here ..



The only problem that comes to my mind is the case where the inner
region is RO
Yes, this is one of examples around the permission or size I mentioned 
above, which may have a conflict view with the memory system or the kernel.


Thanks,
-Siwei


and it is a write command, but I don't think we have this
case in a sane guest. A malicious guest cannot do any harm this way
anyway.


Do we still assume some overlapping order so we
always return the first match from the tree? Or we expect every current
user of iova_tree_find_iova should pass in GPA rather than HVA and use
the vhost_iova_xxx_gpa API variant to look up IOVA?


No, iova_tree_find_iova should keep asking for vaddr, as the result is
guaranteed to be there. Users of VhostIOVATree only need to modify how
they add or remove regions, knowing if they come from the guest or
not. As shown by this series, it is easier to do in that place than in
translation.


Th

  1   2   3   >