Re: Reducing vdpa migration downtime because of memory pin / maps
On 7/19/2023 3:40 AM, Eugenio Perez Martin wrote: On Mon, Jul 17, 2023 at 9:57 PM Si-Wei Liu wrote: Hey, I am now back from the break. Sorry for the delayed response, please see in line. On 7/9/2023 11:04 PM, Eugenio Perez Martin wrote: On Sat, Jul 8, 2023 at 11:14 AM Si-Wei Liu wrote: On 7/5/2023 10:46 PM, Eugenio Perez Martin wrote: On Thu, Jul 6, 2023 at 2:13 AM Si-Wei Liu wrote: On 7/5/2023 11:03 AM, Eugenio Perez Martin wrote: On Tue, Jun 27, 2023 at 8:36 AM Si-Wei Liu wrote: On 6/9/2023 7:32 AM, Eugenio Perez Martin wrote: On Fri, Jun 9, 2023 at 12:39 AM Si-Wei Liu wrote: On 6/7/23 01:08, Eugenio Perez Martin wrote: On Wed, Jun 7, 2023 at 12:43 AM Si-Wei Liu wrote: Sorry for reviving this old thread, I lost the best timing to follow up on this while I was on vacation. I have been working on this and found out some discrepancy, please see below. On 4/5/23 04:37, Eugenio Perez Martin wrote: Hi! As mentioned in the last upstream virtio-networking meeting, one of the factors that adds more downtime to migration is the handling of the guest memory (pin, map, etc). At this moment this handling is bound to the virtio life cycle (DRIVER_OK, RESET). In that sense, the destination device waits until all the guest memory / state is migrated to start pinning all the memory. The proposal is to bind it to the char device life cycle (open vs close), Hmmm, really? If it's the life cycle for char device, the next guest / qemu launch on the same vhost-vdpa device node won't make it work. Maybe my sentence was not accurate, but I think we're on the same page here. Two qemu instances opening the same char device at the same time are not allowed, and vhost_vdpa_release clean all the maps. So the next qemu that opens the char device should see a clean device anyway. I mean the pin can't be done at the time of char device open, where the user address space is not known/bound yet. The earliest point possible for pinning would be until the vhost_attach_mm() call from SET_OWNER is done. Maybe we are deviating, let me start again. Using QEMU code, what I'm proposing is to modify the lifecycle of the .listener member of struct vhost_vdpa. At this moment, the memory listener is registered at vhost_vdpa_dev_start(dev, started=true) call for the last vhost_dev, and is unregistered in both vhost_vdpa_reset_status and vhost_vdpa_cleanup. My original proposal was just to move the memory listener registration to the last vhost_vdpa_init, and remove the unregister from vhost_vdpa_reset_status. The calls to vhost_vdpa_dma_map/unmap would be the same, the device should not realize this change. This can address LM downtime latency for sure, but it won't help downtime during dynamic SVQ switch - which still needs to go through the full unmap/map cycle (that includes the slow part for pinning) from passthrough to SVQ mode. Be noted not every device could work with a separate ASID for SVQ descriptors. The fix should expect to work on normal vDPA vendor devices without a separate descriptor ASID, with platform IOMMU underneath or with on-chip IOMMU. At this moment the SVQ switch is very inefficient mapping-wise, as it unmap all the GPA->HVA maps and overrides it. In particular, SVQ is allocated in low regions of the iova space, and then the guest memory is allocated in this new IOVA region incrementally. Yep. The key to build this fast path for SVQ switching I think is to maintain the identity mapping for the passthrough queues so that QEMU can reuse the old mappings for guest memory (e.g. GIOVA identity mapped to GPA) while incrementally adding new mappings for SVQ vrings. We can optimize that if we place SVQ in a free GPA area instead. Here's a question though: it might not be hard to find a free GPA range for the non-vIOMMU case (allocate iova from beyond the 48bit or 52bit ranges), but I'm not sure if easy to find a free GIOVA range for the vIOMMU case - particularly this has to work in the same entire 64bit IOVA address ranges that (for now) QEMU won't be able to "reserve" a specific IOVA ranges for SVQ from the vIOMMU. Do you foresee this can be done for every QEMU emulated vIOMMU (intel-iommu amd-iommu, arm smmu and virito-iommu) so that we can call it out as a generic means for SVQ switching optimization? In the case vIOMMU allocates a new block we will use the same algorithm as now: * Find a new free IOVA chunk of the same size * Map this new SVQ IOVA, that may or may not be the same as SVQ Since we must go through the translation phase to sanitize guest's available descriptors anyway, it has zero added cost. Not sure I followed, this can work but doesn't seem able to reuse the old host kernel mappings for guest memory, hence still requires remap of the entire host IOVA ranges when SVQ IOVA comes along. I think by maintaining 1:1 identity map on guest memory, we don't have to bother tearing down existing HVA-
Re: [PATCH 1/2] Reduce vdpa initialization / startup overhead
On 7/21/2023 3:39 AM, Eugenio Perez Martin wrote: On Tue, Jul 18, 2023 at 12:55 PM Michael S. Tsirkin wrote: On Thu, Apr 20, 2023 at 10:59:56AM +0200, Eugenio Perez Martin wrote: On Thu, Apr 20, 2023 at 7:25 AM Pei Li wrote: Hi all, My bad, I just submitted the kernel patch. If we are passing some generic command, still we have to add an additional field in the structure to indicate what is the unbatched version of this command, and the struct vhost_ioctls would be some command specific structure. In summary, the structure would be something like struct vhost_cmd_batch { int ncmds; int cmd; The unbatched version should go in each vhost_ioctls. That allows us to send many different commands in one ioctl instead of having to resort to many ioctls, each one for a different task. The problem with that is the size of that struct vhost_ioctl, so we can build an array. I think it should be enough with the biggest of them (vhost_vring_addr ?) for a long time, but I would like to know if anybody finds a drawback here. We could always resort to pointers if we find we need more space, or start with them from the beginning. CCing Si-Wei here too, as he is also interested in reducing the startup time. Thanks! And copying my response too: This is all very exciting, but what exactly is the benefit? No optimization patches are going to be merged without numbers showing performance gains. In this case, can you show gains in process startup time? Are they significant enough to warrant adding new UAPI? This should have been marked as RFC in that regard. When this was sent it was one of the planned actions to reduce overhead. After Si-Wei benchmarks, all the efforts will focus on reducing the pinning / maps for the moment. It is unlikely that this will be moved forward soon. Right, this work has comparatively lower priority in terms of significance of impact to migration downtime (to vdpa h/w device that does DMA), but after getting the pinning/map latency effect removed from the performance path, it'd be easier to see same scalability effect subjected to vq count as how software vp_vdpa performs today. I think in order to profile the vq scalability effect with large queue count, we first would need to have proper implementation of CVQ replay and multiqueue LM in place - I'm not sure if x-svq=on could be a good approximate, but maybe that can be used to collect some initial profiling data. Would this be sufficient to move this forward in parallel? Regards, -Siwei Thanks! struct vhost_ioctls[]; }; This is doable. Also, this is my first time submitting patches to open source, sorry about the mess in advance. That being said, feel free to throw questions / comments! Thanks and best regards, Pei On Wed, Apr 19, 2023 at 9:19 PM Jason Wang wrote: On Wed, Apr 19, 2023 at 11:33 PM Eugenio Perez Martin wrote: On Wed, Apr 19, 2023 at 12:56 AM wrote: From: Pei Li Currently, part of the vdpa initialization / startup process needs to trigger many ioctls per vq, which is very inefficient and causing unnecessary context switch between user mode and kernel mode. This patch creates an additional ioctl() command, namely VHOST_VDPA_GET_VRING_GROUP_BATCH, that will batching commands of VHOST_VDPA_GET_VRING_GROUP into a single ioctl() call. I'd expect there's a kernel patch but I didn't see that? If we want to go this way. Why simply have a more generic way, that is introducing something like: VHOST_CMD_BATCH which did something like struct vhost_cmd_batch { int ncmds; struct vhost_ioctls[]; }; Then you can batch other ioctls other than GET_VRING_GROUP? Thanks It seems to me you forgot to send the 0/2 cover letter :). Please include the time we save thanks to avoiding the repetitive ioctls in each patch. CCing Jason and Michael. Signed-off-by: Pei Li --- hw/virtio/vhost-vdpa.c | 31 +++- include/standard-headers/linux/vhost_types.h | 3 ++ linux-headers/linux/vhost.h | 7 + 3 files changed, 33 insertions(+), 8 deletions(-) diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c index bc6bad23d5..6d45ff8539 100644 --- a/hw/virtio/vhost-vdpa.c +++ b/hw/virtio/vhost-vdpa.c @@ -679,7 +679,8 @@ static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev) uint64_t f = 0x1ULL << VHOST_BACKEND_F_IOTLB_MSG_V2 | 0x1ULL << VHOST_BACKEND_F_IOTLB_BATCH | 0x1ULL << VHOST_BACKEND_F_IOTLB_ASID | -0x1ULL << VHOST_BACKEND_F_SUSPEND; +0x1ULL << VHOST_BACKEND_F_SUSPEND | +0x1ULL << VHOST_BACKEND_F_IOCTL_BATCH; int r; if (vhost_vdpa_call(dev, VHOST_GET_BACKEND_FEATURES, &features)) { @@ -731,14 +732,28 @@ static int vhost_vdpa_get_vq_index(struct vhost_dev *dev, int idx) static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev) { -int i; +int i, nvqs = dev->nvqs; +uint64_t backend_features = dev->backend_cap; + trac
Re: [RFC PATCH 07/12] vdpa: add vhost_vdpa_reset_queue
On 7/20/2023 11:14 AM, Eugenio Pérez wrote: Split out vq reset operation in its own function, as it may be called with ring reset. Signed-off-by: Eugenio Pérez --- hw/virtio/vhost-vdpa.c | 16 1 file changed, 16 insertions(+) diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c index 6ae276ccde..df2515a247 100644 --- a/hw/virtio/vhost-vdpa.c +++ b/hw/virtio/vhost-vdpa.c @@ -547,6 +547,21 @@ int vhost_vdpa_set_vring_ready(struct vhost_vdpa *v, unsigned idx) return vhost_vdpa_set_vring_ready_internal(v, idx, true); } +/* TODO: Properly reorder static functions */ +static void vhost_vdpa_svq_stop(struct vhost_dev *dev, unsigned idx); +static void vhost_vdpa_reset_queue(struct vhost_dev *dev, int idx) +{ +struct vhost_vdpa *v = dev->opaque; + +if (dev->features & VIRTIO_F_RING_RESET) { +vhost_vdpa_set_vring_ready_internal(v, idx, false); I'm not sure I understand this patch - this is NOT the spec defined way to initiate RING_RESET? Quoting the spec diff from the original RING_RESET tex doc: +The device MUST reset the queue when 1 is written to \field{queue_reset}, and +present a 1 in \field{queue_reset} after the queue has been reset, until the +driver re-enables the queue via \field{queue_enable} or the device is reset. +The device MUST present consistent default values after queue reset. +(see \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Reset}). Or you intend to rewrite it to be spec conforming later on? -Siwei +} + +if (v->shadow_vqs_enabled) { +vhost_vdpa_svq_stop(dev, idx - dev->vq_index); +} +} + /* * The use of this function is for requests that only need to be * applied once. Typically such request occurs at the beginning @@ -1543,4 +1558,5 @@ const VhostOps vdpa_ops = { .vhost_force_iommu = vhost_vdpa_force_iommu, .vhost_set_config_call = vhost_vdpa_set_config_call, .vhost_reset_status = vhost_vdpa_reset_status, +.vhost_reset_queue = vhost_vdpa_reset_queue, };
Re: [RFC PATCH 11/12] vdpa: use SVQ to stall dataplane while NIC state is being restored
On 7/20/2023 11:14 AM, Eugenio Pérez wrote: Some dynamic state of a virtio-net vDPA devices is restored from CVQ in the event of a live migration. However, dataplane needs to be disabled so the NIC does not receive buffers in the invalid ring. As a default method to achieve it, let's offer a shadow vring with 0 avail idx. As a fallback method, we will enable dataplane vqs later, as proposed previously. Let's not jump to conclusion too early what will be the default v.s. fallback [1] - as this is on a latency sensitive path, I'm not fully convinced ring reset could perform better than or equally same as the deferred dataplane enablement approach on hardware. At this stage I think ring_reset has no adoption on vendors device, while it's definitely easier with lower hardware overhead for vendor to implement deferred dataplane enabling. If at some point vendor's device has to support RING_RESET for other use cases (MTU change propagation for ex., a prerequisite for GRO HW) than live migration, defaulting to RING_RESET on this SVQ path has no real benefit but adds complications needlessly to vendor's device. [1] https://lore.kernel.org/virtualization/bf2164a9-1dfd-14d9-be2a-8bb7620a0...@oracle.com/T/#m15caca6fbb00ca9c00e2b33391297a2d8282ff89 Thanks, -Siwei Signed-off-by: Eugenio Pérez --- net/vhost-vdpa.c | 49 +++- 1 file changed, 44 insertions(+), 5 deletions(-) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index af83de92f8..e14ae48f23 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -338,10 +338,25 @@ static int vhost_vdpa_net_data_start(NetClientState *nc) { VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc); struct vhost_vdpa *v = &s->vhost_vdpa; +bool has_cvq = v->dev->vq_index_end % 2; assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); -if (s->always_svq || +if (has_cvq && (v->dev->features & VIRTIO_F_RING_RESET)) { +/* + * Offer a fake vring to the device while the state is restored + * through CVQ. That way, the guest will not see packets in unexpected + * queues. + * + * This will be undone after loading all state through CVQ, at + * vhost_vdpa_net_load. + * + * TODO: Future optimizations may skip some SVQ setup and teardown, + * like set the right kick and call fd or doorbell maps directly, and + * the iova tree. + */ +v->shadow_vqs_enabled = true; +} else if (s->always_svq || migration_is_setup_or_active(migrate_get_current()->state)) { v->shadow_vqs_enabled = true; v->shadow_data = true; @@ -738,10 +753,34 @@ static int vhost_vdpa_net_load(NetClientState *nc) return r; } -for (int i = 0; i < v->dev->vq_index; ++i) { -r = vhost_vdpa_set_vring_ready(v, i); -if (unlikely(r)) { -return r; +if (v->dev->features & VIRTIO_F_RING_RESET && !s->always_svq && +!migration_is_setup_or_active(migrate_get_current()->state)) { +NICState *nic = qemu_get_nic(s->nc.peer); +int queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; + +for (int i = 0; i < queue_pairs; ++i) { +NetClientState *ncs = qemu_get_peer(nic->ncs, i); +VhostVDPAState *s_i = DO_UPCAST(VhostVDPAState, nc, ncs); + +for (int j = 0; j < 2; ++j) { +vhost_net_virtqueue_reset(v->dev->vdev, ncs->peer, j); +} + +s_i->vhost_vdpa.shadow_vqs_enabled = false; + +for (int j = 0; j < 2; ++j) { +r = vhost_net_virtqueue_restart(v->dev->vdev, ncs->peer, j); +if (unlikely(r < 0)) { +return r; +} +} +} +} else { +for (int i = 0; i < v->dev->vq_index; ++i) { +r = vhost_vdpa_set_vring_ready(v, i); +if (unlikely(r)) { +return r; +} } }
[PATCH 00/12] Preparatory patches for live migration downtime improvement
This small series is a spin-off from [1], where the patches already acked from that large patchset may get merged earlier without having to wait for those that are still in review. The last 3 patches (10 - 12) are bug fix to an issue where cancellation of ongoing migration may lead to busted network. These are the only outstanding patches in this patchset with no acknowledgement received as yet. Please try to review them at the earliest oppotunity. Thanks! Regards, -Siwei [1] [PATCH 00/40] vdpa-net: improve migration downtime through descriptor ASID and persistent IOTLB https://lore.kernel.org/qemu-devel/1701970793-6865-1-git-send-email-si-wei@oracle.com/ --- Si-Wei Liu (12): vdpa: add back vhost_vdpa_net_first_nc_vdpa vdpa: no repeat setting shadow_data vdpa: factor out vhost_vdpa_last_dev vdpa: factor out vhost_vdpa_net_get_nc_vdpa vdpa: add vhost_vdpa_set_address_space_id trace vdpa: add vhost_vdpa_get_vring_base trace for svq mode vdpa: add vhost_vdpa_set_dev_vring_base trace for svq mode vdpa: add trace events for vhost_vdpa_net_load_cmd vdpa: add trace event for vhost_vdpa_net_load_mq vdpa: define SVQ transitioning state for mode switching vdpa: indicate transitional state for SVQ switching vdpa: fix network breakage after cancelling migration hw/virtio/trace-events | 4 ++-- hw/virtio/vhost-vdpa.c | 27 ++- include/hw/virtio/vhost-vdpa.h | 9 + net/trace-events | 6 ++ net/vhost-vdpa.c | 33 + 5 files changed, 68 insertions(+), 11 deletions(-) -- 1.8.3.1
[PATCH 02/12] vdpa: no repeat setting shadow_data
Since shadow_data is now shared in the parent data struct, it just needs to be set only once by the first vq. This change will make shadow_data independent of svq enabled state, which can be optionally turned off when SVQ descritors and device driver areas are all isolated to a separate address space. Reviewed-by: Eugenio Pérez Acked-by: Jason Wang Signed-off-by: Si-Wei Liu --- net/vhost-vdpa.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 4479ffa..06c83b4 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -354,13 +354,12 @@ static int vhost_vdpa_net_data_start(NetClientState *nc) if (s->always_svq || migration_is_setup_or_active(migrate_get_current()->state)) { v->shadow_vqs_enabled = true; -v->shared->shadow_data = true; } else { v->shadow_vqs_enabled = false; -v->shared->shadow_data = false; } if (v->index == 0) { +v->shared->shadow_data = v->shadow_vqs_enabled; vhost_vdpa_net_data_start_first(s); return 0; } -- 1.8.3.1
[PATCH 12/12] vdpa: fix network breakage after cancelling migration
Fix an issue where cancellation of ongoing migration ends up with no network connectivity. When canceling migration, SVQ will be switched back to the passthrough mode, but the right call fd is not programed to the device and the svq's own call fd is still used. At the point of this transitioning period, the shadow_vqs_enabled hadn't been set back to false yet, causing the installation of call fd inadvertently bypassed. Fixes: a8ac88585da1 ("vhost: Add Shadow VirtQueue call forwarding capabilities") Cc: Eugenio Pérez Acked-by: Jason Wang Signed-off-by: Si-Wei Liu --- hw/virtio/vhost-vdpa.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c index 004110f..dfeca8b 100644 --- a/hw/virtio/vhost-vdpa.c +++ b/hw/virtio/vhost-vdpa.c @@ -1468,7 +1468,15 @@ static int vhost_vdpa_set_vring_call(struct vhost_dev *dev, /* Remember last call fd because we can switch to SVQ anytime. */ vhost_svq_set_svq_call_fd(svq, file->fd); -if (v->shadow_vqs_enabled) { +/* + * When SVQ is transitioning to off, shadow_vqs_enabled has + * not been set back to false yet, but the underlying call fd + * will have to switch back to the guest notifier to signal the + * passthrough virtqueues. In other situations, SVQ's own call + * fd shall be used to signal the device model. + */ +if (v->shadow_vqs_enabled && +v->shared->svq_switching != SVQ_TSTATE_DISABLING) { return 0; } -- 1.8.3.1
[PATCH 11/12] vdpa: indicate transitional state for SVQ switching
svq_switching indicates the transitional state whether or not SVQ mode switching is in progress, and towards which direction. Add the neccessary state around where the switching would take place. Signed-off-by: Si-Wei Liu --- net/vhost-vdpa.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 9f25221..96d95b9 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -317,6 +317,8 @@ static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? n->max_ncs - n->max_queue_pairs : 0; +v->shared->svq_switching = enable ? +SVQ_TSTATE_ENABLING : SVQ_TSTATE_DISABLING; /* * TODO: vhost_net_stop does suspend, get_base and reset. We can be smarter * in the future and resume the device if read-only operations between @@ -329,6 +331,7 @@ static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) if (unlikely(r < 0)) { error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r); } +v->shared->svq_switching = SVQ_TSTATE_DONE; } static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data) -- 1.8.3.1
[PATCH 05/12] vdpa: add vhost_vdpa_set_address_space_id trace
For better debuggability and observability. Reviewed-by: Eugenio Pérez Signed-off-by: Si-Wei Liu --- net/trace-events | 3 +++ net/vhost-vdpa.c | 3 +++ 2 files changed, 6 insertions(+) diff --git a/net/trace-events b/net/trace-events index 823a071..aab666a 100644 --- a/net/trace-events +++ b/net/trace-events @@ -23,3 +23,6 @@ colo_compare_tcp_info(const char *pkt, uint32_t seq, uint32_t ack, int hdlen, in # filter-rewriter.c colo_filter_rewriter_pkt_info(const char *func, const char *src, const char *dst, uint32_t seq, uint32_t ack, uint32_t flag) "%s: src/dst: %s/%s p: seq/ack=%u/%u flags=0x%x" colo_filter_rewriter_conn_offset(uint32_t offset) ": offset=%u" + +# vhost-vdpa.c +vhost_vdpa_set_address_space_id(void *v, unsigned vq_group, unsigned asid_num) "vhost_vdpa: %p vq_group: %u asid: %u" diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 4168cad..48a5608 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -29,6 +29,7 @@ #include "migration/migration.h" #include "migration/misc.h" #include "hw/virtio/vhost.h" +#include "trace.h" /* Todo:need to add the multiqueue support here */ typedef struct VhostVDPAState { @@ -440,6 +441,8 @@ static int vhost_vdpa_set_address_space_id(struct vhost_vdpa *v, }; int r; +trace_vhost_vdpa_set_address_space_id(v, vq_group, asid_num); + r = ioctl(v->shared->device_fd, VHOST_VDPA_SET_GROUP_ASID, &asid); if (unlikely(r < 0)) { error_report("Can't set vq group %u asid %u, errno=%d (%s)", -- 1.8.3.1
[PATCH 03/12] vdpa: factor out vhost_vdpa_last_dev
Generalize duplicated condition check for the last vq of vdpa device to a common function. Reviewed-by: Eugenio Pérez Acked-by: Jason Wang Signed-off-by: Si-Wei Liu --- hw/virtio/vhost-vdpa.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c index f7162da..1d3154a 100644 --- a/hw/virtio/vhost-vdpa.c +++ b/hw/virtio/vhost-vdpa.c @@ -551,6 +551,11 @@ static bool vhost_vdpa_first_dev(struct vhost_dev *dev) return v->index == 0; } +static bool vhost_vdpa_last_dev(struct vhost_dev *dev) +{ +return dev->vq_index + dev->nvqs == dev->vq_index_end; +} + static int vhost_vdpa_get_dev_features(struct vhost_dev *dev, uint64_t *features) { @@ -1317,7 +1322,7 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started) vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs); } -if (dev->vq_index + dev->nvqs != dev->vq_index_end) { +if (!vhost_vdpa_last_dev(dev)) { return 0; } @@ -1347,7 +1352,7 @@ static void vhost_vdpa_reset_status(struct vhost_dev *dev) { struct vhost_vdpa *v = dev->opaque; -if (dev->vq_index + dev->nvqs != dev->vq_index_end) { +if (!vhost_vdpa_last_dev(dev)) { return; } -- 1.8.3.1
[PATCH 06/12] vdpa: add vhost_vdpa_get_vring_base trace for svq mode
For better debuggability and observability. Reviewed-by: Eugenio Pérez Acked-by: Jason Wang Signed-off-by: Si-Wei Liu --- hw/virtio/trace-events | 2 +- hw/virtio/vhost-vdpa.c | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events index 77905d1..28d6d78 100644 --- a/hw/virtio/trace-events +++ b/hw/virtio/trace-events @@ -58,7 +58,7 @@ vhost_vdpa_set_log_base(void *dev, uint64_t base, unsigned long long size, int r vhost_vdpa_set_vring_addr(void *dev, unsigned int index, unsigned int flags, uint64_t desc_user_addr, uint64_t used_user_addr, uint64_t avail_user_addr, uint64_t log_guest_addr) "dev: %p index: %u flags: 0x%x desc_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" log_guest_addr: 0x%"PRIx64 vhost_vdpa_set_vring_num(void *dev, unsigned int index, unsigned int num) "dev: %p index: %u num: %u" vhost_vdpa_set_vring_base(void *dev, unsigned int index, unsigned int num) "dev: %p index: %u num: %u" -vhost_vdpa_get_vring_base(void *dev, unsigned int index, unsigned int num) "dev: %p index: %u num: %u" +vhost_vdpa_get_vring_base(void *dev, unsigned int index, unsigned int num, bool svq) "dev: %p index: %u num: %u svq: %d" vhost_vdpa_set_vring_kick(void *dev, unsigned int index, int fd) "dev: %p index: %u fd: %d" vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p index: %u fd: %d" vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 0x%"PRIx64 diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c index 1d3154a..0de7bdf 100644 --- a/hw/virtio/vhost-vdpa.c +++ b/hw/virtio/vhost-vdpa.c @@ -1424,6 +1424,7 @@ static int vhost_vdpa_get_vring_base(struct vhost_dev *dev, if (v->shadow_vqs_enabled) { ring->num = virtio_queue_get_last_avail_idx(dev->vdev, ring->index); +trace_vhost_vdpa_get_vring_base(dev, ring->index, ring->num, true); return 0; } @@ -1436,7 +1437,7 @@ static int vhost_vdpa_get_vring_base(struct vhost_dev *dev, } ret = vhost_vdpa_call(dev, VHOST_GET_VRING_BASE, ring); -trace_vhost_vdpa_get_vring_base(dev, ring->index, ring->num); +trace_vhost_vdpa_get_vring_base(dev, ring->index, ring->num, false); return ret; } -- 1.8.3.1
[PATCH 09/12] vdpa: add trace event for vhost_vdpa_net_load_mq
For better debuggability and observability. Reviewed-by: Eugenio Pérez Signed-off-by: Si-Wei Liu --- net/trace-events | 1 + net/vhost-vdpa.c | 2 ++ 2 files changed, 3 insertions(+) diff --git a/net/trace-events b/net/trace-events index 88f56f2..cda960f 100644 --- a/net/trace-events +++ b/net/trace-events @@ -28,3 +28,4 @@ colo_filter_rewriter_conn_offset(uint32_t offset) ": offset=%u" vhost_vdpa_set_address_space_id(void *v, unsigned vq_group, unsigned asid_num) "vhost_vdpa: %p vq_group: %u asid: %u" vhost_vdpa_net_load_cmd(void *s, uint8_t class, uint8_t cmd, int data_num, int data_size) "vdpa state: %p class: %u cmd: %u sg_num: %d size: %d" vhost_vdpa_net_load_cmd_retval(void *s, uint8_t class, uint8_t cmd, int r) "vdpa state: %p class: %u cmd: %u retval: %d" +vhost_vdpa_net_load_mq(void *s, int ncurqps) "vdpa state: %p current_qpairs: %d" diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 6ee438f..9f25221 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -901,6 +901,8 @@ static int vhost_vdpa_net_load_mq(VhostVDPAState *s, return 0; } +trace_vhost_vdpa_net_load_mq(s, n->curr_queue_pairs); + mq.virtqueue_pairs = cpu_to_le16(n->curr_queue_pairs); const struct iovec data = { .iov_base = &mq, -- 1.8.3.1
[PATCH 01/12] vdpa: add back vhost_vdpa_net_first_nc_vdpa
Previous commits had it removed. Now adding it back because this function will be needed by future patches. Reviewed-by: Eugenio Pérez Signed-off-by: Si-Wei Liu --- net/vhost-vdpa.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 46e350a..4479ffa 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -280,6 +280,16 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf, return size; } + +/** From any vdpa net client, get the netclient of the first queue pair */ +static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) +{ +NICState *nic = qemu_get_nic(s->nc.peer); +NetClientState *nc0 = qemu_get_peer(nic->ncs, 0); + +return DO_UPCAST(VhostVDPAState, nc, nc0); +} + static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) { struct vhost_vdpa *v = &s->vhost_vdpa; @@ -492,7 +502,7 @@ dma_map_err: static int vhost_vdpa_net_cvq_start(NetClientState *nc) { -VhostVDPAState *s; +VhostVDPAState *s, *s0; struct vhost_vdpa *v; int64_t cvq_group; int r; @@ -503,7 +513,8 @@ static int vhost_vdpa_net_cvq_start(NetClientState *nc) s = DO_UPCAST(VhostVDPAState, nc, nc); v = &s->vhost_vdpa; -v->shadow_vqs_enabled = v->shared->shadow_data; +s0 = vhost_vdpa_net_first_nc_vdpa(s); +v->shadow_vqs_enabled = s0->vhost_vdpa.shadow_vqs_enabled; s->vhost_vdpa.address_space_id = VHOST_VDPA_GUEST_PA_ASID; if (v->shared->shadow_data) { -- 1.8.3.1
[PATCH 10/12] vdpa: define SVQ transitioning state for mode switching
Will be used in following patches. DISABLING(-1) means SVQ is being switched off to passthrough mode. ENABLING(1) means passthrough VQs are being switched to SVQ. DONE(0) means SVQ switching is completed. Signed-off-by: Si-Wei Liu --- include/hw/virtio/vhost-vdpa.h | 9 + 1 file changed, 9 insertions(+) diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h index ad754eb..449bf5c 100644 --- a/include/hw/virtio/vhost-vdpa.h +++ b/include/hw/virtio/vhost-vdpa.h @@ -30,6 +30,12 @@ typedef struct VhostVDPAHostNotifier { void *addr; } VhostVDPAHostNotifier; +typedef enum SVQTransitionState { +SVQ_TSTATE_DISABLING = -1, +SVQ_TSTATE_DONE, +SVQ_TSTATE_ENABLING +} SVQTransitionState; + /* Info shared by all vhost_vdpa device models */ typedef struct vhost_vdpa_shared { int device_fd; @@ -67,6 +73,9 @@ typedef struct vhost_vdpa_shared { /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */ bool shadow_data; + +/* SVQ switching is in progress, or already completed? */ +SVQTransitionState svq_switching; } VhostVDPAShared; typedef struct vhost_vdpa { -- 1.8.3.1
[PATCH 08/12] vdpa: add trace events for vhost_vdpa_net_load_cmd
For better debuggability and observability. Reviewed-by: Eugenio Pérez Signed-off-by: Si-Wei Liu --- net/trace-events | 2 ++ net/vhost-vdpa.c | 2 ++ 2 files changed, 4 insertions(+) diff --git a/net/trace-events b/net/trace-events index aab666a..88f56f2 100644 --- a/net/trace-events +++ b/net/trace-events @@ -26,3 +26,5 @@ colo_filter_rewriter_conn_offset(uint32_t offset) ": offset=%u" # vhost-vdpa.c vhost_vdpa_set_address_space_id(void *v, unsigned vq_group, unsigned asid_num) "vhost_vdpa: %p vq_group: %u asid: %u" +vhost_vdpa_net_load_cmd(void *s, uint8_t class, uint8_t cmd, int data_num, int data_size) "vdpa state: %p class: %u cmd: %u sg_num: %d size: %d" +vhost_vdpa_net_load_cmd_retval(void *s, uint8_t class, uint8_t cmd, int r) "vdpa state: %p class: %u cmd: %u retval: %d" diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 48a5608..6ee438f 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -677,6 +677,7 @@ static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, assert(data_size < vhost_vdpa_net_cvq_cmd_page_len() - sizeof(ctrl)); cmd_size = sizeof(ctrl) + data_size; +trace_vhost_vdpa_net_load_cmd(s, class, cmd, data_num, data_size); if (vhost_svq_available_slots(svq) < 2 || iov_size(out_cursor, 1) < cmd_size) { /* @@ -708,6 +709,7 @@ static ssize_t vhost_vdpa_net_load_cmd(VhostVDPAState *s, r = vhost_vdpa_net_cvq_add(s, &out, 1, &in, 1); if (unlikely(r < 0)) { +trace_vhost_vdpa_net_load_cmd_retval(s, class, cmd, r); return r; } -- 1.8.3.1
[PATCH 04/12] vdpa: factor out vhost_vdpa_net_get_nc_vdpa
Introduce new API. No functional change on existing API. Acked-by: Jason Wang Signed-off-by: Si-Wei Liu --- net/vhost-vdpa.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 06c83b4..4168cad 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -281,13 +281,18 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf, } -/** From any vdpa net client, get the netclient of the first queue pair */ -static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) +/** From any vdpa net client, get the netclient of the i-th queue pair */ +static VhostVDPAState *vhost_vdpa_net_get_nc_vdpa(VhostVDPAState *s, int i) { NICState *nic = qemu_get_nic(s->nc.peer); -NetClientState *nc0 = qemu_get_peer(nic->ncs, 0); +NetClientState *nc_i = qemu_get_peer(nic->ncs, i); + +return DO_UPCAST(VhostVDPAState, nc, nc_i); +} -return DO_UPCAST(VhostVDPAState, nc, nc0); +static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) +{ +return vhost_vdpa_net_get_nc_vdpa(s, 0); } static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) -- 1.8.3.1
[PATCH 07/12] vdpa: add vhost_vdpa_set_dev_vring_base trace for svq mode
For better debuggability and observability. Reviewed-by: Eugenio Pérez Acked-by: Jason Wang Signed-off-by: Si-Wei Liu --- hw/virtio/trace-events | 2 +- hw/virtio/vhost-vdpa.c | 5 - 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events index 28d6d78..20577aa 100644 --- a/hw/virtio/trace-events +++ b/hw/virtio/trace-events @@ -57,7 +57,7 @@ vhost_vdpa_dev_start(void *dev, bool started) "dev: %p started: %d" vhost_vdpa_set_log_base(void *dev, uint64_t base, unsigned long long size, int refcnt, int fd, void *log) "dev: %p base: 0x%"PRIx64" size: %llu refcnt: %d fd: %d log: %p" vhost_vdpa_set_vring_addr(void *dev, unsigned int index, unsigned int flags, uint64_t desc_user_addr, uint64_t used_user_addr, uint64_t avail_user_addr, uint64_t log_guest_addr) "dev: %p index: %u flags: 0x%x desc_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" log_guest_addr: 0x%"PRIx64 vhost_vdpa_set_vring_num(void *dev, unsigned int index, unsigned int num) "dev: %p index: %u num: %u" -vhost_vdpa_set_vring_base(void *dev, unsigned int index, unsigned int num) "dev: %p index: %u num: %u" +vhost_vdpa_set_dev_vring_base(void *dev, unsigned int index, unsigned int num, bool svq) "dev: %p index: %u num: %u svq: %d" vhost_vdpa_get_vring_base(void *dev, unsigned int index, unsigned int num, bool svq) "dev: %p index: %u num: %u svq: %d" vhost_vdpa_set_vring_kick(void *dev, unsigned int index, int fd) "dev: %p index: %u fd: %d" vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p index: %u fd: %d" diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c index 0de7bdf..004110f 100644 --- a/hw/virtio/vhost-vdpa.c +++ b/hw/virtio/vhost-vdpa.c @@ -972,7 +972,10 @@ static int vhost_vdpa_get_config(struct vhost_dev *dev, uint8_t *config, static int vhost_vdpa_set_dev_vring_base(struct vhost_dev *dev, struct vhost_vring_state *ring) { -trace_vhost_vdpa_set_vring_base(dev, ring->index, ring->num); +struct vhost_vdpa *v = dev->opaque; + +trace_vhost_vdpa_set_dev_vring_base(dev, ring->index, ring->num, +v->shadow_vqs_enabled); return vhost_vdpa_call(dev, VHOST_SET_VRING_BASE, ring); } -- 1.8.3.1
[PATCH v2 1/2] vhost: dirty log should be per backend type
There could be a mix of both vhost-user and vhost-kernel clients in the same QEMU process, where separate vhost loggers for the specific vhost type have to be used. Make the vhost logger per backend type, and have them properly reference counted. Suggested-by: Michael S. Tsirkin Signed-off-by: Si-Wei Liu --- hw/virtio/vhost.c | 49 + 1 file changed, 37 insertions(+), 12 deletions(-) diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index 2c9ac79..ef6d9b5 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -43,8 +43,8 @@ do { } while (0) #endif -static struct vhost_log *vhost_log; -static struct vhost_log *vhost_log_shm; +static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX]; +static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX]; /* Memslots used by backends that support private memslots (without an fd). */ static unsigned int used_memslots; @@ -287,6 +287,8 @@ static int vhost_set_backend_type(struct vhost_dev *dev, r = -1; } +assert(dev->vhost_ops->backend_type == backend_type || r < 0); + return r; } @@ -319,16 +321,23 @@ static struct vhost_log *vhost_log_alloc(uint64_t size, bool share) return log; } -static struct vhost_log *vhost_log_get(uint64_t size, bool share) +static struct vhost_log *vhost_log_get(VhostBackendType backend_type, + uint64_t size, bool share) { -struct vhost_log *log = share ? vhost_log_shm : vhost_log; +struct vhost_log *log; + +if (backend_type == VHOST_BACKEND_TYPE_NONE || +backend_type >= VHOST_BACKEND_TYPE_MAX) +return NULL; + +log = share ? vhost_log_shm[backend_type] : vhost_log[backend_type]; if (!log || log->size != size) { log = vhost_log_alloc(size, share); if (share) { -vhost_log_shm = log; +vhost_log_shm[backend_type] = log; } else { -vhost_log = log; +vhost_log[backend_type] = log; } } else { ++log->refcnt; @@ -340,11 +349,20 @@ static struct vhost_log *vhost_log_get(uint64_t size, bool share) static void vhost_log_put(struct vhost_dev *dev, bool sync) { struct vhost_log *log = dev->log; +VhostBackendType backend_type; if (!log) { return; } +assert(dev->vhost_ops); +backend_type = dev->vhost_ops->backend_type; + +if (backend_type == VHOST_BACKEND_TYPE_NONE || +backend_type >= VHOST_BACKEND_TYPE_MAX) { +return; +} + --log->refcnt; if (log->refcnt == 0) { /* Sync only the range covered by the old log */ @@ -352,13 +370,13 @@ static void vhost_log_put(struct vhost_dev *dev, bool sync) vhost_log_sync_range(dev, 0, dev->log_size * VHOST_LOG_CHUNK - 1); } -if (vhost_log == log) { +if (vhost_log[backend_type] == log) { g_free(log->log); -vhost_log = NULL; -} else if (vhost_log_shm == log) { +vhost_log[backend_type] = NULL; +} else if (vhost_log_shm[backend_type] == log) { qemu_memfd_free(log->log, log->size * sizeof(*(log->log)), log->fd); -vhost_log_shm = NULL; +vhost_log_shm[backend_type] = NULL; } g_free(log); @@ -376,7 +394,8 @@ static bool vhost_dev_log_is_shared(struct vhost_dev *dev) static inline void vhost_dev_log_resize(struct vhost_dev *dev, uint64_t size) { -struct vhost_log *log = vhost_log_get(size, vhost_dev_log_is_shared(dev)); +struct vhost_log *log = vhost_log_get(dev->vhost_ops->backend_type, + size, vhost_dev_log_is_shared(dev)); uint64_t log_base = (uintptr_t)log->log; int r; @@ -2037,8 +2056,14 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings) uint64_t log_base; hdev->log_size = vhost_get_log_size(hdev); -hdev->log = vhost_log_get(hdev->log_size, +hdev->log = vhost_log_get(hdev->vhost_ops->backend_type, + hdev->log_size, vhost_dev_log_is_shared(hdev)); +if (!hdev->log) { +VHOST_OPS_DEBUG(r, "vhost_log_get failed"); +goto fail_vq; +} + log_base = (uintptr_t)hdev->log->log; r = hdev->vhost_ops->vhost_set_log_base(hdev, hdev->log_size ? log_base : 0, -- 1.8.3.1
[PATCH v2 2/2] vhost: Perform memory section dirty scans once per iteration
On setups with one or more virtio-net devices with vhost on, dirty tracking iteration increases cost the bigger the number amount of queues are set up e.g. on idle guests migration the following is observed with virtio-net with vhost=on: 48 queues -> 78.11% [.] vhost_dev_sync_region.isra.13 8 queues -> 40.50% [.] vhost_dev_sync_region.isra.13 1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13 2 devices, 1 queue -> 18.60% [.] vhost_dev_sync_region.isra.14 With high memory rates the symptom is lack of convergence as soon as it has a vhost device with a sufficiently high number of queues, the sufficient number of vhost devices. On every migration iteration (every 100msecs) it will redundantly query the *shared log* the number of queues configured with vhost that exist in the guest. For the virtqueue data, this is necessary, but not for the memory sections which are the same. So essentially we end up scanning the dirty log too often. To fix that, select a vhost device responsible for scanning the log with regards to memory sections dirty tracking. It is selected when we enable the logger (during migration) and cleared when we disable the logger. If the vhost logger device goes away for some reason, the logger will be re-selected from the rest of vhost devices. Co-developed-by: Joao Martins Signed-off-by: Joao Martins Signed-off-by: Si-Wei Liu --- hw/virtio/vhost.c | 75 +++ include/hw/virtio/vhost.h | 1 + 2 files changed, 70 insertions(+), 6 deletions(-) diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index ef6d9b5..997d560 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -45,6 +45,9 @@ static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX]; static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX]; +static struct vhost_dev *vhost_mem_logger[VHOST_BACKEND_TYPE_MAX]; +static QLIST_HEAD(, vhost_dev) vhost_mlog_devices = +QLIST_HEAD_INITIALIZER(vhost_mlog_devices); /* Memslots used by backends that support private memslots (without an fd). */ static unsigned int used_memslots; @@ -149,6 +152,53 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev) } } +static bool vhost_log_dev_enabled(struct vhost_dev *dev) +{ +assert(dev->vhost_ops); +assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE); +assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX); + +return dev == vhost_mem_logger[dev->vhost_ops->backend_type]; +} + +static void vhost_mlog_set_dev(struct vhost_dev *hdev, bool enable) +{ +struct vhost_dev *logdev = NULL; +VhostBackendType backend_type; +bool reelect = false; + +assert(hdev->vhost_ops); +assert(hdev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE); +assert(hdev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX); + +backend_type = hdev->vhost_ops->backend_type; + +if (enable && !QLIST_IS_INSERTED(hdev, logdev_entry)) { +reelect = !vhost_mem_logger[backend_type]; +QLIST_INSERT_HEAD(&vhost_mlog_devices, hdev, logdev_entry); +} else if (!enable && QLIST_IS_INSERTED(hdev, logdev_entry)) { +reelect = vhost_mem_logger[backend_type] == hdev; +QLIST_REMOVE(hdev, logdev_entry); +} + +if (!reelect) +return; + +QLIST_FOREACH(hdev, &vhost_mlog_devices, logdev_entry) { +if (!hdev->vhost_ops || +hdev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_NONE || +hdev->vhost_ops->backend_type >= VHOST_BACKEND_TYPE_MAX) +continue; + +if (hdev->vhost_ops->backend_type == backend_type) { +logdev = hdev; +break; +} +} + +vhost_mem_logger[backend_type] = logdev; +} + static int vhost_sync_dirty_bitmap(struct vhost_dev *dev, MemoryRegionSection *section, hwaddr first, @@ -166,12 +216,14 @@ static int vhost_sync_dirty_bitmap(struct vhost_dev *dev, start_addr = MAX(first, start_addr); end_addr = MIN(last, end_addr); -for (i = 0; i < dev->mem->nregions; ++i) { -struct vhost_memory_region *reg = dev->mem->regions + i; -vhost_dev_sync_region(dev, section, start_addr, end_addr, - reg->guest_phys_addr, - range_get_last(reg->guest_phys_addr, - reg->memory_size)); +if (vhost_log_dev_enabled(dev)) { +for (i = 0; i < dev->mem->nregions; ++i) { +struct vhost_memory_region *reg = dev->mem->regions + i; +vhost_dev_sync_region(dev, section, start_addr, end_addr, + reg->guest_phys_addr, + range_get_last(reg->guest_phys_addr, +
Re: [PATCH v2 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init
Hi Michael, On 2/13/2024 2:22 AM, Michael S. Tsirkin wrote: On Mon, Feb 05, 2024 at 05:10:36PM -0800, Si-Wei Liu wrote: Hi Eugenio, I thought this new code looks good to me and the original issue I saw with x-svq=on should be gone. However, after rebase my tree on top of this, there's a new failure I found around setting up guest mappings at early boot, please see attached the specific QEMU config and corresponding event traces. Haven't checked into the detail yet, thinking you would need to be aware of ahead. Regards, -Siwei Eugenio were you able to reproduce? Siwei did you have time to look into this? Didn't get a chance to look into the detail yet in the past week, but thought it may have something to do with the (internals of) iova tree range allocation and the lookup routine. It started to fall apart at the first vhost_vdpa_dma_unmap call showing up in the trace events, where it should've gotten IOVA=0x201000, but an incorrect IOVA address 0x1000 was ended up returning from the iova tree lookup routine. HVA GPA IOVA - Map [0x7f7903e0, 0x7f7983e0) [0x0, 0x8000) [0x1000, 0x8000) [0x7f7983e0, 0x7f9903e0) [0x1, 0x208000) [0x80001000, 0x201000) [0x7f7903ea, 0x7f7903ec) [0xfeda, 0xfedc) [0x201000, 0x221000) Unmap [0x7f7903ea, 0x7f7903ec) [0xfeda, 0xfedc) [0x1000, 0x2) ??? shouldn't it be [0x201000, 0x221000) ??? PS, I will be taking off from today and for the next two weeks. Will try to help out looking more closely after I get back. -Siwei Can't merge patches which are known to break things ...
Re: [PATCH v2 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init
Hi Eugenio, Just to answer the question you had in the sync meeting as I've just tried, it seems that the issue is also reproducible even with VGA device and VNC display removed, and also reproducible with 8G mem size. You already knew that I can only repro with x-svq=on. Regards, -Siwei On 2/13/2024 8:26 AM, Eugenio Perez Martin wrote: On Tue, Feb 13, 2024 at 11:22 AM Michael S. Tsirkin wrote: On Mon, Feb 05, 2024 at 05:10:36PM -0800, Si-Wei Liu wrote: Hi Eugenio, I thought this new code looks good to me and the original issue I saw with x-svq=on should be gone. However, after rebase my tree on top of this, there's a new failure I found around setting up guest mappings at early boot, please see attached the specific QEMU config and corresponding event traces. Haven't checked into the detail yet, thinking you would need to be aware of ahead. Regards, -Siwei Eugenio were you able to reproduce? Siwei did you have time to look into this? Can't merge patches which are known to break things ... Sorry for the lack of news, I'll try to reproduce this week. Meanwhile this patch should not be merged, as you mention. Thanks!
Re: [PATCH v2 1/2] vhost: dirty log should be per backend type
Hi Michael, I'm taking off for 2+ weeks, but please feel free to provide comment and feedback while I'm off. I'll be checking emails still, and am about to address any opens as soon as I am back. Thanks, -Siwei On 2/14/2024 3:50 AM, Si-Wei Liu wrote: There could be a mix of both vhost-user and vhost-kernel clients in the same QEMU process, where separate vhost loggers for the specific vhost type have to be used. Make the vhost logger per backend type, and have them properly reference counted. Suggested-by: Michael S. Tsirkin Signed-off-by: Si-Wei Liu --- hw/virtio/vhost.c | 49 + 1 file changed, 37 insertions(+), 12 deletions(-) diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index 2c9ac79..ef6d9b5 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -43,8 +43,8 @@ do { } while (0) #endif -static struct vhost_log *vhost_log; -static struct vhost_log *vhost_log_shm; +static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX]; +static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX]; /* Memslots used by backends that support private memslots (without an fd). */ static unsigned int used_memslots; @@ -287,6 +287,8 @@ static int vhost_set_backend_type(struct vhost_dev *dev, r = -1; } +assert(dev->vhost_ops->backend_type == backend_type || r < 0); + return r; } @@ -319,16 +321,23 @@ static struct vhost_log *vhost_log_alloc(uint64_t size, bool share) return log; } -static struct vhost_log *vhost_log_get(uint64_t size, bool share) +static struct vhost_log *vhost_log_get(VhostBackendType backend_type, + uint64_t size, bool share) { -struct vhost_log *log = share ? vhost_log_shm : vhost_log; +struct vhost_log *log; + +if (backend_type == VHOST_BACKEND_TYPE_NONE || +backend_type >= VHOST_BACKEND_TYPE_MAX) +return NULL; + +log = share ? vhost_log_shm[backend_type] : vhost_log[backend_type]; if (!log || log->size != size) { log = vhost_log_alloc(size, share); if (share) { -vhost_log_shm = log; +vhost_log_shm[backend_type] = log; } else { -vhost_log = log; +vhost_log[backend_type] = log; } } else { ++log->refcnt; @@ -340,11 +349,20 @@ static struct vhost_log *vhost_log_get(uint64_t size, bool share) static void vhost_log_put(struct vhost_dev *dev, bool sync) { struct vhost_log *log = dev->log; +VhostBackendType backend_type; if (!log) { return; } +assert(dev->vhost_ops); +backend_type = dev->vhost_ops->backend_type; + +if (backend_type == VHOST_BACKEND_TYPE_NONE || +backend_type >= VHOST_BACKEND_TYPE_MAX) { +return; +} + --log->refcnt; if (log->refcnt == 0) { /* Sync only the range covered by the old log */ @@ -352,13 +370,13 @@ static void vhost_log_put(struct vhost_dev *dev, bool sync) vhost_log_sync_range(dev, 0, dev->log_size * VHOST_LOG_CHUNK - 1); } -if (vhost_log == log) { +if (vhost_log[backend_type] == log) { g_free(log->log); -vhost_log = NULL; -} else if (vhost_log_shm == log) { +vhost_log[backend_type] = NULL; +} else if (vhost_log_shm[backend_type] == log) { qemu_memfd_free(log->log, log->size * sizeof(*(log->log)), log->fd); -vhost_log_shm = NULL; +vhost_log_shm[backend_type] = NULL; } g_free(log); @@ -376,7 +394,8 @@ static bool vhost_dev_log_is_shared(struct vhost_dev *dev) static inline void vhost_dev_log_resize(struct vhost_dev *dev, uint64_t size) { -struct vhost_log *log = vhost_log_get(size, vhost_dev_log_is_shared(dev)); +struct vhost_log *log = vhost_log_get(dev->vhost_ops->backend_type, + size, vhost_dev_log_is_shared(dev)); uint64_t log_base = (uintptr_t)log->log; int r; @@ -2037,8 +2056,14 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings) uint64_t log_base; hdev->log_size = vhost_get_log_size(hdev); -hdev->log = vhost_log_get(hdev->log_size, +hdev->log = vhost_log_get(hdev->vhost_ops->backend_type, + hdev->log_size, vhost_dev_log_is_shared(hdev)); +if (!hdev->log) { +VHOST_OPS_DEBUG(r, "vhost_log_get failed"); +goto fail_vq; +} + log_base = (uintptr_t)hdev->log->log; r = hdev->vhost_ops->vhost_set_log_base(hdev, hdev->log_size ? log_base : 0,
Re: [PATCH 04/12] vdpa: factor out vhost_vdpa_net_get_nc_vdpa
On 2/14/2024 10:54 AM, Eugenio Perez Martin wrote: On Wed, Feb 14, 2024 at 1:39 PM Si-Wei Liu wrote: Introduce new API. No functional change on existing API. Acked-by: Jason Wang Signed-off-by: Si-Wei Liu I'm ok with the new function, but doesn't the compiler complain because adding a static function is not used? Hmmm, which one? vhost_vdpa_net_get_nc_vdpa is used by vhost_vdpa_net_first_nc_vdpa internally, and vhost_vdpa_net_first_nc_vdpa is used by vhost_vdpa_net_cvq_start (Patch 01). I think we should be fine? -Siwei --- net/vhost-vdpa.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 06c83b4..4168cad 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -281,13 +281,18 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf, } -/** From any vdpa net client, get the netclient of the first queue pair */ -static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) +/** From any vdpa net client, get the netclient of the i-th queue pair */ +static VhostVDPAState *vhost_vdpa_net_get_nc_vdpa(VhostVDPAState *s, int i) { NICState *nic = qemu_get_nic(s->nc.peer); -NetClientState *nc0 = qemu_get_peer(nic->ncs, 0); +NetClientState *nc_i = qemu_get_peer(nic->ncs, i); + +return DO_UPCAST(VhostVDPAState, nc, nc_i); +} -return DO_UPCAST(VhostVDPAState, nc, nc0); +static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) +{ +return vhost_vdpa_net_get_nc_vdpa(s, 0); } static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) -- 1.8.3.1
Re: [PATCH 1/6] vdpa: check for iova tree initialized at net_client_start
Hi Eugenio, Maybe there's some patch missing, but I saw this core dump when x-svq=on is specified while waiting for the incoming migration on destination host: (gdb) bt #0 0x5643b24cc13c in vhost_iova_tree_map_alloc (tree=0x0, map=map@entry=0x7ffd58c54830) at ../hw/virtio/vhost-iova-tree.c:89 #1 0x5643b234f193 in vhost_vdpa_listener_region_add (listener=0x5643b4403fd8, section=0x7ffd58c548d0) at /home/opc/qemu-upstream/include/qemu/int128.h:34 #2 0x5643b24e6a61 in address_space_update_topology_pass (as=as@entry=0x5643b35a3840 , old_view=old_view@entry=0x5643b442b5f0, new_view=new_view@entry=0x5643b44a2130, adding=adding@entry=true) at ../system/memory.c:1004 #3 0x5643b24e6e60 in address_space_set_flatview (as=0x5643b35a3840 ) at ../system/memory.c:1080 #4 0x5643b24ea750 in memory_region_transaction_commit () at ../system/memory.c:1132 #5 0x5643b24ea750 in memory_region_transaction_commit () at ../system/memory.c:1117 #6 0x5643b241f4c1 in pc_memory_init (pcms=pcms@entry=0x5643b43c8400, system_memory=system_memory@entry=0x5643b43d18b0, rom_memory=rom_memory@entry=0x5643b449a960, pci_hole64_size=out>) at ../hw/i386/pc.c:954 #7 0x5643b240d088 in pc_q35_init (machine=0x5643b43c8400) at ../hw/i386/pc_q35.c:222 #8 0x5643b21e1da8 in machine_run_board_init (machine=out>, mem_path=, errp=, errp@entry=0x5643b35b7958 ) at ../hw/core/machine.c:1509 #9 0x5643b237c0f6 in qmp_x_exit_preconfig () at ../system/vl.c:2613 #10 0x5643b237c0f6 in qmp_x_exit_preconfig (errp=) at ../system/vl.c:2704 #11 0x5643b237fcdd in qemu_init (errp=) at ../system/vl.c:3753 #12 0x5643b237fcdd in qemu_init (argc=, argv=) at ../system/vl.c:3753 #13 0x5643b2158249 in main (argc=, argv=out>) at ../system/main.c:47 Shall we create the iova tree early during vdpa dev int for the x-svq=on case? + if (s->always_svq) { + /* iova tree is needed because of SVQ */ + shared->iova_tree = vhost_iova_tree_new(shared->iova_range.first, + shared->iova_range.last); + } + Regards, -Siwei On 1/11/2024 11:02 AM, Eugenio Pérez wrote: To map the guest memory while it is migrating we need to create the iova_tree, as long as the destination uses x-svq=on. Checking to not override it. The function vhost_vdpa_net_client_stop clear it if the device is stopped. If the guest starts the device again, the iova tree is recreated by vhost_vdpa_net_data_start_first or vhost_vdpa_net_cvq_start if needed, so old behavior is kept. Signed-off-by: Eugenio Pérez --- net/vhost-vdpa.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 3726ee5d67..e11b390466 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -341,7 +341,9 @@ static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) migration_add_notifier(&s->migration_state, vdpa_net_migration_state_notifier); -if (v->shadow_vqs_enabled) { + +/* iova_tree may be initialized by vhost_vdpa_net_load_setup */ +if (v->shadow_vqs_enabled && !v->shared->iova_tree) { v->shared->iova_tree = vhost_iova_tree_new(v->shared->iova_range.first, v->shared->iova_range.last); }
Re: [PATCH v2 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init
Hi Eugenio, I thought this new code looks good to me and the original issue I saw with x-svq=on should be gone. However, after rebase my tree on top of this, there's a new failure I found around setting up guest mappings at early boot, please see attached the specific QEMU config and corresponding event traces. Haven't checked into the detail yet, thinking you would need to be aware of ahead. Regards, -Siwei On 2/1/2024 10:09 AM, Eugenio Pérez wrote: As we are moving to keep the mapping through all the vdpa device life instead of resetting it at VirtIO reset, we need to move all its dependencies to the initialization too. In particular devices with x-svq=on need a valid iova_tree from the beginning. Simplify the code also consolidating the two creation points: the first data vq in case of SVQ active and CVQ start in case only CVQ uses it. Suggested-by: Si-Wei Liu Signed-off-by: Eugenio Pérez --- include/hw/virtio/vhost-vdpa.h | 16 ++- net/vhost-vdpa.c | 36 +++--- 2 files changed, 18 insertions(+), 34 deletions(-) diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h index 03ed2f2be3..ad754eb803 100644 --- a/include/hw/virtio/vhost-vdpa.h +++ b/include/hw/virtio/vhost-vdpa.h @@ -37,7 +37,21 @@ typedef struct vhost_vdpa_shared { struct vhost_vdpa_iova_range iova_range; QLIST_HEAD(, vdpa_iommu) iommu_list; -/* IOVA mapping used by the Shadow Virtqueue */ +/* + * IOVA mapping used by the Shadow Virtqueue + * + * It is shared among all ASID for simplicity, whether CVQ shares ASID with + * guest or not: + * - Memory listener need access to guest's memory addresses allocated in + * the IOVA tree. + * - There should be plenty of IOVA address space for both ASID not to + * worry about collisions between them. Guest's translations are still + * validated with virtio virtqueue_pop so there is no risk for the guest + * to access memory that it shouldn't. + * + * To allocate a iova tree per ASID is doable but it complicates the code + * and it is not worth it for the moment. + */ VhostIOVATree *iova_tree; /* Copy of backend features */ diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index cc589dd148..57edcf34d0 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -232,6 +232,7 @@ static void vhost_vdpa_cleanup(NetClientState *nc) return; } qemu_close(s->vhost_vdpa.shared->device_fd); +g_clear_pointer(&s->vhost_vdpa.shared->iova_tree, vhost_iova_tree_delete); g_free(s->vhost_vdpa.shared); } @@ -329,16 +330,8 @@ static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data) static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) { -struct vhost_vdpa *v = &s->vhost_vdpa; - migration_add_notifier(&s->migration_state, vdpa_net_migration_state_notifier); - -/* iova_tree may be initialized by vhost_vdpa_net_load_setup */ -if (v->shadow_vqs_enabled && !v->shared->iova_tree) { -v->shared->iova_tree = vhost_iova_tree_new(v->shared->iova_range.first, - v->shared->iova_range.last); -} } static int vhost_vdpa_net_data_start(NetClientState *nc) @@ -383,19 +376,12 @@ static int vhost_vdpa_net_data_load(NetClientState *nc) static void vhost_vdpa_net_client_stop(NetClientState *nc) { VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc); -struct vhost_dev *dev; assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); if (s->vhost_vdpa.index == 0) { migration_remove_notifier(&s->migration_state); } - -dev = s->vhost_vdpa.dev; -if (dev->vq_index + dev->nvqs == dev->vq_index_end) { -g_clear_pointer(&s->vhost_vdpa.shared->iova_tree, -vhost_iova_tree_delete); -} } static NetClientInfo net_vhost_vdpa_info = { @@ -557,24 +543,6 @@ out: return 0; } -/* - * If other vhost_vdpa already have an iova_tree, reuse it for simplicity, - * whether CVQ shares ASID with guest or not, because: - * - Memory listener need access to guest's memory addresses allocated in - * the IOVA tree. - * - There should be plenty of IOVA address space for both ASID not to - * worry about collisions between them. Guest's translations are still - * validated with virtio virtqueue_pop so there is no risk for the guest - * to access memory that it shouldn't. - * - * To allocate a iova tree per ASID is doable but it complicates the code - * and it is not worth it for the moment. - */ -if (!v->shared->iova_tree) { -v->shared->
Re: [PATCH v4 1/2] vhost: dirty log should be per backend type
On 3/19/2024 8:25 PM, Jason Wang wrote: On Tue, Mar 19, 2024 at 6:06 AM Si-Wei Liu wrote: On 3/17/2024 8:20 PM, Jason Wang wrote: On Sat, Mar 16, 2024 at 2:33 AM Si-Wei Liu wrote: On 3/14/2024 8:50 PM, Jason Wang wrote: On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu wrote: There could be a mix of both vhost-user and vhost-kernel clients in the same QEMU process, where separate vhost loggers for the specific vhost type have to be used. Make the vhost logger per backend type, and have them properly reference counted. It's better to describe what's the advantage of doing this. Yes, I can add that to the log. Although it's a niche use case, it was actually a long standing limitation / bug that vhost-user and vhost-kernel loggers can't co-exist per QEMU process, but today it's just silent failure that may be ended up with. This bug fix removes that implicit limitation in the code. Ok. Suggested-by: Michael S. Tsirkin Signed-off-by: Si-Wei Liu --- v3->v4: - remove checking NULL return value from vhost_log_get v2->v3: - remove non-effective assertion that never be reached - do not return NULL from vhost_log_get() - add neccessary assertions to vhost_log_get() --- hw/virtio/vhost.c | 45 + 1 file changed, 33 insertions(+), 12 deletions(-) diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index 2c9ac79..612f4db 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -43,8 +43,8 @@ do { } while (0) #endif -static struct vhost_log *vhost_log; -static struct vhost_log *vhost_log_shm; +static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX]; +static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX]; /* Memslots used by backends that support private memslots (without an fd). */ static unsigned int used_memslots; @@ -287,6 +287,10 @@ static int vhost_set_backend_type(struct vhost_dev *dev, r = -1; } +if (r == 0) { +assert(dev->vhost_ops->backend_type == backend_type); +} + Under which condition could we hit this? Just in case some other function inadvertently corrupted this earlier, we have to capture discrepancy in the first place... On the other hand, it will be helpful for other vhost backend writers to diagnose day-one bug in the code. I feel just code comment here will not be sufficient/helpful. See below. It seems not good to assert a local logic. It seems to me quite a few local asserts are in the same file already, vhost_save_backend_state, For example it has assert for assert(!dev->started); which is not the logic of the function itself but require vhost_dev_start() not to be called before. But it looks like this patch you assert the code just a few lines above the assert itself? Yes, that was the intent - for e.g. xxx_ops may contain corrupted xxx_ops.backend_type already before coming to this vhost_set_backend_type() function. And we may capture this corrupted state by asserting the expected xxx_ops.backend_type (to be consistent with the backend_type passed in), This can happen for all variables. Not sure why backend_ops is special. The assert is just checking the backend_type field only. The other op fields in backend_ops have similar assert within the op function itself also. For e.g. vhost_user_requires_shm_log() and a lot of other vhost_user ops have the following: assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_USER); vhost_vdpa_vq_get_addr() and a lot of other vhost_vdpa ops have: assert(dev->vhost_ops->backend_type == VHOST_BACKEND_TYPE_VDPA); vhost_kernel ops has similar assertions as well. The reason why it has to be checked against here is now the callers of vhost_log_get(), would pass in dev->vhost_ops->backend_type to the API, which are unable to verify the validity of the backend_type by themselves. The vhost_log_get() has necessary asserts to make bound check for the vhost_log[] or vhost_log_shm[] array, but specific assert against the exact backend type in vhost_set_backend_type() will further harden the implementation in vhost_log_get() and other backend ops. which needs be done in the first place when this discrepancy is detected. In practice I think there should be no harm to add this assert, but this will add warranted guarantee to the current code. For example, such corruption can happen after the assert() so a TOCTOU issue. Sure, it's best effort only. As pointed out earlier, I think together with this, there are other similar asserts already in various backend ops, which could be helpful to nail down the earliest point or a specific range where things may go wrong in the first place. Thanks, -Siwei Thanks Regards, -Siwei dev->vhost_ops = &xxx_ops; ... assert(dev->vhost_ops->backend_type == backend_type) ? Thanks vhost_load_backend_state, vhost_virtqueue_mask, vhost_config_mask, just to name a few. Why local assert a problem? Thanks, -Siwei Thanks
Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration
On 3/19/2024 8:27 PM, Jason Wang wrote: On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu wrote: On 3/17/2024 8:22 PM, Jason Wang wrote: On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu wrote: On 3/14/2024 9:03 PM, Jason Wang wrote: On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu wrote: On setups with one or more virtio-net devices with vhost on, dirty tracking iteration increases cost the bigger the number amount of queues are set up e.g. on idle guests migration the following is observed with virtio-net with vhost=on: 48 queues -> 78.11% [.] vhost_dev_sync_region.isra.13 8 queues -> 40.50% [.] vhost_dev_sync_region.isra.13 1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13 2 devices, 1 queue -> 18.60% [.] vhost_dev_sync_region.isra.14 With high memory rates the symptom is lack of convergence as soon as it has a vhost device with a sufficiently high number of queues, the sufficient number of vhost devices. On every migration iteration (every 100msecs) it will redundantly query the *shared log* the number of queues configured with vhost that exist in the guest. For the virtqueue data, this is necessary, but not for the memory sections which are the same. So essentially we end up scanning the dirty log too often. To fix that, select a vhost device responsible for scanning the log with regards to memory sections dirty tracking. It is selected when we enable the logger (during migration) and cleared when we disable the logger. If the vhost logger device goes away for some reason, the logger will be re-selected from the rest of vhost devices. After making mem-section logger a singleton instance, constant cost of 7%-9% (like the 1 queue report) will be seen, no matter how many queues or how many vhost devices are configured: 48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13 2 devices, 8 queues -> 7.97% [.] vhost_dev_sync_region.isra.14 Co-developed-by: Joao Martins Signed-off-by: Joao Martins Signed-off-by: Si-Wei Liu --- v3 -> v4: - add comment to clarify effect on cache locality and performance v2 -> v3: - add after-fix benchmark to commit log - rename vhost_log_dev_enabled to vhost_dev_should_log - remove unneeded comparisons for backend_type - use QLIST array instead of single flat list to store vhost logger devices - simplify logger election logic --- hw/virtio/vhost.c | 67 ++- include/hw/virtio/vhost.h | 1 + 2 files changed, 62 insertions(+), 6 deletions(-) diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index 612f4db..58522f1 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -45,6 +45,7 @@ static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX]; static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX]; +static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX]; /* Memslots used by backends that support private memslots (without an fd). */ static unsigned int used_memslots; @@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev) } } +static inline bool vhost_dev_should_log(struct vhost_dev *dev) +{ +assert(dev->vhost_ops); +assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE); +assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX); + +return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]); A dumb question, why not simple check dev->log == vhost_log_shm[dev->vhost_ops->backend_type] Because we are not sure if the logger comes from vhost_log_shm[] or vhost_log[]. Don't want to complicate the check here by calling into vhost_dev_log_is_shared() everytime when the .log_sync() is called. It has very low overhead, isn't it? Whether this has low overhead will have to depend on the specific backend's implementation for .vhost_requires_shm_log(), which the common vhost layer should not assume upon or rely on the current implementation. static bool vhost_dev_log_is_shared(struct vhost_dev *dev) { return dev->vhost_ops->vhost_requires_shm_log && dev->vhost_ops->vhost_requires_shm_log(dev); } For example, if I understand the code correctly, the log type won't be changed during runtime, so we can endup with a boolean to record that instead of a query ops? Right now the log type won't change during runtime, but I am not sure if this may prohibit future revisit to allow change at the runtime, then there'll be complex code involvled to maintain the state. Other than this, I think it's insufficient to just check the shm log v.s. normal log. The logger device requires to identify a leading logger device that gets elected in vhost_dev_elect_mem_logger(), as all the dev->log points to the same logger that is refenerce counted, that we have to add extra field and complex logic to maintain the election
Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration
On 3/20/2024 8:56 PM, Jason Wang wrote: On Thu, Mar 21, 2024 at 5:03 AM Si-Wei Liu wrote: On 3/19/2024 8:27 PM, Jason Wang wrote: On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu wrote: On 3/17/2024 8:22 PM, Jason Wang wrote: On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu wrote: On 3/14/2024 9:03 PM, Jason Wang wrote: On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu wrote: On setups with one or more virtio-net devices with vhost on, dirty tracking iteration increases cost the bigger the number amount of queues are set up e.g. on idle guests migration the following is observed with virtio-net with vhost=on: 48 queues -> 78.11% [.] vhost_dev_sync_region.isra.13 8 queues -> 40.50% [.] vhost_dev_sync_region.isra.13 1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13 2 devices, 1 queue -> 18.60% [.] vhost_dev_sync_region.isra.14 With high memory rates the symptom is lack of convergence as soon as it has a vhost device with a sufficiently high number of queues, the sufficient number of vhost devices. On every migration iteration (every 100msecs) it will redundantly query the *shared log* the number of queues configured with vhost that exist in the guest. For the virtqueue data, this is necessary, but not for the memory sections which are the same. So essentially we end up scanning the dirty log too often. To fix that, select a vhost device responsible for scanning the log with regards to memory sections dirty tracking. It is selected when we enable the logger (during migration) and cleared when we disable the logger. If the vhost logger device goes away for some reason, the logger will be re-selected from the rest of vhost devices. After making mem-section logger a singleton instance, constant cost of 7%-9% (like the 1 queue report) will be seen, no matter how many queues or how many vhost devices are configured: 48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13 2 devices, 8 queues -> 7.97% [.] vhost_dev_sync_region.isra.14 Co-developed-by: Joao Martins Signed-off-by: Joao Martins Signed-off-by: Si-Wei Liu --- v3 -> v4: - add comment to clarify effect on cache locality and performance v2 -> v3: - add after-fix benchmark to commit log - rename vhost_log_dev_enabled to vhost_dev_should_log - remove unneeded comparisons for backend_type - use QLIST array instead of single flat list to store vhost logger devices - simplify logger election logic --- hw/virtio/vhost.c | 67 ++- include/hw/virtio/vhost.h | 1 + 2 files changed, 62 insertions(+), 6 deletions(-) diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index 612f4db..58522f1 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -45,6 +45,7 @@ static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX]; static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX]; +static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX]; /* Memslots used by backends that support private memslots (without an fd). */ static unsigned int used_memslots; @@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev) } } +static inline bool vhost_dev_should_log(struct vhost_dev *dev) +{ +assert(dev->vhost_ops); +assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE); +assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX); + +return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]); A dumb question, why not simple check dev->log == vhost_log_shm[dev->vhost_ops->backend_type] Because we are not sure if the logger comes from vhost_log_shm[] or vhost_log[]. Don't want to complicate the check here by calling into vhost_dev_log_is_shared() everytime when the .log_sync() is called. It has very low overhead, isn't it? Whether this has low overhead will have to depend on the specific backend's implementation for .vhost_requires_shm_log(), which the common vhost layer should not assume upon or rely on the current implementation. static bool vhost_dev_log_is_shared(struct vhost_dev *dev) { return dev->vhost_ops->vhost_requires_shm_log && dev->vhost_ops->vhost_requires_shm_log(dev); } For example, if I understand the code correctly, the log type won't be changed during runtime, so we can endup with a boolean to record that instead of a query ops? Right now the log type won't change during runtime, but I am not sure if this may prohibit future revisit to allow change at the runtime, We can be bothered when we have such a request then. then there'll be complex code involvled to maintain the state. Other than this, I think it's insufficient to just check the shm log v.s. normal log. The logger device requires to identify a leading logger device that gets elected in vhost_dev_elect_mem_logger(
Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration
On 3/21/2024 10:08 PM, Jason Wang wrote: On Fri, Mar 22, 2024 at 5:43 AM Si-Wei Liu wrote: On 3/20/2024 8:56 PM, Jason Wang wrote: On Thu, Mar 21, 2024 at 5:03 AM Si-Wei Liu wrote: On 3/19/2024 8:27 PM, Jason Wang wrote: On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu wrote: On 3/17/2024 8:22 PM, Jason Wang wrote: On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu wrote: On 3/14/2024 9:03 PM, Jason Wang wrote: On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu wrote: On setups with one or more virtio-net devices with vhost on, dirty tracking iteration increases cost the bigger the number amount of queues are set up e.g. on idle guests migration the following is observed with virtio-net with vhost=on: 48 queues -> 78.11% [.] vhost_dev_sync_region.isra.13 8 queues -> 40.50% [.] vhost_dev_sync_region.isra.13 1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13 2 devices, 1 queue -> 18.60% [.] vhost_dev_sync_region.isra.14 With high memory rates the symptom is lack of convergence as soon as it has a vhost device with a sufficiently high number of queues, the sufficient number of vhost devices. On every migration iteration (every 100msecs) it will redundantly query the *shared log* the number of queues configured with vhost that exist in the guest. For the virtqueue data, this is necessary, but not for the memory sections which are the same. So essentially we end up scanning the dirty log too often. To fix that, select a vhost device responsible for scanning the log with regards to memory sections dirty tracking. It is selected when we enable the logger (during migration) and cleared when we disable the logger. If the vhost logger device goes away for some reason, the logger will be re-selected from the rest of vhost devices. After making mem-section logger a singleton instance, constant cost of 7%-9% (like the 1 queue report) will be seen, no matter how many queues or how many vhost devices are configured: 48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13 2 devices, 8 queues -> 7.97% [.] vhost_dev_sync_region.isra.14 Co-developed-by: Joao Martins Signed-off-by: Joao Martins Signed-off-by: Si-Wei Liu --- v3 -> v4: - add comment to clarify effect on cache locality and performance v2 -> v3: - add after-fix benchmark to commit log - rename vhost_log_dev_enabled to vhost_dev_should_log - remove unneeded comparisons for backend_type - use QLIST array instead of single flat list to store vhost logger devices - simplify logger election logic --- hw/virtio/vhost.c | 67 ++- include/hw/virtio/vhost.h | 1 + 2 files changed, 62 insertions(+), 6 deletions(-) diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index 612f4db..58522f1 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -45,6 +45,7 @@ static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX]; static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX]; +static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX]; /* Memslots used by backends that support private memslots (without an fd). */ static unsigned int used_memslots; @@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev) } } +static inline bool vhost_dev_should_log(struct vhost_dev *dev) +{ +assert(dev->vhost_ops); +assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE); +assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX); + +return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]); A dumb question, why not simple check dev->log == vhost_log_shm[dev->vhost_ops->backend_type] Because we are not sure if the logger comes from vhost_log_shm[] or vhost_log[]. Don't want to complicate the check here by calling into vhost_dev_log_is_shared() everytime when the .log_sync() is called. It has very low overhead, isn't it? Whether this has low overhead will have to depend on the specific backend's implementation for .vhost_requires_shm_log(), which the common vhost layer should not assume upon or rely on the current implementation. static bool vhost_dev_log_is_shared(struct vhost_dev *dev) { return dev->vhost_ops->vhost_requires_shm_log && dev->vhost_ops->vhost_requires_shm_log(dev); } For example, if I understand the code correctly, the log type won't be changed during runtime, so we can endup with a boolean to record that instead of a query ops? Right now the log type won't change during runtime, but I am not sure if this may prohibit future revisit to allow change at the runtime, We can be bothered when we have such a request then. then there'll be complex code involvled to maintain the state. Other than this, I think it's insufficient to just check the shm log v.s. normal lo
Re: [External] : Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration
On 3/24/2024 11:13 PM, Jason Wang wrote: On Sat, Mar 23, 2024 at 5:14 AM Si-Wei Liu wrote: On 3/21/2024 10:08 PM, Jason Wang wrote: On Fri, Mar 22, 2024 at 5:43 AM Si-Wei Liu wrote: On 3/20/2024 8:56 PM, Jason Wang wrote: On Thu, Mar 21, 2024 at 5:03 AM Si-Wei Liu wrote: On 3/19/2024 8:27 PM, Jason Wang wrote: On Tue, Mar 19, 2024 at 6:16 AM Si-Wei Liu wrote: On 3/17/2024 8:22 PM, Jason Wang wrote: On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu wrote: On 3/14/2024 9:03 PM, Jason Wang wrote: On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu wrote: On setups with one or more virtio-net devices with vhost on, dirty tracking iteration increases cost the bigger the number amount of queues are set up e.g. on idle guests migration the following is observed with virtio-net with vhost=on: 48 queues -> 78.11% [.] vhost_dev_sync_region.isra.13 8 queues -> 40.50% [.] vhost_dev_sync_region.isra.13 1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13 2 devices, 1 queue -> 18.60% [.] vhost_dev_sync_region.isra.14 With high memory rates the symptom is lack of convergence as soon as it has a vhost device with a sufficiently high number of queues, the sufficient number of vhost devices. On every migration iteration (every 100msecs) it will redundantly query the *shared log* the number of queues configured with vhost that exist in the guest. For the virtqueue data, this is necessary, but not for the memory sections which are the same. So essentially we end up scanning the dirty log too often. To fix that, select a vhost device responsible for scanning the log with regards to memory sections dirty tracking. It is selected when we enable the logger (during migration) and cleared when we disable the logger. If the vhost logger device goes away for some reason, the logger will be re-selected from the rest of vhost devices. After making mem-section logger a singleton instance, constant cost of 7%-9% (like the 1 queue report) will be seen, no matter how many queues or how many vhost devices are configured: 48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13 2 devices, 8 queues -> 7.97% [.] vhost_dev_sync_region.isra.14 Co-developed-by: Joao Martins Signed-off-by: Joao Martins Signed-off-by: Si-Wei Liu --- v3 -> v4: - add comment to clarify effect on cache locality and performance v2 -> v3: - add after-fix benchmark to commit log - rename vhost_log_dev_enabled to vhost_dev_should_log - remove unneeded comparisons for backend_type - use QLIST array instead of single flat list to store vhost logger devices - simplify logger election logic --- hw/virtio/vhost.c | 67 ++- include/hw/virtio/vhost.h | 1 + 2 files changed, 62 insertions(+), 6 deletions(-) diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index 612f4db..58522f1 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -45,6 +45,7 @@ static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX]; static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX]; +static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX]; /* Memslots used by backends that support private memslots (without an fd). */ static unsigned int used_memslots; @@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev) } } +static inline bool vhost_dev_should_log(struct vhost_dev *dev) +{ +assert(dev->vhost_ops); +assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE); +assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX); + +return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]); A dumb question, why not simple check dev->log == vhost_log_shm[dev->vhost_ops->backend_type] Because we are not sure if the logger comes from vhost_log_shm[] or vhost_log[]. Don't want to complicate the check here by calling into vhost_dev_log_is_shared() everytime when the .log_sync() is called. It has very low overhead, isn't it? Whether this has low overhead will have to depend on the specific backend's implementation for .vhost_requires_shm_log(), which the common vhost layer should not assume upon or rely on the current implementation. static bool vhost_dev_log_is_shared(struct vhost_dev *dev) { return dev->vhost_ops->vhost_requires_shm_log && dev->vhost_ops->vhost_requires_shm_log(dev); } For example, if I understand the code correctly, the log type won't be changed during runtime, so we can endup with a boolean to record that instead of a query ops? Right now the log type won't change during runtime, but I am not sure if this may prohibit future revisit to allow change at the runtime, We can be bothered when we have such a request then. then there'll be complex code involvled
Re: [PATCH v2 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init
On 2/14/2024 11:11 AM, Eugenio Perez Martin wrote: On Wed, Feb 14, 2024 at 7:29 PM Si-Wei Liu wrote: Hi Michael, On 2/13/2024 2:22 AM, Michael S. Tsirkin wrote: On Mon, Feb 05, 2024 at 05:10:36PM -0800, Si-Wei Liu wrote: Hi Eugenio, I thought this new code looks good to me and the original issue I saw with x-svq=on should be gone. However, after rebase my tree on top of this, there's a new failure I found around setting up guest mappings at early boot, please see attached the specific QEMU config and corresponding event traces. Haven't checked into the detail yet, thinking you would need to be aware of ahead. Regards, -Siwei Eugenio were you able to reproduce? Siwei did you have time to look into this? Didn't get a chance to look into the detail yet in the past week, but thought it may have something to do with the (internals of) iova tree range allocation and the lookup routine. It started to fall apart at the first vhost_vdpa_dma_unmap call showing up in the trace events, where it should've gotten IOVA=0x201000, but an incorrect IOVA address 0x1000 was ended up returning from the iova tree lookup routine. HVAGPAIOVA - Map [0x7f7903e0, 0x7f7983e0)[0x0, 0x8000) [0x1000, 0x8000) [0x7f7983e0, 0x7f9903e0)[0x1, 0x208000) [0x80001000, 0x201000) [0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc) [0x201000, 0x221000) Unmap [0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc) [0x1000, 0x2) ??? shouldn't it be [0x201000, 0x221000) ??? It looks the SVQ iova tree lookup routine vhost_iova_tree_find_iova(), which is called from vhost_vdpa_listener_region_del(), can't properly deal with overlapped region. Specifically, q35's mch_realize() has the following: 579 memory_region_init_alias(&mch->open_high_smram, OBJECT(mch), "smram-open-high", 580 mch->ram_memory, MCH_HOST_BRIDGE_SMRAM_C_BASE, 581 MCH_HOST_BRIDGE_SMRAM_C_SIZE); 582 memory_region_add_subregion_overlap(mch->system_memory, 0xfeda, 583 &mch->open_high_smram, 1); 584 memory_region_set_enabled(&mch->open_high_smram, false); #0 0x564c30bf6980 in iova_tree_find_address_iterator (key=0x564c331cf8e0, value=0x564c331cf8e0, data=0x7fffb6d749b0) at ../util/iova-tree.c:96 #1 0x7f5f66479654 in g_tree_foreach () at /lib64/libglib-2.0.so.0 #2 0x564c30bf6b53 in iova_tree_find_iova (tree=, map=map@entry=0x7fffb6d74a00) at ../util/iova-tree.c:114 #3 0x564c309da0a9 in vhost_iova_tree_find_iova (tree=out>, map=map@entry=0x7fffb6d74a00) at ../hw/virtio/vhost-iova-tree.c:70 #4 0x564c3085e49d in vhost_vdpa_listener_region_del (listener=0x564c331024c8, section=0x7fffb6d74aa0) at ../hw/virtio/vhost-vdpa.c:444 #5 0x564c309f4931 in address_space_update_topology_pass (as=as@entry=0x564c31ab1840 , old_view=old_view@entry=0x564c33364cc0, new_view=new_view@entry=0x564c333640f0, adding=adding@entry=false) at ../system/memory.c:977 #6 0x564c309f4dcd in address_space_set_flatview (as=0x564c31ab1840 ) at ../system/memory.c:1079 #7 0x564c309f86d0 in memory_region_transaction_commit () at ../system/memory.c:1132 #8 0x564c309f86d0 in memory_region_transaction_commit () at ../system/memory.c:1117 #9 0x564c307cce64 in mch_realize (d=, errp=) at ../hw/pci-host/q35.c:584 However, it looks like iova_tree_find_address_iterator() only check if the translated address (HVA) falls in to the range when trying to locate the desired IOVA, causing the first DMAMap that happens to overlap in the translated address (HVA) space to be returned prematurely: 89 static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value, 90 gpointer data) 91 { : : 99 if (map->translated_addr + map->size < needle->translated_addr || 100 needle->translated_addr + needle->size < map->translated_addr) { 101 return false; 102 } 103 104 args->result = map; 105 return true; 106 } In the QEMU trace file, it reveals that the first DMAMap as below gets returned incorrectly instead the second, the latter of which is what the actual IOVA corresponds to: HVA GPA IOVA [0x7f7903e0, 0x7f7983e0)[0x0, 0x8000) [0x1000, 0x80001000) [0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc) [0x201000, 0x221000) Maybe other than check the HVA range, we should also match GPA, or at least the size should exactly match? Yes,
Re: [PATCH v2 6/7] vdpa: move iova_tree allocation to net_vhost_vdpa_init
On 4/2/2024 5:01 AM, Eugenio Perez Martin wrote: On Tue, Apr 2, 2024 at 8:19 AM Si-Wei Liu wrote: On 2/14/2024 11:11 AM, Eugenio Perez Martin wrote: On Wed, Feb 14, 2024 at 7:29 PM Si-Wei Liu wrote: Hi Michael, On 2/13/2024 2:22 AM, Michael S. Tsirkin wrote: On Mon, Feb 05, 2024 at 05:10:36PM -0800, Si-Wei Liu wrote: Hi Eugenio, I thought this new code looks good to me and the original issue I saw with x-svq=on should be gone. However, after rebase my tree on top of this, there's a new failure I found around setting up guest mappings at early boot, please see attached the specific QEMU config and corresponding event traces. Haven't checked into the detail yet, thinking you would need to be aware of ahead. Regards, -Siwei Eugenio were you able to reproduce? Siwei did you have time to look into this? Didn't get a chance to look into the detail yet in the past week, but thought it may have something to do with the (internals of) iova tree range allocation and the lookup routine. It started to fall apart at the first vhost_vdpa_dma_unmap call showing up in the trace events, where it should've gotten IOVA=0x201000, but an incorrect IOVA address 0x1000 was ended up returning from the iova tree lookup routine. HVAGPAIOVA - Map [0x7f7903e0, 0x7f7983e0)[0x0, 0x8000) [0x1000, 0x8000) [0x7f7983e0, 0x7f9903e0)[0x1, 0x208000) [0x80001000, 0x201000) [0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc) [0x201000, 0x221000) Unmap [0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc) [0x1000, 0x2) ??? shouldn't it be [0x201000, 0x221000) ??? It looks the SVQ iova tree lookup routine vhost_iova_tree_find_iova(), which is called from vhost_vdpa_listener_region_del(), can't properly deal with overlapped region. Specifically, q35's mch_realize() has the following: 579 memory_region_init_alias(&mch->open_high_smram, OBJECT(mch), "smram-open-high", 580 mch->ram_memory, MCH_HOST_BRIDGE_SMRAM_C_BASE, 581 MCH_HOST_BRIDGE_SMRAM_C_SIZE); 582 memory_region_add_subregion_overlap(mch->system_memory, 0xfeda, 583 &mch->open_high_smram, 1); 584 memory_region_set_enabled(&mch->open_high_smram, false); #0 0x564c30bf6980 in iova_tree_find_address_iterator (key=0x564c331cf8e0, value=0x564c331cf8e0, data=0x7fffb6d749b0) at ../util/iova-tree.c:96 #1 0x7f5f66479654 in g_tree_foreach () at /lib64/libglib-2.0.so.0 #2 0x564c30bf6b53 in iova_tree_find_iova (tree=, map=map@entry=0x7fffb6d74a00) at ../util/iova-tree.c:114 #3 0x564c309da0a9 in vhost_iova_tree_find_iova (tree=, map=map@entry=0x7fffb6d74a00) at ../hw/virtio/vhost-iova-tree.c:70 #4 0x564c3085e49d in vhost_vdpa_listener_region_del (listener=0x564c331024c8, section=0x7fffb6d74aa0) at ../hw/virtio/vhost-vdpa.c:444 #5 0x564c309f4931 in address_space_update_topology_pass (as=as@entry=0x564c31ab1840 , old_view=old_view@entry=0x564c33364cc0, new_view=new_view@entry=0x564c333640f0, adding=adding@entry=false) at ../system/memory.c:977 #6 0x564c309f4dcd in address_space_set_flatview (as=0x564c31ab1840 ) at ../system/memory.c:1079 #7 0x564c309f86d0 in memory_region_transaction_commit () at ../system/memory.c:1132 #8 0x564c309f86d0 in memory_region_transaction_commit () at ../system/memory.c:1117 #9 0x564c307cce64 in mch_realize (d=, errp=) at ../hw/pci-host/q35.c:584 However, it looks like iova_tree_find_address_iterator() only check if the translated address (HVA) falls in to the range when trying to locate the desired IOVA, causing the first DMAMap that happens to overlap in the translated address (HVA) space to be returned prematurely: 89 static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value, 90 gpointer data) 91 { : : 99 if (map->translated_addr + map->size < needle->translated_addr || 100 needle->translated_addr + needle->size < map->translated_addr) { 101 return false; 102 } 103 104 args->result = map; 105 return true; 106 } In the QEMU trace file, it reveals that the first DMAMap as below gets returned incorrectly instead the second, the latter of which is what the actual IOVA corresponds to: HVA GPA IOVA [0x7f7903e0, 0x7f7983e0)[0x0, 0x8000) [0x1000, 0x80001000) [0x7f7903ea, 0x7f7903ec)[0xfeda, 0xfedc) [0x201000, 0x221000) I think the analysis is totally accurat
Re: [PATCH 12/12] vdpa: fix network breakage after cancelling migration
On 3/13/2024 11:12 AM, Michael Tokarev wrote: 14.02.2024 14:28, Si-Wei Liu wrote: Fix an issue where cancellation of ongoing migration ends up with no network connectivity. When canceling migration, SVQ will be switched back to the passthrough mode, but the right call fd is not programed to the device and the svq's own call fd is still used. At the point of this transitioning period, the shadow_vqs_enabled hadn't been set back to false yet, causing the installation of call fd inadvertently bypassed. Fixes: a8ac88585da1 ("vhost: Add Shadow VirtQueue call forwarding capabilities") Cc: Eugenio Pérez Acked-by: Jason Wang Signed-off-by: Si-Wei Liu --- hw/virtio/vhost-vdpa.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) Is this a -stable material? Probably yes, the pre-requisites of this patch are PATCH #10 and #11 from this series (where SVQ_TSTATE_DISABLING gets defined and set). If yes, is it also applicable for stable-7.2 (mentioned commit is in 7.2.0), which lacks v7.2.0-2327-gb276524386 "vdpa: Remember last call fd set", or should this one also be picked up? Eugenio can judge, but seems to me the relevant code path cannot be effectively called as the dynamic SVQ feature (switching over to SVQ dynamically when migration is started) is not supported from 7.2. Maybe not worth it to cherry-pick this one to 7.2. Cherry-pick to stable-8.0 and above should be applicable though (it needs some tweaks on patch #10 to move svq_switching from @struct VhostVDPAShared to @struct vhost_vdpa). Regards, -Siwei Thanks, /mjt diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c index 004110f..dfeca8b 100644 --- a/hw/virtio/vhost-vdpa.c +++ b/hw/virtio/vhost-vdpa.c @@ -1468,7 +1468,15 @@ static int vhost_vdpa_set_vring_call(struct vhost_dev *dev, /* Remember last call fd because we can switch to SVQ anytime. */ vhost_svq_set_svq_call_fd(svq, file->fd); - if (v->shadow_vqs_enabled) { + /* + * When SVQ is transitioning to off, shadow_vqs_enabled has + * not been set back to false yet, but the underlying call fd + * will have to switch back to the guest notifier to signal the + * passthrough virtqueues. In other situations, SVQ's own call + * fd shall be used to signal the device model. + */ + if (v->shadow_vqs_enabled && + v->shared->svq_switching != SVQ_TSTATE_DISABLING) { return 0; }
Re: [PATCH v2 1/2] vhost: dirty log should be per backend type
On 3/12/2024 8:07 AM, Michael S. Tsirkin wrote: On Wed, Feb 14, 2024 at 10:42:29AM -0800, Si-Wei Liu wrote: Hi Michael, I'm taking off for 2+ weeks, but please feel free to provide comment and feedback while I'm off. I'll be checking emails still, and am about to address any opens as soon as I am back. Thanks, -Siwei Eugenio sent some comments. I don't have more, just address these please. Thanks! Thanks Michael, good to know you don't have more other than the one from Eugenio. I will post a v3 shortly to address his comments. -Siwei
[PATCH v3 2/2] vhost: Perform memory section dirty scans once per iteration
On setups with one or more virtio-net devices with vhost on, dirty tracking iteration increases cost the bigger the number amount of queues are set up e.g. on idle guests migration the following is observed with virtio-net with vhost=on: 48 queues -> 78.11% [.] vhost_dev_sync_region.isra.13 8 queues -> 40.50% [.] vhost_dev_sync_region.isra.13 1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13 2 devices, 1 queue -> 18.60% [.] vhost_dev_sync_region.isra.14 With high memory rates the symptom is lack of convergence as soon as it has a vhost device with a sufficiently high number of queues, the sufficient number of vhost devices. On every migration iteration (every 100msecs) it will redundantly query the *shared log* the number of queues configured with vhost that exist in the guest. For the virtqueue data, this is necessary, but not for the memory sections which are the same. So essentially we end up scanning the dirty log too often. To fix that, select a vhost device responsible for scanning the log with regards to memory sections dirty tracking. It is selected when we enable the logger (during migration) and cleared when we disable the logger. If the vhost logger device goes away for some reason, the logger will be re-selected from the rest of vhost devices. After making mem-section logger a singleton instance, constant cost of 7%-9% (like the 1 queue report) will be seen, no matter how many queues or how many vhost devices are configured: 48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13 2 devices, 8 queues -> 7.97% [.] vhost_dev_sync_region.isra.14 Co-developed-by: Joao Martins Signed-off-by: Joao Martins Signed-off-by: Si-Wei Liu --- v2 -> v3: - add after-fix benchmark to commit log - rename vhost_log_dev_enabled to vhost_dev_should_log - remove unneeded comparisons for backend_type - use QLIST array instead of single flat list to store vhost logger devices - simplify logger election logic --- hw/virtio/vhost.c | 63 ++- include/hw/virtio/vhost.h | 1 + 2 files changed, 58 insertions(+), 6 deletions(-) diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index efe2f74..d91858b 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -45,6 +45,7 @@ static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX]; static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX]; +static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX]; /* Memslots used by backends that support private memslots (without an fd). */ static unsigned int used_memslots; @@ -149,6 +150,43 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev) } } +static inline bool vhost_dev_should_log(struct vhost_dev *dev) +{ +assert(dev->vhost_ops); +assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE); +assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX); + +return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]); +} + +static inline void vhost_dev_elect_mem_logger(struct vhost_dev *hdev, bool add) +{ +VhostBackendType backend_type; + +assert(hdev->vhost_ops); + +backend_type = hdev->vhost_ops->backend_type; +assert(backend_type > VHOST_BACKEND_TYPE_NONE); +assert(backend_type < VHOST_BACKEND_TYPE_MAX); + +if (add && !QLIST_IS_INSERTED(hdev, logdev_entry)) { +if (QLIST_EMPTY(&vhost_log_devs[backend_type])) { +QLIST_INSERT_HEAD(&vhost_log_devs[backend_type], + hdev, logdev_entry); +} else { +/* + * The first vhost_device in the list is selected as the shared + * logger to scan memory sections. Put new entry next to the head + * to avoid inadvertent change to the underlying logger device. + */ +QLIST_INSERT_AFTER(QLIST_FIRST(&vhost_log_devs[backend_type]), + hdev, logdev_entry); +} +} else if (!add && QLIST_IS_INSERTED(hdev, logdev_entry)) { +QLIST_REMOVE(hdev, logdev_entry); +} +} + static int vhost_sync_dirty_bitmap(struct vhost_dev *dev, MemoryRegionSection *section, hwaddr first, @@ -166,12 +204,14 @@ static int vhost_sync_dirty_bitmap(struct vhost_dev *dev, start_addr = MAX(first, start_addr); end_addr = MIN(last, end_addr); -for (i = 0; i < dev->mem->nregions; ++i) { -struct vhost_memory_region *reg = dev->mem->regions + i; -vhost_dev_sync_region(dev, section, start_addr, end_addr, - reg->guest_phys_addr, - range_get_last(reg->guest_phys_addr, - reg->memory_size)); +if (vhost_dev_should_log(dev)) { +for (i = 0; i < dev->mem->nregio
[PATCH v3 1/2] vhost: dirty log should be per backend type
There could be a mix of both vhost-user and vhost-kernel clients in the same QEMU process, where separate vhost loggers for the specific vhost type have to be used. Make the vhost logger per backend type, and have them properly reference counted. Suggested-by: Michael S. Tsirkin Signed-off-by: Si-Wei Liu --- v2->v3: - remove non-effective assertion that never be reached - do not return NULL from vhost_log_get() - add neccessary assertions to vhost_log_get() --- hw/virtio/vhost.c | 50 ++ 1 file changed, 38 insertions(+), 12 deletions(-) diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index 2c9ac79..efe2f74 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -43,8 +43,8 @@ do { } while (0) #endif -static struct vhost_log *vhost_log; -static struct vhost_log *vhost_log_shm; +static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX]; +static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX]; /* Memslots used by backends that support private memslots (without an fd). */ static unsigned int used_memslots; @@ -287,6 +287,10 @@ static int vhost_set_backend_type(struct vhost_dev *dev, r = -1; } +if (r == 0) { +assert(dev->vhost_ops->backend_type == backend_type); +} + return r; } @@ -319,16 +323,22 @@ static struct vhost_log *vhost_log_alloc(uint64_t size, bool share) return log; } -static struct vhost_log *vhost_log_get(uint64_t size, bool share) +static struct vhost_log *vhost_log_get(VhostBackendType backend_type, + uint64_t size, bool share) { -struct vhost_log *log = share ? vhost_log_shm : vhost_log; +struct vhost_log *log; + +assert(backend_type > VHOST_BACKEND_TYPE_NONE); +assert(backend_type < VHOST_BACKEND_TYPE_MAX); + +log = share ? vhost_log_shm[backend_type] : vhost_log[backend_type]; if (!log || log->size != size) { log = vhost_log_alloc(size, share); if (share) { -vhost_log_shm = log; +vhost_log_shm[backend_type] = log; } else { -vhost_log = log; +vhost_log[backend_type] = log; } } else { ++log->refcnt; @@ -340,11 +350,20 @@ static struct vhost_log *vhost_log_get(uint64_t size, bool share) static void vhost_log_put(struct vhost_dev *dev, bool sync) { struct vhost_log *log = dev->log; +VhostBackendType backend_type; if (!log) { return; } +assert(dev->vhost_ops); +backend_type = dev->vhost_ops->backend_type; + +if (backend_type == VHOST_BACKEND_TYPE_NONE || +backend_type >= VHOST_BACKEND_TYPE_MAX) { +return; +} + --log->refcnt; if (log->refcnt == 0) { /* Sync only the range covered by the old log */ @@ -352,13 +371,13 @@ static void vhost_log_put(struct vhost_dev *dev, bool sync) vhost_log_sync_range(dev, 0, dev->log_size * VHOST_LOG_CHUNK - 1); } -if (vhost_log == log) { +if (vhost_log[backend_type] == log) { g_free(log->log); -vhost_log = NULL; -} else if (vhost_log_shm == log) { +vhost_log[backend_type] = NULL; +} else if (vhost_log_shm[backend_type] == log) { qemu_memfd_free(log->log, log->size * sizeof(*(log->log)), log->fd); -vhost_log_shm = NULL; +vhost_log_shm[backend_type] = NULL; } g_free(log); @@ -376,7 +395,8 @@ static bool vhost_dev_log_is_shared(struct vhost_dev *dev) static inline void vhost_dev_log_resize(struct vhost_dev *dev, uint64_t size) { -struct vhost_log *log = vhost_log_get(size, vhost_dev_log_is_shared(dev)); +struct vhost_log *log = vhost_log_get(dev->vhost_ops->backend_type, + size, vhost_dev_log_is_shared(dev)); uint64_t log_base = (uintptr_t)log->log; int r; @@ -2037,8 +2057,14 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings) uint64_t log_base; hdev->log_size = vhost_get_log_size(hdev); -hdev->log = vhost_log_get(hdev->log_size, +hdev->log = vhost_log_get(hdev->vhost_ops->backend_type, + hdev->log_size, vhost_dev_log_is_shared(hdev)); +if (!hdev->log) { +VHOST_OPS_DEBUG(r, "vhost_log_get failed"); +goto fail_vq; +} + log_base = (uintptr_t)hdev->log->log; r = hdev->vhost_ops->vhost_set_log_base(hdev, hdev->log_size ? log_base : 0, -- 1.8.3.1
Re: [PATCH v3 2/2] vhost: Perform memory section dirty scans once per iteration
On 3/14/2024 8:34 AM, Eugenio Perez Martin wrote: On Thu, Mar 14, 2024 at 9:38 AM Si-Wei Liu wrote: On setups with one or more virtio-net devices with vhost on, dirty tracking iteration increases cost the bigger the number amount of queues are set up e.g. on idle guests migration the following is observed with virtio-net with vhost=on: 48 queues -> 78.11% [.] vhost_dev_sync_region.isra.13 8 queues -> 40.50% [.] vhost_dev_sync_region.isra.13 1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13 2 devices, 1 queue -> 18.60% [.] vhost_dev_sync_region.isra.14 With high memory rates the symptom is lack of convergence as soon as it has a vhost device with a sufficiently high number of queues, the sufficient number of vhost devices. On every migration iteration (every 100msecs) it will redundantly query the *shared log* the number of queues configured with vhost that exist in the guest. For the virtqueue data, this is necessary, but not for the memory sections which are the same. So essentially we end up scanning the dirty log too often. To fix that, select a vhost device responsible for scanning the log with regards to memory sections dirty tracking. It is selected when we enable the logger (during migration) and cleared when we disable the logger. If the vhost logger device goes away for some reason, the logger will be re-selected from the rest of vhost devices. After making mem-section logger a singleton instance, constant cost of 7%-9% (like the 1 queue report) will be seen, no matter how many queues or how many vhost devices are configured: 48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13 2 devices, 8 queues -> 7.97% [.] vhost_dev_sync_region.isra.14 Co-developed-by: Joao Martins Signed-off-by: Joao Martins Signed-off-by: Si-Wei Liu --- v2 -> v3: - add after-fix benchmark to commit log - rename vhost_log_dev_enabled to vhost_dev_should_log - remove unneeded comparisons for backend_type - use QLIST array instead of single flat list to store vhost logger devices - simplify logger election logic --- hw/virtio/vhost.c | 63 ++- include/hw/virtio/vhost.h | 1 + 2 files changed, 58 insertions(+), 6 deletions(-) diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index efe2f74..d91858b 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -45,6 +45,7 @@ static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX]; static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX]; +static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX]; /* Memslots used by backends that support private memslots (without an fd). */ static unsigned int used_memslots; @@ -149,6 +150,43 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev) } } +static inline bool vhost_dev_should_log(struct vhost_dev *dev) +{ +assert(dev->vhost_ops); +assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE); +assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX); + +return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]); +} + +static inline void vhost_dev_elect_mem_logger(struct vhost_dev *hdev, bool add) +{ +VhostBackendType backend_type; + +assert(hdev->vhost_ops); + +backend_type = hdev->vhost_ops->backend_type; +assert(backend_type > VHOST_BACKEND_TYPE_NONE); +assert(backend_type < VHOST_BACKEND_TYPE_MAX); + +if (add && !QLIST_IS_INSERTED(hdev, logdev_entry)) { +if (QLIST_EMPTY(&vhost_log_devs[backend_type])) { +QLIST_INSERT_HEAD(&vhost_log_devs[backend_type], + hdev, logdev_entry); +} else { +/* + * The first vhost_device in the list is selected as the shared + * logger to scan memory sections. Put new entry next to the head + * to avoid inadvertent change to the underlying logger device. + */ Why is changing the logger device a problem? All the code paths are either changing the QLIST or logging, isn't it? Changing logger device doesn't affect functionality for sure, but may have inadvertent effect on cache locality, particularly it's relevant to the log scanning process in the hot path. The code makes sure there's no churn on the leading logger selection as a result of adding new vhost device, unless the selected logger device will be gone and a re-election of another logger is needed. -Siwei +QLIST_INSERT_AFTER(QLIST_FIRST(&vhost_log_devs[backend_type]), + hdev, logdev_entry); +} +} else if (!add && QLIST_IS_INSERTED(hdev, logdev_entry)) { +QLIST_REMOVE(hdev, logdev_entry); +} +} + static int vhost_sync_dirty_bitmap(struct vhost_dev *dev, MemoryRegionSection *section,
Re: [PATCH v3 1/2] vhost: dirty log should be per backend type
On 3/14/2024 8:25 AM, Eugenio Perez Martin wrote: On Thu, Mar 14, 2024 at 9:38 AM Si-Wei Liu wrote: There could be a mix of both vhost-user and vhost-kernel clients in the same QEMU process, where separate vhost loggers for the specific vhost type have to be used. Make the vhost logger per backend type, and have them properly reference counted. Suggested-by: Michael S. Tsirkin Signed-off-by: Si-Wei Liu --- v2->v3: - remove non-effective assertion that never be reached - do not return NULL from vhost_log_get() - add neccessary assertions to vhost_log_get() --- hw/virtio/vhost.c | 50 ++ 1 file changed, 38 insertions(+), 12 deletions(-) diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index 2c9ac79..efe2f74 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -43,8 +43,8 @@ do { } while (0) #endif -static struct vhost_log *vhost_log; -static struct vhost_log *vhost_log_shm; +static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX]; +static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX]; /* Memslots used by backends that support private memslots (without an fd). */ static unsigned int used_memslots; @@ -287,6 +287,10 @@ static int vhost_set_backend_type(struct vhost_dev *dev, r = -1; } +if (r == 0) { +assert(dev->vhost_ops->backend_type == backend_type); +} + return r; } @@ -319,16 +323,22 @@ static struct vhost_log *vhost_log_alloc(uint64_t size, bool share) return log; } -static struct vhost_log *vhost_log_get(uint64_t size, bool share) +static struct vhost_log *vhost_log_get(VhostBackendType backend_type, + uint64_t size, bool share) { -struct vhost_log *log = share ? vhost_log_shm : vhost_log; +struct vhost_log *log; + +assert(backend_type > VHOST_BACKEND_TYPE_NONE); +assert(backend_type < VHOST_BACKEND_TYPE_MAX); + +log = share ? vhost_log_shm[backend_type] : vhost_log[backend_type]; if (!log || log->size != size) { log = vhost_log_alloc(size, share); if (share) { -vhost_log_shm = log; +vhost_log_shm[backend_type] = log; } else { -vhost_log = log; +vhost_log[backend_type] = log; } } else { ++log->refcnt; @@ -340,11 +350,20 @@ static struct vhost_log *vhost_log_get(uint64_t size, bool share) static void vhost_log_put(struct vhost_dev *dev, bool sync) { struct vhost_log *log = dev->log; +VhostBackendType backend_type; if (!log) { return; } +assert(dev->vhost_ops); +backend_type = dev->vhost_ops->backend_type; + +if (backend_type == VHOST_BACKEND_TYPE_NONE || +backend_type >= VHOST_BACKEND_TYPE_MAX) { +return; +} + --log->refcnt; if (log->refcnt == 0) { /* Sync only the range covered by the old log */ @@ -352,13 +371,13 @@ static void vhost_log_put(struct vhost_dev *dev, bool sync) vhost_log_sync_range(dev, 0, dev->log_size * VHOST_LOG_CHUNK - 1); } -if (vhost_log == log) { +if (vhost_log[backend_type] == log) { g_free(log->log); -vhost_log = NULL; -} else if (vhost_log_shm == log) { +vhost_log[backend_type] = NULL; +} else if (vhost_log_shm[backend_type] == log) { qemu_memfd_free(log->log, log->size * sizeof(*(log->log)), log->fd); -vhost_log_shm = NULL; +vhost_log_shm[backend_type] = NULL; } g_free(log); @@ -376,7 +395,8 @@ static bool vhost_dev_log_is_shared(struct vhost_dev *dev) static inline void vhost_dev_log_resize(struct vhost_dev *dev, uint64_t size) { -struct vhost_log *log = vhost_log_get(size, vhost_dev_log_is_shared(dev)); +struct vhost_log *log = vhost_log_get(dev->vhost_ops->backend_type, + size, vhost_dev_log_is_shared(dev)); uint64_t log_base = (uintptr_t)log->log; int r; @@ -2037,8 +2057,14 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings) uint64_t log_base; hdev->log_size = vhost_get_log_size(hdev); -hdev->log = vhost_log_get(hdev->log_size, +hdev->log = vhost_log_get(hdev->vhost_ops->backend_type, + hdev->log_size, vhost_dev_log_is_shared(hdev)); +if (!hdev->log) { I thought vhost_log_get couldn't return NULL :). Sure, missed that. Will post a revised v4. -Siwei Other than that, Acked-by: Eugenio Pérez +VHOST_OPS_DEBUG(r, "vhost_log_get failed"); +goto fail_vq; +} + log_base = (uintptr_t)hdev->log->log; r = hde
[PATCH v4 1/2] vhost: dirty log should be per backend type
There could be a mix of both vhost-user and vhost-kernel clients in the same QEMU process, where separate vhost loggers for the specific vhost type have to be used. Make the vhost logger per backend type, and have them properly reference counted. Suggested-by: Michael S. Tsirkin Signed-off-by: Si-Wei Liu --- v3->v4: - remove checking NULL return value from vhost_log_get v2->v3: - remove non-effective assertion that never be reached - do not return NULL from vhost_log_get() - add neccessary assertions to vhost_log_get() --- hw/virtio/vhost.c | 45 + 1 file changed, 33 insertions(+), 12 deletions(-) diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index 2c9ac79..612f4db 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -43,8 +43,8 @@ do { } while (0) #endif -static struct vhost_log *vhost_log; -static struct vhost_log *vhost_log_shm; +static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX]; +static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX]; /* Memslots used by backends that support private memslots (without an fd). */ static unsigned int used_memslots; @@ -287,6 +287,10 @@ static int vhost_set_backend_type(struct vhost_dev *dev, r = -1; } +if (r == 0) { +assert(dev->vhost_ops->backend_type == backend_type); +} + return r; } @@ -319,16 +323,22 @@ static struct vhost_log *vhost_log_alloc(uint64_t size, bool share) return log; } -static struct vhost_log *vhost_log_get(uint64_t size, bool share) +static struct vhost_log *vhost_log_get(VhostBackendType backend_type, + uint64_t size, bool share) { -struct vhost_log *log = share ? vhost_log_shm : vhost_log; +struct vhost_log *log; + +assert(backend_type > VHOST_BACKEND_TYPE_NONE); +assert(backend_type < VHOST_BACKEND_TYPE_MAX); + +log = share ? vhost_log_shm[backend_type] : vhost_log[backend_type]; if (!log || log->size != size) { log = vhost_log_alloc(size, share); if (share) { -vhost_log_shm = log; +vhost_log_shm[backend_type] = log; } else { -vhost_log = log; +vhost_log[backend_type] = log; } } else { ++log->refcnt; @@ -340,11 +350,20 @@ static struct vhost_log *vhost_log_get(uint64_t size, bool share) static void vhost_log_put(struct vhost_dev *dev, bool sync) { struct vhost_log *log = dev->log; +VhostBackendType backend_type; if (!log) { return; } +assert(dev->vhost_ops); +backend_type = dev->vhost_ops->backend_type; + +if (backend_type == VHOST_BACKEND_TYPE_NONE || +backend_type >= VHOST_BACKEND_TYPE_MAX) { +return; +} + --log->refcnt; if (log->refcnt == 0) { /* Sync only the range covered by the old log */ @@ -352,13 +371,13 @@ static void vhost_log_put(struct vhost_dev *dev, bool sync) vhost_log_sync_range(dev, 0, dev->log_size * VHOST_LOG_CHUNK - 1); } -if (vhost_log == log) { +if (vhost_log[backend_type] == log) { g_free(log->log); -vhost_log = NULL; -} else if (vhost_log_shm == log) { +vhost_log[backend_type] = NULL; +} else if (vhost_log_shm[backend_type] == log) { qemu_memfd_free(log->log, log->size * sizeof(*(log->log)), log->fd); -vhost_log_shm = NULL; +vhost_log_shm[backend_type] = NULL; } g_free(log); @@ -376,7 +395,8 @@ static bool vhost_dev_log_is_shared(struct vhost_dev *dev) static inline void vhost_dev_log_resize(struct vhost_dev *dev, uint64_t size) { -struct vhost_log *log = vhost_log_get(size, vhost_dev_log_is_shared(dev)); +struct vhost_log *log = vhost_log_get(dev->vhost_ops->backend_type, + size, vhost_dev_log_is_shared(dev)); uint64_t log_base = (uintptr_t)log->log; int r; @@ -2037,7 +2057,8 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings) uint64_t log_base; hdev->log_size = vhost_get_log_size(hdev); -hdev->log = vhost_log_get(hdev->log_size, +hdev->log = vhost_log_get(hdev->vhost_ops->backend_type, + hdev->log_size, vhost_dev_log_is_shared(hdev)); log_base = (uintptr_t)hdev->log->log; r = hdev->vhost_ops->vhost_set_log_base(hdev, -- 1.8.3.1
[PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration
On setups with one or more virtio-net devices with vhost on, dirty tracking iteration increases cost the bigger the number amount of queues are set up e.g. on idle guests migration the following is observed with virtio-net with vhost=on: 48 queues -> 78.11% [.] vhost_dev_sync_region.isra.13 8 queues -> 40.50% [.] vhost_dev_sync_region.isra.13 1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13 2 devices, 1 queue -> 18.60% [.] vhost_dev_sync_region.isra.14 With high memory rates the symptom is lack of convergence as soon as it has a vhost device with a sufficiently high number of queues, the sufficient number of vhost devices. On every migration iteration (every 100msecs) it will redundantly query the *shared log* the number of queues configured with vhost that exist in the guest. For the virtqueue data, this is necessary, but not for the memory sections which are the same. So essentially we end up scanning the dirty log too often. To fix that, select a vhost device responsible for scanning the log with regards to memory sections dirty tracking. It is selected when we enable the logger (during migration) and cleared when we disable the logger. If the vhost logger device goes away for some reason, the logger will be re-selected from the rest of vhost devices. After making mem-section logger a singleton instance, constant cost of 7%-9% (like the 1 queue report) will be seen, no matter how many queues or how many vhost devices are configured: 48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13 2 devices, 8 queues -> 7.97% [.] vhost_dev_sync_region.isra.14 Co-developed-by: Joao Martins Signed-off-by: Joao Martins Signed-off-by: Si-Wei Liu --- v3 -> v4: - add comment to clarify effect on cache locality and performance v2 -> v3: - add after-fix benchmark to commit log - rename vhost_log_dev_enabled to vhost_dev_should_log - remove unneeded comparisons for backend_type - use QLIST array instead of single flat list to store vhost logger devices - simplify logger election logic --- hw/virtio/vhost.c | 67 ++- include/hw/virtio/vhost.h | 1 + 2 files changed, 62 insertions(+), 6 deletions(-) diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index 612f4db..58522f1 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -45,6 +45,7 @@ static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX]; static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX]; +static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX]; /* Memslots used by backends that support private memslots (without an fd). */ static unsigned int used_memslots; @@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev) } } +static inline bool vhost_dev_should_log(struct vhost_dev *dev) +{ +assert(dev->vhost_ops); +assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE); +assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX); + +return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]); +} + +static inline void vhost_dev_elect_mem_logger(struct vhost_dev *hdev, bool add) +{ +VhostBackendType backend_type; + +assert(hdev->vhost_ops); + +backend_type = hdev->vhost_ops->backend_type; +assert(backend_type > VHOST_BACKEND_TYPE_NONE); +assert(backend_type < VHOST_BACKEND_TYPE_MAX); + +if (add && !QLIST_IS_INSERTED(hdev, logdev_entry)) { +if (QLIST_EMPTY(&vhost_log_devs[backend_type])) { +QLIST_INSERT_HEAD(&vhost_log_devs[backend_type], + hdev, logdev_entry); +} else { +/* + * The first vhost_device in the list is selected as the shared + * logger to scan memory sections. Put new entry next to the head + * to avoid inadvertent change to the underlying logger device. + * This is done in order to get better cache locality and to avoid + * performance churn on the hot path for log scanning. Even when + * new devices come and go quickly, it wouldn't end up changing + * the active leading logger device at all. + */ +QLIST_INSERT_AFTER(QLIST_FIRST(&vhost_log_devs[backend_type]), + hdev, logdev_entry); +} +} else if (!add && QLIST_IS_INSERTED(hdev, logdev_entry)) { +QLIST_REMOVE(hdev, logdev_entry); +} +} + static int vhost_sync_dirty_bitmap(struct vhost_dev *dev, MemoryRegionSection *section, hwaddr first, @@ -166,12 +208,14 @@ static int vhost_sync_dirty_bitmap(struct vhost_dev *dev, start_addr = MAX(first, start_addr); end_addr = MIN(last, end_addr); -for (i = 0; i < dev->mem->nregions; ++i) { -struct vho
Re: [PATCH v4 1/2] vhost: dirty log should be per backend type
On 3/14/2024 8:50 PM, Jason Wang wrote: On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu wrote: There could be a mix of both vhost-user and vhost-kernel clients in the same QEMU process, where separate vhost loggers for the specific vhost type have to be used. Make the vhost logger per backend type, and have them properly reference counted. It's better to describe what's the advantage of doing this. Yes, I can add that to the log. Although it's a niche use case, it was actually a long standing limitation / bug that vhost-user and vhost-kernel loggers can't co-exist per QEMU process, but today it's just silent failure that may be ended up with. This bug fix removes that implicit limitation in the code. Suggested-by: Michael S. Tsirkin Signed-off-by: Si-Wei Liu --- v3->v4: - remove checking NULL return value from vhost_log_get v2->v3: - remove non-effective assertion that never be reached - do not return NULL from vhost_log_get() - add neccessary assertions to vhost_log_get() --- hw/virtio/vhost.c | 45 + 1 file changed, 33 insertions(+), 12 deletions(-) diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index 2c9ac79..612f4db 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -43,8 +43,8 @@ do { } while (0) #endif -static struct vhost_log *vhost_log; -static struct vhost_log *vhost_log_shm; +static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX]; +static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX]; /* Memslots used by backends that support private memslots (without an fd). */ static unsigned int used_memslots; @@ -287,6 +287,10 @@ static int vhost_set_backend_type(struct vhost_dev *dev, r = -1; } +if (r == 0) { +assert(dev->vhost_ops->backend_type == backend_type); +} + Under which condition could we hit this? Just in case some other function inadvertently corrupted this earlier, we have to capture discrepancy in the first place... On the other hand, it will be helpful for other vhost backend writers to diagnose day-one bug in the code. I feel just code comment here will not be sufficient/helpful. It seems not good to assert a local logic. It seems to me quite a few local asserts are in the same file already, vhost_save_backend_state, vhost_load_backend_state, vhost_virtqueue_mask, vhost_config_mask, just to name a few. Why local assert a problem? Thanks, -Siwei Thanks
Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration
On 3/14/2024 9:03 PM, Jason Wang wrote: On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu wrote: On setups with one or more virtio-net devices with vhost on, dirty tracking iteration increases cost the bigger the number amount of queues are set up e.g. on idle guests migration the following is observed with virtio-net with vhost=on: 48 queues -> 78.11% [.] vhost_dev_sync_region.isra.13 8 queues -> 40.50% [.] vhost_dev_sync_region.isra.13 1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13 2 devices, 1 queue -> 18.60% [.] vhost_dev_sync_region.isra.14 With high memory rates the symptom is lack of convergence as soon as it has a vhost device with a sufficiently high number of queues, the sufficient number of vhost devices. On every migration iteration (every 100msecs) it will redundantly query the *shared log* the number of queues configured with vhost that exist in the guest. For the virtqueue data, this is necessary, but not for the memory sections which are the same. So essentially we end up scanning the dirty log too often. To fix that, select a vhost device responsible for scanning the log with regards to memory sections dirty tracking. It is selected when we enable the logger (during migration) and cleared when we disable the logger. If the vhost logger device goes away for some reason, the logger will be re-selected from the rest of vhost devices. After making mem-section logger a singleton instance, constant cost of 7%-9% (like the 1 queue report) will be seen, no matter how many queues or how many vhost devices are configured: 48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13 2 devices, 8 queues -> 7.97% [.] vhost_dev_sync_region.isra.14 Co-developed-by: Joao Martins Signed-off-by: Joao Martins Signed-off-by: Si-Wei Liu --- v3 -> v4: - add comment to clarify effect on cache locality and performance v2 -> v3: - add after-fix benchmark to commit log - rename vhost_log_dev_enabled to vhost_dev_should_log - remove unneeded comparisons for backend_type - use QLIST array instead of single flat list to store vhost logger devices - simplify logger election logic --- hw/virtio/vhost.c | 67 ++- include/hw/virtio/vhost.h | 1 + 2 files changed, 62 insertions(+), 6 deletions(-) diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index 612f4db..58522f1 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -45,6 +45,7 @@ static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX]; static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX]; +static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX]; /* Memslots used by backends that support private memslots (without an fd). */ static unsigned int used_memslots; @@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev) } } +static inline bool vhost_dev_should_log(struct vhost_dev *dev) +{ +assert(dev->vhost_ops); +assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE); +assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX); + +return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]); A dumb question, why not simple check dev->log == vhost_log_shm[dev->vhost_ops->backend_type] Because we are not sure if the logger comes from vhost_log_shm[] or vhost_log[]. Don't want to complicate the check here by calling into vhost_dev_log_is_shared() everytime when the .log_sync() is called. -Siwei ? Thanks
Re: [PATCH v4 1/2] vhost: dirty log should be per backend type
On 3/17/2024 8:20 PM, Jason Wang wrote: On Sat, Mar 16, 2024 at 2:33 AM Si-Wei Liu wrote: On 3/14/2024 8:50 PM, Jason Wang wrote: On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu wrote: There could be a mix of both vhost-user and vhost-kernel clients in the same QEMU process, where separate vhost loggers for the specific vhost type have to be used. Make the vhost logger per backend type, and have them properly reference counted. It's better to describe what's the advantage of doing this. Yes, I can add that to the log. Although it's a niche use case, it was actually a long standing limitation / bug that vhost-user and vhost-kernel loggers can't co-exist per QEMU process, but today it's just silent failure that may be ended up with. This bug fix removes that implicit limitation in the code. Ok. Suggested-by: Michael S. Tsirkin Signed-off-by: Si-Wei Liu --- v3->v4: - remove checking NULL return value from vhost_log_get v2->v3: - remove non-effective assertion that never be reached - do not return NULL from vhost_log_get() - add neccessary assertions to vhost_log_get() --- hw/virtio/vhost.c | 45 + 1 file changed, 33 insertions(+), 12 deletions(-) diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index 2c9ac79..612f4db 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -43,8 +43,8 @@ do { } while (0) #endif -static struct vhost_log *vhost_log; -static struct vhost_log *vhost_log_shm; +static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX]; +static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX]; /* Memslots used by backends that support private memslots (without an fd). */ static unsigned int used_memslots; @@ -287,6 +287,10 @@ static int vhost_set_backend_type(struct vhost_dev *dev, r = -1; } +if (r == 0) { +assert(dev->vhost_ops->backend_type == backend_type); +} + Under which condition could we hit this? Just in case some other function inadvertently corrupted this earlier, we have to capture discrepancy in the first place... On the other hand, it will be helpful for other vhost backend writers to diagnose day-one bug in the code. I feel just code comment here will not be sufficient/helpful. See below. It seems not good to assert a local logic. It seems to me quite a few local asserts are in the same file already, vhost_save_backend_state, For example it has assert for assert(!dev->started); which is not the logic of the function itself but require vhost_dev_start() not to be called before. But it looks like this patch you assert the code just a few lines above the assert itself? Yes, that was the intent - for e.g. xxx_ops may contain corrupted xxx_ops.backend_type already before coming to this vhost_set_backend_type() function. And we may capture this corrupted state by asserting the expected xxx_ops.backend_type (to be consistent with the backend_type passed in), which needs be done in the first place when this discrepancy is detected. In practice I think there should be no harm to add this assert, but this will add warranted guarantee to the current code. Regards, -Siwei dev->vhost_ops = &xxx_ops; ... assert(dev->vhost_ops->backend_type == backend_type) ? Thanks vhost_load_backend_state, vhost_virtqueue_mask, vhost_config_mask, just to name a few. Why local assert a problem? Thanks, -Siwei Thanks
Re: [PATCH v4 2/2] vhost: Perform memory section dirty scans once per iteration
On 3/17/2024 8:22 PM, Jason Wang wrote: On Sat, Mar 16, 2024 at 2:45 AM Si-Wei Liu wrote: On 3/14/2024 9:03 PM, Jason Wang wrote: On Fri, Mar 15, 2024 at 5:39 AM Si-Wei Liu wrote: On setups with one or more virtio-net devices with vhost on, dirty tracking iteration increases cost the bigger the number amount of queues are set up e.g. on idle guests migration the following is observed with virtio-net with vhost=on: 48 queues -> 78.11% [.] vhost_dev_sync_region.isra.13 8 queues -> 40.50% [.] vhost_dev_sync_region.isra.13 1 queue -> 6.89% [.] vhost_dev_sync_region.isra.13 2 devices, 1 queue -> 18.60% [.] vhost_dev_sync_region.isra.14 With high memory rates the symptom is lack of convergence as soon as it has a vhost device with a sufficiently high number of queues, the sufficient number of vhost devices. On every migration iteration (every 100msecs) it will redundantly query the *shared log* the number of queues configured with vhost that exist in the guest. For the virtqueue data, this is necessary, but not for the memory sections which are the same. So essentially we end up scanning the dirty log too often. To fix that, select a vhost device responsible for scanning the log with regards to memory sections dirty tracking. It is selected when we enable the logger (during migration) and cleared when we disable the logger. If the vhost logger device goes away for some reason, the logger will be re-selected from the rest of vhost devices. After making mem-section logger a singleton instance, constant cost of 7%-9% (like the 1 queue report) will be seen, no matter how many queues or how many vhost devices are configured: 48 queues -> 8.71%[.] vhost_dev_sync_region.isra.13 2 devices, 8 queues -> 7.97% [.] vhost_dev_sync_region.isra.14 Co-developed-by: Joao Martins Signed-off-by: Joao Martins Signed-off-by: Si-Wei Liu --- v3 -> v4: - add comment to clarify effect on cache locality and performance v2 -> v3: - add after-fix benchmark to commit log - rename vhost_log_dev_enabled to vhost_dev_should_log - remove unneeded comparisons for backend_type - use QLIST array instead of single flat list to store vhost logger devices - simplify logger election logic --- hw/virtio/vhost.c | 67 ++- include/hw/virtio/vhost.h | 1 + 2 files changed, 62 insertions(+), 6 deletions(-) diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index 612f4db..58522f1 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -45,6 +45,7 @@ static struct vhost_log *vhost_log[VHOST_BACKEND_TYPE_MAX]; static struct vhost_log *vhost_log_shm[VHOST_BACKEND_TYPE_MAX]; +static QLIST_HEAD(, vhost_dev) vhost_log_devs[VHOST_BACKEND_TYPE_MAX]; /* Memslots used by backends that support private memslots (without an fd). */ static unsigned int used_memslots; @@ -149,6 +150,47 @@ bool vhost_dev_has_iommu(struct vhost_dev *dev) } } +static inline bool vhost_dev_should_log(struct vhost_dev *dev) +{ +assert(dev->vhost_ops); +assert(dev->vhost_ops->backend_type > VHOST_BACKEND_TYPE_NONE); +assert(dev->vhost_ops->backend_type < VHOST_BACKEND_TYPE_MAX); + +return dev == QLIST_FIRST(&vhost_log_devs[dev->vhost_ops->backend_type]); A dumb question, why not simple check dev->log == vhost_log_shm[dev->vhost_ops->backend_type] Because we are not sure if the logger comes from vhost_log_shm[] or vhost_log[]. Don't want to complicate the check here by calling into vhost_dev_log_is_shared() everytime when the .log_sync() is called. It has very low overhead, isn't it? Whether this has low overhead will have to depend on the specific backend's implementation for .vhost_requires_shm_log(), which the common vhost layer should not assume upon or rely on the current implementation. static bool vhost_dev_log_is_shared(struct vhost_dev *dev) { return dev->vhost_ops->vhost_requires_shm_log && dev->vhost_ops->vhost_requires_shm_log(dev); } And it helps to simplify the logic. Generally yes, but when it comes to hot path operations the performance consideration could override this principle. I think there's no harm to check against logger device cached in vhost layer itself, and the current patch does not create a lot of complexity or performance side effect (actually I think the conditional should be very straightforward to turn into just a couple of assembly compare and branch instructions rather than indirection through another jmp call). -Siwei Thanks -Siwei ? Thanks
Re: [RFC 1/2] iova_tree: add an id member to DMAMap
On 4/10/2024 3:03 AM, Eugenio Pérez wrote: IOVA tree is also used to track the mappings of virtio-net shadow virtqueue. This mappings may not match with the GPA->HVA ones. This causes a problem when overlapped regions (different GPA but same translated HVA) exists in the tree, as looking them by HVA will return them twice. To solve this, create an id member so we can assign unique identifiers (GPA) to the maps. Signed-off-by: Eugenio Pérez --- include/qemu/iova-tree.h | 5 +++-- util/iova-tree.c | 3 ++- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h index 2a10a7052e..34ee230e7d 100644 --- a/include/qemu/iova-tree.h +++ b/include/qemu/iova-tree.h @@ -36,6 +36,7 @@ typedef struct DMAMap { hwaddr iova; hwaddr translated_addr; hwaddr size;/* Inclusive */ +uint64_t id; IOMMUAccessFlags perm; } QEMU_PACKED DMAMap; typedef gboolean (*iova_tree_iterator)(DMAMap *map); @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map); * @map: the mapping to search * * Search for a mapping in the iova tree that translated_addr overlaps with the - * mapping range specified. Only the first found mapping will be - * returned. + * mapping range specified and map->id is equal. Only the first found + * mapping will be returned. * * Return: DMAMap pointer if found, or NULL if not found. Note that * the returned DMAMap pointer is maintained internally. User should diff --git a/util/iova-tree.c b/util/iova-tree.c index 536789797e..0863e0a3b8 100644 --- a/util/iova-tree.c +++ b/util/iova-tree.c @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value, needle = args->needle; if (map->translated_addr + map->size < needle->translated_addr || -needle->translated_addr + needle->size < map->translated_addr) { +needle->translated_addr + needle->size < map->translated_addr || +needle->id != map->id) { It looks this iterator can also be invoked by SVQ from vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA space will be searched on without passing in the ID (GPA), and exact match for the same GPA range is not actually needed unlike the mapping removal case. Could we create an API variant, for the SVQ lookup case specifically? Or alternatively, add a special flag, say skip_id_match to DMAMap, and the id match check may look like below: (!needle->skip_id_match && needle->id != map->id) I think vhost_svq_translate_addr() could just call the API variant or pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova(). Thanks, -Siwei return false; }
Re: [RFC 1/2] iova_tree: add an id member to DMAMap
On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote: On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu wrote: On 4/10/2024 3:03 AM, Eugenio Pérez wrote: IOVA tree is also used to track the mappings of virtio-net shadow virtqueue. This mappings may not match with the GPA->HVA ones. This causes a problem when overlapped regions (different GPA but same translated HVA) exists in the tree, as looking them by HVA will return them twice. To solve this, create an id member so we can assign unique identifiers (GPA) to the maps. Signed-off-by: Eugenio Pérez --- include/qemu/iova-tree.h | 5 +++-- util/iova-tree.c | 3 ++- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h index 2a10a7052e..34ee230e7d 100644 --- a/include/qemu/iova-tree.h +++ b/include/qemu/iova-tree.h @@ -36,6 +36,7 @@ typedef struct DMAMap { hwaddr iova; hwaddr translated_addr; hwaddr size;/* Inclusive */ +uint64_t id; IOMMUAccessFlags perm; } QEMU_PACKED DMAMap; typedef gboolean (*iova_tree_iterator)(DMAMap *map); @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map); * @map: the mapping to search * * Search for a mapping in the iova tree that translated_addr overlaps with the - * mapping range specified. Only the first found mapping will be - * returned. + * mapping range specified and map->id is equal. Only the first found + * mapping will be returned. * * Return: DMAMap pointer if found, or NULL if not found. Note that * the returned DMAMap pointer is maintained internally. User should diff --git a/util/iova-tree.c b/util/iova-tree.c index 536789797e..0863e0a3b8 100644 --- a/util/iova-tree.c +++ b/util/iova-tree.c @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value, needle = args->needle; if (map->translated_addr + map->size < needle->translated_addr || -needle->translated_addr + needle->size < map->translated_addr) { +needle->translated_addr + needle->size < map->translated_addr || +needle->id != map->id) { It looks this iterator can also be invoked by SVQ from vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA space will be searched on without passing in the ID (GPA), and exact match for the same GPA range is not actually needed unlike the mapping removal case. Could we create an API variant, for the SVQ lookup case specifically? Or alternatively, add a special flag, say skip_id_match to DMAMap, and the id match check may look like below: (!needle->skip_id_match && needle->id != map->id) I think vhost_svq_translate_addr() could just call the API variant or pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova(). I think you're totally right. But I'd really like to not complicate the API of the iova_tree more. I think we can look for the hwaddr using memory_region_from_host and then get the hwaddr. It is another lookup though... Yeah, that will be another means of doing translation without having to complicate the API around iova_tree. I wonder how the lookup through memory_region_from_host() may perform compared to the iova tree one, the former looks to be an O(N) linear search on a linked list while the latter would be roughly O(log N) on an AVL tree? Of course, memory_region_from_host() won't search out of the guest memory space for sure. As this could be on the hot data path I have a little bit hesitance over the potential cost or performance regression this change could bring in, but maybe I'm overthinking it too much... Thanks, -Siwei Thanks, -Siwei return false; }
Re: [RFC 1/2] iova_tree: add an id member to DMAMap
On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote: On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu wrote: On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote: On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu wrote: On 4/10/2024 3:03 AM, Eugenio Pérez wrote: IOVA tree is also used to track the mappings of virtio-net shadow virtqueue. This mappings may not match with the GPA->HVA ones. This causes a problem when overlapped regions (different GPA but same translated HVA) exists in the tree, as looking them by HVA will return them twice. To solve this, create an id member so we can assign unique identifiers (GPA) to the maps. Signed-off-by: Eugenio Pérez --- include/qemu/iova-tree.h | 5 +++-- util/iova-tree.c | 3 ++- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h index 2a10a7052e..34ee230e7d 100644 --- a/include/qemu/iova-tree.h +++ b/include/qemu/iova-tree.h @@ -36,6 +36,7 @@ typedef struct DMAMap { hwaddr iova; hwaddr translated_addr; hwaddr size;/* Inclusive */ +uint64_t id; IOMMUAccessFlags perm; } QEMU_PACKED DMAMap; typedef gboolean (*iova_tree_iterator)(DMAMap *map); @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map); * @map: the mapping to search * * Search for a mapping in the iova tree that translated_addr overlaps with the - * mapping range specified. Only the first found mapping will be - * returned. + * mapping range specified and map->id is equal. Only the first found + * mapping will be returned. * * Return: DMAMap pointer if found, or NULL if not found. Note that * the returned DMAMap pointer is maintained internally. User should diff --git a/util/iova-tree.c b/util/iova-tree.c index 536789797e..0863e0a3b8 100644 --- a/util/iova-tree.c +++ b/util/iova-tree.c @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value, needle = args->needle; if (map->translated_addr + map->size < needle->translated_addr || -needle->translated_addr + needle->size < map->translated_addr) { +needle->translated_addr + needle->size < map->translated_addr || +needle->id != map->id) { It looks this iterator can also be invoked by SVQ from vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA space will be searched on without passing in the ID (GPA), and exact match for the same GPA range is not actually needed unlike the mapping removal case. Could we create an API variant, for the SVQ lookup case specifically? Or alternatively, add a special flag, say skip_id_match to DMAMap, and the id match check may look like below: (!needle->skip_id_match && needle->id != map->id) I think vhost_svq_translate_addr() could just call the API variant or pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova(). I think you're totally right. But I'd really like to not complicate the API of the iova_tree more. I think we can look for the hwaddr using memory_region_from_host and then get the hwaddr. It is another lookup though... Yeah, that will be another means of doing translation without having to complicate the API around iova_tree. I wonder how the lookup through memory_region_from_host() may perform compared to the iova tree one, the former looks to be an O(N) linear search on a linked list while the latter would be roughly O(log N) on an AVL tree? Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is linear too. It is not even ordered. Oh Sorry, I misread the code and I should look for g_tree_foreach () instead of g_tree_search_node(). So the former is indeed linear iteration, but it looks to be ordered? https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115 But apart from this detail you're right, I have the same concerns with this solution too. If we see a hard performance regression we could go to more complicated solutions, like maintaining a reverse IOVATree in vhost-iova-tree too. First RFCs of SVQ did that actually. Agreed, yeap we can use memory_region_from_host for now. Any reason why reverse IOVATree was dropped, lack of users? But now we have one! Thanks, -Siwei Thanks! Of course, memory_region_from_host() won't search out of the guest memory space for sure. As this could be on the hot data path I have a little bit hesitance over the potential cost or performance regression this change could bring in, but maybe I'm overthinking it too much... Thanks, -Siwei Thanks, -Siwei return false; }
Re: [RFC 1/2] iova_tree: add an id member to DMAMap
On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote: On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu wrote: On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote: On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu wrote: On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote: On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu wrote: On 4/10/2024 3:03 AM, Eugenio Pérez wrote: IOVA tree is also used to track the mappings of virtio-net shadow virtqueue. This mappings may not match with the GPA->HVA ones. This causes a problem when overlapped regions (different GPA but same translated HVA) exists in the tree, as looking them by HVA will return them twice. To solve this, create an id member so we can assign unique identifiers (GPA) to the maps. Signed-off-by: Eugenio Pérez --- include/qemu/iova-tree.h | 5 +++-- util/iova-tree.c | 3 ++- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h index 2a10a7052e..34ee230e7d 100644 --- a/include/qemu/iova-tree.h +++ b/include/qemu/iova-tree.h @@ -36,6 +36,7 @@ typedef struct DMAMap { hwaddr iova; hwaddr translated_addr; hwaddr size;/* Inclusive */ +uint64_t id; IOMMUAccessFlags perm; } QEMU_PACKED DMAMap; typedef gboolean (*iova_tree_iterator)(DMAMap *map); @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map); * @map: the mapping to search * * Search for a mapping in the iova tree that translated_addr overlaps with the - * mapping range specified. Only the first found mapping will be - * returned. + * mapping range specified and map->id is equal. Only the first found + * mapping will be returned. * * Return: DMAMap pointer if found, or NULL if not found. Note that * the returned DMAMap pointer is maintained internally. User should diff --git a/util/iova-tree.c b/util/iova-tree.c index 536789797e..0863e0a3b8 100644 --- a/util/iova-tree.c +++ b/util/iova-tree.c @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value, needle = args->needle; if (map->translated_addr + map->size < needle->translated_addr || -needle->translated_addr + needle->size < map->translated_addr) { +needle->translated_addr + needle->size < map->translated_addr || +needle->id != map->id) { It looks this iterator can also be invoked by SVQ from vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA space will be searched on without passing in the ID (GPA), and exact match for the same GPA range is not actually needed unlike the mapping removal case. Could we create an API variant, for the SVQ lookup case specifically? Or alternatively, add a special flag, say skip_id_match to DMAMap, and the id match check may look like below: (!needle->skip_id_match && needle->id != map->id) I think vhost_svq_translate_addr() could just call the API variant or pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova(). I think you're totally right. But I'd really like to not complicate the API of the iova_tree more. I think we can look for the hwaddr using memory_region_from_host and then get the hwaddr. It is another lookup though... Yeah, that will be another means of doing translation without having to complicate the API around iova_tree. I wonder how the lookup through memory_region_from_host() may perform compared to the iova tree one, the former looks to be an O(N) linear search on a linked list while the latter would be roughly O(log N) on an AVL tree? Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is linear too. It is not even ordered. Oh Sorry, I misread the code and I should look for g_tree_foreach () instead of g_tree_search_node(). So the former is indeed linear iteration, but it looks to be ordered? https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115 The GPA / IOVA are ordered but we're looking by QEMU's vaddr. If we have these translations: [0x1000, 0x2000] -> [0x1, 0x11000] [0x2000, 0x3000] -> [0x6000, 0x7000] We will see them in this order, so we cannot stop the search at the first node. Yeah, reverse lookup is unordered indeed, anyway. But apart from this detail you're right, I have the same concerns with this solution too. If we see a hard performance regression we could go to more complicated solutions, like maintaining a reverse IOVATree in vhost-iova-tree too. First RFCs of SVQ did that actually. Agreed, yeap we can use memory_region_from_host for now. Any reason why reverse IOVATree was dropped, lack of users? But now we have one! No, it is just simplicity. We already have an user in the hot patch in the master branch, vhost_svq_vring_write_descs. But I never profiled enough to find if it is a bottleneck or not to
Re: [RFC 1/2] iova_tree: add an id member to DMAMap
On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote: On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu wrote: On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote: On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu wrote: On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote: On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu wrote: On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote: On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu wrote: On 4/10/2024 3:03 AM, Eugenio Pérez wrote: IOVA tree is also used to track the mappings of virtio-net shadow virtqueue. This mappings may not match with the GPA->HVA ones. This causes a problem when overlapped regions (different GPA but same translated HVA) exists in the tree, as looking them by HVA will return them twice. To solve this, create an id member so we can assign unique identifiers (GPA) to the maps. Signed-off-by: Eugenio Pérez --- include/qemu/iova-tree.h | 5 +++-- util/iova-tree.c | 3 ++- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h index 2a10a7052e..34ee230e7d 100644 --- a/include/qemu/iova-tree.h +++ b/include/qemu/iova-tree.h @@ -36,6 +36,7 @@ typedef struct DMAMap { hwaddr iova; hwaddr translated_addr; hwaddr size;/* Inclusive */ +uint64_t id; IOMMUAccessFlags perm; } QEMU_PACKED DMAMap; typedef gboolean (*iova_tree_iterator)(DMAMap *map); @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map); * @map: the mapping to search * * Search for a mapping in the iova tree that translated_addr overlaps with the - * mapping range specified. Only the first found mapping will be - * returned. + * mapping range specified and map->id is equal. Only the first found + * mapping will be returned. * * Return: DMAMap pointer if found, or NULL if not found. Note that * the returned DMAMap pointer is maintained internally. User should diff --git a/util/iova-tree.c b/util/iova-tree.c index 536789797e..0863e0a3b8 100644 --- a/util/iova-tree.c +++ b/util/iova-tree.c @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value, needle = args->needle; if (map->translated_addr + map->size < needle->translated_addr || -needle->translated_addr + needle->size < map->translated_addr) { +needle->translated_addr + needle->size < map->translated_addr || +needle->id != map->id) { It looks this iterator can also be invoked by SVQ from vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA space will be searched on without passing in the ID (GPA), and exact match for the same GPA range is not actually needed unlike the mapping removal case. Could we create an API variant, for the SVQ lookup case specifically? Or alternatively, add a special flag, say skip_id_match to DMAMap, and the id match check may look like below: (!needle->skip_id_match && needle->id != map->id) I think vhost_svq_translate_addr() could just call the API variant or pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova(). I think you're totally right. But I'd really like to not complicate the API of the iova_tree more. I think we can look for the hwaddr using memory_region_from_host and then get the hwaddr. It is another lookup though... Yeah, that will be another means of doing translation without having to complicate the API around iova_tree. I wonder how the lookup through memory_region_from_host() may perform compared to the iova tree one, the former looks to be an O(N) linear search on a linked list while the latter would be roughly O(log N) on an AVL tree? Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is linear too. It is not even ordered. Oh Sorry, I misread the code and I should look for g_tree_foreach () instead of g_tree_search_node(). So the former is indeed linear iteration, but it looks to be ordered? https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115 The GPA / IOVA are ordered but we're looking by QEMU's vaddr. If we have these translations: [0x1000, 0x2000] -> [0x1, 0x11000] [0x2000, 0x3000] -> [0x6000, 0x7000] We will see them in this order, so we cannot stop the search at the first node. Yeah, reverse lookup is unordered indeed, anyway. But apart from this detail you're right, I have the same concerns with this solution too. If we see a hard performance regression we could go to more complicated solutions, like maintaining a reverse IOVATree in vhost-iova-tree too. First RFCs of SVQ did that actually. Agreed, yeap we can use memory_region_from_host for now. Any reason why reverse IOVATree was dropped, lack of users? But now we have one! No, it is just simplicity. We already have an user in the hot patch in the ma
Re: [RFC 1/2] iova_tree: add an id member to DMAMap
On 4/30/2024 11:11 AM, Eugenio Perez Martin wrote: On Mon, Apr 29, 2024 at 1:19 PM Jonah Palmer wrote: On 4/29/24 4:14 AM, Eugenio Perez Martin wrote: On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu wrote: On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote: On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu wrote: On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote: On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu wrote: On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote: On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu wrote: On 4/10/2024 3:03 AM, Eugenio Pérez wrote: IOVA tree is also used to track the mappings of virtio-net shadow virtqueue. This mappings may not match with the GPA->HVA ones. This causes a problem when overlapped regions (different GPA but same translated HVA) exists in the tree, as looking them by HVA will return them twice. To solve this, create an id member so we can assign unique identifiers (GPA) to the maps. Signed-off-by: Eugenio Pérez --- include/qemu/iova-tree.h | 5 +++-- util/iova-tree.c | 3 ++- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h index 2a10a7052e..34ee230e7d 100644 --- a/include/qemu/iova-tree.h +++ b/include/qemu/iova-tree.h @@ -36,6 +36,7 @@ typedef struct DMAMap { hwaddr iova; hwaddr translated_addr; hwaddr size;/* Inclusive */ +uint64_t id; IOMMUAccessFlags perm; } QEMU_PACKED DMAMap; typedef gboolean (*iova_tree_iterator)(DMAMap *map); @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map); * @map: the mapping to search * * Search for a mapping in the iova tree that translated_addr overlaps with the - * mapping range specified. Only the first found mapping will be - * returned. + * mapping range specified and map->id is equal. Only the first found + * mapping will be returned. * * Return: DMAMap pointer if found, or NULL if not found. Note that * the returned DMAMap pointer is maintained internally. User should diff --git a/util/iova-tree.c b/util/iova-tree.c index 536789797e..0863e0a3b8 100644 --- a/util/iova-tree.c +++ b/util/iova-tree.c @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value, needle = args->needle; if (map->translated_addr + map->size < needle->translated_addr || -needle->translated_addr + needle->size < map->translated_addr) { +needle->translated_addr + needle->size < map->translated_addr || +needle->id != map->id) { It looks this iterator can also be invoked by SVQ from vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA space will be searched on without passing in the ID (GPA), and exact match for the same GPA range is not actually needed unlike the mapping removal case. Could we create an API variant, for the SVQ lookup case specifically? Or alternatively, add a special flag, say skip_id_match to DMAMap, and the id match check may look like below: (!needle->skip_id_match && needle->id != map->id) I think vhost_svq_translate_addr() could just call the API variant or pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova(). I think you're totally right. But I'd really like to not complicate the API of the iova_tree more. I think we can look for the hwaddr using memory_region_from_host and then get the hwaddr. It is another lookup though... Yeah, that will be another means of doing translation without having to complicate the API around iova_tree. I wonder how the lookup through memory_region_from_host() may perform compared to the iova tree one, the former looks to be an O(N) linear search on a linked list while the latter would be roughly O(log N) on an AVL tree? Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is linear too. It is not even ordered. Oh Sorry, I misread the code and I should look for g_tree_foreach () instead of g_tree_search_node(). So the former is indeed linear iteration, but it looks to be ordered? https://urldefense.com/v3/__https://github.com/GNOME/glib/blob/main/glib/gtree.c*L1115__;Iw!!ACWV5N9M2RV99hQ!Ng2rLfRd9tLyNTNocW50Mf5AcxSt0uF0wOdv120djff-z_iAdbujYK-jMi5UC1DZLxb1yLUv2vV0j3wJo8o$ The GPA / IOVA are ordered but we're looking by QEMU's vaddr. If we have these translations: [0x1000, 0x2000] -> [0x1, 0x11000] [0x2000, 0x3000] -> [0x6000, 0x7000] We will see them in this order, so we cannot stop the search at the first node. Yeah, reverse lookup is unordered indeed, anyway. But apart from this detail you're right, I have the same concerns with this solution too. If we see a hard performance regression we could go to more complicated solutions, like maintaining a reverse IOVATree in vhost-iova-tree to
Re: [RFC 1/2] iova_tree: add an id member to DMAMap
On 4/30/2024 10:19 AM, Eugenio Perez Martin wrote: On Tue, Apr 30, 2024 at 7:55 AM Si-Wei Liu wrote: On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote: On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu wrote: On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote: On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu wrote: On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote: On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu wrote: On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote: On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu wrote: On 4/10/2024 3:03 AM, Eugenio Pérez wrote: IOVA tree is also used to track the mappings of virtio-net shadow virtqueue. This mappings may not match with the GPA->HVA ones. This causes a problem when overlapped regions (different GPA but same translated HVA) exists in the tree, as looking them by HVA will return them twice. To solve this, create an id member so we can assign unique identifiers (GPA) to the maps. Signed-off-by: Eugenio Pérez --- include/qemu/iova-tree.h | 5 +++-- util/iova-tree.c | 3 ++- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h index 2a10a7052e..34ee230e7d 100644 --- a/include/qemu/iova-tree.h +++ b/include/qemu/iova-tree.h @@ -36,6 +36,7 @@ typedef struct DMAMap { hwaddr iova; hwaddr translated_addr; hwaddr size;/* Inclusive */ +uint64_t id; IOMMUAccessFlags perm; } QEMU_PACKED DMAMap; typedef gboolean (*iova_tree_iterator)(DMAMap *map); @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map); * @map: the mapping to search * * Search for a mapping in the iova tree that translated_addr overlaps with the - * mapping range specified. Only the first found mapping will be - * returned. + * mapping range specified and map->id is equal. Only the first found + * mapping will be returned. * * Return: DMAMap pointer if found, or NULL if not found. Note that * the returned DMAMap pointer is maintained internally. User should diff --git a/util/iova-tree.c b/util/iova-tree.c index 536789797e..0863e0a3b8 100644 --- a/util/iova-tree.c +++ b/util/iova-tree.c @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value, needle = args->needle; if (map->translated_addr + map->size < needle->translated_addr || -needle->translated_addr + needle->size < map->translated_addr) { +needle->translated_addr + needle->size < map->translated_addr || +needle->id != map->id) { It looks this iterator can also be invoked by SVQ from vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA space will be searched on without passing in the ID (GPA), and exact match for the same GPA range is not actually needed unlike the mapping removal case. Could we create an API variant, for the SVQ lookup case specifically? Or alternatively, add a special flag, say skip_id_match to DMAMap, and the id match check may look like below: (!needle->skip_id_match && needle->id != map->id) I think vhost_svq_translate_addr() could just call the API variant or pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova(). I think you're totally right. But I'd really like to not complicate the API of the iova_tree more. I think we can look for the hwaddr using memory_region_from_host and then get the hwaddr. It is another lookup though... Yeah, that will be another means of doing translation without having to complicate the API around iova_tree. I wonder how the lookup through memory_region_from_host() may perform compared to the iova tree one, the former looks to be an O(N) linear search on a linked list while the latter would be roughly O(log N) on an AVL tree? Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is linear too. It is not even ordered. Oh Sorry, I misread the code and I should look for g_tree_foreach () instead of g_tree_search_node(). So the former is indeed linear iteration, but it looks to be ordered? https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115 The GPA / IOVA are ordered but we're looking by QEMU's vaddr. If we have these translations: [0x1000, 0x2000] -> [0x1, 0x11000] [0x2000, 0x3000] -> [0x6000, 0x7000] We will see them in this order, so we cannot stop the search at the first node. Yeah, reverse lookup is unordered indeed, anyway. But apart from this detail you're right, I have the same concerns with this solution too. If we see a hard performance regression we could go to more complicated solutions, like maintaining a reverse IOVATree in vhost-iova-tree too. First RFCs of SVQ did that actually. Agreed, yeap we can use memory_region_from_host for now. Any reason why reverse IOVATree was dr
Re: [RFC 1/2] iova_tree: add an id member to DMAMap
On 5/1/2024 11:18 PM, Eugenio Perez Martin wrote: On Thu, May 2, 2024 at 12:09 AM Si-Wei Liu wrote: On 4/30/2024 11:11 AM, Eugenio Perez Martin wrote: On Mon, Apr 29, 2024 at 1:19 PM Jonah Palmer wrote: On 4/29/24 4:14 AM, Eugenio Perez Martin wrote: On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu wrote: On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote: On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu wrote: On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote: On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu wrote: On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote: On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu wrote: On 4/10/2024 3:03 AM, Eugenio Pérez wrote: IOVA tree is also used to track the mappings of virtio-net shadow virtqueue. This mappings may not match with the GPA->HVA ones. This causes a problem when overlapped regions (different GPA but same translated HVA) exists in the tree, as looking them by HVA will return them twice. To solve this, create an id member so we can assign unique identifiers (GPA) to the maps. Signed-off-by: Eugenio Pérez --- include/qemu/iova-tree.h | 5 +++-- util/iova-tree.c | 3 ++- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h index 2a10a7052e..34ee230e7d 100644 --- a/include/qemu/iova-tree.h +++ b/include/qemu/iova-tree.h @@ -36,6 +36,7 @@ typedef struct DMAMap { hwaddr iova; hwaddr translated_addr; hwaddr size;/* Inclusive */ +uint64_t id; IOMMUAccessFlags perm; } QEMU_PACKED DMAMap; typedef gboolean (*iova_tree_iterator)(DMAMap *map); @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map); * @map: the mapping to search * * Search for a mapping in the iova tree that translated_addr overlaps with the - * mapping range specified. Only the first found mapping will be - * returned. + * mapping range specified and map->id is equal. Only the first found + * mapping will be returned. * * Return: DMAMap pointer if found, or NULL if not found. Note that * the returned DMAMap pointer is maintained internally. User should diff --git a/util/iova-tree.c b/util/iova-tree.c index 536789797e..0863e0a3b8 100644 --- a/util/iova-tree.c +++ b/util/iova-tree.c @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value, needle = args->needle; if (map->translated_addr + map->size < needle->translated_addr || -needle->translated_addr + needle->size < map->translated_addr) { +needle->translated_addr + needle->size < map->translated_addr || +needle->id != map->id) { It looks this iterator can also be invoked by SVQ from vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA space will be searched on without passing in the ID (GPA), and exact match for the same GPA range is not actually needed unlike the mapping removal case. Could we create an API variant, for the SVQ lookup case specifically? Or alternatively, add a special flag, say skip_id_match to DMAMap, and the id match check may look like below: (!needle->skip_id_match && needle->id != map->id) I think vhost_svq_translate_addr() could just call the API variant or pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova(). I think you're totally right. But I'd really like to not complicate the API of the iova_tree more. I think we can look for the hwaddr using memory_region_from_host and then get the hwaddr. It is another lookup though... Yeah, that will be another means of doing translation without having to complicate the API around iova_tree. I wonder how the lookup through memory_region_from_host() may perform compared to the iova tree one, the former looks to be an O(N) linear search on a linked list while the latter would be roughly O(log N) on an AVL tree? Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is linear too. It is not even ordered. Oh Sorry, I misread the code and I should look for g_tree_foreach () instead of g_tree_search_node(). So the former is indeed linear iteration, but it looks to be ordered? https://urldefense.com/v3/__https://github.com/GNOME/glib/blob/main/glib/gtree.c*L1115__;Iw!!ACWV5N9M2RV99hQ!Ng2rLfRd9tLyNTNocW50Mf5AcxSt0uF0wOdv120djff-z_iAdbujYK-jMi5UC1DZLxb1yLUv2vV0j3wJo8o$ The GPA / IOVA are ordered but we're looking by QEMU's vaddr. If we have these translations: [0x1000, 0x2000] -> [0x1, 0x11000] [0x2000, 0x3000] -> [0x6000, 0x7000] We will see them in this order, so we cannot stop the search at the first node. Yeah, reverse lookup is unordered indeed, anyway. But apart from this detail you're right, I have the same concerns with this solution too. If we see a hard pe
Re: [RFC 1/2] iova_tree: add an id member to DMAMap
On 5/1/2024 11:44 PM, Eugenio Perez Martin wrote: On Thu, May 2, 2024 at 1:16 AM Si-Wei Liu wrote: On 4/30/2024 10:19 AM, Eugenio Perez Martin wrote: On Tue, Apr 30, 2024 at 7:55 AM Si-Wei Liu wrote: On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote: On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu wrote: On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote: On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu wrote: On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote: On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu wrote: On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote: On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu wrote: On 4/10/2024 3:03 AM, Eugenio Pérez wrote: IOVA tree is also used to track the mappings of virtio-net shadow virtqueue. This mappings may not match with the GPA->HVA ones. This causes a problem when overlapped regions (different GPA but same translated HVA) exists in the tree, as looking them by HVA will return them twice. To solve this, create an id member so we can assign unique identifiers (GPA) to the maps. Signed-off-by: Eugenio Pérez --- include/qemu/iova-tree.h | 5 +++-- util/iova-tree.c | 3 ++- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h index 2a10a7052e..34ee230e7d 100644 --- a/include/qemu/iova-tree.h +++ b/include/qemu/iova-tree.h @@ -36,6 +36,7 @@ typedef struct DMAMap { hwaddr iova; hwaddr translated_addr; hwaddr size;/* Inclusive */ +uint64_t id; IOMMUAccessFlags perm; } QEMU_PACKED DMAMap; typedef gboolean (*iova_tree_iterator)(DMAMap *map); @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map); * @map: the mapping to search * * Search for a mapping in the iova tree that translated_addr overlaps with the - * mapping range specified. Only the first found mapping will be - * returned. + * mapping range specified and map->id is equal. Only the first found + * mapping will be returned. * * Return: DMAMap pointer if found, or NULL if not found. Note that * the returned DMAMap pointer is maintained internally. User should diff --git a/util/iova-tree.c b/util/iova-tree.c index 536789797e..0863e0a3b8 100644 --- a/util/iova-tree.c +++ b/util/iova-tree.c @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value, needle = args->needle; if (map->translated_addr + map->size < needle->translated_addr || -needle->translated_addr + needle->size < map->translated_addr) { +needle->translated_addr + needle->size < map->translated_addr || +needle->id != map->id) { It looks this iterator can also be invoked by SVQ from vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA space will be searched on without passing in the ID (GPA), and exact match for the same GPA range is not actually needed unlike the mapping removal case. Could we create an API variant, for the SVQ lookup case specifically? Or alternatively, add a special flag, say skip_id_match to DMAMap, and the id match check may look like below: (!needle->skip_id_match && needle->id != map->id) I think vhost_svq_translate_addr() could just call the API variant or pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova(). I think you're totally right. But I'd really like to not complicate the API of the iova_tree more. I think we can look for the hwaddr using memory_region_from_host and then get the hwaddr. It is another lookup though... Yeah, that will be another means of doing translation without having to complicate the API around iova_tree. I wonder how the lookup through memory_region_from_host() may perform compared to the iova tree one, the former looks to be an O(N) linear search on a linked list while the latter would be roughly O(log N) on an AVL tree? Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is linear too. It is not even ordered. Oh Sorry, I misread the code and I should look for g_tree_foreach () instead of g_tree_search_node(). So the former is indeed linear iteration, but it looks to be ordered? https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115 The GPA / IOVA are ordered but we're looking by QEMU's vaddr. If we have these translations: [0x1000, 0x2000] -> [0x1, 0x11000] [0x2000, 0x3000] -> [0x6000, 0x7000] We will see them in this order, so we cannot stop the search at the first node. Yeah, reverse lookup is unordered indeed, anyway. But apart from this detail you're right, I have the same concerns with this solution too. If we see a hard performance regression we could go to more complicated solutions, like maintaining a reverse IOVATree in vhost-iova-tree too. First RFCs of SV
Re: [RFC v2 12/13] vdpa: preemptive kick at enable
On 1/13/2023 1:06 AM, Eugenio Perez Martin wrote: On Fri, Jan 13, 2023 at 4:39 AM Jason Wang wrote: On Fri, Jan 13, 2023 at 11:25 AM Zhu, Lingshan wrote: On 1/13/2023 10:31 AM, Jason Wang wrote: On Fri, Jan 13, 2023 at 1:27 AM Eugenio Pérez wrote: Spuriously kick the destination device's queue so it knows in case there are new descriptors. RFC: This is somehow a gray area. The guest may have placed descriptors in a virtqueue but not kicked it, so it might be surprised if the device starts processing it. So I think this is kind of the work of the vDPA parent. For the parent that needs this trick, we should do it in the parent driver. Agree, it looks easier implementing this in parent driver, I can implement it in ifcvf set_vq_ready right now Great, but please check whether or not it is really needed. Some device implementation could check the available descriptions after DRIVER_OK without waiting for a kick. So IIUC we can entirely drop this from the series (and I hope we can). But then, what with the devices that does *not* check for them? I wonder how the kick can be missed from the first place. Supposedly the moment when vhost_dev_stop() calls .suspend() into vdpa driver, the vcpus already stopped running (vm_running = false) and all pending kicks are delivered through vhost-vdpa's host notifiers or mapped doorbell already then device won't get new ones. If the device intends to purposely ignore (note: this could be a device bug) pending kicks during .suspend(), then consequently it should check available descriptors after reaching driver_ok to process outstanding descriptors, making up for the missing kick. If the vdpa driver doesn't support .suspend(), then it should flush the work before .reset() - vhost-scsi does it this way. Or otherwise I think it's the norm (right thing to do) device should process pending kicks before guest memory is to be unmapped at the late game of vhost_dev_stop(). Is there any case kicks may be missing? -Siwei If we drop it it seems to me we must mandate devices to check for descriptors at queue_enable. The queue could stall if not, isn't it? Thanks! Thanks Thanks Zhu Lingshan Thanks However, that information is not in the migration stream and it should be an edge case anyhow, being resilient to parallel notifications from the guest. Signed-off-by: Eugenio Pérez --- hw/virtio/vhost-vdpa.c | 5 + 1 file changed, 5 insertions(+) diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c index 40b7e8706a..dff94355dd 100644 --- a/hw/virtio/vhost-vdpa.c +++ b/hw/virtio/vhost-vdpa.c @@ -732,11 +732,16 @@ static int vhost_vdpa_set_vring_ready(struct vhost_dev *dev, int ready) } trace_vhost_vdpa_set_vring_ready(dev); for (i = 0; i < dev->nvqs; ++i) { +VirtQueue *vq; struct vhost_vring_state state = { .index = dev->vq_index + i, .num = 1, }; vhost_vdpa_call(dev, VHOST_VDPA_SET_VRING_ENABLE, &state); + +/* Preemptive kick */ +vq = virtio_get_queue(dev->vdev, dev->vq_index + i); +event_notifier_set(virtio_queue_get_host_notifier(vq)); } return 0; } -- 2.31.1
Re: [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration
On 1/12/2023 9:24 AM, Eugenio Pérez wrote: It's possible to migrate vdpa net devices if they are shadowed from the start. But to always shadow the dataplane is effectively break its host passthrough, so its not convenient in vDPA scenarios. This series enables dynamically switching to shadow mode only at migration time. This allow full data virtqueues passthrough all the time qemu is not migrating. Successfully tested with vdpa_sim_net (but it needs some patches, I will send them soon) and qemu emulated device with vp_vdpa with some restrictions: * No CVQ. * VIRTIO_RING_F_STATE patches. What are these patches (I'm not sure I follow VIRTIO_RING_F_STATE, is it a new feature that other vdpa driver would need for live migration)? -Siwei * Expose _F_SUSPEND, but ignore it and suspend on ring state fetch like DPDK. Comments are welcome, especially in the patcheswith RFC in the message. v2: - Use a migration listener instead of a memory listener to know when the migration starts. - Add stuff not picked with ASID patches, like enable rings after driver_ok - Add rewinding on the migration src, not in dst - v1 at https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01664.html Eugenio Pérez (13): vdpa: fix VHOST_BACKEND_F_IOTLB_ASID flag check vdpa net: move iova tree creation from init to start vdpa: copy cvq shadow_data from data vqs, not from x-svq vdpa: rewind at get_base, not set_base vdpa net: add migration blocker if cannot migrate cvq vhost: delay set_vring_ready after DRIVER_OK vdpa: delay set_vring_ready after DRIVER_OK vdpa: Negotiate _F_SUSPEND feature vdpa: add feature_log parameter to vhost_vdpa vdpa net: allow VHOST_F_LOG_ALL vdpa: add vdpa net migration state notifier vdpa: preemptive kick at enable vdpa: Conditionally expose _F_LOG in vhost_net devices include/hw/virtio/vhost-backend.h | 4 + include/hw/virtio/vhost-vdpa.h| 1 + hw/net/vhost_net.c| 25 ++- hw/virtio/vhost-vdpa.c| 64 +--- hw/virtio/vhost.c | 3 + net/vhost-vdpa.c | 247 +- 6 files changed, 278 insertions(+), 66 deletions(-)
Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
On 1/12/2023 9:24 AM, Eugenio Pérez wrote: This allows net to restart the device backend to configure SVQ on it. Ideally, these changes should not be net specific. However, the vdpa net backend is the one with enough knowledge to configure everything because of some reasons: * Queues might need to be shadowed or not depending on its kind (control vs data). * Queues need to share the same map translations (iova tree). Because of that it is cleaner to restart the whole net backend and configure again as expected, similar to how vhost-kernel moves between userspace and passthrough. If more kinds of devices need dynamic switching to SVQ we can create a callback struct like VhostOps and move most of the code there. VhostOps cannot be reused since all vdpa backend share them, and to personalize just for networking would be too heavy. Signed-off-by: Eugenio Pérez --- net/vhost-vdpa.c | 84 1 file changed, 84 insertions(+) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 5d7ad6e4d7..f38532b1df 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -26,6 +26,8 @@ #include #include "standard-headers/linux/virtio_net.h" #include "monitor/monitor.h" +#include "migration/migration.h" +#include "migration/misc.h" #include "migration/blocker.h" #include "hw/virtio/vhost.h" @@ -33,6 +35,7 @@ typedef struct VhostVDPAState { NetClientState nc; struct vhost_vdpa vhost_vdpa; +Notifier migration_state; Error *migration_blocker; VHostNetState *vhost_net; @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) return DO_UPCAST(VhostVDPAState, nc, nc0); } +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) +{ +struct vhost_vdpa *v = &s->vhost_vdpa; +VirtIONet *n; +VirtIODevice *vdev; +int data_queue_pairs, cvq, r; +NetClientState *peer; + +/* We are only called on the first data vqs and only if x-svq is not set */ +if (s->vhost_vdpa.shadow_vqs_enabled == enable) { +return; +} + +vdev = v->dev->vdev; +n = VIRTIO_NET(vdev); +if (!n->vhost_started) { +return; +} + +if (enable) { +ioctl(v->device_fd, VHOST_VDPA_SUSPEND); +} +data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; +cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? + n->max_ncs - n->max_queue_pairs : 0; +vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq); + +peer = s->nc.peer; +for (int i = 0; i < data_queue_pairs + cvq; i++) { +VhostVDPAState *vdpa_state; +NetClientState *nc; + +if (i < data_queue_pairs) { +nc = qemu_get_peer(peer, i); +} else { +nc = qemu_get_peer(peer, n->max_queue_pairs); +} + +vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc); +vdpa_state->vhost_vdpa.shadow_data = enable; + +if (i < data_queue_pairs) { +/* Do not override CVQ shadow_vqs_enabled */ +vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable; +} +} + +r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq); As the first revision, this method (vhost_net_stop followed by vhost_net_start) should be fine for software vhost-vdpa backend for e.g. vp_vdpa and vdpa_sim_net. However, I would like to get your attention that this method implies substantial blackout time for mode switching on real hardware - get a full cycle of device reset of getting memory mappings torn down, unpin & repin same set of pages, and set up new mapping would take very significant amount of time, especially for a large VM. Maybe we can do: 1) replace reset with the RESUME feature that was just added to the vhost-vdpa ioctls in kernel 2) add new vdpa ioctls to allow iova range rebound to new virtual address for QEMU's shadow vq or back to device's vq 3) use a light-weighted sequence of suspend+rebind+resume to switch mode on the fly instead of getting through the whole reset+restart cycle I suspect the same idea could even be used to address high live migration downtime seen on hardware vdpa device. What do you think? Thanks, -Siwei +if (unlikely(r < 0)) { +error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r); +} +} + +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data) +{ +MigrationState *migration = data; +VhostVDPAState *s = container_of(notifier, VhostVDPAState, + migration_state); + +switch (migration->state) { +case MIGRATION_STATUS_SETUP: +vhost_vdpa_net_log_global_enable(s, true); +return; + +case MIGRATION_STATUS_CANCELLING: +case MIGRATION_STATUS_CANCELLED: +case MIGRATION_STATUS_FAILED: +vhost_vdpa_net_log_global_enable(s, false); +return; +}; +} + static void vhost
Re: [PATCH v2 2/5] virtio-net: align ctrl_vq index for non-mq guest for vhost_vdpa
On 4/28/2022 7:23 PM, Jason Wang wrote: 在 2022/4/27 16:30, Si-Wei Liu 写道: With MQ enabled vdpa device and non-MQ supporting guest e.g. booting vdpa with mq=on over OVMF of single vqp, below assert failure is seen: ../hw/virtio/vhost-vdpa.c:560: vhost_vdpa_get_vq_index: Assertion `idx >= dev->vq_index && idx < dev->vq_index + dev->nvqs' failed. 0 0x7f8ce3ff3387 in raise () at /lib64/libc.so.6 1 0x7f8ce3ff4a78 in abort () at /lib64/libc.so.6 2 0x7f8ce3fec1a6 in __assert_fail_base () at /lib64/libc.so.6 3 0x7f8ce3fec252 in () at /lib64/libc.so.6 4 0x558f52d79421 in vhost_vdpa_get_vq_index (dev=out>, idx=) at ../hw/virtio/vhost-vdpa.c:563 5 0x558f52d79421 in vhost_vdpa_get_vq_index (dev=out>, idx=) at ../hw/virtio/vhost-vdpa.c:558 6 0x558f52d7329a in vhost_virtqueue_mask (hdev=0x558f55c01800, vdev=0x558f568f91f0, n=2, mask=) at ../hw/virtio/vhost.c:1557 7 0x558f52c6b89a in virtio_pci_set_guest_notifier (d=d@entry=0x558f568f0f60, n=n@entry=2, assign=assign@entry=true, with_irqfd=with_irqfd@entry=false) at ../hw/virtio/virtio-pci.c:974 8 0x558f52c6c0d8 in virtio_pci_set_guest_notifiers (d=0x558f568f0f60, nvqs=3, assign=true) at ../hw/virtio/virtio-pci.c:1019 9 0x558f52bf091d in vhost_net_start (dev=dev@entry=0x558f568f91f0, ncs=0x558f56937cd0, data_queue_pairs=data_queue_pairs@entry=1, cvq=cvq@entry=1) at ../hw/net/vhost_net.c:361 10 0x558f52d4e5e7 in virtio_net_set_status (status=out>, n=0x558f568f91f0) at ../hw/net/virtio-net.c:289 11 0x558f52d4e5e7 in virtio_net_set_status (vdev=0x558f568f91f0, status=15 '\017') at ../hw/net/virtio-net.c:370 12 0x558f52d6c4b2 in virtio_set_status (vdev=vdev@entry=0x558f568f91f0, val=val@entry=15 '\017') at ../hw/virtio/virtio.c:1945 13 0x558f52c69eff in virtio_pci_common_write (opaque=0x558f568f0f60, addr=, val=, size=) at ../hw/virtio/virtio-pci.c:1292 14 0x558f52d15d6e in memory_region_write_accessor (mr=0x558f568f19d0, addr=20, value=, size=1, shift=, mask=, attrs=...) at ../softmmu/memory.c:492 15 0x558f52d127de in access_with_adjusted_size (addr=addr@entry=20, value=value@entry=0x7f8cdbffe748, size=size@entry=1, access_size_min=, access_size_max=, access_fn=0x558f52d15cf0 , mr=0x558f568f19d0, attrs=...) at ../softmmu/memory.c:554 16 0x558f52d157ef in memory_region_dispatch_write (mr=mr@entry=0x558f568f19d0, addr=20, data=, op=, attrs=attrs@entry=...) at ../softmmu/memory.c:1504 17 0x558f52d078e7 in flatview_write_continue (fv=fv@entry=0x7f8accbc3b90, addr=addr@entry=103079215124, attrs=..., ptr=ptr@entry=0x7f8ce6300028, len=len@entry=1, addr1=, l=, mr=0x558f568f19d0) at /home/opc/qemu-upstream/include/qemu/host-utils.h:165 18 0x558f52d07b06 in flatview_write (fv=0x7f8accbc3b90, addr=103079215124, attrs=..., buf=0x7f8ce6300028, len=1) at ../softmmu/physmem.c:2822 19 0x558f52d0b36b in address_space_write (as=, addr=, attrs=..., buf=buf@entry=0x7f8ce6300028, len=) at ../softmmu/physmem.c:2914 20 0x558f52d0b3da in address_space_rw (as=, addr=, attrs=..., attrs@entry=..., buf=buf@entry=0x7f8ce6300028, len=out>, is_write=) at ../softmmu/physmem.c:2924 21 0x558f52dced09 in kvm_cpu_exec (cpu=cpu@entry=0x558f55c2da60) at ../accel/kvm/kvm-all.c:2903 22 0x558f52dcfabd in kvm_vcpu_thread_fn (arg=arg@entry=0x558f55c2da60) at ../accel/kvm/kvm-accel-ops.c:49 23 0x558f52f9f04a in qemu_thread_start (args=) at ../util/qemu-thread-posix.c:556 24 0x7f8ce4392ea5 in start_thread () at /lib64/libpthread.so.0 25 0x7f8ce40bb9fd in clone () at /lib64/libc.so.6 The cause for the assert failure is due to that the vhost_dev index for the ctrl vq was not aligned with actual one in use by the guest. Upon multiqueue feature negotiation in virtio_net_set_multiqueue(), if guest doesn't support multiqueue, the guest vq layout would shrink to a single queue pair, consisting of 3 vqs in total (rx, tx and ctrl). This results in ctrl_vq taking a different vhost_dev group index than the default. We can map vq to the correct vhost_dev group by checking if MQ is supported by guest and successfully negotiated. Since the MQ feature is only present along with CTRL_VQ, we make sure the index 2 is only meant for the control vq while MQ is not supported by guest. Fixes: 22288fe ("virtio-net: vhost control virtqueue support") Suggested-by: Jason Wang Signed-off-by: Si-Wei Liu --- hw/net/virtio-net.c | 22 -- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index ffb3475..8ca0b80 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -3171,8 +3171,17 @@ static NetClientInfo net_virtio_info = { static bool virtio_net_guest_notifier_pending(VirtIODevice *vdev, int idx) { VirtIONet *n = VIRTIO_NET(vdev); - NetClientState *nc = qemu_get_subqueue(n->nic
Re: [PATCH v2 2/5] virtio-net: align ctrl_vq index for non-mq guest for vhost_vdpa
On 4/28/2022 7:24 PM, Jason Wang wrote: On Fri, Apr 29, 2022 at 10:24 AM Jason Wang wrote: 在 2022/4/27 16:30, Si-Wei Liu 写道: With MQ enabled vdpa device and non-MQ supporting guest e.g. booting vdpa with mq=on over OVMF of single vqp, below assert failure is seen: ../hw/virtio/vhost-vdpa.c:560: vhost_vdpa_get_vq_index: Assertion `idx >= dev->vq_index && idx < dev->vq_index + dev->nvqs' failed. 0 0x7f8ce3ff3387 in raise () at /lib64/libc.so.6 1 0x7f8ce3ff4a78 in abort () at /lib64/libc.so.6 2 0x7f8ce3fec1a6 in __assert_fail_base () at /lib64/libc.so.6 3 0x7f8ce3fec252 in () at /lib64/libc.so.6 4 0x558f52d79421 in vhost_vdpa_get_vq_index (dev=, idx=) at ../hw/virtio/vhost-vdpa.c:563 5 0x558f52d79421 in vhost_vdpa_get_vq_index (dev=, idx=) at ../hw/virtio/vhost-vdpa.c:558 6 0x558f52d7329a in vhost_virtqueue_mask (hdev=0x558f55c01800, vdev=0x558f568f91f0, n=2, mask=) at ../hw/virtio/vhost.c:1557 7 0x558f52c6b89a in virtio_pci_set_guest_notifier (d=d@entry=0x558f568f0f60, n=n@entry=2, assign=assign@entry=true, with_irqfd=with_irqfd@entry=false) at ../hw/virtio/virtio-pci.c:974 8 0x558f52c6c0d8 in virtio_pci_set_guest_notifiers (d=0x558f568f0f60, nvqs=3, assign=true) at ../hw/virtio/virtio-pci.c:1019 9 0x558f52bf091d in vhost_net_start (dev=dev@entry=0x558f568f91f0, ncs=0x558f56937cd0, data_queue_pairs=data_queue_pairs@entry=1, cvq=cvq@entry=1) at ../hw/net/vhost_net.c:361 10 0x558f52d4e5e7 in virtio_net_set_status (status=, n=0x558f568f91f0) at ../hw/net/virtio-net.c:289 11 0x558f52d4e5e7 in virtio_net_set_status (vdev=0x558f568f91f0, status=15 '\017') at ../hw/net/virtio-net.c:370 12 0x558f52d6c4b2 in virtio_set_status (vdev=vdev@entry=0x558f568f91f0, val=val@entry=15 '\017') at ../hw/virtio/virtio.c:1945 13 0x558f52c69eff in virtio_pci_common_write (opaque=0x558f568f0f60, addr=, val=, size=) at ../hw/virtio/virtio-pci.c:1292 14 0x558f52d15d6e in memory_region_write_accessor (mr=0x558f568f19d0, addr=20, value=, size=1, shift=, mask=, attrs=...) at ../softmmu/memory.c:492 15 0x558f52d127de in access_with_adjusted_size (addr=addr@entry=20, value=value@entry=0x7f8cdbffe748, size=size@entry=1, access_size_min=, access_size_max=, access_fn=0x558f52d15cf0 , mr=0x558f568f19d0, attrs=...) at ../softmmu/memory.c:554 16 0x558f52d157ef in memory_region_dispatch_write (mr=mr@entry=0x558f568f19d0, addr=20, data=, op=, attrs=attrs@entry=...) at ../softmmu/memory.c:1504 17 0x558f52d078e7 in flatview_write_continue (fv=fv@entry=0x7f8accbc3b90, addr=addr@entry=103079215124, attrs=..., ptr=ptr@entry=0x7f8ce6300028, len=len@entry=1, addr1=, l=, mr=0x558f568f19d0) at /home/opc/qemu-upstream/include/qemu/host-utils.h:165 18 0x558f52d07b06 in flatview_write (fv=0x7f8accbc3b90, addr=103079215124, attrs=..., buf=0x7f8ce6300028, len=1) at ../softmmu/physmem.c:2822 19 0x558f52d0b36b in address_space_write (as=, addr=, attrs=..., buf=buf@entry=0x7f8ce6300028, len=) at ../softmmu/physmem.c:2914 20 0x558f52d0b3da in address_space_rw (as=, addr=, attrs=..., attrs@entry=..., buf=buf@entry=0x7f8ce6300028, len=, is_write=) at ../softmmu/physmem.c:2924 21 0x558f52dced09 in kvm_cpu_exec (cpu=cpu@entry=0x558f55c2da60) at ../accel/kvm/kvm-all.c:2903 22 0x558f52dcfabd in kvm_vcpu_thread_fn (arg=arg@entry=0x558f55c2da60) at ../accel/kvm/kvm-accel-ops.c:49 23 0x558f52f9f04a in qemu_thread_start (args=) at ../util/qemu-thread-posix.c:556 24 0x7f8ce4392ea5 in start_thread () at /lib64/libpthread.so.0 25 0x7f8ce40bb9fd in clone () at /lib64/libc.so.6 The cause for the assert failure is due to that the vhost_dev index for the ctrl vq was not aligned with actual one in use by the guest. Upon multiqueue feature negotiation in virtio_net_set_multiqueue(), if guest doesn't support multiqueue, the guest vq layout would shrink to a single queue pair, consisting of 3 vqs in total (rx, tx and ctrl). This results in ctrl_vq taking a different vhost_dev group index than the default. We can map vq to the correct vhost_dev group by checking if MQ is supported by guest and successfully negotiated. Since the MQ feature is only present along with CTRL_VQ, we make sure the index 2 is only meant for the control vq while MQ is not supported by guest. Fixes: 22288fe ("virtio-net: vhost control virtqueue support") Suggested-by: Jason Wang Signed-off-by: Si-Wei Liu --- hw/net/virtio-net.c | 22 -- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index ffb3475..8ca0b80 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -3171,8 +3171,17 @@ static NetClientInfo net_virtio_info = { static bool virtio_net_guest_notifier_pending(VirtIODevice *vdev, int idx) { VirtIONet *n = VIRTIO_NET(vdev); -NetClientState *nc = qemu_get_subque
Re: [PATCH 0/7] vhost-vdpa multiqueue fixes
On 4/28/2022 7:30 PM, Jason Wang wrote: On Wed, Apr 27, 2022 at 5:09 PM Si-Wei Liu wrote: On 4/27/2022 1:38 AM, Jason Wang wrote: On Wed, Apr 27, 2022 at 4:30 PM Si-Wei Liu wrote: On 4/26/2022 9:28 PM, Jason Wang wrote: 在 2022/3/30 14:33, Si-Wei Liu 写道: Hi, This patch series attempt to fix a few issues in vhost-vdpa multiqueue functionality. Patch #1 is the formal submission for RFC patch in: https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/c3e931ee-1a1b-9c2f-2f59-cb4395c23...@oracle.com/__;!!ACWV5N9M2RV99hQ!OoUKcyWauHGQOM4MTAUn88TINQo5ZP4aaYyvyUCK9ggrI_L6diSZo5Nmq55moGH769SD87drxQyqg3ysNsk$ Patch #2 and #3 were taken from a previous patchset posted on qemu-devel: https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/2027192851.65529-1-epere...@redhat.com/__;!!ACWV5N9M2RV99hQ!OoUKcyWauHGQOM4MTAUn88TINQo5ZP4aaYyvyUCK9ggrI_L6diSZo5Nmq55moGH769SD87drxQyqc3mXqDs$ albeit abandoned, two patches in that set turn out to be useful for patch #4, which is to fix a QEMU crash due to race condition. Patch #5 through #7 are obviously small bug fixes. Please find the description of each in the commit log. Thanks, -Siwei Hi Si Wei: I think we need another version of this series? Hi Jason, Apologies for the long delay. I was in the middle of reworking the patch "virtio: don't read pending event on host notifier if disabled", but found out that it would need quite some code change for the userspace fallback handler to work properly (for the queue no. change case specifically). We probably need this fix for -stable, so I wonder if we can have a workaround first and do refactoring on top? Hmmm, a nasty fix but may well address the segfault is something like this: diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index 8ca0b80..3ac93a4 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -368,6 +368,10 @@ static void virtio_net_set_status(struct VirtIODevice *vdev, uint8_t status) int i; uint8_t queue_status; +if (n->status_pending) +return; + +n->status_pending = true; virtio_net_vnet_endian_status(n, status); virtio_net_vhost_status(n, status); @@ -416,6 +420,7 @@ static void virtio_net_set_status(struct VirtIODevice *vdev, uint8_t status) } } } +n->status_pending = false; } static void virtio_net_set_link_status(NetClientState *nc) diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h index eb87032..95efea8 100644 --- a/include/hw/virtio/virtio-net.h +++ b/include/hw/virtio/virtio-net.h @@ -216,6 +216,7 @@ struct VirtIONet { VirtioNetRssData rss_data; struct NetRxPkt *rx_pkt; struct EBPFRSSContext ebpf_rss; +bool status_pending; }; void virtio_net_set_netclient_name(VirtIONet *n, const char *name, To be honest, I am not sure if this is worth a full blown fix to make it completely work. Without applying vq suspend patch (the one I posted in https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/df7c9a87-b2bd-7758-a6b6-bd834a733...@oracle.com/__;!!ACWV5N9M2RV99hQ!L4qque3YpPr-CGp12NrNdMMT1HROfEY_Juw2vnfZXHjOhtT0XJCR9GB8cvWEbJL9Aeh-WhDogBVArJn91P0$ ), it's very hard for me to effectively verify my code change - it's very easy for the guest vq index to be out of sync if not stopping the vq once the vhost is up and running (I tested it with repeatedly set_link off and on). Can we test via vmstop? Yes, of coz, that's where the segfault happened. The tight loop of set_link on/off doesn't even work for the single queue case, hence that's why I doubted it ever worked for vhost-vdpa. I am not sure if there's real chance we can run into issue in practice due to the incomplete fix, if we don't fix the vq stop/suspend issue first. Anyway I will try, as other use case e.g, live migration is likely to get stumbled on it, too. Ok, so I think we probably don't need the "nasty" fix above. Let's fix it with the issue of stop/resume. Ok, then does below tentative code change suffice the need? i.e. it would fail the request of changing queue pairs when the vhost-vdpa backend falls back to the userspace handler, but it's probably the easiest way to fix the vmstop segfault. diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index ed231f9..8ba9f09 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -1177,6 +1177,7 @@ static int virtio_net_handle_mq(VirtIONet *n, uint8_t cmd, struct virtio_net_ctrl_mq mq; size_t s; uint16_t queue_pairs; + NetClientState *nc = qemu_get_queue(n->nic); s = iov_to_buf(iov, iov_cnt, 0, &mq, sizeof(mq)); if (s != sizeof(mq)) { @@ -1196,6 +1197,13 @@ static int virtio_net_handle_mq(VirtIONet *n, uint8_t cmd, return VIRTIO_NET_ERR; } + /* avoid changing the number of queue_pairs for vdpa device in + * userspace handler. + * TO
[PATCH v3 5/6] vhost-vdpa: backend feature should set only once
The vhost_vdpa_one_time_request() branch in vhost_vdpa_set_backend_cap() incorrectly sends down ioctls on vhost_dev with non-zero index. This may end up with multiple VHOST_SET_BACKEND_FEATURES ioctl calls sent down on the vhost-vdpa fd that is shared between all these vhost_dev's. To fix it, send down ioctl only once via the first vhost_dev with index 0. For more readibility of code, vhost_vdpa_one_time_request() is renamed to vhost_vdpa_first_dev() with polarity flipped. This call is only applicable to the request that performs operation before setting up queues, and usually at the beginning of operation. Document the requirement for it in place. Signed-off-by: Si-Wei Liu Acked-by: Jason Wang Acked-by: Eugenio Pérez --- hw/virtio/vhost-vdpa.c | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c index 8adf7c0..fd1268e 100644 --- a/hw/virtio/vhost-vdpa.c +++ b/hw/virtio/vhost-vdpa.c @@ -366,11 +366,18 @@ static void vhost_vdpa_get_iova_range(struct vhost_vdpa *v) v->iova_range.last); } -static bool vhost_vdpa_one_time_request(struct vhost_dev *dev) +/* + * The use of this function is for requests that only need to be + * applied once. Typically such request occurs at the beginning + * of operation, and before setting up queues. It should not be + * used for request that performs operation until all queues are + * set, which would need to check dev->vq_index_end instead. + */ +static bool vhost_vdpa_first_dev(struct vhost_dev *dev) { struct vhost_vdpa *v = dev->opaque; -return v->index != 0; +return v->index == 0; } static int vhost_vdpa_get_dev_features(struct vhost_dev *dev, @@ -451,7 +458,7 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp) vhost_vdpa_get_iova_range(v); -if (vhost_vdpa_one_time_request(dev)) { +if (!vhost_vdpa_first_dev(dev)) { return 0; } @@ -594,7 +601,7 @@ static int vhost_vdpa_memslots_limit(struct vhost_dev *dev) static int vhost_vdpa_set_mem_table(struct vhost_dev *dev, struct vhost_memory *mem) { -if (vhost_vdpa_one_time_request(dev)) { +if (!vhost_vdpa_first_dev(dev)) { return 0; } @@ -623,7 +630,7 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev, struct vhost_vdpa *v = dev->opaque; int ret; -if (vhost_vdpa_one_time_request(dev)) { +if (!vhost_vdpa_first_dev(dev)) { return 0; } @@ -665,7 +672,7 @@ static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev) features &= f; -if (vhost_vdpa_one_time_request(dev)) { +if (vhost_vdpa_first_dev(dev)) { r = vhost_vdpa_call(dev, VHOST_SET_BACKEND_FEATURES, &features); if (r) { return -EFAULT; @@ -1118,7 +1125,7 @@ static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base, struct vhost_log *log) { struct vhost_vdpa *v = dev->opaque; -if (v->shadow_vqs_enabled || vhost_vdpa_one_time_request(dev)) { +if (v->shadow_vqs_enabled || !vhost_vdpa_first_dev(dev)) { return 0; } @@ -1240,7 +1247,7 @@ static int vhost_vdpa_get_features(struct vhost_dev *dev, static int vhost_vdpa_set_owner(struct vhost_dev *dev) { -if (vhost_vdpa_one_time_request(dev)) { +if (!vhost_vdpa_first_dev(dev)) { return 0; } -- 1.8.3.1
[PATCH v3 3/6] vhost-vdpa: fix improper cleanup in net_init_vhost_vdpa
... such that no memory leaks on dangling net clients in case of error. Signed-off-by: Si-Wei Liu Acked-by: Jason Wang --- net/vhost-vdpa.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 1e9fe47..df1e69e 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -306,7 +306,9 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name, err: if (i) { -qemu_del_net_client(ncs[0]); +for (i--; i >= 0; i--) { +qemu_del_net_client(ncs[i]); +} } qemu_close(vdpa_device_fd); -- 1.8.3.1
[PATCH v3 2/6] virtio-net: align ctrl_vq index for non-mq guest for vhost_vdpa
With MQ enabled vdpa device and non-MQ supporting guest e.g. booting vdpa with mq=on over OVMF of single vqp, below assert failure is seen: ../hw/virtio/vhost-vdpa.c:560: vhost_vdpa_get_vq_index: Assertion `idx >= dev->vq_index && idx < dev->vq_index + dev->nvqs' failed. 0 0x7f8ce3ff3387 in raise () at /lib64/libc.so.6 1 0x7f8ce3ff4a78 in abort () at /lib64/libc.so.6 2 0x7f8ce3fec1a6 in __assert_fail_base () at /lib64/libc.so.6 3 0x7f8ce3fec252 in () at /lib64/libc.so.6 4 0x558f52d79421 in vhost_vdpa_get_vq_index (dev=, idx=) at ../hw/virtio/vhost-vdpa.c:563 5 0x558f52d79421 in vhost_vdpa_get_vq_index (dev=, idx=) at ../hw/virtio/vhost-vdpa.c:558 6 0x558f52d7329a in vhost_virtqueue_mask (hdev=0x558f55c01800, vdev=0x558f568f91f0, n=2, mask=) at ../hw/virtio/vhost.c:1557 7 0x558f52c6b89a in virtio_pci_set_guest_notifier (d=d@entry=0x558f568f0f60, n=n@entry=2, assign=assign@entry=true, with_irqfd=with_irqfd@entry=false) at ../hw/virtio/virtio-pci.c:974 8 0x558f52c6c0d8 in virtio_pci_set_guest_notifiers (d=0x558f568f0f60, nvqs=3, assign=true) at ../hw/virtio/virtio-pci.c:1019 9 0x558f52bf091d in vhost_net_start (dev=dev@entry=0x558f568f91f0, ncs=0x558f56937cd0, data_queue_pairs=data_queue_pairs@entry=1, cvq=cvq@entry=1) at ../hw/net/vhost_net.c:361 10 0x558f52d4e5e7 in virtio_net_set_status (status=, n=0x558f568f91f0) at ../hw/net/virtio-net.c:289 11 0x558f52d4e5e7 in virtio_net_set_status (vdev=0x558f568f91f0, status=15 '\017') at ../hw/net/virtio-net.c:370 12 0x558f52d6c4b2 in virtio_set_status (vdev=vdev@entry=0x558f568f91f0, val=val@entry=15 '\017') at ../hw/virtio/virtio.c:1945 13 0x558f52c69eff in virtio_pci_common_write (opaque=0x558f568f0f60, addr=, val=, size=) at ../hw/virtio/virtio-pci.c:1292 14 0x558f52d15d6e in memory_region_write_accessor (mr=0x558f568f19d0, addr=20, value=, size=1, shift=, mask=, attrs=...) at ../softmmu/memory.c:492 15 0x558f52d127de in access_with_adjusted_size (addr=addr@entry=20, value=value@entry=0x7f8cdbffe748, size=size@entry=1, access_size_min=, access_size_max=, access_fn=0x558f52d15cf0 , mr=0x558f568f19d0, attrs=...) at ../softmmu/memory.c:554 16 0x558f52d157ef in memory_region_dispatch_write (mr=mr@entry=0x558f568f19d0, addr=20, data=, op=, attrs=attrs@entry=...) at ../softmmu/memory.c:1504 17 0x558f52d078e7 in flatview_write_continue (fv=fv@entry=0x7f8accbc3b90, addr=addr@entry=103079215124, attrs=..., ptr=ptr@entry=0x7f8ce6300028, len=len@entry=1, addr1=, l=, mr=0x558f568f19d0) at /home/opc/qemu-upstream/include/qemu/host-utils.h:165 18 0x558f52d07b06 in flatview_write (fv=0x7f8accbc3b90, addr=103079215124, attrs=..., buf=0x7f8ce6300028, len=1) at ../softmmu/physmem.c:2822 19 0x558f52d0b36b in address_space_write (as=, addr=, attrs=..., buf=buf@entry=0x7f8ce6300028, len=) at ../softmmu/physmem.c:2914 20 0x558f52d0b3da in address_space_rw (as=, addr=, attrs=..., attrs@entry=..., buf=buf@entry=0x7f8ce6300028, len=, is_write=) at ../softmmu/physmem.c:2924 21 0x558f52dced09 in kvm_cpu_exec (cpu=cpu@entry=0x558f55c2da60) at ../accel/kvm/kvm-all.c:2903 22 0x558f52dcfabd in kvm_vcpu_thread_fn (arg=arg@entry=0x558f55c2da60) at ../accel/kvm/kvm-accel-ops.c:49 23 0x558f52f9f04a in qemu_thread_start (args=) at ../util/qemu-thread-posix.c:556 24 0x7f8ce4392ea5 in start_thread () at /lib64/libpthread.so.0 25 0x7f8ce40bb9fd in clone () at /lib64/libc.so.6 The cause for the assert failure is due to that the vhost_dev index for the ctrl vq was not aligned with actual one in use by the guest. Upon multiqueue feature negotiation in virtio_net_set_multiqueue(), if guest doesn't support multiqueue, the guest vq layout would shrink to a single queue pair, consisting of 3 vqs in total (rx, tx and ctrl). This results in ctrl_vq taking a different vhost_dev group index than the default. We can map vq to the correct vhost_dev group by checking if MQ is supported by guest and successfully negotiated. Since the MQ feature is only present along with CTRL_VQ, we ensure the index 2 is only meant for the control vq while MQ is not supported by guest. Fixes: 22288fe ("virtio-net: vhost control virtqueue support") Suggested-by: Jason Wang Signed-off-by: Si-Wei Liu --- hw/net/virtio-net.c | 33 +++-- 1 file changed, 31 insertions(+), 2 deletions(-) diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index ffb3475..f0bb29c 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -14,6 +14,7 @@ #include "qemu/osdep.h" #include "qemu/atomic.h" #include "qemu/iov.h" +#include "qemu/log.h" #include "qemu/main-loop.h" #include "qemu/module.h" #include "hw/virtio/virtio.h" @@ -3171,8 +3172,22 @@ static NetClientInfo net_virtio_info = { static bool virtio_net_guest
[PATCH v3 0/6] vhost-vdpa multiqueue fixes
Hi, This patch series attempt to fix a few issues in vhost-vdpa multiqueue functionality. Patch #1 and #2 are the formal submission for RFC patch in: https://lore.kernel.org/qemu-devel/c3e931ee-1a1b-9c2f-2f59-cb4395c23...@oracle.com/ Patch #3 through #5 are obviously small bug fixes. Please find the description of each in the commit log. Patch #6 is a workaround fix for the QEMU segfault described in: https://lore.kernel.org/qemu-devel/4f2acb7a-d436-9d97-80b1-3308c1b39...@oracle.com/ Thanks, -Siwei --- v3: - switch to LOG_GUEST_ERROR for guest trigger-able error - add temporary band-aid fix for QEMU crash due to recursive call v2: - split off vhost_dev notifier patch from "align ctrl_vq index for non-mq guest for vhost_vdpa" - change assert to error message - rename vhost_vdpa_one_time_request to vhost_vdpa_first_dev for clarity Si-Wei Liu (6): virtio-net: setup vhost_dev and notifiers for cvq only when feature is negotiated virtio-net: align ctrl_vq index for non-mq guest for vhost_vdpa vhost-vdpa: fix improper cleanup in net_init_vhost_vdpa vhost-net: fix improper cleanup in vhost_net_start vhost-vdpa: backend feature should set only once virtio-net: don't handle mq request in userspace handler for vhost-vdpa hw/net/vhost_net.c | 4 +++- hw/net/virtio-net.c| 49 ++--- hw/virtio/vhost-vdpa.c | 23 +++ net/vhost-vdpa.c | 4 +++- 4 files changed, 67 insertions(+), 13 deletions(-) -- 1.8.3.1
[PATCH v3 6/6] virtio-net: don't handle mq request in userspace handler for vhost-vdpa
virtio_queue_host_notifier_read() tends to read pending event left behind on ioeventfd in the vhost_net_stop() path, and attempts to handle outstanding kicks from userspace vq handler. However, in the ctrl_vq handler, virtio_net_handle_mq() has a recursive call into virtio_net_set_status(), which may lead to segmentation fault as shown in below stack trace: 0 0x55f800df1780 in qdev_get_parent_bus (dev=0x0) at ../hw/core/qdev.c:376 1 0x55f800c68ad8 in virtio_bus_device_iommu_enabled (vdev=vdev@entry=0x0) at ../hw/virtio/virtio-bus.c:331 2 0x55f800d70d7f in vhost_memory_unmap (dev=) at ../hw/virtio/vhost.c:318 3 0x55f800d70d7f in vhost_memory_unmap (dev=, buffer=0x7fc19bec5240, len=2052, is_write=1, access_len=2052) at ../hw/virtio/vhost.c:336 4 0x55f800d71867 in vhost_virtqueue_stop (dev=dev@entry=0x55f8037ccc30, vdev=vdev@entry=0x55f8044ec590, vq=0x55f8037cceb0, idx=0) at ../hw/virtio/vhost.c:1241 5 0x55f800d7406c in vhost_dev_stop (hdev=hdev@entry=0x55f8037ccc30, vdev=vdev@entry=0x55f8044ec590) at ../hw/virtio/vhost.c:1839 6 0x55f800bf00a7 in vhost_net_stop_one (net=0x55f8037ccc30, dev=0x55f8044ec590) at ../hw/net/vhost_net.c:315 7 0x55f800bf0678 in vhost_net_stop (dev=dev@entry=0x55f8044ec590, ncs=0x55f80452bae0, data_queue_pairs=data_queue_pairs@entry=7, cvq=cvq@entry=1) at ../hw/net/vhost_net.c:423 8 0x55f800d4e628 in virtio_net_set_status (status=, n=0x55f8044ec590) at ../hw/net/virtio-net.c:296 9 0x55f800d4e628 in virtio_net_set_status (vdev=vdev@entry=0x55f8044ec590, status=15 '\017') at ../hw/net/virtio-net.c:370 10 0x55f800d534d8 in virtio_net_handle_ctrl (iov_cnt=, iov=, cmd=0 '\000', n=0x55f8044ec590) at ../hw/net/virtio-net.c:1408 11 0x55f800d534d8 in virtio_net_handle_ctrl (vdev=0x55f8044ec590, vq=0x7fc1a7e888d0) at ../hw/net/virtio-net.c:1452 12 0x55f800d69f37 in virtio_queue_host_notifier_read (vq=0x7fc1a7e888d0) at ../hw/virtio/virtio.c:2331 13 0x55f800d69f37 in virtio_queue_host_notifier_read (n=n@entry=0x7fc1a7e8894c) at ../hw/virtio/virtio.c:3575 14 0x55f800c688e6 in virtio_bus_cleanup_host_notifier (bus=, n=n@entry=14) at ../hw/virtio/virtio-bus.c:312 15 0x55f800d73106 in vhost_dev_disable_notifiers (hdev=hdev@entry=0x55f8035b51b0, vdev=vdev@entry=0x55f8044ec590) at ../../../include/hw/virtio/virtio-bus.h:35 16 0x55f800bf00b2 in vhost_net_stop_one (net=0x55f8035b51b0, dev=0x55f8044ec590) at ../hw/net/vhost_net.c:316 17 0x55f800bf0678 in vhost_net_stop (dev=dev@entry=0x55f8044ec590, ncs=0x55f80452bae0, data_queue_pairs=data_queue_pairs@entry=7, cvq=cvq@entry=1) at ../hw/net/vhost_net.c:423 18 0x55f800d4e628 in virtio_net_set_status (status=, n=0x55f8044ec590) at ../hw/net/virtio-net.c:296 19 0x55f800d4e628 in virtio_net_set_status (vdev=0x55f8044ec590, status=15 '\017') at ../hw/net/virtio-net.c:370 20 0x55f800d6c4b2 in virtio_set_status (vdev=0x55f8044ec590, val=) at ../hw/virtio/virtio.c:1945 21 0x55f800d11d9d in vm_state_notify (running=running@entry=false, state=state@entry=RUN_STATE_SHUTDOWN) at ../softmmu/runstate.c:333 22 0x55f800d04e7a in do_vm_stop (state=state@entry=RUN_STATE_SHUTDOWN, send_stop=send_stop@entry=false) at ../softmmu/cpus.c:262 23 0x55f800d04e99 in vm_shutdown () at ../softmmu/cpus.c:280 24 0x55f800d126af in qemu_cleanup () at ../softmmu/runstate.c:812 25 0x55f800ad5b13 in main (argc=, argv=, envp=) at ../softmmu/main.c:51 For now, temporarily disable handling MQ request from the ctrl_vq userspace hanlder to avoid the recursive virtio_net_set_status() call. Some rework is needed to allow changing the number of queues without going through a full virtio_net_set_status cycle, particularly for vhost-vdpa backend. This patch will need to be reverted as soon as future patches of having the change of #queues handled in userspace is merged. Fixes: 402378407db ("vhost-vdpa: multiqueue support") Signed-off-by: Si-Wei Liu --- hw/net/virtio-net.c | 13 + 1 file changed, 13 insertions(+) diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index f0bb29c..e263116 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -1381,6 +1381,7 @@ static int virtio_net_handle_mq(VirtIONet *n, uint8_t cmd, { VirtIODevice *vdev = VIRTIO_DEVICE(n); uint16_t queue_pairs; +NetClientState *nc = qemu_get_queue(n->nic); virtio_net_disable_rss(n); if (cmd == VIRTIO_NET_CTRL_MQ_HASH_CONFIG) { @@ -1412,6 +1413,18 @@ static int virtio_net_handle_mq(VirtIONet *n, uint8_t cmd, return VIRTIO_NET_ERR; } +/* Avoid changing the number of queue_pairs for vdpa device in + * userspace handler. A future fix is needed to handle the mq + * change in userspace handler with vhost-vdpa. Let's disable + * the mq handling from userspace for now and only allow get + * done through the kernel. Ripples may be seen when falling +
[PATCH v3 1/6] virtio-net: setup vhost_dev and notifiers for cvq only when feature is negotiated
When the control virtqueue feature is absent or not negotiated, vhost_net_start() still tries to set up vhost_dev and install vhost notifiers for the control virtqueue, which results in erroneous ioctl calls with incorrect queue index sending down to driver. Do that only when needed. Fixes: 22288fe ("virtio-net: vhost control virtqueue support") Signed-off-by: Si-Wei Liu --- hw/net/virtio-net.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index 1067e72..ffb3475 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -245,7 +245,8 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t status) VirtIODevice *vdev = VIRTIO_DEVICE(n); NetClientState *nc = qemu_get_queue(n->nic); int queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; -int cvq = n->max_ncs - n->max_queue_pairs; +int cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? + n->max_ncs - n->max_queue_pairs : 0; if (!get_vhost_net(nc->peer)) { return; -- 1.8.3.1
[PATCH v3 4/6] vhost-net: fix improper cleanup in vhost_net_start
vhost_net_start() missed a corresponding stop_one() upon error from vhost_set_vring_enable(). While at it, make the error handling for err_start more robust. No real issue was found due to this though. Signed-off-by: Si-Wei Liu Acked-by: Jason Wang --- hw/net/vhost_net.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c index 30379d2..d6d7c51 100644 --- a/hw/net/vhost_net.c +++ b/hw/net/vhost_net.c @@ -381,6 +381,7 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs, r = vhost_set_vring_enable(peer, peer->vring_enable); if (r < 0) { +vhost_net_stop_one(get_vhost_net(peer), dev); goto err_start; } } @@ -390,7 +391,8 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs, err_start: while (--i >= 0) { -peer = qemu_get_peer(ncs , i); +peer = qemu_get_peer(ncs, i < data_queue_pairs ? + i : n->max_queue_pairs); vhost_net_stop_one(get_vhost_net(peer), dev); } e = k->set_guest_notifiers(qbus->parent, total_notifiers, false); -- 1.8.3.1
[PATCH v4 1/7] virtio-net: setup vhost_dev and notifiers for cvq only when feature is negotiated
When the control virtqueue feature is absent or not negotiated, vhost_net_start() still tries to set up vhost_dev and install vhost notifiers for the control virtqueue, which results in erroneous ioctl calls with incorrect queue index sending down to driver. Do that only when needed. Fixes: 22288fe ("virtio-net: vhost control virtqueue support") Signed-off-by: Si-Wei Liu Acked-by: Jason Wang --- hw/net/virtio-net.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index 1067e72..ffb3475 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -245,7 +245,8 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t status) VirtIODevice *vdev = VIRTIO_DEVICE(n); NetClientState *nc = qemu_get_queue(n->nic); int queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; -int cvq = n->max_ncs - n->max_queue_pairs; +int cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? + n->max_ncs - n->max_queue_pairs : 0; if (!get_vhost_net(nc->peer)) { return; -- 1.8.3.1
[PATCH v4 6/7] vhost-vdpa: change name and polarity for vhost_vdpa_one_time_request()
The name vhost_vdpa_one_time_request() was confusing. No matter whatever it returns, its typical occurrence had always been at requests that only need to be applied once. And the name didn't suggest what it actually checks for. Change it to vhost_vdpa_first_dev() with polarity flipped for better readibility of code. That way it is able to reflect what the check is really about. This call is applicable to request which performs operation only once, before queues are set up, and usually at the beginning of the caller function. Document the requirement for it in place. Signed-off-by: Si-Wei Liu --- hw/virtio/vhost-vdpa.c | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c index 6e3dbd9..33dcaa1 100644 --- a/hw/virtio/vhost-vdpa.c +++ b/hw/virtio/vhost-vdpa.c @@ -366,11 +366,18 @@ static void vhost_vdpa_get_iova_range(struct vhost_vdpa *v) v->iova_range.last); } -static bool vhost_vdpa_one_time_request(struct vhost_dev *dev) +/* + * The use of this function is for requests that only need to be + * applied once. Typically such request occurs at the beginning + * of operation, and before setting up queues. It should not be + * used for request that performs operation until all queues are + * set, which would need to check dev->vq_index_end instead. + */ +static bool vhost_vdpa_first_dev(struct vhost_dev *dev) { struct vhost_vdpa *v = dev->opaque; -return v->index != 0; +return v->index == 0; } static int vhost_vdpa_get_dev_features(struct vhost_dev *dev, @@ -451,7 +458,7 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp) vhost_vdpa_get_iova_range(v); -if (vhost_vdpa_one_time_request(dev)) { +if (!vhost_vdpa_first_dev(dev)) { return 0; } @@ -594,7 +601,7 @@ static int vhost_vdpa_memslots_limit(struct vhost_dev *dev) static int vhost_vdpa_set_mem_table(struct vhost_dev *dev, struct vhost_memory *mem) { -if (vhost_vdpa_one_time_request(dev)) { +if (!vhost_vdpa_first_dev(dev)) { return 0; } @@ -623,7 +630,7 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev, struct vhost_vdpa *v = dev->opaque; int ret; -if (vhost_vdpa_one_time_request(dev)) { +if (!vhost_vdpa_first_dev(dev)) { return 0; } @@ -665,7 +672,7 @@ static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev) features &= f; -if (!vhost_vdpa_one_time_request(dev)) { +if (vhost_vdpa_first_dev(dev)) { r = vhost_vdpa_call(dev, VHOST_SET_BACKEND_FEATURES, &features); if (r) { return -EFAULT; @@ -1118,7 +1125,7 @@ static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base, struct vhost_log *log) { struct vhost_vdpa *v = dev->opaque; -if (v->shadow_vqs_enabled || vhost_vdpa_one_time_request(dev)) { +if (v->shadow_vqs_enabled || !vhost_vdpa_first_dev(dev)) { return 0; } @@ -1240,7 +1247,7 @@ static int vhost_vdpa_get_features(struct vhost_dev *dev, static int vhost_vdpa_set_owner(struct vhost_dev *dev) { -if (vhost_vdpa_one_time_request(dev)) { +if (!vhost_vdpa_first_dev(dev)) { return 0; } -- 1.8.3.1
[PATCH v4 0/7] vhost-vdpa multiqueue fixes
Hi, This patch series attempt to fix a few issues in vhost-vdpa multiqueue functionality. Patch #1 and #2 are the formal submission for RFC patch as in: https://lore.kernel.org/qemu-devel/c3e931ee-1a1b-9c2f-2f59-cb4395c23...@oracle.com/ Patch #3 through #6 are obviously small bug fixes. Please find the description of each in the commit log. Patch #7 is a workaround fix for the QEMU segfault described in: https://lore.kernel.org/qemu-devel/4f2acb7a-d436-9d97-80b1-3308c1b39...@oracle.com/ Thanks, -Siwei --- v4: - split off the vhost_vdpa_set_backend_cap patch v3: - switch to LOG_GUEST_ERROR for guest trigger-able error - add temporary band-aid fix for QEMU crash due to recursive call v2: - split off vhost_dev notifier patch from "align ctrl_vq index for non-mq guest for vhost_vdpa" - change assert to error message - rename vhost_vdpa_one_time_request to vhost_vdpa_first_dev for clarity --- Si-Wei Liu (7): virtio-net: setup vhost_dev and notifiers for cvq only when feature is negotiated virtio-net: align ctrl_vq index for non-mq guest for vhost_vdpa vhost-vdpa: fix improper cleanup in net_init_vhost_vdpa vhost-net: fix improper cleanup in vhost_net_start vhost-vdpa: backend feature should set only once vhost-vdpa: change name and polarity for vhost_vdpa_one_time_request() virtio-net: don't handle mq request in userspace handler for vhost-vdpa hw/net/vhost_net.c | 4 +++- hw/net/virtio-net.c| 49 ++--- hw/virtio/vhost-vdpa.c | 23 +++ net/vhost-vdpa.c | 4 +++- 4 files changed, 67 insertions(+), 13 deletions(-) -- 1.8.3.1
[PATCH v4 3/7] vhost-vdpa: fix improper cleanup in net_init_vhost_vdpa
... such that no memory leaks on dangling net clients in case of error. Signed-off-by: Si-Wei Liu Acked-by: Jason Wang --- net/vhost-vdpa.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 1e9fe47..df1e69e 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -306,7 +306,9 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name, err: if (i) { -qemu_del_net_client(ncs[0]); +for (i--; i >= 0; i--) { +qemu_del_net_client(ncs[i]); +} } qemu_close(vdpa_device_fd); -- 1.8.3.1
[PATCH v4 2/7] virtio-net: align ctrl_vq index for non-mq guest for vhost_vdpa
With MQ enabled vdpa device and non-MQ supporting guest e.g. booting vdpa with mq=on over OVMF of single vqp, below assert failure is seen: ../hw/virtio/vhost-vdpa.c:560: vhost_vdpa_get_vq_index: Assertion `idx >= dev->vq_index && idx < dev->vq_index + dev->nvqs' failed. 0 0x7f8ce3ff3387 in raise () at /lib64/libc.so.6 1 0x7f8ce3ff4a78 in abort () at /lib64/libc.so.6 2 0x7f8ce3fec1a6 in __assert_fail_base () at /lib64/libc.so.6 3 0x7f8ce3fec252 in () at /lib64/libc.so.6 4 0x558f52d79421 in vhost_vdpa_get_vq_index (dev=, idx=) at ../hw/virtio/vhost-vdpa.c:563 5 0x558f52d79421 in vhost_vdpa_get_vq_index (dev=, idx=) at ../hw/virtio/vhost-vdpa.c:558 6 0x558f52d7329a in vhost_virtqueue_mask (hdev=0x558f55c01800, vdev=0x558f568f91f0, n=2, mask=) at ../hw/virtio/vhost.c:1557 7 0x558f52c6b89a in virtio_pci_set_guest_notifier (d=d@entry=0x558f568f0f60, n=n@entry=2, assign=assign@entry=true, with_irqfd=with_irqfd@entry=false) at ../hw/virtio/virtio-pci.c:974 8 0x558f52c6c0d8 in virtio_pci_set_guest_notifiers (d=0x558f568f0f60, nvqs=3, assign=true) at ../hw/virtio/virtio-pci.c:1019 9 0x558f52bf091d in vhost_net_start (dev=dev@entry=0x558f568f91f0, ncs=0x558f56937cd0, data_queue_pairs=data_queue_pairs@entry=1, cvq=cvq@entry=1) at ../hw/net/vhost_net.c:361 10 0x558f52d4e5e7 in virtio_net_set_status (status=, n=0x558f568f91f0) at ../hw/net/virtio-net.c:289 11 0x558f52d4e5e7 in virtio_net_set_status (vdev=0x558f568f91f0, status=15 '\017') at ../hw/net/virtio-net.c:370 12 0x558f52d6c4b2 in virtio_set_status (vdev=vdev@entry=0x558f568f91f0, val=val@entry=15 '\017') at ../hw/virtio/virtio.c:1945 13 0x558f52c69eff in virtio_pci_common_write (opaque=0x558f568f0f60, addr=, val=, size=) at ../hw/virtio/virtio-pci.c:1292 14 0x558f52d15d6e in memory_region_write_accessor (mr=0x558f568f19d0, addr=20, value=, size=1, shift=, mask=, attrs=...) at ../softmmu/memory.c:492 15 0x558f52d127de in access_with_adjusted_size (addr=addr@entry=20, value=value@entry=0x7f8cdbffe748, size=size@entry=1, access_size_min=, access_size_max=, access_fn=0x558f52d15cf0 , mr=0x558f568f19d0, attrs=...) at ../softmmu/memory.c:554 16 0x558f52d157ef in memory_region_dispatch_write (mr=mr@entry=0x558f568f19d0, addr=20, data=, op=, attrs=attrs@entry=...) at ../softmmu/memory.c:1504 17 0x558f52d078e7 in flatview_write_continue (fv=fv@entry=0x7f8accbc3b90, addr=addr@entry=103079215124, attrs=..., ptr=ptr@entry=0x7f8ce6300028, len=len@entry=1, addr1=, l=, mr=0x558f568f19d0) at /home/opc/qemu-upstream/include/qemu/host-utils.h:165 18 0x558f52d07b06 in flatview_write (fv=0x7f8accbc3b90, addr=103079215124, attrs=..., buf=0x7f8ce6300028, len=1) at ../softmmu/physmem.c:2822 19 0x558f52d0b36b in address_space_write (as=, addr=, attrs=..., buf=buf@entry=0x7f8ce6300028, len=) at ../softmmu/physmem.c:2914 20 0x558f52d0b3da in address_space_rw (as=, addr=, attrs=..., attrs@entry=..., buf=buf@entry=0x7f8ce6300028, len=, is_write=) at ../softmmu/physmem.c:2924 21 0x558f52dced09 in kvm_cpu_exec (cpu=cpu@entry=0x558f55c2da60) at ../accel/kvm/kvm-all.c:2903 22 0x558f52dcfabd in kvm_vcpu_thread_fn (arg=arg@entry=0x558f55c2da60) at ../accel/kvm/kvm-accel-ops.c:49 23 0x558f52f9f04a in qemu_thread_start (args=) at ../util/qemu-thread-posix.c:556 24 0x7f8ce4392ea5 in start_thread () at /lib64/libpthread.so.0 25 0x7f8ce40bb9fd in clone () at /lib64/libc.so.6 The cause for the assert failure is due to that the vhost_dev index for the ctrl vq was not aligned with actual one in use by the guest. Upon multiqueue feature negotiation in virtio_net_set_multiqueue(), if guest doesn't support multiqueue, the guest vq layout would shrink to a single queue pair, consisting of 3 vqs in total (rx, tx and ctrl). This results in ctrl_vq taking a different vhost_dev group index than the default. We can map vq to the correct vhost_dev group by checking if MQ is supported by guest and successfully negotiated. Since the MQ feature is only present along with CTRL_VQ, we ensure the index 2 is only meant for the control vq while MQ is not supported by guest. Fixes: 22288fe ("virtio-net: vhost control virtqueue support") Suggested-by: Jason Wang Signed-off-by: Si-Wei Liu Acked-by: Jason Wang --- hw/net/virtio-net.c | 33 +++-- 1 file changed, 31 insertions(+), 2 deletions(-) diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index ffb3475..f0bb29c 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -14,6 +14,7 @@ #include "qemu/osdep.h" #include "qemu/atomic.h" #include "qemu/iov.h" +#include "qemu/log.h" #include "qemu/main-loop.h" #include "qemu/module.h" #include "hw/virtio/virtio.h" @@ -3171,8 +3172,22 @@ static NetClientInfo net_virtio_info = { static
[PATCH v4 4/7] vhost-net: fix improper cleanup in vhost_net_start
vhost_net_start() missed a corresponding stop_one() upon error from vhost_set_vring_enable(). While at it, make the error handling for err_start more robust. No real issue was found due to this though. Signed-off-by: Si-Wei Liu Acked-by: Jason Wang --- hw/net/vhost_net.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c index 30379d2..d6d7c51 100644 --- a/hw/net/vhost_net.c +++ b/hw/net/vhost_net.c @@ -381,6 +381,7 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs, r = vhost_set_vring_enable(peer, peer->vring_enable); if (r < 0) { +vhost_net_stop_one(get_vhost_net(peer), dev); goto err_start; } } @@ -390,7 +391,8 @@ int vhost_net_start(VirtIODevice *dev, NetClientState *ncs, err_start: while (--i >= 0) { -peer = qemu_get_peer(ncs , i); +peer = qemu_get_peer(ncs, i < data_queue_pairs ? + i : n->max_queue_pairs); vhost_net_stop_one(get_vhost_net(peer), dev); } e = k->set_guest_notifiers(qbus->parent, total_notifiers, false); -- 1.8.3.1
[PATCH v4 5/7] vhost-vdpa: backend feature should set only once
The vhost_vdpa_one_time_request() branch in vhost_vdpa_set_backend_cap() incorrectly sends down ioctls on vhost_dev with non-zero index. This may end up with multiple VHOST_SET_BACKEND_FEATURES ioctl calls sent down on the vhost-vdpa fd that is shared between all these vhost_dev's. To fix it, send down ioctl only once via the first vhost_dev with index 0. Toggle the polarity of the vhost_vdpa_one_time_request() test should do the trick. Fixes: 4d191cfdc7de ("vhost-vdpa: classify one time request") Signed-off-by: Si-Wei Liu Reviewed-by: Stefano Garzarella Acked-by: Jason Wang Acked-by: Eugenio Pérez --- hw/virtio/vhost-vdpa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c index 8adf7c0..6e3dbd9 100644 --- a/hw/virtio/vhost-vdpa.c +++ b/hw/virtio/vhost-vdpa.c @@ -665,7 +665,7 @@ static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev) features &= f; -if (vhost_vdpa_one_time_request(dev)) { +if (!vhost_vdpa_one_time_request(dev)) { r = vhost_vdpa_call(dev, VHOST_SET_BACKEND_FEATURES, &features); if (r) { return -EFAULT; -- 1.8.3.1
[PATCH v4 7/7] virtio-net: don't handle mq request in userspace handler for vhost-vdpa
virtio_queue_host_notifier_read() tends to read pending event left behind on ioeventfd in the vhost_net_stop() path, and attempts to handle outstanding kicks from userspace vq handler. However, in the ctrl_vq handler, virtio_net_handle_mq() has a recursive call into virtio_net_set_status(), which may lead to segmentation fault as shown in below stack trace: 0 0x55f800df1780 in qdev_get_parent_bus (dev=0x0) at ../hw/core/qdev.c:376 1 0x55f800c68ad8 in virtio_bus_device_iommu_enabled (vdev=vdev@entry=0x0) at ../hw/virtio/virtio-bus.c:331 2 0x55f800d70d7f in vhost_memory_unmap (dev=) at ../hw/virtio/vhost.c:318 3 0x55f800d70d7f in vhost_memory_unmap (dev=, buffer=0x7fc19bec5240, len=2052, is_write=1, access_len=2052) at ../hw/virtio/vhost.c:336 4 0x55f800d71867 in vhost_virtqueue_stop (dev=dev@entry=0x55f8037ccc30, vdev=vdev@entry=0x55f8044ec590, vq=0x55f8037cceb0, idx=0) at ../hw/virtio/vhost.c:1241 5 0x55f800d7406c in vhost_dev_stop (hdev=hdev@entry=0x55f8037ccc30, vdev=vdev@entry=0x55f8044ec590) at ../hw/virtio/vhost.c:1839 6 0x55f800bf00a7 in vhost_net_stop_one (net=0x55f8037ccc30, dev=0x55f8044ec590) at ../hw/net/vhost_net.c:315 7 0x55f800bf0678 in vhost_net_stop (dev=dev@entry=0x55f8044ec590, ncs=0x55f80452bae0, data_queue_pairs=data_queue_pairs@entry=7, cvq=cvq@entry=1) at ../hw/net/vhost_net.c:423 8 0x55f800d4e628 in virtio_net_set_status (status=, n=0x55f8044ec590) at ../hw/net/virtio-net.c:296 9 0x55f800d4e628 in virtio_net_set_status (vdev=vdev@entry=0x55f8044ec590, status=15 '\017') at ../hw/net/virtio-net.c:370 10 0x55f800d534d8 in virtio_net_handle_ctrl (iov_cnt=, iov=, cmd=0 '\000', n=0x55f8044ec590) at ../hw/net/virtio-net.c:1408 11 0x55f800d534d8 in virtio_net_handle_ctrl (vdev=0x55f8044ec590, vq=0x7fc1a7e888d0) at ../hw/net/virtio-net.c:1452 12 0x55f800d69f37 in virtio_queue_host_notifier_read (vq=0x7fc1a7e888d0) at ../hw/virtio/virtio.c:2331 13 0x55f800d69f37 in virtio_queue_host_notifier_read (n=n@entry=0x7fc1a7e8894c) at ../hw/virtio/virtio.c:3575 14 0x55f800c688e6 in virtio_bus_cleanup_host_notifier (bus=, n=n@entry=14) at ../hw/virtio/virtio-bus.c:312 15 0x55f800d73106 in vhost_dev_disable_notifiers (hdev=hdev@entry=0x55f8035b51b0, vdev=vdev@entry=0x55f8044ec590) at ../../../include/hw/virtio/virtio-bus.h:35 16 0x55f800bf00b2 in vhost_net_stop_one (net=0x55f8035b51b0, dev=0x55f8044ec590) at ../hw/net/vhost_net.c:316 17 0x55f800bf0678 in vhost_net_stop (dev=dev@entry=0x55f8044ec590, ncs=0x55f80452bae0, data_queue_pairs=data_queue_pairs@entry=7, cvq=cvq@entry=1) at ../hw/net/vhost_net.c:423 18 0x55f800d4e628 in virtio_net_set_status (status=, n=0x55f8044ec590) at ../hw/net/virtio-net.c:296 19 0x55f800d4e628 in virtio_net_set_status (vdev=0x55f8044ec590, status=15 '\017') at ../hw/net/virtio-net.c:370 20 0x55f800d6c4b2 in virtio_set_status (vdev=0x55f8044ec590, val=) at ../hw/virtio/virtio.c:1945 21 0x55f800d11d9d in vm_state_notify (running=running@entry=false, state=state@entry=RUN_STATE_SHUTDOWN) at ../softmmu/runstate.c:333 22 0x55f800d04e7a in do_vm_stop (state=state@entry=RUN_STATE_SHUTDOWN, send_stop=send_stop@entry=false) at ../softmmu/cpus.c:262 23 0x55f800d04e99 in vm_shutdown () at ../softmmu/cpus.c:280 24 0x55f800d126af in qemu_cleanup () at ../softmmu/runstate.c:812 25 0x55f800ad5b13 in main (argc=, argv=, envp=) at ../softmmu/main.c:51 For now, temporarily disable handling MQ request from the ctrl_vq userspace hanlder to avoid the recursive virtio_net_set_status() call. Some rework is needed to allow changing the number of queues without going through a full virtio_net_set_status cycle, particularly for vhost-vdpa backend. This patch will need to be reverted as soon as future patches of having the change of #queues handled in userspace is merged. Fixes: 402378407db ("vhost-vdpa: multiqueue support") Signed-off-by: Si-Wei Liu Acked-by: Jason Wang --- hw/net/virtio-net.c | 13 + 1 file changed, 13 insertions(+) diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index f0bb29c..099e650 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -1381,6 +1381,7 @@ static int virtio_net_handle_mq(VirtIONet *n, uint8_t cmd, { VirtIODevice *vdev = VIRTIO_DEVICE(n); uint16_t queue_pairs; +NetClientState *nc = qemu_get_queue(n->nic); virtio_net_disable_rss(n); if (cmd == VIRTIO_NET_CTRL_MQ_HASH_CONFIG) { @@ -1412,6 +1413,18 @@ static int virtio_net_handle_mq(VirtIONet *n, uint8_t cmd, return VIRTIO_NET_ERR; } +/* Avoid changing the number of queue_pairs for vdpa device in + * userspace handler. A future fix is needed to handle the mq + * change in userspace handler with vhost-vdpa. Let's disable + * the mq handling from userspace for now and only allow get + * done through the kernel. Ripples may be
Re: [PATCH 4/5] virtio-net: Update virtio-net curr_queue_pairs in vdpa backends
On 8/23/2022 9:27 PM, Jason Wang wrote: 在 2022/8/20 01:13, Eugenio Pérez 写道: It was returned as error before. Instead of it, simply update the corresponding field so qemu can send it in the migration data. Signed-off-by: Eugenio Pérez --- Looks correct. Adding Si Wei for double check. Hmmm, I understand why this change is needed for live migration, but this would easily cause userspace out of sync with the kernel for other use cases, such as link down or userspace fallback due to vdpa ioctl error. Yes, these are edge cases. Not completely against it, but I wonder if there's a way we can limit the change scope to live migration case only? -Siwei Thanks hw/net/virtio-net.c | 17 ++--- 1 file changed, 6 insertions(+), 11 deletions(-) diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index dd0d056fde..63a8332cd0 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -1412,19 +1412,14 @@ static int virtio_net_handle_mq(VirtIONet *n, uint8_t cmd, return VIRTIO_NET_ERR; } - /* Avoid changing the number of queue_pairs for vdpa device in - * userspace handler. A future fix is needed to handle the mq - * change in userspace handler with vhost-vdpa. Let's disable - * the mq handling from userspace for now and only allow get - * done through the kernel. Ripples may be seen when falling - * back to userspace, but without doing it qemu process would - * crash on a recursive entry to virtio_net_set_status(). - */ + n->curr_queue_pairs = queue_pairs; if (nc->peer && nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { - return VIRTIO_NET_ERR; + /* + * Avoid updating the backend for a vdpa device: We're only interested + * in updating the device model queues. + */ + return VIRTIO_NET_OK; } - - n->curr_queue_pairs = queue_pairs; /* stop the backend before changing the number of queue_pairs to avoid handling a * disabled queue */ virtio_net_set_status(vdev, vdev->status);
Re: [PATCH 4/5] virtio-net: Update virtio-net curr_queue_pairs in vdpa backends
Hi Jason, On 8/24/2022 7:53 PM, Jason Wang wrote: On Thu, Aug 25, 2022 at 8:38 AM Si-Wei Liu wrote: On 8/23/2022 9:27 PM, Jason Wang wrote: 在 2022/8/20 01:13, Eugenio Pérez 写道: It was returned as error before. Instead of it, simply update the corresponding field so qemu can send it in the migration data. Signed-off-by: Eugenio Pérez --- Looks correct. Adding Si Wei for double check. Hmmm, I understand why this change is needed for live migration, but this would easily cause userspace out of sync with the kernel for other use cases, such as link down or userspace fallback due to vdpa ioctl error. Yes, these are edge cases. Considering 7.2 will start, maybe it's time to fix the root cause instead of having a workaround like this? The fix for the immediate cause is not hard, though what is missing from my WIP series for a full blown fix is something similar to Shadow CVQ for all general cases than just live migration: QEMU will have to apply the curr_queue_pairs to the kernel once switched back from the userspace virtqueues. I think Shadow CVQ won't work if ASID support is missing from kernel. Do you think if it bother to build another similar facility, or we reuse Shadow CVQ code to make it work without ASID support? I have been a bit busy with internal project for the moment, but I hope I can post my series next week. Here's what I have for the relevant patches from the WIP series. diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index dd0d056..16edfa3 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -361,16 +361,13 @@ static void virtio_net_drop_tx_queue_data(VirtIODevice *vdev, VirtQueue *vq) } } -static void virtio_net_set_status(struct VirtIODevice *vdev, uint8_t status) +static void virtio_net_queue_status(struct VirtIONet *n, uint8_t status) { - VirtIONet *n = VIRTIO_NET(vdev); + VirtIODevice *vdev = VIRTIO_DEVICE(n); VirtIONetQueue *q; int i; uint8_t queue_status; - virtio_net_vnet_endian_status(n, status); - virtio_net_vhost_status(n, status); - for (i = 0; i < n->max_queue_pairs; i++) { NetClientState *ncs = qemu_get_subqueue(n->nic, i); bool queue_started; @@ -418,6 +415,15 @@ static void virtio_net_set_status(struct VirtIODevice *vdev, uint8_t status) } } +static void virtio_net_set_status(struct VirtIODevice *vdev, uint8_t status) +{ + VirtIONet *n = VIRTIO_NET(vdev); + + virtio_net_vnet_endian_status(n, status); + virtio_net_vhost_status(n, status); + virtio_net_queue_status(n, status); +} + static void virtio_net_set_link_status(NetClientState *nc) { VirtIONet *n = qemu_get_nic_opaque(nc); @@ -1380,7 +1386,6 @@ static int virtio_net_handle_mq(VirtIONet *n, uint8_t cmd, { VirtIODevice *vdev = VIRTIO_DEVICE(n); uint16_t queue_pairs; - NetClientState *nc = qemu_get_queue(n->nic); virtio_net_disable_rss(n); if (cmd == VIRTIO_NET_CTRL_MQ_HASH_CONFIG) { @@ -1412,22 +1417,10 @@ static int virtio_net_handle_mq(VirtIONet *n, uint8_t cmd, return VIRTIO_NET_ERR; } - /* Avoid changing the number of queue_pairs for vdpa device in - * userspace handler. A future fix is needed to handle the mq - * change in userspace handler with vhost-vdpa. Let's disable - * the mq handling from userspace for now and only allow get - * done through the kernel. Ripples may be seen when falling - * back to userspace, but without doing it qemu process would - * crash on a recursive entry to virtio_net_set_status(). - */ - if (nc->peer && nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { - return VIRTIO_NET_ERR; - } - n->curr_queue_pairs = queue_pairs; /* stop the backend before changing the number of queue_pairs to avoid handling a * disabled queue */ - virtio_net_set_status(vdev, vdev->status); + virtio_net_queue_status(n, vdev->status); virtio_net_set_queue_pairs(n); return VIRTIO_NET_OK; Regards, -Siwei THanks Not completely against it, but I wonder if there's a way we can limit the change scope to live migration case only? -Siwei Thanks hw/net/virtio-net.c | 17 ++--- 1 file changed, 6 insertions(+), 11 deletions(-) diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index dd0d056fde..63a8332cd0 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -1412,19 +1412,14 @@ static int virtio_net_handle_mq(VirtIONet *n, uint8_t cmd, return VIRTIO_NET_ERR; } -/* Avoid changing the number of queue_pairs for vdpa device in - * userspace handler. A future fix is needed to handle the mq - * change in userspace handler with vhost-vdpa. Let's disable - * the mq handling from userspace for now and only allow get - * done through the kernel. Ripples may be seen when falling - * back to userspace, but wi
Re: [PATCH 4/5] virtio-net: Update virtio-net curr_queue_pairs in vdpa backends
On 8/24/2022 8:05 PM, Jason Wang wrote: On Thu, Aug 25, 2022 at 10:53 AM Jason Wang wrote: On Thu, Aug 25, 2022 at 8:38 AM Si-Wei Liu wrote: On 8/23/2022 9:27 PM, Jason Wang wrote: 在 2022/8/20 01:13, Eugenio Pérez 写道: It was returned as error before. Instead of it, simply update the corresponding field so qemu can send it in the migration data. Signed-off-by: Eugenio Pérez --- Looks correct. Adding Si Wei for double check. Hmmm, I understand why this change is needed for live migration, but this would easily cause userspace out of sync with the kernel for other use cases, such as link down or userspace fallback due to vdpa ioctl error. Yes, these are edge cases. Considering 7.2 will start, maybe it's time to fix the root cause instead of having a workaround like this? Btw, the patch actually tries its best to limit the behaviour, e.g it doesn't do the following set_status() stuff. So I think it won't trigger the issue you mentioned here? Well, we can claim we don't support the link down+up case while changing queue numbers in between. On the other hand, the error recovery from fallback userspace is another story, which would need more attention and care on the error path. Yes, if see it from that perspective the change is fine. For completeness, please refer to the patch in the other email. -Siwei Thanks THanks Not completely against it, but I wonder if there's a way we can limit the change scope to live migration case only? -Siwei Thanks hw/net/virtio-net.c | 17 ++--- 1 file changed, 6 insertions(+), 11 deletions(-) diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index dd0d056fde..63a8332cd0 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -1412,19 +1412,14 @@ static int virtio_net_handle_mq(VirtIONet *n, uint8_t cmd, return VIRTIO_NET_ERR; } -/* Avoid changing the number of queue_pairs for vdpa device in - * userspace handler. A future fix is needed to handle the mq - * change in userspace handler with vhost-vdpa. Let's disable - * the mq handling from userspace for now and only allow get - * done through the kernel. Ripples may be seen when falling - * back to userspace, but without doing it qemu process would - * crash on a recursive entry to virtio_net_set_status(). - */ +n->curr_queue_pairs = queue_pairs; if (nc->peer && nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { -return VIRTIO_NET_ERR; +/* + * Avoid updating the backend for a vdpa device: We're only interested + * in updating the device model queues. + */ +return VIRTIO_NET_OK; } - -n->curr_queue_pairs = queue_pairs; /* stop the backend before changing the number of queue_pairs to avoid handling a * disabled queue */ virtio_net_set_status(vdev, vdev->status);
Re: [PATCH 4/5] virtio-net: Update virtio-net curr_queue_pairs in vdpa backends
On 8/24/2022 11:19 PM, Eugenio Perez Martin wrote: On Thu, Aug 25, 2022 at 2:38 AM Si-Wei Liu wrote: On 8/23/2022 9:27 PM, Jason Wang wrote: 在 2022/8/20 01:13, Eugenio Pérez 写道: It was returned as error before. Instead of it, simply update the corresponding field so qemu can send it in the migration data. Signed-off-by: Eugenio Pérez --- Looks correct. Adding Si Wei for double check. Hmmm, I understand why this change is needed for live migration, but this would easily cause userspace out of sync with the kernel for other use cases, such as link down or userspace fallback due to vdpa ioctl error. Yes, these are edge cases. The link down case is not possible at this moment because that cvq command does not call virtio_net_handle_ctrl_iov. Right. Though shadow cvq would need to rely on extra ASID support from kernel. For the case without shadow cvq we still need to look for an alternative mechanism. A similar treatment than mq would be needed when supported, and the call to virtio_net_set_status will be avoided. So, maybe the seemingly "right" fix for the moment is to prohibit manual set_link at all (for vDPA only)? In longer term we'd need to come up with appropriate support for applying mq config regardless of asid or shadow cvq support. I'll double check device initialization ioctl failure with n->curr_queue_pairs > 1 in the destination, but I think we should be safe. Not completely against it, but I wonder if there's a way we can limit the change scope to live migration case only? The reason to update the device model is to send the curr_queue_pairs to the destination in a backend agnostic way. To send it otherwise would limit the live migration possibilities, but sure we can explore another way. A hacky workaround that came off the top of my head was to allow sending curr_queue_pairs for the !vm_running case for vdpa. It doesn't look it would affect other backend I think. But I agree with Jason, this doesn't look decent so I give up on this idea. Hence for this patch, Acked-by: Si-Wei Liu Thanks!
Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
On 2/13/2023 1:47 AM, Eugenio Perez Martin wrote: On Sat, Feb 4, 2023 at 3:04 AM Si-Wei Liu wrote: On 2/2/2023 7:28 AM, Eugenio Perez Martin wrote: On Thu, Feb 2, 2023 at 2:53 AM Si-Wei Liu wrote: On 1/12/2023 9:24 AM, Eugenio Pérez wrote: This allows net to restart the device backend to configure SVQ on it. Ideally, these changes should not be net specific. However, the vdpa net backend is the one with enough knowledge to configure everything because of some reasons: * Queues might need to be shadowed or not depending on its kind (control vs data). * Queues need to share the same map translations (iova tree). Because of that it is cleaner to restart the whole net backend and configure again as expected, similar to how vhost-kernel moves between userspace and passthrough. If more kinds of devices need dynamic switching to SVQ we can create a callback struct like VhostOps and move most of the code there. VhostOps cannot be reused since all vdpa backend share them, and to personalize just for networking would be too heavy. Signed-off-by: Eugenio Pérez --- net/vhost-vdpa.c | 84 1 file changed, 84 insertions(+) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 5d7ad6e4d7..f38532b1df 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -26,6 +26,8 @@ #include #include "standard-headers/linux/virtio_net.h" #include "monitor/monitor.h" +#include "migration/migration.h" +#include "migration/misc.h" #include "migration/blocker.h" #include "hw/virtio/vhost.h" @@ -33,6 +35,7 @@ typedef struct VhostVDPAState { NetClientState nc; struct vhost_vdpa vhost_vdpa; +Notifier migration_state; Error *migration_blocker; VHostNetState *vhost_net; @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) return DO_UPCAST(VhostVDPAState, nc, nc0); } +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) +{ +struct vhost_vdpa *v = &s->vhost_vdpa; +VirtIONet *n; +VirtIODevice *vdev; +int data_queue_pairs, cvq, r; +NetClientState *peer; + +/* We are only called on the first data vqs and only if x-svq is not set */ +if (s->vhost_vdpa.shadow_vqs_enabled == enable) { +return; +} + +vdev = v->dev->vdev; +n = VIRTIO_NET(vdev); +if (!n->vhost_started) { +return; +} + +if (enable) { +ioctl(v->device_fd, VHOST_VDPA_SUSPEND); +} +data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; +cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? + n->max_ncs - n->max_queue_pairs : 0; +vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq); + +peer = s->nc.peer; +for (int i = 0; i < data_queue_pairs + cvq; i++) { +VhostVDPAState *vdpa_state; +NetClientState *nc; + +if (i < data_queue_pairs) { +nc = qemu_get_peer(peer, i); +} else { +nc = qemu_get_peer(peer, n->max_queue_pairs); +} + +vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc); +vdpa_state->vhost_vdpa.shadow_data = enable; + +if (i < data_queue_pairs) { +/* Do not override CVQ shadow_vqs_enabled */ +vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable; +} +} + +r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq); As the first revision, this method (vhost_net_stop followed by vhost_net_start) should be fine for software vhost-vdpa backend for e.g. vp_vdpa and vdpa_sim_net. However, I would like to get your attention that this method implies substantial blackout time for mode switching on real hardware - get a full cycle of device reset of getting memory mappings torn down, unpin & repin same set of pages, and set up new mapping would take very significant amount of time, especially for a large VM. Maybe we can do: Right, I think this is something that deserves optimization in the future. Note that we must replace the mappings anyway, with all passthrough queues stopped. Yes, unmap and remap is needed indeed. I haven't checked, does shadow vq keep mapping to the same GPA where passthrough data virtqueues were associated with across switch (so that the mode switch is transparent to the guest)? I don't get this question, SVQ switching is already transparent to the guest. Never mind, you seem to have answered the question in the reply here and below. I was thinking of possibility to do incremental in-place update for a given IOVA range with one single call (for the on-chip IOMMU case), instead of separate unmap() and map() calls. Things like .set_map_replace(vdpa, asid, iova_start, size, iotlb_new_maps) as I ever mentioned. For platform IOMMU the ma
Re: [PATCH v2 01/13] vdpa net: move iova tree creation from init to start
On 2/13/2023 3:14 AM, Eugenio Perez Martin wrote: On Mon, Feb 13, 2023 at 7:51 AM Si-Wei Liu wrote: On 2/8/2023 1:42 AM, Eugenio Pérez wrote: Only create iova_tree if and when it is needed. The cleanup keeps being responsible of last VQ but this change allows it to merge both cleanup functions. Signed-off-by: Eugenio Pérez Acked-by: Jason Wang --- net/vhost-vdpa.c | 99 ++-- 1 file changed, 71 insertions(+), 28 deletions(-) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index de5ed8ff22..a9e6c8f28e 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -178,13 +178,9 @@ err_init: static void vhost_vdpa_cleanup(NetClientState *nc) { VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc); -struct vhost_dev *dev = &s->vhost_net->dev; qemu_vfree(s->cvq_cmd_out_buffer); qemu_vfree(s->status); -if (dev->vq_index + dev->nvqs == dev->vq_index_end) { -g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); -} if (s->vhost_net) { vhost_net_cleanup(s->vhost_net); g_free(s->vhost_net); @@ -234,10 +230,64 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf, return size; } +/** From any vdpa net client, get the netclient of first queue pair */ +static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) +{ +NICState *nic = qemu_get_nic(s->nc.peer); +NetClientState *nc0 = qemu_get_peer(nic->ncs, 0); + +return DO_UPCAST(VhostVDPAState, nc, nc0); +} + +static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) +{ +struct vhost_vdpa *v = &s->vhost_vdpa; + +if (v->shadow_vqs_enabled) { +v->iova_tree = vhost_iova_tree_new(v->iova_range.first, + v->iova_range.last); +} +} + +static int vhost_vdpa_net_data_start(NetClientState *nc) +{ +VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc); +struct vhost_vdpa *v = &s->vhost_vdpa; + +assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); + +if (v->index == 0) { +vhost_vdpa_net_data_start_first(s); +return 0; +} + +if (v->shadow_vqs_enabled) { +VhostVDPAState *s0 = vhost_vdpa_net_first_nc_vdpa(s); +v->iova_tree = s0->vhost_vdpa.iova_tree; +} + +return 0; +} + +static void vhost_vdpa_net_client_stop(NetClientState *nc) +{ +VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc); +struct vhost_dev *dev; + +assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); + +dev = s->vhost_vdpa.dev; +if (dev->vq_index + dev->nvqs == dev->vq_index_end) { +g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); +} +} + static NetClientInfo net_vhost_vdpa_info = { .type = NET_CLIENT_DRIVER_VHOST_VDPA, .size = sizeof(VhostVDPAState), .receive = vhost_vdpa_receive, +.start = vhost_vdpa_net_data_start, +.stop = vhost_vdpa_net_client_stop, .cleanup = vhost_vdpa_cleanup, .has_vnet_hdr = vhost_vdpa_has_vnet_hdr, .has_ufo = vhost_vdpa_has_ufo, @@ -351,7 +401,7 @@ dma_map_err: static int vhost_vdpa_net_cvq_start(NetClientState *nc) { -VhostVDPAState *s; +VhostVDPAState *s, *s0; struct vhost_vdpa *v; uint64_t backend_features; int64_t cvq_group; @@ -425,6 +475,15 @@ out: return 0; } +s0 = vhost_vdpa_net_first_nc_vdpa(s); +if (s0->vhost_vdpa.iova_tree) { +/* SVQ is already configured for all virtqueues */ +v->iova_tree = s0->vhost_vdpa.iova_tree; +} else { +v->iova_tree = vhost_iova_tree_new(v->iova_range.first, + v->iova_range.last); I wonder how this case could happen, vhost_vdpa_net_data_start_first() should've allocated an iova tree on the first data vq. Is zero data vq ever possible on net vhost-vdpa? It's the case of the current qemu master when only CVQ is being shadowed. It's not that "there are no data vq": If that case were possible, CVQ vhost-vdpa state would be s0. The case is that since only CVQ vhost-vdpa is the one being migrated, only CVQ has an iova tree. OK, so this corresponds to the case where live migration is not started and CVQ starts in its own address space of VHOST_VDPA_NET_CVQ_ASID. Thanks for explaining it! With this series applied and with no migration running, the case is the same as before: only SVQ gets shadowed. When migration starts, all vqs are migrated, and share iova tree. I wonder what is the reason to share the iova tree when migration starts, I think CVQ may stay on its own VHOST_VDPA_NET_CVQ_ASID still? Actually there's discrepancy in vhost_vdpa_net_log_global_enable(), I don't see explicit c
Re: [PATCH v2 01/13] vdpa net: move iova tree creation from init to start
On 2/14/2023 11:07 AM, Eugenio Perez Martin wrote: On Tue, Feb 14, 2023 at 2:45 AM Si-Wei Liu wrote: On 2/13/2023 3:14 AM, Eugenio Perez Martin wrote: On Mon, Feb 13, 2023 at 7:51 AM Si-Wei Liu wrote: On 2/8/2023 1:42 AM, Eugenio Pérez wrote: Only create iova_tree if and when it is needed. The cleanup keeps being responsible of last VQ but this change allows it to merge both cleanup functions. Signed-off-by: Eugenio Pérez Acked-by: Jason Wang --- net/vhost-vdpa.c | 99 ++-- 1 file changed, 71 insertions(+), 28 deletions(-) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index de5ed8ff22..a9e6c8f28e 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -178,13 +178,9 @@ err_init: static void vhost_vdpa_cleanup(NetClientState *nc) { VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc); -struct vhost_dev *dev = &s->vhost_net->dev; qemu_vfree(s->cvq_cmd_out_buffer); qemu_vfree(s->status); -if (dev->vq_index + dev->nvqs == dev->vq_index_end) { -g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); -} if (s->vhost_net) { vhost_net_cleanup(s->vhost_net); g_free(s->vhost_net); @@ -234,10 +230,64 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf, return size; } +/** From any vdpa net client, get the netclient of first queue pair */ +static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) +{ +NICState *nic = qemu_get_nic(s->nc.peer); +NetClientState *nc0 = qemu_get_peer(nic->ncs, 0); + +return DO_UPCAST(VhostVDPAState, nc, nc0); +} + +static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) +{ +struct vhost_vdpa *v = &s->vhost_vdpa; + +if (v->shadow_vqs_enabled) { +v->iova_tree = vhost_iova_tree_new(v->iova_range.first, + v->iova_range.last); +} +} + +static int vhost_vdpa_net_data_start(NetClientState *nc) +{ +VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc); +struct vhost_vdpa *v = &s->vhost_vdpa; + +assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); + +if (v->index == 0) { +vhost_vdpa_net_data_start_first(s); +return 0; +} + +if (v->shadow_vqs_enabled) { +VhostVDPAState *s0 = vhost_vdpa_net_first_nc_vdpa(s); +v->iova_tree = s0->vhost_vdpa.iova_tree; +} + +return 0; +} + +static void vhost_vdpa_net_client_stop(NetClientState *nc) +{ +VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc); +struct vhost_dev *dev; + +assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); + +dev = s->vhost_vdpa.dev; +if (dev->vq_index + dev->nvqs == dev->vq_index_end) { +g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); +} +} + static NetClientInfo net_vhost_vdpa_info = { .type = NET_CLIENT_DRIVER_VHOST_VDPA, .size = sizeof(VhostVDPAState), .receive = vhost_vdpa_receive, +.start = vhost_vdpa_net_data_start, +.stop = vhost_vdpa_net_client_stop, .cleanup = vhost_vdpa_cleanup, .has_vnet_hdr = vhost_vdpa_has_vnet_hdr, .has_ufo = vhost_vdpa_has_ufo, @@ -351,7 +401,7 @@ dma_map_err: static int vhost_vdpa_net_cvq_start(NetClientState *nc) { -VhostVDPAState *s; +VhostVDPAState *s, *s0; struct vhost_vdpa *v; uint64_t backend_features; int64_t cvq_group; @@ -425,6 +475,15 @@ out: return 0; } +s0 = vhost_vdpa_net_first_nc_vdpa(s); +if (s0->vhost_vdpa.iova_tree) { +/* SVQ is already configured for all virtqueues */ +v->iova_tree = s0->vhost_vdpa.iova_tree; +} else { +v->iova_tree = vhost_iova_tree_new(v->iova_range.first, + v->iova_range.last); I wonder how this case could happen, vhost_vdpa_net_data_start_first() should've allocated an iova tree on the first data vq. Is zero data vq ever possible on net vhost-vdpa? It's the case of the current qemu master when only CVQ is being shadowed. It's not that "there are no data vq": If that case were possible, CVQ vhost-vdpa state would be s0. The case is that since only CVQ vhost-vdpa is the one being migrated, only CVQ has an iova tree. OK, so this corresponds to the case where live migration is not started and CVQ starts in its own address space of VHOST_VDPA_NET_CVQ_ASID. Thanks for explaining it! With this series applied and with no migration running, the case is the same as before: only SVQ gets shadowed. When migration starts, all vqs are migrated, and share iova tree. I wonder what is the reason to share the iova tree when migration starts, I think CVQ may stay on its own VHOST_VD
Re: [PATCH v2 01/13] vdpa net: move iova tree creation from init to start
On 2/15/2023 11:35 PM, Eugenio Perez Martin wrote: On Thu, Feb 16, 2023 at 3:15 AM Si-Wei Liu wrote: On 2/14/2023 11:07 AM, Eugenio Perez Martin wrote: On Tue, Feb 14, 2023 at 2:45 AM Si-Wei Liu wrote: On 2/13/2023 3:14 AM, Eugenio Perez Martin wrote: On Mon, Feb 13, 2023 at 7:51 AM Si-Wei Liu wrote: On 2/8/2023 1:42 AM, Eugenio Pérez wrote: Only create iova_tree if and when it is needed. The cleanup keeps being responsible of last VQ but this change allows it to merge both cleanup functions. Signed-off-by: Eugenio Pérez Acked-by: Jason Wang --- net/vhost-vdpa.c | 99 ++-- 1 file changed, 71 insertions(+), 28 deletions(-) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index de5ed8ff22..a9e6c8f28e 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -178,13 +178,9 @@ err_init: static void vhost_vdpa_cleanup(NetClientState *nc) { VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc); -struct vhost_dev *dev = &s->vhost_net->dev; qemu_vfree(s->cvq_cmd_out_buffer); qemu_vfree(s->status); -if (dev->vq_index + dev->nvqs == dev->vq_index_end) { -g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); -} if (s->vhost_net) { vhost_net_cleanup(s->vhost_net); g_free(s->vhost_net); @@ -234,10 +230,64 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf, return size; } +/** From any vdpa net client, get the netclient of first queue pair */ +static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) +{ +NICState *nic = qemu_get_nic(s->nc.peer); +NetClientState *nc0 = qemu_get_peer(nic->ncs, 0); + +return DO_UPCAST(VhostVDPAState, nc, nc0); +} + +static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) +{ +struct vhost_vdpa *v = &s->vhost_vdpa; + +if (v->shadow_vqs_enabled) { +v->iova_tree = vhost_iova_tree_new(v->iova_range.first, + v->iova_range.last); +} +} + +static int vhost_vdpa_net_data_start(NetClientState *nc) +{ +VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc); +struct vhost_vdpa *v = &s->vhost_vdpa; + +assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); + +if (v->index == 0) { +vhost_vdpa_net_data_start_first(s); +return 0; +} + +if (v->shadow_vqs_enabled) { +VhostVDPAState *s0 = vhost_vdpa_net_first_nc_vdpa(s); +v->iova_tree = s0->vhost_vdpa.iova_tree; +} + +return 0; +} + +static void vhost_vdpa_net_client_stop(NetClientState *nc) +{ +VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc); +struct vhost_dev *dev; + +assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); + +dev = s->vhost_vdpa.dev; +if (dev->vq_index + dev->nvqs == dev->vq_index_end) { +g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); +} +} + static NetClientInfo net_vhost_vdpa_info = { .type = NET_CLIENT_DRIVER_VHOST_VDPA, .size = sizeof(VhostVDPAState), .receive = vhost_vdpa_receive, +.start = vhost_vdpa_net_data_start, +.stop = vhost_vdpa_net_client_stop, .cleanup = vhost_vdpa_cleanup, .has_vnet_hdr = vhost_vdpa_has_vnet_hdr, .has_ufo = vhost_vdpa_has_ufo, @@ -351,7 +401,7 @@ dma_map_err: static int vhost_vdpa_net_cvq_start(NetClientState *nc) { -VhostVDPAState *s; +VhostVDPAState *s, *s0; struct vhost_vdpa *v; uint64_t backend_features; int64_t cvq_group; @@ -425,6 +475,15 @@ out: return 0; } +s0 = vhost_vdpa_net_first_nc_vdpa(s); +if (s0->vhost_vdpa.iova_tree) { +/* SVQ is already configured for all virtqueues */ +v->iova_tree = s0->vhost_vdpa.iova_tree; +} else { +v->iova_tree = vhost_iova_tree_new(v->iova_range.first, + v->iova_range.last); I wonder how this case could happen, vhost_vdpa_net_data_start_first() should've allocated an iova tree on the first data vq. Is zero data vq ever possible on net vhost-vdpa? It's the case of the current qemu master when only CVQ is being shadowed. It's not that "there are no data vq": If that case were possible, CVQ vhost-vdpa state would be s0. The case is that since only CVQ vhost-vdpa is the one being migrated, only CVQ has an iova tree. OK, so this corresponds to the case where live migration is not started and CVQ starts in its own address space of VHOST_VDPA_NET_CVQ_ASID. Thanks for explaining it! With this series applied and with no migration running, the case is the same as before: only SVQ gets shadowed. When migration starts, all vqs are migrated
[PATCH] vhost-vdpa: fix assert !virtio_net_get_subqueue(nc)->async_tx.elem in virtio_net_reset
The citing commit has incorrect code in vhost_vdpa_receive() that returns zero instead of full packet size to the caller. This renders pending packets unable to be freed so then get clogged in the tx queue forever. When device is being reset later on, below assertion failure ensues: 0 0x7f86d53bb387 in raise () from /lib64/libc.so.6 1 0x7f86d53bca78 in abort () from /lib64/libc.so.6 2 0x7f86d53b41a6 in __assert_fail_base () from /lib64/libc.so.6 3 0x7f86d53b4252 in __assert_fail () from /lib64/libc.so.6 4 0x55b8f6ff6fcc in virtio_net_reset (vdev=) at /usr/src/debug/qemu/hw/net/virtio-net.c:563 5 0x55b8f7012fcf in virtio_reset (opaque=0x55b8faf881f0) at /usr/src/debug/qemu/hw/virtio/virtio.c:1993 6 0x55b8f71f0086 in virtio_bus_reset (bus=bus@entry=0x55b8faf88178) at /usr/src/debug/qemu/hw/virtio/virtio-bus.c:102 7 0x55b8f71f1620 in virtio_pci_reset (qdev=) at /usr/src/debug/qemu/hw/virtio/virtio-pci.c:1845 8 0x55b8f6fafc6c in memory_region_write_accessor (mr=, addr=, value=, size=, shift=, mask=, attrs=...) at /usr/src/debug/qemu/memory.c:483 9 0x55b8f6fadce9 in access_with_adjusted_size (addr=addr@entry=20, value=value@entry=0x7f867e7fb7e8, size=size@entry=1, access_size_min=, access_size_max=, access_fn=0x55b8f6fafc20 , mr=0x55b8faf80a50, attrs=...) at /usr/src/debug/qemu/memory.c:544 10 0x55b8f6fb1d0b in memory_region_dispatch_write (mr=mr@entry=0x55b8faf80a50, addr=addr@entry=20, data=0, op=, attrs=attrs@entry=...) at /usr/src/debug/qemu/memory.c:1470 11 0x55b8f6f62ada in flatview_write_continue (fv=fv@entry=0x7f86ac04cd20, addr=addr@entry=549755813908, attrs=..., attrs@entry=..., buf=buf@entry=0x7f86d0223028 , len=len@entry=1, addr1=20, l=1, mr=0x55b8faf80a50) at /usr/src/debug/qemu/exec.c:3266 12 0x55b8f6f62c8f in flatview_write (fv=0x7f86ac04cd20, addr=549755813908, attrs=..., buf=0x7f86d0223028 , len=1) at /usr/src/debug/qemu/exec.c:3306 13 0x55b8f6f674cb in address_space_write (as=, addr=, attrs=..., buf=, len=) at /usr/src/debug/qemu/exec.c:3396 14 0x55b8f6f67575 in address_space_rw (as=, addr=, attrs=..., attrs@entry=..., buf=buf@entry=0x7f86d0223028 , len=, is_write=) at /usr/src/debug/qemu/exec.c:3406 15 0x55b8f6fc1cc8 in kvm_cpu_exec (cpu=cpu@entry=0x55b8f9aa0e10) at /usr/src/debug/qemu/accel/kvm/kvm-all.c:2410 16 0x55b8f6fa5f5e in qemu_kvm_cpu_thread_fn (arg=0x55b8f9aa0e10) at /usr/src/debug/qemu/cpus.c:1318 17 0x55b8f7336e16 in qemu_thread_start (args=0x55b8f9ac8480) at /usr/src/debug/qemu/util/qemu-thread-posix.c:519 18 0x7f86d575aea5 in start_thread () from /lib64/libpthread.so.0 19 0x7f86d5483b2d in clone () from /lib64/libc.so.6 Make vhost_vdpa_receive() return the size passed in as is, so that the caller qemu_deliver_packet_iov() would eventually propagate it back to virtio_net_flush_tx() to release pending packets from the async_tx queue. Which corresponds to the drop path where qemu_sendv_packet_async() returns non-zero in virtio_net_flush_tx(). Fixes: 846a1e85da64 ("vdpa: Add dummy receive callback") Cc: Eugenio Perez Martin Signed-off-by: Si-Wei Liu --- net/vhost-vdpa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 4bc3fd0..182b3a1 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -211,7 +211,7 @@ static bool vhost_vdpa_check_peer_type(NetClientState *nc, ObjectClass *oc, static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf, size_t size) { -return 0; +return size; } static NetClientInfo net_vhost_vdpa_info = { -- 1.8.3.1
[PATCH] vhost-vdpa: allow passing opened vhostfd to vhost-vdpa
Similar to other vhost backends, vhostfd can be passed to vhost-vdpa backend as another parameter to instantiate vhost-vdpa net client. This would benefit the use case where only open fd's, as oppposed to raw vhost-vdpa device paths, are accessible from the QEMU process. (qemu) netdev_add type=vhost-vdpa,vhostfd=61,id=vhost-vdpa1 Signed-off-by: Si-Wei Liu --- net/vhost-vdpa.c | 25 - qapi/net.json| 3 +++ qemu-options.hx | 6 -- 3 files changed, 27 insertions(+), 7 deletions(-) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 182b3a1..366b070 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -683,14 +683,29 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name, assert(netdev->type == NET_CLIENT_DRIVER_VHOST_VDPA); opts = &netdev->u.vhost_vdpa; -if (!opts->vhostdev) { -error_setg(errp, "vdpa character device not specified with vhostdev"); +if (!opts->has_vhostdev && !opts->has_vhostfd) { +error_setg(errp, + "vhost-vdpa: neither vhostdev= nor vhostfd= was specified"); return -1; } -vdpa_device_fd = qemu_open(opts->vhostdev, O_RDWR, errp); -if (vdpa_device_fd == -1) { -return -errno; +if (opts->has_vhostdev && opts->has_vhostfd) { +error_setg(errp, + "vhost-vdpa: vhostdev= and vhostfd= are mutually exclusive"); +return -1; +} + +if (opts->has_vhostdev) { +vdpa_device_fd = qemu_open(opts->vhostdev, O_RDWR, errp); +if (vdpa_device_fd == -1) { +return -errno; +} +} else if (opts->has_vhostfd) { +vdpa_device_fd = monitor_fd_param(monitor_cur(), opts->vhostfd, errp); +if (vdpa_device_fd == -1) { +error_prepend(errp, "vhost-vdpa: unable to parse vhostfd: "); +return -1; +} } r = vhost_vdpa_get_features(vdpa_device_fd, &features, errp); diff --git a/qapi/net.json b/qapi/net.json index dd088c0..926ecc8 100644 --- a/qapi/net.json +++ b/qapi/net.json @@ -442,6 +442,8 @@ # @vhostdev: path of vhost-vdpa device #(default:'/dev/vhost-vdpa-0') # +# @vhostfd: file descriptor of an already opened vhost vdpa device +# # @queues: number of queues to be created for multiqueue vhost-vdpa # (default: 1) # @@ -456,6 +458,7 @@ { 'struct': 'NetdevVhostVDPAOptions', 'data': { '*vhostdev': 'str', +'*vhostfd': 'str', '*queues': 'int', '*x-svq':{'type': 'bool', 'features' : [ 'unstable'] } } } diff --git a/qemu-options.hx b/qemu-options.hx index 913c71e..c040f74 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -2774,8 +2774,10 @@ DEF("netdev", HAS_ARG, QEMU_OPTION_netdev, "configure a vhost-user network, backed by a chardev 'dev'\n" #endif #ifdef __linux__ -"-netdev vhost-vdpa,id=str,vhostdev=/path/to/dev\n" +"-netdev vhost-vdpa,id=str[,vhostdev=/path/to/dev][,vhostfd=h]\n" "configure a vhost-vdpa network,Establish a vhost-vdpa netdev\n" +"use 'vhostdev=/path/to/dev' to open a vhost vdpa device\n" +"use 'vhostfd=h' to connect to an already opened vhost vdpa device\n" #endif #ifdef CONFIG_VMNET "-netdev vmnet-host,id=str[,isolated=on|off][,net-uuid=uuid]\n" @@ -3280,7 +3282,7 @@ SRST -netdev type=vhost-user,id=net0,chardev=chr0 \ -device virtio-net-pci,netdev=net0 -``-netdev vhost-vdpa,vhostdev=/path/to/dev`` +``-netdev vhost-vdpa[,vhostdev=/path/to/dev][,vhostfd=h]`` Establish a vhost-vdpa netdev. vDPA device is a device that uses a datapath which complies with -- 1.8.3.1
[PATCH] vhost-vdpa: fix assert !virtio_net_get_subqueue(nc)->async_tx.elem in virtio_net_reset
The citing commit has incorrect code in vhost_vdpa_receive() that returns zero instead of full packet size to the caller. This renders pending packets unable to be freed so then get clogged in the tx queue forever. When device is being reset later on, below assertion failure ensues: 0 0x7f86d53bb387 in raise () from /lib64/libc.so.6 1 0x7f86d53bca78 in abort () from /lib64/libc.so.6 2 0x7f86d53b41a6 in __assert_fail_base () from /lib64/libc.so.6 3 0x7f86d53b4252 in __assert_fail () from /lib64/libc.so.6 4 0x55b8f6ff6fcc in virtio_net_reset (vdev=) at /usr/src/debug/qemu/hw/net/virtio-net.c:563 5 0x55b8f7012fcf in virtio_reset (opaque=0x55b8faf881f0) at /usr/src/debug/qemu/hw/virtio/virtio.c:1993 6 0x55b8f71f0086 in virtio_bus_reset (bus=bus@entry=0x55b8faf88178) at /usr/src/debug/qemu/hw/virtio/virtio-bus.c:102 7 0x55b8f71f1620 in virtio_pci_reset (qdev=) at /usr/src/debug/qemu/hw/virtio/virtio-pci.c:1845 8 0x55b8f6fafc6c in memory_region_write_accessor (mr=, addr=, value=, size=, shift=, mask=, attrs=...) at /usr/src/debug/qemu/memory.c:483 9 0x55b8f6fadce9 in access_with_adjusted_size (addr=addr@entry=20, value=value@entry=0x7f867e7fb7e8, size=size@entry=1, access_size_min=, access_size_max=, access_fn=0x55b8f6fafc20 , mr=0x55b8faf80a50, attrs=...) at /usr/src/debug/qemu/memory.c:544 10 0x55b8f6fb1d0b in memory_region_dispatch_write (mr=mr@entry=0x55b8faf80a50, addr=addr@entry=20, data=0, op=, attrs=attrs@entry=...) at /usr/src/debug/qemu/memory.c:1470 11 0x55b8f6f62ada in flatview_write_continue (fv=fv@entry=0x7f86ac04cd20, addr=addr@entry=549755813908, attrs=..., attrs@entry=..., buf=buf@entry=0x7f86d0223028 , len=len@entry=1, addr1=20, l=1, mr=0x55b8faf80a50) at /usr/src/debug/qemu/exec.c:3266 12 0x55b8f6f62c8f in flatview_write (fv=0x7f86ac04cd20, addr=549755813908, attrs=..., buf=0x7f86d0223028 , len=1) at /usr/src/debug/qemu/exec.c:3306 13 0x55b8f6f674cb in address_space_write (as=, addr=, attrs=..., buf=, len=) at /usr/src/debug/qemu/exec.c:3396 14 0x55b8f6f67575 in address_space_rw (as=, addr=, attrs=..., attrs@entry=..., buf=buf@entry=0x7f86d0223028 , len=, is_write=) at /usr/src/debug/qemu/exec.c:3406 15 0x55b8f6fc1cc8 in kvm_cpu_exec (cpu=cpu@entry=0x55b8f9aa0e10) at /usr/src/debug/qemu/accel/kvm/kvm-all.c:2410 16 0x55b8f6fa5f5e in qemu_kvm_cpu_thread_fn (arg=0x55b8f9aa0e10) at /usr/src/debug/qemu/cpus.c:1318 17 0x55b8f7336e16 in qemu_thread_start (args=0x55b8f9ac8480) at /usr/src/debug/qemu/util/qemu-thread-posix.c:519 18 0x7f86d575aea5 in start_thread () from /lib64/libpthread.so.0 19 0x7f86d5483b2d in clone () from /lib64/libc.so.6 Make vhost_vdpa_receive() return the size passed in as is, so that the caller qemu_deliver_packet_iov() would eventually propagate it back to virtio_net_flush_tx() to release pending packets from the async_tx queue. Which corresponds to the drop path where qemu_sendv_packet_async() returns non-zero in virtio_net_flush_tx(). Fixes: 846a1e85da64 ("vdpa: Add dummy receive callback") Cc: Eugenio Perez Martin Signed-off-by: Si-Wei Liu --- net/vhost-vdpa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 4bc3fd0..182b3a1 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -211,7 +211,7 @@ static bool vhost_vdpa_check_peer_type(NetClientState *nc, ObjectClass *oc, static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf, size_t size) { -return 0; +return size; } static NetClientInfo net_vhost_vdpa_info = { -- 1.8.3.1
Re: [PATCH] vhost-vdpa: fix assert !virtio_net_get_subqueue(nc)->async_tx.elem in virtio_net_reset
Apologies, please disregard this email. Wrong target audience it was sent to, although the content of patch is correct. For those who want to review the patch, please reply to this thread: Message-Id: <1664913563-3351-1-git-send-email-si-wei@oracle.com> Thanks, -Siwei On 10/4/2022 12:58 PM, Si-Wei Liu wrote: The citing commit has incorrect code in vhost_vdpa_receive() that returns zero instead of full packet size to the caller. This renders pending packets unable to be freed so then get clogged in the tx queue forever. When device is being reset later on, below assertion failure ensues: 0 0x7f86d53bb387 in raise () from /lib64/libc.so.6 1 0x7f86d53bca78 in abort () from /lib64/libc.so.6 2 0x7f86d53b41a6 in __assert_fail_base () from /lib64/libc.so.6 3 0x7f86d53b4252 in __assert_fail () from /lib64/libc.so.6 4 0x55b8f6ff6fcc in virtio_net_reset (vdev=) at /usr/src/debug/qemu/hw/net/virtio-net.c:563 5 0x55b8f7012fcf in virtio_reset (opaque=0x55b8faf881f0) at /usr/src/debug/qemu/hw/virtio/virtio.c:1993 6 0x55b8f71f0086 in virtio_bus_reset (bus=bus@entry=0x55b8faf88178) at /usr/src/debug/qemu/hw/virtio/virtio-bus.c:102 7 0x55b8f71f1620 in virtio_pci_reset (qdev=) at /usr/src/debug/qemu/hw/virtio/virtio-pci.c:1845 8 0x55b8f6fafc6c in memory_region_write_accessor (mr=, addr=, value=, size=, shift=, mask=, attrs=...) at /usr/src/debug/qemu/memory.c:483 9 0x55b8f6fadce9 in access_with_adjusted_size (addr=addr@entry=20, value=value@entry=0x7f867e7fb7e8, size=size@entry=1, access_size_min=, access_size_max=, access_fn=0x55b8f6fafc20 , mr=0x55b8faf80a50, attrs=...) at /usr/src/debug/qemu/memory.c:544 10 0x55b8f6fb1d0b in memory_region_dispatch_write (mr=mr@entry=0x55b8faf80a50, addr=addr@entry=20, data=0, op=, attrs=attrs@entry=...) at /usr/src/debug/qemu/memory.c:1470 11 0x55b8f6f62ada in flatview_write_continue (fv=fv@entry=0x7f86ac04cd20, addr=addr@entry=549755813908, attrs=..., attrs@entry=..., buf=buf@entry=0x7f86d0223028 , len=len@entry=1, addr1=20, l=1, mr=0x55b8faf80a50) at /usr/src/debug/qemu/exec.c:3266 12 0x55b8f6f62c8f in flatview_write (fv=0x7f86ac04cd20, addr=549755813908, attrs=..., buf=0x7f86d0223028 , len=1) at /usr/src/debug/qemu/exec.c:3306 13 0x55b8f6f674cb in address_space_write (as=, addr=, attrs=..., buf=, len=) at /usr/src/debug/qemu/exec.c:3396 14 0x55b8f6f67575 in address_space_rw (as=, addr=, attrs=..., attrs@entry=..., buf=buf@entry=0x7f86d0223028 , len=, is_write=) at /usr/src/debug/qemu/exec.c:3406 15 0x55b8f6fc1cc8 in kvm_cpu_exec (cpu=cpu@entry=0x55b8f9aa0e10) at /usr/src/debug/qemu/accel/kvm/kvm-all.c:2410 16 0x55b8f6fa5f5e in qemu_kvm_cpu_thread_fn (arg=0x55b8f9aa0e10) at /usr/src/debug/qemu/cpus.c:1318 17 0x55b8f7336e16 in qemu_thread_start (args=0x55b8f9ac8480) at /usr/src/debug/qemu/util/qemu-thread-posix.c:519 18 0x7f86d575aea5 in start_thread () from /lib64/libpthread.so.0 19 0x7f86d5483b2d in clone () from /lib64/libc.so.6 Make vhost_vdpa_receive() return the size passed in as is, so that the caller qemu_deliver_packet_iov() would eventually propagate it back to virtio_net_flush_tx() to release pending packets from the async_tx queue. Which corresponds to the drop path where qemu_sendv_packet_async() returns non-zero in virtio_net_flush_tx(). Fixes: 846a1e85da64 ("vdpa: Add dummy receive callback") Cc: Eugenio Perez Martin Signed-off-by: Si-Wei Liu --- net/vhost-vdpa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 4bc3fd0..182b3a1 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -211,7 +211,7 @@ static bool vhost_vdpa_check_peer_type(NetClientState *nc, ObjectClass *oc, static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf, size_t size) { -return 0; +return size; } static NetClientInfo net_vhost_vdpa_info = {
Re: [PATCH 2/3] vdpa: load vlan configuration at NIC startup
On 9/29/2022 12:13 AM, Michael S. Tsirkin wrote: On Wed, Sep 21, 2022 at 04:00:58PM -0700, Si-Wei Liu wrote: The spec doesn't explicitly say anything about that as far as I see. Here the spec is totally ruled by the (software artifact of) implementation rather than what a real device is expected to work with VLAN rx filters. Are we sure we'd stick to this flawed device implementation? The guest driver seems to be agnostic with this broken spec behavior so far, and I am afraid it's an overkill to add another feature bit or ctrl command to VLAN filter in clean way. I agree with all of the above. So, double checking, all vlan should be allowed by default at device start? That is true only when VIRTIO_NET_F_CTRL_VLAN is not negotiated. If the guest already negotiated VIRTIO_NET_F_CTRL_VLAN before being migrated, device should resume with all VLANs filtered/disallowed. Maybe the spec needs to be more clear in that regard? Yes, I think this is crucial. Otherwise we can't get consistent behavior, either from software to vDPA, or cross various vDPA vendors. OK. Can you open a github issue for the spec? We'll try to address. Thanks, ticket filed at: https://github.com/oasis-tcs/virtio-spec/issues/147 Also, is it ok if we make it a SHOULD, i.e. best effort filtering? Yes, that's fine. -Siwei
Re: [PATCH v2] vhost-vdpa: allow passing opened vhostfd to vhost-vdpa
Hi Jason, Sorry for top posting, but are you going to queue this patch? It looks like the discussion has been settled and no further comment I got for 2 weeks for this patch. Thanks, -Siwei On 10/13/2022 4:12 PM, Si-Wei Liu wrote: Jason, On 10/12/2022 10:02 PM, Jason Wang wrote: 在 2022/10/12 13:59, Si-Wei Liu 写道: On 10/11/2022 8:09 PM, Jason Wang wrote: On Tue, Oct 11, 2022 at 1:18 AM Si-Wei Liu wrote: On 10/8/2022 10:43 PM, Jason Wang wrote: On Sat, Oct 8, 2022 at 5:04 PM Si-Wei Liu wrote: Similar to other vhost backends, vhostfd can be passed to vhost-vdpa backend as another parameter to instantiate vhost-vdpa net client. This would benefit the use case where only open file descriptors, as opposed to raw vhost-vdpa device paths, are accessible from the QEMU process. (qemu) netdev_add type=vhost-vdpa,vhostfd=61,id=vhost-vdpa1 Adding Cindy. This has been discussed before, we've already had vhostdev=/dev/fdset/$fd which should be functional equivalent to what has been proposed here. (And this is how libvirt works if I understand correctly). Yes, I was aware of that discussion. However, our implementation of the management software is a bit different from libvirt, in which the paths in /dev/fdset/NNN can't be dynamically passed to the container where QEMU is running. By using a specific vhostfd property with existing code, it would allow our mgmt software smooth adaption without having to add too much infra code to support the /dev/fdset/NNN trick. I think fdset has extra flexibility in e.g hot-plug to allow the file descriptor to be passed with SCM_RIGHTS. Yes, that's exactly the use case we'd like to support. Though the difference in our mgmt software stack from libvirt is that any dynamic path in /dev (like /dev/fdset/ABC or /dev/vhost-vdpa-XYZ) can't be allowed to get passed through to the container running QEMU on the fly for security reasons. fd passing is allowed, though, with very strict security checks. Interesting, any reason for disallowing fd passing? For our mgmt software stack, QEMU is running in a secured container with its own namespace(s) with minimally well known and trusted devices from root ns exposed (only) at the time when QEMU is being started. Direct fd passing via SCM_RIGHTS is allowed, but fdset device node exposure is not allowed and not even considered useful to us, as it adds an unwarranted attack surface to the QEMU's secured container unnecessarily. This has been the case and our security model for a while now w.r.t hot plugging vhost-net/tap and vhost-scsi devices, so will do for vhost-vdpa with vhostfd. It's not an open source project, though what I can share is that it's not a simple script that can be easily changed, and allow passing extra devices e.g. fdset especially on the fly is not even in consideration per suggested security guideline. I think we don't do anything special here as with other secured containers that disallow dynamic device injection on the fly. I'm asking since it's the way that libvirt work and it seems to me we don't get any complaints in the past. I guess it was because libvirt doesn't run QEMU in a container with very limited device exposure, otherwise this sort of constraints would pop up. Anyway the point and the way I see it is that passing vhostfd is proved to be working well and secure with other vhost devices, I don't see why vhost-vdpa is treated special here that would need to enforce the fdset usage. It's an edge case for libvirt maybe, but supporting QEMU's vhost-vdpa device to run in a securely contained environment with no dynamic device injection shouldn't be an odd or bizarre use case. Thanks, -Siwei That's the main motivation for this direct vhostfd passing support (noted fdset doesn't need to be used along with /dev/fdset node). Having it said, I found there's also nuance in the vhostdev=/dev/fdset/XyZ interface besides the /dev node limitation: the fd to open has to be dup'ed from the original one passed via SCM_RIGHTS. This also has implication on security that any ioctl call from QEMU can't be audited through the original fd. I'm not sure I get this, but management layer can enforce a ioctl whiltelist for safety. Thanks With this regard, I think vhostfd offers more flexibility than work around those qemu_open() specifics. Would these justify the use case of concern? Thanks, -Siwei It would still be good to add the support. On the other hand, the other vhost backends, e.g. tap (via vhost-net), vhost-scsi and vhost-vsock all accept vhostfd as parameter to instantiate device, although the /dev/fdset trick also works there. I think vhost-vdpa is not unprecedented in this case? Yes. Thanks Thanks, -Siwei Thanks Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez --- v2: - fixed typo in commit message
Re: [PATCH v2] vhost-vdpa: allow passing opened vhostfd to vhost-vdpa
On 10/27/2022 6:50 PM, Jason Wang wrote: On Fri, Oct 28, 2022 at 5:56 AM Si-Wei Liu wrote: Hi Jason, Sorry for top posting, but are you going to queue this patch? It looks like the discussion has been settled and no further comment I got for 2 weeks for this patch. Yes, I've queued this. Excellent, thanks Jason. I see it gets pulled. -Siwei Thanks Thanks, -Siwei On 10/13/2022 4:12 PM, Si-Wei Liu wrote: Jason, On 10/12/2022 10:02 PM, Jason Wang wrote: 在 2022/10/12 13:59, Si-Wei Liu 写道: On 10/11/2022 8:09 PM, Jason Wang wrote: On Tue, Oct 11, 2022 at 1:18 AM Si-Wei Liu wrote: On 10/8/2022 10:43 PM, Jason Wang wrote: On Sat, Oct 8, 2022 at 5:04 PM Si-Wei Liu wrote: Similar to other vhost backends, vhostfd can be passed to vhost-vdpa backend as another parameter to instantiate vhost-vdpa net client. This would benefit the use case where only open file descriptors, as opposed to raw vhost-vdpa device paths, are accessible from the QEMU process. (qemu) netdev_add type=vhost-vdpa,vhostfd=61,id=vhost-vdpa1 Adding Cindy. This has been discussed before, we've already had vhostdev=/dev/fdset/$fd which should be functional equivalent to what has been proposed here. (And this is how libvirt works if I understand correctly). Yes, I was aware of that discussion. However, our implementation of the management software is a bit different from libvirt, in which the paths in /dev/fdset/NNN can't be dynamically passed to the container where QEMU is running. By using a specific vhostfd property with existing code, it would allow our mgmt software smooth adaption without having to add too much infra code to support the /dev/fdset/NNN trick. I think fdset has extra flexibility in e.g hot-plug to allow the file descriptor to be passed with SCM_RIGHTS. Yes, that's exactly the use case we'd like to support. Though the difference in our mgmt software stack from libvirt is that any dynamic path in /dev (like /dev/fdset/ABC or /dev/vhost-vdpa-XYZ) can't be allowed to get passed through to the container running QEMU on the fly for security reasons. fd passing is allowed, though, with very strict security checks. Interesting, any reason for disallowing fd passing? For our mgmt software stack, QEMU is running in a secured container with its own namespace(s) with minimally well known and trusted devices from root ns exposed (only) at the time when QEMU is being started. Direct fd passing via SCM_RIGHTS is allowed, but fdset device node exposure is not allowed and not even considered useful to us, as it adds an unwarranted attack surface to the QEMU's secured container unnecessarily. This has been the case and our security model for a while now w.r.t hot plugging vhost-net/tap and vhost-scsi devices, so will do for vhost-vdpa with vhostfd. It's not an open source project, though what I can share is that it's not a simple script that can be easily changed, and allow passing extra devices e.g. fdset especially on the fly is not even in consideration per suggested security guideline. I think we don't do anything special here as with other secured containers that disallow dynamic device injection on the fly. I'm asking since it's the way that libvirt work and it seems to me we don't get any complaints in the past. I guess it was because libvirt doesn't run QEMU in a container with very limited device exposure, otherwise this sort of constraints would pop up. Anyway the point and the way I see it is that passing vhostfd is proved to be working well and secure with other vhost devices, I don't see why vhost-vdpa is treated special here that would need to enforce the fdset usage. It's an edge case for libvirt maybe, but supporting QEMU's vhost-vdpa device to run in a securely contained environment with no dynamic device injection shouldn't be an odd or bizarre use case. Thanks, -Siwei That's the main motivation for this direct vhostfd passing support (noted fdset doesn't need to be used along with /dev/fdset node). Having it said, I found there's also nuance in the vhostdev=/dev/fdset/XyZ interface besides the /dev node limitation: the fd to open has to be dup'ed from the original one passed via SCM_RIGHTS. This also has implication on security that any ioctl call from QEMU can't be audited through the original fd. I'm not sure I get this, but management layer can enforce a ioctl whiltelist for safety. Thanks With this regard, I think vhostfd offers more flexibility than work around those qemu_open() specifics. Would these justify the use case of concern? Thanks, -Siwei It would still be good to add the support. On the other hand, the other vhost backends, e.g. tap (via vhost-net), vhost-scsi and vhost-vsock all accept vhostfd as parameter to instantiate device, although the /dev/fdset trick also works there. I think vhost-vdpa is not unprecedented in this case? Yes.
Re: [PATCH] vhost-vdpa: fix assert !virtio_net_get_subqueue(nc)->async_tx.elem in virtio_net_reset
Hi Jason, This one is a one-line simple bug fix but seems to be missed from the pull request. If there's a v2 for the PULL, would appreciate if you can piggyback. Thanks in advance! Regards, -Siwei On 10/7/2022 8:42 AM, Eugenio Perez Martin wrote: On Tue, Oct 4, 2022 at 11:05 PM Si-Wei Liu wrote: The citing commit has incorrect code in vhost_vdpa_receive() that returns zero instead of full packet size to the caller. This renders pending packets unable to be freed so then get clogged in the tx queue forever. When device is being reset later on, below assertion failure ensues: 0 0x7f86d53bb387 in raise () from /lib64/libc.so.6 1 0x7f86d53bca78 in abort () from /lib64/libc.so.6 2 0x7f86d53b41a6 in __assert_fail_base () from /lib64/libc.so.6 3 0x7f86d53b4252 in __assert_fail () from /lib64/libc.so.6 4 0x55b8f6ff6fcc in virtio_net_reset (vdev=) at /usr/src/debug/qemu/hw/net/virtio-net.c:563 5 0x55b8f7012fcf in virtio_reset (opaque=0x55b8faf881f0) at /usr/src/debug/qemu/hw/virtio/virtio.c:1993 6 0x55b8f71f0086 in virtio_bus_reset (bus=bus@entry=0x55b8faf88178) at /usr/src/debug/qemu/hw/virtio/virtio-bus.c:102 7 0x55b8f71f1620 in virtio_pci_reset (qdev=) at /usr/src/debug/qemu/hw/virtio/virtio-pci.c:1845 8 0x55b8f6fafc6c in memory_region_write_accessor (mr=, addr=, value=, size=, shift=, mask=, attrs=...) at /usr/src/debug/qemu/memory.c:483 9 0x55b8f6fadce9 in access_with_adjusted_size (addr=addr@entry=20, value=value@entry=0x7f867e7fb7e8, size=size@entry=1, access_size_min=, access_size_max=, access_fn=0x55b8f6fafc20 , mr=0x55b8faf80a50, attrs=...) at /usr/src/debug/qemu/memory.c:544 10 0x55b8f6fb1d0b in memory_region_dispatch_write (mr=mr@entry=0x55b8faf80a50, addr=addr@entry=20, data=0, op=, attrs=attrs@entry=...) at /usr/src/debug/qemu/memory.c:1470 11 0x55b8f6f62ada in flatview_write_continue (fv=fv@entry=0x7f86ac04cd20, addr=addr@entry=549755813908, attrs=..., attrs@entry=..., buf=buf@entry=0x7f86d0223028 , len=len@entry=1, addr1=20, l=1, mr=0x55b8faf80a50) at /usr/src/debug/qemu/exec.c:3266 12 0x55b8f6f62c8f in flatview_write (fv=0x7f86ac04cd20, addr=549755813908, attrs=..., buf=0x7f86d0223028 , len=1) at /usr/src/debug/qemu/exec.c:3306 13 0x55b8f6f674cb in address_space_write (as=, addr=, attrs=..., buf=, len=) at /usr/src/debug/qemu/exec.c:3396 14 0x55b8f6f67575 in address_space_rw (as=, addr=, attrs=..., attrs@entry=..., buf=buf@entry=0x7f86d0223028 , len=, is_write=) at /usr/src/debug/qemu/exec.c:3406 15 0x55b8f6fc1cc8 in kvm_cpu_exec (cpu=cpu@entry=0x55b8f9aa0e10) at /usr/src/debug/qemu/accel/kvm/kvm-all.c:2410 16 0x55b8f6fa5f5e in qemu_kvm_cpu_thread_fn (arg=0x55b8f9aa0e10) at /usr/src/debug/qemu/cpus.c:1318 17 0x55b8f7336e16 in qemu_thread_start (args=0x55b8f9ac8480) at /usr/src/debug/qemu/util/qemu-thread-posix.c:519 18 0x7f86d575aea5 in start_thread () from /lib64/libpthread.so.0 19 0x7f86d5483b2d in clone () from /lib64/libc.so.6 Make vhost_vdpa_receive() return the size passed in as is, so that the caller qemu_deliver_packet_iov() would eventually propagate it back to virtio_net_flush_tx() to release pending packets from the async_tx queue. Which corresponds to the drop path where qemu_sendv_packet_async() returns non-zero in virtio_net_flush_tx(). Acked-by: Eugenio Pérez Fixes: 846a1e85da64 ("vdpa: Add dummy receive callback") Cc: Eugenio Perez Martin Signed-off-by: Si-Wei Liu --- net/vhost-vdpa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 4bc3fd0..182b3a1 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -211,7 +211,7 @@ static bool vhost_vdpa_check_peer_type(NetClientState *nc, ObjectClass *oc, static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf, size_t size) { -return 0; +return size; } static NetClientInfo net_vhost_vdpa_info = { -- 1.8.3.1
Re: [RFC v2 00/13] Dinamycally switch to vhost shadow virtqueues at vdpa net migration
On 2/2/2023 3:27 AM, Eugenio Perez Martin wrote: On Thu, Feb 2, 2023 at 2:00 AM Si-Wei Liu wrote: On 1/12/2023 9:24 AM, Eugenio Pérez wrote: It's possible to migrate vdpa net devices if they are shadowed from the start. But to always shadow the dataplane is effectively break its host passthrough, so its not convenient in vDPA scenarios. This series enables dynamically switching to shadow mode only at migration time. This allow full data virtqueues passthrough all the time qemu is not migrating. Successfully tested with vdpa_sim_net (but it needs some patches, I will send them soon) and qemu emulated device with vp_vdpa with some restrictions: * No CVQ. * VIRTIO_RING_F_STATE patches. What are these patches (I'm not sure I follow VIRTIO_RING_F_STATE, is it a new feature that other vdpa driver would need for live migration)? Not really, Since vp_vdpa wraps a virtio-net-pci driver to give it vdpa capabilities it needs a virtio in-band method to set and fetch the virtqueue state. Jason sent a proposal some time ago [1], and I implemented it in qemu's virtio emulated device. I can send them as a RFC but I didn't worry about making it pretty, nor I think they should be merged at the moment. vdpa parent drivers should follow vdpa_sim changes. Got it. No bother sending RFC for now, I think it's limited to virtio backed vdpa providers only. Thanks for the clarifications. -Siwei Thanks! [1] https://lists.oasis-open.org/archives/virtio-comment/202103/msg00036.html -Siwei * Expose _F_SUSPEND, but ignore it and suspend on ring state fetch like DPDK. Comments are welcome, especially in the patcheswith RFC in the message. v2: - Use a migration listener instead of a memory listener to know when the migration starts. - Add stuff not picked with ASID patches, like enable rings after driver_ok - Add rewinding on the migration src, not in dst - v1 at https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg01664.html Eugenio Pérez (13): vdpa: fix VHOST_BACKEND_F_IOTLB_ASID flag check vdpa net: move iova tree creation from init to start vdpa: copy cvq shadow_data from data vqs, not from x-svq vdpa: rewind at get_base, not set_base vdpa net: add migration blocker if cannot migrate cvq vhost: delay set_vring_ready after DRIVER_OK vdpa: delay set_vring_ready after DRIVER_OK vdpa: Negotiate _F_SUSPEND feature vdpa: add feature_log parameter to vhost_vdpa vdpa net: allow VHOST_F_LOG_ALL vdpa: add vdpa net migration state notifier vdpa: preemptive kick at enable vdpa: Conditionally expose _F_LOG in vhost_net devices include/hw/virtio/vhost-backend.h | 4 + include/hw/virtio/vhost-vdpa.h| 1 + hw/net/vhost_net.c| 25 ++- hw/virtio/vhost-vdpa.c| 64 +--- hw/virtio/vhost.c | 3 + net/vhost-vdpa.c | 247 +- 6 files changed, 278 insertions(+), 66 deletions(-)
Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier
On 2/2/2023 7:28 AM, Eugenio Perez Martin wrote: On Thu, Feb 2, 2023 at 2:53 AM Si-Wei Liu wrote: On 1/12/2023 9:24 AM, Eugenio Pérez wrote: This allows net to restart the device backend to configure SVQ on it. Ideally, these changes should not be net specific. However, the vdpa net backend is the one with enough knowledge to configure everything because of some reasons: * Queues might need to be shadowed or not depending on its kind (control vs data). * Queues need to share the same map translations (iova tree). Because of that it is cleaner to restart the whole net backend and configure again as expected, similar to how vhost-kernel moves between userspace and passthrough. If more kinds of devices need dynamic switching to SVQ we can create a callback struct like VhostOps and move most of the code there. VhostOps cannot be reused since all vdpa backend share them, and to personalize just for networking would be too heavy. Signed-off-by: Eugenio Pérez --- net/vhost-vdpa.c | 84 1 file changed, 84 insertions(+) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 5d7ad6e4d7..f38532b1df 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -26,6 +26,8 @@ #include #include "standard-headers/linux/virtio_net.h" #include "monitor/monitor.h" +#include "migration/migration.h" +#include "migration/misc.h" #include "migration/blocker.h" #include "hw/virtio/vhost.h" @@ -33,6 +35,7 @@ typedef struct VhostVDPAState { NetClientState nc; struct vhost_vdpa vhost_vdpa; +Notifier migration_state; Error *migration_blocker; VHostNetState *vhost_net; @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) return DO_UPCAST(VhostVDPAState, nc, nc0); } +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) +{ +struct vhost_vdpa *v = &s->vhost_vdpa; +VirtIONet *n; +VirtIODevice *vdev; +int data_queue_pairs, cvq, r; +NetClientState *peer; + +/* We are only called on the first data vqs and only if x-svq is not set */ +if (s->vhost_vdpa.shadow_vqs_enabled == enable) { +return; +} + +vdev = v->dev->vdev; +n = VIRTIO_NET(vdev); +if (!n->vhost_started) { +return; +} + +if (enable) { +ioctl(v->device_fd, VHOST_VDPA_SUSPEND); +} +data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; +cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? + n->max_ncs - n->max_queue_pairs : 0; +vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq); + +peer = s->nc.peer; +for (int i = 0; i < data_queue_pairs + cvq; i++) { +VhostVDPAState *vdpa_state; +NetClientState *nc; + +if (i < data_queue_pairs) { +nc = qemu_get_peer(peer, i); +} else { +nc = qemu_get_peer(peer, n->max_queue_pairs); +} + +vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc); +vdpa_state->vhost_vdpa.shadow_data = enable; + +if (i < data_queue_pairs) { +/* Do not override CVQ shadow_vqs_enabled */ +vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable; +} +} + +r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq); As the first revision, this method (vhost_net_stop followed by vhost_net_start) should be fine for software vhost-vdpa backend for e.g. vp_vdpa and vdpa_sim_net. However, I would like to get your attention that this method implies substantial blackout time for mode switching on real hardware - get a full cycle of device reset of getting memory mappings torn down, unpin & repin same set of pages, and set up new mapping would take very significant amount of time, especially for a large VM. Maybe we can do: Right, I think this is something that deserves optimization in the future. Note that we must replace the mappings anyway, with all passthrough queues stopped. Yes, unmap and remap is needed indeed. I haven't checked, does shadow vq keep mapping to the same GPA where passthrough data virtqueues were associated with across switch (so that the mode switch is transparent to the guest)? For platform IOMMU the mapping and remapping cost is inevitable, though I wonder for the on-chip IOMMU case could it take some fast path to just replace IOVA in place without destroying and setting up all mapping entries, if the same GPA is going to be used for the data rings (copy Eli for his input). This is because SVQ vrings are not in the guest space. The pin can be skipped though, I think that's a low hand fruit here. Yes, that's right. For a large VM pining overhead usually overweighs the mapping cost. It would be a great amount of time saving if pin can be skipp
Re: [RFC v2 12/13] vdpa: preemptive kick at enable
On 2/2/2023 8:53 AM, Eugenio Perez Martin wrote: On Thu, Feb 2, 2023 at 1:57 AM Si-Wei Liu wrote: On 1/13/2023 1:06 AM, Eugenio Perez Martin wrote: On Fri, Jan 13, 2023 at 4:39 AM Jason Wang wrote: On Fri, Jan 13, 2023 at 11:25 AM Zhu, Lingshan wrote: On 1/13/2023 10:31 AM, Jason Wang wrote: On Fri, Jan 13, 2023 at 1:27 AM Eugenio Pérez wrote: Spuriously kick the destination device's queue so it knows in case there are new descriptors. RFC: This is somehow a gray area. The guest may have placed descriptors in a virtqueue but not kicked it, so it might be surprised if the device starts processing it. So I think this is kind of the work of the vDPA parent. For the parent that needs this trick, we should do it in the parent driver. Agree, it looks easier implementing this in parent driver, I can implement it in ifcvf set_vq_ready right now Great, but please check whether or not it is really needed. Some device implementation could check the available descriptions after DRIVER_OK without waiting for a kick. So IIUC we can entirely drop this from the series (and I hope we can). But then, what with the devices that does *not* check for them? I wonder how the kick can be missed from the first place. Supposedly the moment when vhost_dev_stop() calls .suspend() into vdpa driver, the vcpus already stopped running (vm_running = false) and all pending kicks are delivered through vhost-vdpa's host notifiers or mapped doorbell already then device won't get new ones. I'm thinking now in cases like the net rx queue. When the guest starts it fills it and kicks the device. Let's say avail_idx is 255. Following the qemu emulated virtio net, hw/virtio/virtio.c:virtqueue_split_pop will stash shadow_avail_idx = 255, and it will not check it again until it is out of rx descriptors. Now the NIC fills N < 255 receive buffers, and VMM migrates. Will the destination device check rx avail idx even if it has not received any kick? (here could be at startup or when it needs to receive a packet). - If the answer is yes, and it will be a bug not to check it, then we can drop this patch. We're covered even if there is a possibility of losing a kick in the source. So this is not an issue of missing delivery of kicks, but more of how device is expected to handle pending kicks during suspend? For network device, it's not required to process up to avail_idx during suspend, but this doesn't mean it should ignore the kick for new descriptors, or instead I would say the device shouldn't specifically rely on kick, either at suspend or at startup. If at suspend, the device doesn't process up to avail_idx, correspondingly the implementation of it should sync the avail_idx in memory at startup. Even if the device implementation has to process up to avail_idx at suspend, for interoperability (i.e. source device didn't sync at suspend) point of view it still needs to check avail_idx at startup (resume) time and go on to process any pending request, right? So in any case, it seems to me the "implicit" kick at startup is needed for any device implementation anyway. I wouldn't say mandatory but that's the way how its supposed to work I feel. - If the answer is that it is not mandatory, we need to solve it somehow. To me, the best way is to spuriously kick as we don't need changes in the device, all we need is here. A new feature flag _F_CHECK_AVAIL_ON_STARTUP or equivalent would work the same, but I think it complicates everything more. For tx the device should suspend "immediately", so it may receive a kick, fetch avail_idx with M pending descriptors, transmit P < M and then receive the suspend. If we don't want to wait indefinitely, the device should stop processing so there are still pending requests in the queue for the destination to send. So the case now is the same as rx, even if the source device actually receives the kick. Having said that, I didn't check if any code drains the vhost host notifier. Or, as mentioned in the meeting, check that HW cannot reorder kick and suspend call. Not sure how order matters here, though I thought device suspend/resume doesn't tie in with kick order? If the device intends to purposely ignore (note: this could be a device bug) pending kicks during .suspend(), then consequently it should check available descriptors after reaching driver_ok to process outstanding descriptors, making up for the missing kick. If the vdpa driver doesn't support .suspend(), then it should flush the work before .reset() - vhost-scsi does it this way. Or otherwise I think it's the norm (right thing to do) device should process pending kicks before guest memory is to be unmapped at the late game of vhost_dev_stop(). Is there any case kicks may be missing? So process pending kicks means to drain all tx and rx descriptors? No it doesn't have to. What I sai
Re: [RFC v2 12/13] vdpa: preemptive kick at enable
On 2/5/2023 2:00 AM, Michael S. Tsirkin wrote: On Sat, Feb 04, 2023 at 03:04:02AM -0800, Si-Wei Liu wrote: For network hardware device, I thought suspend just needs to wait until the completion of ongoing Tx/Rx DMA transaction already in the flight, rather than to drain all the upcoming packets until avail_idx. It depends I guess but if device expects to recover all state from just ring state in memory then at least it has to drain until some index value. Yes, that's the general requirement for other devices than networking device. For e.g., if a storage device had posted request before suspending and there's no way to replay those requests from destination, it needs to drain until all posted requests are completed. For network device, this requirement can be lifted up somehow, as network (Ethernet) usually is tolerant to packet drops. Jason and I once had a long discussion about the expectation for {get,set}_vq_state() driver API and we came to conclusion that this is something networking device can stand up to: https://lore.kernel.org/lkml/b2d18964-8cd6-6bb1-1995-5b9662070...@redhat.com/ -Siwei
Re: [PATCH v2 01/13] vdpa net: move iova tree creation from init to start
On 2/8/2023 1:42 AM, Eugenio Pérez wrote: Only create iova_tree if and when it is needed. The cleanup keeps being responsible of last VQ but this change allows it to merge both cleanup functions. Signed-off-by: Eugenio Pérez Acked-by: Jason Wang --- net/vhost-vdpa.c | 99 ++-- 1 file changed, 71 insertions(+), 28 deletions(-) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index de5ed8ff22..a9e6c8f28e 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -178,13 +178,9 @@ err_init: static void vhost_vdpa_cleanup(NetClientState *nc) { VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc); -struct vhost_dev *dev = &s->vhost_net->dev; qemu_vfree(s->cvq_cmd_out_buffer); qemu_vfree(s->status); -if (dev->vq_index + dev->nvqs == dev->vq_index_end) { -g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); -} if (s->vhost_net) { vhost_net_cleanup(s->vhost_net); g_free(s->vhost_net); @@ -234,10 +230,64 @@ static ssize_t vhost_vdpa_receive(NetClientState *nc, const uint8_t *buf, return size; } +/** From any vdpa net client, get the netclient of first queue pair */ +static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) +{ +NICState *nic = qemu_get_nic(s->nc.peer); +NetClientState *nc0 = qemu_get_peer(nic->ncs, 0); + +return DO_UPCAST(VhostVDPAState, nc, nc0); +} + +static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) +{ +struct vhost_vdpa *v = &s->vhost_vdpa; + +if (v->shadow_vqs_enabled) { +v->iova_tree = vhost_iova_tree_new(v->iova_range.first, + v->iova_range.last); +} +} + +static int vhost_vdpa_net_data_start(NetClientState *nc) +{ +VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc); +struct vhost_vdpa *v = &s->vhost_vdpa; + +assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); + +if (v->index == 0) { +vhost_vdpa_net_data_start_first(s); +return 0; +} + +if (v->shadow_vqs_enabled) { +VhostVDPAState *s0 = vhost_vdpa_net_first_nc_vdpa(s); +v->iova_tree = s0->vhost_vdpa.iova_tree; +} + +return 0; +} + +static void vhost_vdpa_net_client_stop(NetClientState *nc) +{ +VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc); +struct vhost_dev *dev; + +assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); + +dev = s->vhost_vdpa.dev; +if (dev->vq_index + dev->nvqs == dev->vq_index_end) { +g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); +} +} + static NetClientInfo net_vhost_vdpa_info = { .type = NET_CLIENT_DRIVER_VHOST_VDPA, .size = sizeof(VhostVDPAState), .receive = vhost_vdpa_receive, +.start = vhost_vdpa_net_data_start, +.stop = vhost_vdpa_net_client_stop, .cleanup = vhost_vdpa_cleanup, .has_vnet_hdr = vhost_vdpa_has_vnet_hdr, .has_ufo = vhost_vdpa_has_ufo, @@ -351,7 +401,7 @@ dma_map_err: static int vhost_vdpa_net_cvq_start(NetClientState *nc) { -VhostVDPAState *s; +VhostVDPAState *s, *s0; struct vhost_vdpa *v; uint64_t backend_features; int64_t cvq_group; @@ -425,6 +475,15 @@ out: return 0; } +s0 = vhost_vdpa_net_first_nc_vdpa(s); +if (s0->vhost_vdpa.iova_tree) { +/* SVQ is already configured for all virtqueues */ +v->iova_tree = s0->vhost_vdpa.iova_tree; +} else { +v->iova_tree = vhost_iova_tree_new(v->iova_range.first, + v->iova_range.last); I wonder how this case could happen, vhost_vdpa_net_data_start_first() should've allocated an iova tree on the first data vq. Is zero data vq ever possible on net vhost-vdpa? Thanks, -Siwei +} + r = vhost_vdpa_cvq_map_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer, vhost_vdpa_net_cvq_cmd_page_len(), false); if (unlikely(r < 0)) { @@ -449,15 +508,9 @@ static void vhost_vdpa_net_cvq_stop(NetClientState *nc) if (s->vhost_vdpa.shadow_vqs_enabled) { vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->cvq_cmd_out_buffer); vhost_vdpa_cvq_unmap_buf(&s->vhost_vdpa, s->status); -if (!s->always_svq) { -/* - * If only the CVQ is shadowed we can delete this safely. - * If all the VQs are shadows this will be needed by the time the - * device is started again to register SVQ vrings and similar. - */ -g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); -} } + +vhost_vdpa_net_client_stop(nc); } static ssize_t vhost_vdpa_net_cvq_add(VhostVDPAState *s, size_t out_len, @@ -667,8 +720,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer, int nvqs,
Re: [PATCH v2 09/13] vdpa net: block migration if the device has CVQ
On 2/8/2023 1:42 AM, Eugenio Pérez wrote: Devices with CVQ needs to migrate state beyond vq state. Leaving this to future series. Signed-off-by: Eugenio Pérez --- net/vhost-vdpa.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index bca13f97fd..309861e56c 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -955,11 +955,17 @@ int net_init_vhost_vdpa(const Netdev *netdev, const char *name, } if (has_cvq) { +VhostVDPAState *s; + nc = net_vhost_vdpa_init(peer, TYPE_VHOST_VDPA, name, vdpa_device_fd, i, 1, false, opts->x_svq, iova_range); if (!nc) goto err; + +s = DO_UPCAST(VhostVDPAState, nc, nc); +error_setg(&s->vhost_vdpa.dev->migration_blocker, + "net vdpa cannot migrate with MQ feature"); Not sure how this can work: migration_blocker is only checked and gets added from vhost_dev_init(), which is already done through net_vhost_vdpa_init() above. Same question applies to the next patch of this series. Thanks, -Siwei } return 0;
Re: [PATCH v2 07/13] vdpa: add vdpa net migration state notifier
On 2/8/2023 1:42 AM, Eugenio Pérez wrote: This allows net to restart the device backend to configure SVQ on it. Ideally, these changes should not be net specific. However, the vdpa net backend is the one with enough knowledge to configure everything because of some reasons: * Queues might need to be shadowed or not depending on its kind (control vs data). * Queues need to share the same map translations (iova tree). Because of that it is cleaner to restart the whole net backend and configure again as expected, similar to how vhost-kernel moves between userspace and passthrough. If more kinds of devices need dynamic switching to SVQ we can create a callback struct like VhostOps and move most of the code there. VhostOps cannot be reused since all vdpa backend share them, and to personalize just for networking would be too heavy. Signed-off-by: Eugenio Pérez --- v3: * Add TODO to use the resume operation in the future. * Use migration_in_setup and migration_has_failed instead of a complicated switch case. --- net/vhost-vdpa.c | 76 1 file changed, 76 insertions(+) diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index dd686b4514..bca13f97fd 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -26,12 +26,14 @@ #include #include "standard-headers/linux/virtio_net.h" #include "monitor/monitor.h" +#include "migration/misc.h" #include "hw/virtio/vhost.h" /* Todo:need to add the multiqueue support here */ typedef struct VhostVDPAState { NetClientState nc; struct vhost_vdpa vhost_vdpa; +Notifier migration_state; VHostNetState *vhost_net; /* Control commands shadow buffers */ @@ -241,10 +243,79 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) return DO_UPCAST(VhostVDPAState, nc, nc0); } +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) +{ +struct vhost_vdpa *v = &s->vhost_vdpa; +VirtIONet *n; +VirtIODevice *vdev; +int data_queue_pairs, cvq, r; +NetClientState *peer; + +/* We are only called on the first data vqs and only if x-svq is not set */ +if (s->vhost_vdpa.shadow_vqs_enabled == enable) { +return; +} + +vdev = v->dev->vdev; +n = VIRTIO_NET(vdev); +if (!n->vhost_started) { +return; What if vhost gets started after migration is started, will svq still be (dynamically) enabled during vhost_dev_start()? I don't see relevant code to deal with it? +} + +data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; +cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? + n->max_ncs - n->max_queue_pairs : 0; +/* + * TODO: vhost_net_stop does suspend, get_base and reset. We can be smarter + * in the future and resume the device if read-only operations between + * suspend and reset goes wrong. + */ +vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq); + +peer = s->nc.peer; +for (int i = 0; i < data_queue_pairs + cvq; i++) { +VhostVDPAState *vdpa_state; +NetClientState *nc; + +if (i < data_queue_pairs) { +nc = qemu_get_peer(peer, i); +} else { +nc = qemu_get_peer(peer, n->max_queue_pairs); +} + +vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc); +vdpa_state->vhost_vdpa.shadow_data = enable; Don't get why shadow_data is set on cvq's vhost_vdpa? This may result in address space collision: data vq's iova getting improperly allocated on cvq's address space in vhost_vdpa_listener_region_{add,del}(). Noted currently there's an issue where guest VM's memory listener registration is always hooked to the last vq, which could be on the cvq in a different iova address space VHOST_VDPA_NET_CVQ_ASID. Thanks, -Siwei + +if (i < data_queue_pairs) { +/* Do not override CVQ shadow_vqs_enabled */ +vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable; +} +} + +r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq); +if (unlikely(r < 0)) { +error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r); +} +} + +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data) +{ +MigrationState *migration = data; +VhostVDPAState *s = container_of(notifier, VhostVDPAState, + migration_state); + +if (migration_in_setup(migration)) { +vhost_vdpa_net_log_global_enable(s, true); +} else if (migration_has_failed(migration)) { +vhost_vdpa_net_log_global_enable(s, false); +} +} + static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) { struct vhost_vdpa *v = &s->vhost_vdpa; +add_migration_state_change_notifier(&s->migration_state); if (v->shadow_vqs_enabled) { v->iova_tree = vhost_iova_tree_new(v->iova_range.first,
Re: [PATCH v4 07/15] vdpa: add vhost_vdpa_suspend
On 2/24/2023 7:54 AM, Eugenio Pérez wrote: The function vhost.c:vhost_dev_stop fetches the vring base so the vq state can be migrated to other devices. However, this is unreliable in vdpa, since we didn't signal the device to suspend the queues, making the value fetched useless. Suspend the device if possible before fetching first and subsequent vring bases. Moreover, vdpa totally reset and wipes the device at the last device before fetch its vrings base, making that operation useless in the last device. This will be fixed in later patches of this series. Signed-off-by: Eugenio Pérez --- v4: * Look for _F_SUSPEND at vhost_dev->backend_cap, not backend_features * Fall back on reset & fetch used idx from guest's memory --- hw/virtio/vhost-vdpa.c | 25 + hw/virtio/trace-events | 1 + 2 files changed, 26 insertions(+) diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c index 228677895a..f542960a64 100644 --- a/hw/virtio/vhost-vdpa.c +++ b/hw/virtio/vhost-vdpa.c @@ -712,6 +712,7 @@ static int vhost_vdpa_reset_device(struct vhost_dev *dev) ret = vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &status); trace_vhost_vdpa_reset_device(dev, status); +v->suspended = false; return ret; } @@ -1109,6 +1110,29 @@ static void vhost_vdpa_svqs_stop(struct vhost_dev *dev) } } +static void vhost_vdpa_suspend(struct vhost_dev *dev) +{ +struct vhost_vdpa *v = dev->opaque; +int r; + +if (!vhost_vdpa_first_dev(dev)) { +return; +} + +if (!(dev->backend_cap & BIT_ULL(VHOST_BACKEND_F_SUSPEND))) { Polarity reversed. This ends up device getting reset always even if the backend offers _F_SUSPEND. -Siwei +trace_vhost_vdpa_suspend(dev); +r = ioctl(v->device_fd, VHOST_VDPA_SUSPEND); +if (unlikely(r)) { +error_report("Cannot suspend: %s(%d)", g_strerror(errno), errno); +} else { +v->suspended = true; +return; +} +} + +vhost_vdpa_reset_device(dev); +} + static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started) { struct vhost_vdpa *v = dev->opaque; @@ -1123,6 +1147,7 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started) } vhost_vdpa_set_vring_ready(dev); } else { +vhost_vdpa_suspend(dev); vhost_vdpa_svqs_stop(dev); vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs); } diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events index a87c5f39a2..8f8d05cf9b 100644 --- a/hw/virtio/trace-events +++ b/hw/virtio/trace-events @@ -50,6 +50,7 @@ vhost_vdpa_set_vring_ready(void *dev) "dev: %p" vhost_vdpa_dump_config(void *dev, const char *line) "dev: %p %s" vhost_vdpa_set_config(void *dev, uint32_t offset, uint32_t size, uint32_t flags) "dev: %p offset: %"PRIu32" size: %"PRIu32" flags: 0x%"PRIx32 vhost_vdpa_get_config(void *dev, void *config, uint32_t config_len) "dev: %p config: %p config_len: %"PRIu32 +vhost_vdpa_suspend(void *dev) "dev: %p" vhost_vdpa_dev_start(void *dev, bool started) "dev: %p started: %d" vhost_vdpa_set_log_base(void *dev, uint64_t base, unsigned long long size, int refcnt, int fd, void *log) "dev: %p base: 0x%"PRIx64" size: %llu refcnt: %d fd: %d log: %p" vhost_vdpa_set_vring_addr(void *dev, unsigned int index, unsigned int flags, uint64_t desc_user_addr, uint64_t used_user_addr, uint64_t avail_user_addr, uint64_t log_guest_addr) "dev: %p index: %u flags: 0x%x desc_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" log_guest_addr: 0x%"PRIx64
Re: [RFC 1/2] vhost-vdpa: Decouple the IOVA allocator
On 8/29/2024 9:53 AM, Eugenio Perez Martin wrote: On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer wrote: Decouples the IOVA allocator from the IOVA->HVA tree and instead adds the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree will hold all IOVA ranges that have been allocated (e.g. in the IOVA->HVA tree) and are removed when any IOVA ranges are deallocated. A new API function vhost_iova_tree_insert() is also created to add a IOVA->HVA mapping into the IOVA->HVA tree. I think this is a good first iteration but we can take steps to simplify it. Also, it is great to be able to make points on real code instead of designs on the air :). I expected a split of vhost_iova_tree_map_alloc between the current vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or similar. Similarly, a vhost_iova_tree_remove and vhost_iova_tree_remove_gpa would be needed. The first one is used for regions that don't exist in the guest, like SVQ vrings or CVQ buffers. The second one is the one used by the memory listener to map the guest regions into the vdpa device. Implementation wise, only two trees are actually needed: * Current iova_taddr_map that contains all IOVA->vaddr translations as seen by the device, so both allocation functions can work on a single tree. The function iova_tree_find_iova keeps using this one, so the I thought we had thorough discussion about this and agreed upon the decoupled IOVA allocator solution. But maybe I missed something earlier, I am not clear how come this iova_tree_find_iova function could still work with the full IOVA-> HVA tree when it comes to aliased memory or overlapped HVAs? Granted, for the memory map removal in the .region_del() path, we could rely on the GPA tree to locate the corresponding IOVA, but how come the translation path could figure out which IOVA range to return when the vaddr happens to fall in an overlapped HVA range? Do we still assume some overlapping order so we always return the first match from the tree? Or we expect every current user of iova_tree_find_iova should pass in GPA rather than HVA and use the vhost_iova_xxx_gpa API variant to look up IOVA? Thanks, -Siwei user does not need to know if the address is from the guest or only exists in QEMU by using RAMBlock etc. All insert and remove functions use this tree. * A new tree that relates IOVA to GPA, that only vhost_iova_tree_map_alloc_gpa and vhost_iova_tree_remove_gpa uses. The ideal case is that the key in this new tree is the GPA and the value is the IOVA. But IOVATree's DMA is named the reverse: iova is the key and translated_addr is the vaddr. We can create a new tree struct for that, use GTree directly, or translate the reverse linearly. As memory add / remove should not be frequent, I think the simpler is the last one, but I'd be ok with creating a new tree. vhost_iova_tree_map_alloc_gpa needs to add the map to this new tree also. Similarly, vhost_iova_tree_remove_gpa must look for the GPA in this tree, and only remove the associated DMAMap in iova_taddr_map that matches the IOVA. Does it make sense to you? Signed-off-by: Jonah Palmer --- hw/virtio/vhost-iova-tree.c | 38 - hw/virtio/vhost-iova-tree.h | 1 + hw/virtio/vhost-vdpa.c | 31 -- net/vhost-vdpa.c| 13 +++-- 4 files changed, 70 insertions(+), 13 deletions(-) diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c index 3d03395a77..32c03db2f5 100644 --- a/hw/virtio/vhost-iova-tree.c +++ b/hw/virtio/vhost-iova-tree.c @@ -28,12 +28,17 @@ struct VhostIOVATree { /* IOVA address to qemu memory maps. */ IOVATree *iova_taddr_map; + +/* IOVA tree (IOVA allocator) */ +IOVATree *iova_map; }; /** - * Create a new IOVA tree + * Create a new VhostIOVATree with a new set of IOVATree's: s/IOVA tree/VhostIOVATree/ is good, but I think the rest is more an implementation detail. + * - IOVA allocator (iova_map) + * - IOVA->HVA tree (iova_taddr_map) * - * Returns the new IOVA tree + * Returns the new VhostIOVATree */ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last) { @@ -44,6 +49,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last) tree->iova_last = iova_last; tree->iova_taddr_map = iova_tree_new(); +tree->iova_map = iova_tree_new(); return tree; } @@ -53,6 +59,7 @@ VhostIOVATree *vhost_iova_tree_new(hwaddr iova_first, hwaddr iova_last) void vhost_iova_tree_delete(VhostIOVATree *iova_tree) { iova_tree_destroy(iova_tree->iova_taddr_map); +iova_tree_destroy(iova_tree->iova_map); g_free(iova_tree); } @@ -88,13 +95,12 @@ int vhost_iova_tree_map_alloc(VhostIOVATree *tree, DMAMap *map) /* Some vhost devices do not like addr 0. Skip first page */ hwaddr iova_first = tree->iova_first ?: qemu_real_host_page_size(); -if (map->translated_addr + map->size < m
Re: [RFC 1/2] vhost-vdpa: Decouple the IOVA allocator
On 8/30/2024 1:05 AM, Eugenio Perez Martin wrote: On Fri, Aug 30, 2024 at 6:20 AM Si-Wei Liu wrote: On 8/29/2024 9:53 AM, Eugenio Perez Martin wrote: On Wed, Aug 21, 2024 at 2:56 PM Jonah Palmer wrote: Decouples the IOVA allocator from the IOVA->HVA tree and instead adds the allocated IOVA range to an IOVA-only tree (iova_map). This IOVA tree will hold all IOVA ranges that have been allocated (e.g. in the IOVA->HVA tree) and are removed when any IOVA ranges are deallocated. A new API function vhost_iova_tree_insert() is also created to add a IOVA->HVA mapping into the IOVA->HVA tree. I think this is a good first iteration but we can take steps to simplify it. Also, it is great to be able to make points on real code instead of designs on the air :). I expected a split of vhost_iova_tree_map_alloc between the current vhost_iova_tree_map_alloc and vhost_iova_tree_map_alloc_gpa, or similar. Similarly, a vhost_iova_tree_remove and vhost_iova_tree_remove_gpa would be needed. The first one is used for regions that don't exist in the guest, like SVQ vrings or CVQ buffers. The second one is the one used by the memory listener to map the guest regions into the vdpa device. Implementation wise, only two trees are actually needed: * Current iova_taddr_map that contains all IOVA->vaddr translations as seen by the device, so both allocation functions can work on a single tree. The function iova_tree_find_iova keeps using this one, so the I thought we had thorough discussion about this and agreed upon the decoupled IOVA allocator solution. My interpretation of it is to leave the allocator as the current one, and create a new tree with GPA which is guaranteed to be unique. But we can talk over it of course. But maybe I missed something earlier, I am not clear how come this iova_tree_find_iova function could still work with the full IOVA-> HVA tree when it comes to aliased memory or overlapped HVAs? Granted, for the memory map removal in the .region_del() path, we could rely on the GPA tree to locate the corresponding IOVA, but how come the translation path could figure out which IOVA range to return when the vaddr happens to fall in an overlapped HVA range? That is not a problem, as they both translate to the same address at the device. Not sure I followed, it might return a wrong IOVA (range) which the host kernel may have conflict or unmatched attribute i.e. permission, size et al in the map. The most complicated situation is where we have a region contained in another region, and the requested buffer crosses them. If the IOVA tree returns the inner region, it will return the buffer chained with the rest of the content in the outer region. Not optimal, but solved either way. Don't quite understand what it means... So in this overlapping case, speaking of the expectation of the translation API, you would like to have all IOVA ranges that match the overlapped HVA to be returned? And then to rely on the user (caller) to figure out which one is correct? Wouldn't it be easier for the user (SVQ) to use the memory system API directly to figure out? As we are talking about API we may want to build it in a way generic enough to address all possible needs (which goes with what memory subsystem is capable of), rather than just look on the current usage which has kind of narrow scope. Although virtio-net device doesn't work with aliased region now, some other virtio device may do, or maybe some day virtio-net would need to use aliased region than the API and the users (SVQ) would have to go with another round of significant refactoring due to the iova-tree internal working. I feel it's just too early or too tight to abstract the iova-tree layer and get the API customized for the current use case with a lot of limitations on how user should expect to use it. We need some more flexibility and ease on extensibility if we want to take the chance to get it rewritten, given it is not a lot of code that Jonah had showed here .. The only problem that comes to my mind is the case where the inner region is RO Yes, this is one of examples around the permission or size I mentioned above, which may have a conflict view with the memory system or the kernel. Thanks, -Siwei and it is a write command, but I don't think we have this case in a sane guest. A malicious guest cannot do any harm this way anyway. Do we still assume some overlapping order so we always return the first match from the tree? Or we expect every current user of iova_tree_find_iova should pass in GPA rather than HVA and use the vhost_iova_xxx_gpa API variant to look up IOVA? No, iova_tree_find_iova should keep asking for vaddr, as the result is guaranteed to be there. Users of VhostIOVATree only need to modify how they add or remove regions, knowing if they come from the guest or not. As shown by this series, it is easier to do in that place than in translation. Th