On 11/2/2023 3:12 AM, Si-Wei Liu wrote:
On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
Current memory operations like pinning may take a lot of time at the
destination. Currently they are done after the source of the
migration is
stopped, and before the workload is resumed at the destination. This
is a
period where neigher traffic can flow, nor the VM workload can continue
(downtime).
We can do better as we know the memory layout of the guest RAM at the
destination from the moment the migration starts. Moving that
operation allows
QEMU to communicate the kernel the maps while the workload is still
running in
the source, so Linux can start mapping them. Ideally, all IOMMU is
configured,
but if the vDPA parent driver uses on-chip IOMMU and .set_map we're
still
saving all the pinning time.
I get what you want to say, though not sure how pinning is relevant to
on-chip IOMMU and .set_map here, essentially pinning is required for
all parent vdpa drivers that perform DMA hence don't want VM pages to
move around.
Note that further devices setup at the end of the migration may alter
the guest
memory layout. But same as the previous point, many operations are
still done
incrementally, like memory pinning, so we're saving time anyway.
The first bunch of patches just reorganizes the code, so memory related
operation parameters are shared between all vhost_vdpa devices. This is
because the destination does not know what vhost_vdpa struct will
have the
registered listener member, so it is easier to place them in a shared
struct
rather to keep them in vhost_vdpa struct. Future version may squash
or omit
these patches.
It looks this VhostVDPAShared facility (patch 1-13) is also what I
need in my SVQ descriptor group series [*], for which I've built
similar construct there. If possible please try to merge this in ASAP.
I'll rework my series on top of that.
[*]
https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5
Only tested with vdpa_sim. I'm sending this before full benchmark, as
some work
like [1] can be based on it, and Si-Wei agreed on benchmark this
series with
his experience.
Haven't done the full benchmark compared to pre-map at destination yet,
Hi Eugenio,
I just notice one thing that affects the performance benchmark for this
series in terms of migration total_time (to be fair, it's mlx5_vdpa
specific). It looks like iotlb map batching is not acked (via
vhost_vdpa_set_backend_cap) at the point of vhost-vdpa_load_setup,
effectively causing quite extensive time spent on hundreds of dma_map
calls from listener_register(). While the equivalent code had been
implemented in my destination pre-map patch [1]. Although I can
benchmark the current patchset by remove batching from my code, I guess
that's not the goal of this benchmark, right?
If would be the best to have map batching in place, so benchmark for
both options could match. What do you think?
Thanks,
-Siwei
[1]
https://github.com/siwliu-kernel/qemu/commit/0ce225b0c7e618163ea09da3846c93c4de2f85ed#diff-45489c6f25dc36fd84e1cd28cbf3b8ff03301e2d24dadb6d1c334c9e8f14c00cR639
though an observation is that the destination QEMU seems very easy to
get stuck for very long time while in mid of pinning pages. During
this period, any client doing read-only QMP query or executing HMP
info command got frozen indefinitely (subject to how large size the
memory is being pinned). Is it possible to unblock those QMP request
or HMP command from being executed (at least the read-only ones) while
in migration? Yield from the load_setup corourtine and spawn another
thread?
Having said, not sure if .load_setup is a good fit for what we want to
do. Searching all current users of .load_setup, either the job can be
done instantly or the task is time bound without trapping into kernel
for too long. Maybe pinning is too special use case here...
-Siwei
Future directions on top of this series may include:
* Iterative migration of virtio-net devices, as it may reduce
downtime per [1].
vhost-vdpa net can apply the configuration through CVQ in the
destination
while the source is still migrating.
* Move more things ahead of migration time, like DRIVER_OK.
* Check that the devices of the destination are valid, and cancel the
migration
in case it is not.
[1]
https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566...@nvidia.com/T/
Eugenio Pérez (18):
vdpa: add VhostVDPAShared
vdpa: move iova tree to the shared struct
vdpa: move iova_range to vhost_vdpa_shared
vdpa: move shadow_data to vhost_vdpa_shared
vdpa: use vdpa shared for tracing
vdpa: move file descriptor to vhost_vdpa_shared
vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared
vdpa: move backend_cap to vhost_vdpa_shared
vdpa: remove msg type of vhost_vdpa
vdpa: move iommu_list to vhost_vdpa_shared
vdpa: use VhostVDPAShared in vdpa_dma_map and unmap
vdpa: use dev_shared in vdpa_iommu
vdpa: move memory listener to vhost_vdpa_shared
vdpa: do not set virtio status bits if unneeded
vdpa: add vhost_vdpa_load_setup
vdpa: add vhost_vdpa_net_load_setup NetClient callback
vdpa: use shadow_data instead of first device v->shadow_vqs_enabled
virtio_net: register incremental migration handlers
include/hw/virtio/vhost-vdpa.h | 43 +++++---
include/net/net.h | 4 +
hw/net/virtio-net.c | 23 +++++
hw/virtio/vdpa-dev.c | 7 +-
hw/virtio/vhost-vdpa.c | 183 ++++++++++++++++++---------------
net/vhost-vdpa.c | 127 ++++++++++++-----------
hw/virtio/trace-events | 14 +--
7 files changed, 239 insertions(+), 162 deletions(-)