On 11/2/2023 5:37 AM, Eugenio Perez Martin wrote:
On Thu, Nov 2, 2023 at 11:13 AM Si-Wei Liu <si-wei....@oracle.com> wrote:


On 10/19/2023 7:34 AM, Eugenio Pérez wrote:
Current memory operations like pinning may take a lot of time at the

destination.  Currently they are done after the source of the migration is

stopped, and before the workload is resumed at the destination.  This is a

period where neigher traffic can flow, nor the VM workload can continue

(downtime).



We can do better as we know the memory layout of the guest RAM at the

destination from the moment the migration starts.  Moving that operation allows

QEMU to communicate the kernel the maps while the workload is still running in

the source, so Linux can start mapping them.  Ideally, all IOMMU is configured,

but if the vDPA parent driver uses on-chip IOMMU and .set_map we're still

saving all the pinning time.
I get what you want to say, though not sure how pinning is relevant to
on-chip IOMMU and .set_map here, essentially pinning is required for all
parent vdpa drivers that perform DMA hence don't want VM pages to move
around.
Basically highlighting that the work done under .set_map is not only
pinning, but it is a significant fraction on it. It can be reworded or
deleted for sure.



Note that further devices setup at the end of the migration may alter the guest

memory layout. But same as the previous point, many operations are still done

incrementally, like memory pinning, so we're saving time anyway.



The first bunch of patches just reorganizes the code, so memory related

operation parameters are shared between all vhost_vdpa devices.  This is

because the destination does not know what vhost_vdpa struct will have the

registered listener member, so it is easier to place them in a shared struct

rather to keep them in vhost_vdpa struct.  Future version may squash or omit

these patches.
It looks this VhostVDPAShared facility (patch 1-13) is also what I need
in my SVQ descriptor group series [*], for which I've built similar
construct there. If possible please try to merge this in ASAP. I'll
rework my series on top of that.

[*]
https://github.com/siwliu-kernel/qemu/commit/813518354af5ee8a6e867b2bf7dff3d6004fbcd5

I can send it individually, for sure.

MST, Jason, can this first part be merged? It doesn't add a lot by
itself but it helps pave the way for future changes.
If it cannot, it doesn't matter. I can pick it from here and get my series posted with your patches 1-13 applied upfront. This should work, I think?



Only tested with vdpa_sim. I'm sending this before full benchmark, as some work

like [1] can be based on it, and Si-Wei agreed on benchmark this series with

his experience.
Haven't done the full benchmark compared to pre-map at destination yet,
though an observation is that the destination QEMU seems very easy to
get stuck for very long time while in mid of pinning pages. During this
period, any client doing read-only QMP query or executing HMP info
command got frozen indefinitely (subject to how large size the memory is
being pinned). Is it possible to unblock those QMP request or HMP
command from being executed (at least the read-only ones) while in
migration? Yield from the load_setup corourtine and spawn another thread?

Ok, I wasn't aware of that.

I think we cannot yield in a coroutine and wait for an ioctl.
I was wondering if we need a separate coroutine out of the general migration path to support this special code without overloading load_setup or its callers. For instance, unblock the source from sending guest rams while allow destination pin pages in parallel should be possible.

Regardless, a separate thread is needed to carry out all the heavy lifting, i.e. ioctl(2) or write(2) syscalls to map&pin pages.


One
option that came to my mind is to effectively use another thread, and
use a POSIX barrier (or equivalent on glib / QEMU) before finishing
the migration.
Yes, a separate thread is needed anyway.

  I'm not sure if there are more points where we can
check the barrier and tell the migration to continue or stop though.
I think there is, for e.g. what if the dma_map fails. There must be a check point for that.


Another option is to effectively start doing these ioctls in an
asynchronous way, io_uring cmds like, but I'd like to achieve this
first.
Yes, io_uring or any async API could be another option. Though this needs new uAPI through additional kernel facility to support. Anyway, it's up to you to decide. :)

Regards,
-Siwei
Having said, not sure if .load_setup is a good fit for what we want to
do. Searching all current users of .load_setup, either the job can be
done instantly or the task is time bound without trapping into kernel
for too long. Maybe pinning is too special use case here...

-Siwei


Future directions on top of this series may include:

* Iterative migration of virtio-net devices, as it may reduce downtime per [1].

    vhost-vdpa net can apply the configuration through CVQ in the destination

    while the source is still migrating.

* Move more things ahead of migration time, like DRIVER_OK.

* Check that the devices of the destination are valid, and cancel the migration

    in case it is not.



[1] 
https://lore.kernel.org/qemu-devel/6c8ebb97-d546-3f1c-4cdd-54e23a566...@nvidia.com/T/



Eugenio Pérez (18):

    vdpa: add VhostVDPAShared

    vdpa: move iova tree to the shared struct

    vdpa: move iova_range to vhost_vdpa_shared

    vdpa: move shadow_data to vhost_vdpa_shared

    vdpa: use vdpa shared for tracing

    vdpa: move file descriptor to vhost_vdpa_shared

    vdpa: move iotlb_batch_begin_sent to vhost_vdpa_shared

    vdpa: move backend_cap to vhost_vdpa_shared

    vdpa: remove msg type of vhost_vdpa

    vdpa: move iommu_list to vhost_vdpa_shared

    vdpa: use VhostVDPAShared in vdpa_dma_map and unmap

    vdpa: use dev_shared in vdpa_iommu

    vdpa: move memory listener to vhost_vdpa_shared

    vdpa: do not set virtio status bits if unneeded

    vdpa: add vhost_vdpa_load_setup

    vdpa: add vhost_vdpa_net_load_setup NetClient callback

    vdpa: use shadow_data instead of first device v->shadow_vqs_enabled

    virtio_net: register incremental migration handlers



   include/hw/virtio/vhost-vdpa.h |  43 +++++---

   include/net/net.h              |   4 +

   hw/net/virtio-net.c            |  23 +++++

   hw/virtio/vdpa-dev.c           |   7 +-

   hw/virtio/vhost-vdpa.c         | 183 ++++++++++++++++++---------------

   net/vhost-vdpa.c               | 127 ++++++++++++-----------

   hw/virtio/trace-events         |  14 +--

   7 files changed, 239 insertions(+), 162 deletions(-)





Reply via email to