RE: [PATCH v8 01/13] vfio: KABI for migration interface

Tian, Kevin Tue, 24 Sep 2019 16:08:06 -0700

> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Wednesday, September 25, 2019 2:03 AM
> 
> On Tue, 24 Sep 2019 02:19:15 +0000
> "Tian, Kevin" <kevin.t...@intel.com> wrote:
> 
> > > From: Tian, Kevin
> > > Sent: Friday, September 13, 2019 7:00 AM
> > >
> > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > Sent: Thursday, September 12, 2019 10:41 PM
> > > >
> > > > On Tue, 3 Sep 2019 06:57:27 +0000
> > > > "Tian, Kevin" <kevin.t...@intel.com> wrote:
> > > >
> > > > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > > > Sent: Saturday, August 31, 2019 12:33 AM
> > > > > >
> > > > > > On Fri, 30 Aug 2019 08:06:32 +0000
> > > > > > "Tian, Kevin" <kevin.t...@intel.com> wrote:
> > > > > >
> > > > > > > > From: Tian, Kevin
> > > > > > > > Sent: Friday, August 30, 2019 3:26 PM
> > > > > > > >
> > > > > > > [...]
> > > > > > > > > How does QEMU handle the fact that IOVAs are potentially
> > > > dynamic
> > > > > > while
> > > > > > > > > performing the live portion of a migration?  For example,
> each
> > > > time a
> > > > > > > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > > > > > > MemoryRegionSection pops in or out of the AddressSpace for
> > > the
> > > > device
> > > > > > > > > (I'm assuming a vIOMMU where the device AddressSpace is
> not
> > > > > > > > > system_memory).  I don't see any QEMU code that intercepts
> > > that
> > > > > > change
> > > > > > > > > in the AddressSpace such that the IOVA dirty pfns could be
> > > > recorded and
> > > > > > > > > translated to GFNs.  The vendor driver can't track these
> beyond
> > > > getting
> > > > > > > > > an unmap notification since it only knows the IOVA pfns,
> which
> > > > can be
> > > > > > > > > re-used with different GFN backing.  Once the DMA mapping
> is
> > > > torn
> > > > > > down,
> > > > > > > > > it seems those dirty pfns are lost in the ether.  If this 
> > > > > > > > > works in
> > > > QEMU,
> > > > > > > > > please help me find the code that handles it.
> > > > > > > >
> > > > > > > > I'm curious about this part too. Interestingly, I didn't find 
> > > > > > > > any
> > > > log_sync
> > > > > > > > callback registered by emulated devices in Qemu. Looks dirty
> > > pages
> > > > > > > > by emulated DMAs are recorded in some implicit way. But KVM
> > > > always
> > > > > > > > reports dirty page in GFN instead of IOVA, regardless of the
> > > > presence of
> > > > > > > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated
> > > DMAs
> > > > > > > >  (translation can be done when DMA happens), then we don't
> > > need
> > > > > > > > worry about transient mapping from IOVA to GFN. Along this
> way
> > > > we
> > > > > > > > also want GFN-based dirty bitmap being reported through VFIO,
> > > > > > > > similar to what KVM does. For vendor drivers, it needs to
> translate
> > > > > > > > from IOVA to HVA to GFN when tracking DMA activities on
> VFIO
> > > > > > > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can
> be
> > > > > > > > provided by KVM but I'm not sure whether it's exposed now.
> > > > > > > >
> > > > > > >
> > > > > > > HVA->GFN can be done through hva_to_gfn_memslot in
> kvm_host.h.
> > > > > >
> > > > > > I thought it was bad enough that we have vendor drivers that
> depend
> > > > on
> > > > > > KVM, but designing a vfio interface that only supports a KVM
> interface
> > > > > > is more undesirable.  I also note without comment that
> > > > gfn_to_memslot()
> > > > > > is a GPL symbol.  Thanks,
> > > > >
> > > > > yes it is bad, but sometimes inevitable. If you recall our discussions
> > > > > back to 3yrs (when discussing the 1st mdev framework), there were
> > > > similar
> > > > > hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
> > > > > creating some shadow structures. gpa->hpa is definitely hypervisor
> > > > > specific knowledge, which is easy in KVM (gpa->hva->hpa), but
> needs
> > > > > hypercall in Xen. but VFIO already makes assumption based on
> KVM-
> > > > > only flavor when implementing vfio_{un}pin_page_external.
> > > >
> > > > Where's the KVM assumption there?  The MAP_DMA ioctl takes an
> IOVA
> > > > and
> > > > HVA.  When an mdev vendor driver calls vfio_pin_pages(), we GUP the
> > > HVA
> > > > to get an HPA and provide an array of HPA pfns back to the caller.  The
> > > > other vGPU mdev vendor manages to make use of this without KVM...
> the
> > > > KVM interface used by GVT-g is GPL-only.
> > >
> > > To be clear it's the assumption on the host-based hypervisors e.g. KVM.
> > > GUP is a perfect example, which doesn't work for Xen since DomU's
> > > memory doesn't belong to Dom0. VFIO in Dom0 has to find the HPA
> > > through Xen specific hypercalls.
> > >
> > > >
> > > > > So GVT-g
> > > > > has to maintain an internal abstraction layer to support both Xen
> and
> > > > > KVM. Maybe someday we will re-consider introducing some
> hypervisor
> > > > > abstraction layer in VFIO, if this issue starts to hurt other devices
> and
> > > > > Xen guys are willing to support VFIO.
> > > >
> > > > Once upon a time, we had a KVM specific device assignment interface,
> > > > ie. legacy KVM devie assignment.  We developed VFIO specifically to
> get
> > > > KVM out of the business of being a (bad) device driver.  We do have
> > > > some awareness and interaction between VFIO and KVM in the vfio-
> kvm
> > > > pseudo device, but we still try to keep those interfaces generic.  In
> > > > some cases we're not very successful at that, see
> vfio_group_set_kvm(),
> > > > but that's largely just a mechanism to associate a cookie with a group
> > > > to be consumed by the mdev vendor driver such that it can work with
> > > kvm
> > > > external to vfio.  I don't intend to add further hypervisor awareness
> > > > to vfio.
> > > >
> > > > > Back to this IOVA issue, I discussed with Yan and we found another
> > > > > hypervisor-agnostic alternative, by learning from vhost. vhost is very
> > > > > similar to VFIO - DMA also happens in the kernel, while it already
> > > > > supports vIOMMU.
> > > > >
> > > > > Generally speaking, there are three paths of dirty page collection
> > > > > in Qemu so far (as previously noted, Qemu always tracks the dirty
> > > > > bitmap in GFN):
> > > >
> > > > GFNs or simply PFNs within an AddressSpace?
> > > >
> > > > > 1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty
> bitmaps
> > > > > are updated directly when the guest memory is being updated. For
> > > > > example, PCI writes are completed through pci_dma_write, which
> > > > > goes through vIOMMU to translate IOVA into GPA and then update
> > > > > the bitmap through cpu_physical_memory_set_dirty_range.
> > > >
> > > > Right, so the IOVA to GPA (GFN) occurs through an explicit translation
> > > > on the IOMMU AddressSpace.
> > > >
> > > > > 2) Memory writes that are not tracked by Qemu are collected by
> > > > > registering .log_sync() callback, which is invoked in the dirty 
> > > > > logging
> > > > > process. Now there are two users: kvm and vhost.
> > > > >
> > > > >   2.1) KVM tracks CPU-side memory writes, through write-protection
> > > > > or EPT A/D bits (+PML). This part is always based on GFN and
> returned
> > > > > to Qemu when kvm_log_sync is invoked;
> > > > >
> > > > >   2.2) vhost tracks kernel-side DMA writes, by interpreting vring
> > > > > data structure. It maintains an internal iotlb which is synced with
> > > > > Qemu vIOMMU through a specific interface:
> > > > >       - new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE)
> > > > > for Qemu to keep vhost iotlb in sync
> > > > >       - new VHOST_IOTLB_MISS message to notify Qemu in case of
> > > > > a miss in vhost iotlb.
> > > > >       - Qemu registers a log buffer to kernel vhost driver. The latter
> > > > > update the buffer (using internal iotlb to get GFN) when serving
> vring
> > > > > descriptor.
> > > > >
> > > > > VFIO could also implement an internal iotlb, so vendor drivers can
> > > > > utilize the iotlb to update the GFN-based dirty bitmap. Ideally we
> > > > > don't need re-invent another iotlb protocol as vhost does. vIOMMU
> > > > > already sends map/unmap ioctl cmds upon any change of IOVA
> > > > > mapping. We may introduce a v2 map/unmap interface, allowing
> > > > > Qemu to pass both {iova, gpa, hva} together to keep internal iotlb
> > > > > in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO
> > > > > doesn't want to cache full-size vIOMMU mappings.
> > > > >
> > > > > Definitely this alternative needs more work and possibly less
> > > > > performant (if maintaining a small size iotlb) than straightforward
> > > > > calling into KVM interface. But the gain is also obvious, since it
> > > > > is fully constrained with VFIO.
> > > > >
> > > > > Thoughts? :-)
> > > >
> > > > So vhost must then be configuring a listener across system memory
> > > > rather than only against the device AddressSpace like we do in vfio,
> > > > such that it get's log_sync() callbacks for the actual GPA space rather
> > > > than only the IOVA space.  OTOH, QEMU could understand that the
> > > device
> > > > AddressSpace has a translate function and apply the IOVA dirty bits to
> > > > the system memory AddressSpace.  Wouldn't it make more sense for
> > > > QEMU
> > > > to perform a log_sync() prior to removing a MemoryRegionSection
> within
> > > > an AddressSpace and update the GPA rather than pushing GPA
> > > awareness
> > > > and potentially large tracking structures into the host kernel?  Thanks,
> > > >
> > >
> >
> > Hi, Alex,
> >
> > I moved back the VFIO related discussion to this thread, to not mix with
> > vhost related discussions here.
> >
> > https://lists.nongnu.org/archive/html/qemu-devel/2019-
> 09/msg03126.html
> >
> > Your latest reply still prefers to the userspace approach:
> >
> > > > Same as last time, you're asking VFIO to be aware of an entirely new
> > > > address space and implement tracking structures of that address
> space
> > > > to make life easier for QEMU.  Don't we typically push such complexity
> > > > to userspace rather than into the kernel?  I'm not convinced.  Thanks,
> > > >
> >
> > I answered two points but didn't hear your further thoughts. Can you
> > take a look and respond?
> >
> > The first point is about complexity and performance:
> > >
> > > Is it really complex? No need of a new tracking structure. Just allowing
> > > the MAP interface to carry a new parameter and then record it in the
> > > existing vfio_dma objects.
> > >
> > > Note the frequency of guest DMA map/unmap could be very high. We
> > > saw >100K invocations per second with a 40G NIC. To do the right
> > > translation Qemu requires log_sync for every unmap, before the
> > > mapping for logged dirty IOVA becomes stale. In current Kirti's patch,
> > > each log_sync requires several system_calls through the migration
> > > info, e.g. setting start_pfn/page_size/total_pfns and then reading
> > > data_offset/data_size. That design is fine for doing log_sync in every
> > > pre-copy round, but too costly if doing so for every IOVA unmap. If
> > > small extension in kernel can lead to great overhead reduction,
> > > why not?
> 
> You're citing a workload that already performs abysmally with vfio and
> vIOMMU, we cannot handle those rates efficiently with the current vfio
> DMA API.  The current use cases of vIOMMU and vfio are predominantly
> for nesting vfio, ex. DPDK/SPDK, where we assume the mappings are
> relatively static or else performance problems are already very
> apparent.  In that sort of model, I don't see that QEMU doing a
> log_sync on unmap is really an issue, unmaps should be relatively
> rare.  Of course I don't want to compound the issue, but the current
> vfio DMA mapping interfaces needs to be scrapped to make this remotely
> performant even before we look at migration performance, so does it
> really make sense to introduce GPAs for a workload the ioctls are
> unsuited for?
> 
> > The second point is about write-protection:
> >
> > > There is another value of recording GPA in VFIO. Vendor drivers
> > > (e.g. GVT-g) may need to selectively write-protect guest memory
> > > pages when interpreting certain workload descriptors. Those pages
> > > are recorded in IOVA when vIOMMU is enabled, however the KVM
> > > write-protection API only knows GPA. So currently vIOMMU must
> > > be disabled on Intel vGPUs when GVT-g is enabled. To make it working
> > > we need a way to translate IOVA into GPA in the vendor drivers.
> > > There are two options. One is having KVM export a new API for such
> > > translation purpose. But as you explained earlier it's not good to
> > > have vendor drivers depend on KVM. The other is having VFIO
> > > maintaining such knowledge through extended MAP interface,
> > > then providing a uniform API for all vendor drivers to use.
> 
> So the argument is that in order to interact with KVM (write protecting
> guest memory) there's a missing feature (IOVA to GPA translation), but
> we don't want to add an API to KVM for this feature because that would
> create a dependency on KVM (for interacting with KVM), so lets add an
> API to vfio instead.  That makes no sense to me.  What am I missing?
> Thanks,
>


Then do you have a recommendation how such feature can be 
implemented cleanly in vendor driver, without introducing direct
dependency on KVM? 

Thanks
Kevin

RE: [PATCH v8 01/13] vfio: KABI for migration interface

Reply via email to