> From: Tian, Kevin > Sent: Friday, September 13, 2019 7:00 AM > > > From: Alex Williamson [mailto:alex.william...@redhat.com] > > Sent: Thursday, September 12, 2019 10:41 PM > > > > On Tue, 3 Sep 2019 06:57:27 +0000 > > "Tian, Kevin" <kevin.t...@intel.com> wrote: > > > > > > From: Alex Williamson [mailto:alex.william...@redhat.com] > > > > Sent: Saturday, August 31, 2019 12:33 AM > > > > > > > > On Fri, 30 Aug 2019 08:06:32 +0000 > > > > "Tian, Kevin" <kevin.t...@intel.com> wrote: > > > > > > > > > > From: Tian, Kevin > > > > > > Sent: Friday, August 30, 2019 3:26 PM > > > > > > > > > > > [...] > > > > > > > How does QEMU handle the fact that IOVAs are potentially > > dynamic > > > > while > > > > > > > performing the live portion of a migration? For example, each > > time a > > > > > > > guest driver calls dma_map_page() or dma_unmap_page(), a > > > > > > > MemoryRegionSection pops in or out of the AddressSpace for > the > > device > > > > > > > (I'm assuming a vIOMMU where the device AddressSpace is not > > > > > > > system_memory). I don't see any QEMU code that intercepts > that > > > > change > > > > > > > in the AddressSpace such that the IOVA dirty pfns could be > > recorded and > > > > > > > translated to GFNs. The vendor driver can't track these beyond > > getting > > > > > > > an unmap notification since it only knows the IOVA pfns, which > > can be > > > > > > > re-used with different GFN backing. Once the DMA mapping is > > torn > > > > down, > > > > > > > it seems those dirty pfns are lost in the ether. If this works in > > QEMU, > > > > > > > please help me find the code that handles it. > > > > > > > > > > > > I'm curious about this part too. Interestingly, I didn't find any > > log_sync > > > > > > callback registered by emulated devices in Qemu. Looks dirty > pages > > > > > > by emulated DMAs are recorded in some implicit way. But KVM > > always > > > > > > reports dirty page in GFN instead of IOVA, regardless of the > > presence of > > > > > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated > DMAs > > > > > > (translation can be done when DMA happens), then we don't > need > > > > > > worry about transient mapping from IOVA to GFN. Along this way > > we > > > > > > also want GFN-based dirty bitmap being reported through VFIO, > > > > > > similar to what KVM does. For vendor drivers, it needs to translate > > > > > > from IOVA to HVA to GFN when tracking DMA activities on VFIO > > > > > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be > > > > > > provided by KVM but I'm not sure whether it's exposed now. > > > > > > > > > > > > > > > > HVA->GFN can be done through hva_to_gfn_memslot in kvm_host.h. > > > > > > > > I thought it was bad enough that we have vendor drivers that depend > > on > > > > KVM, but designing a vfio interface that only supports a KVM interface > > > > is more undesirable. I also note without comment that > > gfn_to_memslot() > > > > is a GPL symbol. Thanks, > > > > > > yes it is bad, but sometimes inevitable. If you recall our discussions > > > back to 3yrs (when discussing the 1st mdev framework), there were > > similar > > > hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when > > > creating some shadow structures. gpa->hpa is definitely hypervisor > > > specific knowledge, which is easy in KVM (gpa->hva->hpa), but needs > > > hypercall in Xen. but VFIO already makes assumption based on KVM- > > > only flavor when implementing vfio_{un}pin_page_external. > > > > Where's the KVM assumption there? The MAP_DMA ioctl takes an IOVA > > and > > HVA. When an mdev vendor driver calls vfio_pin_pages(), we GUP the > HVA > > to get an HPA and provide an array of HPA pfns back to the caller. The > > other vGPU mdev vendor manages to make use of this without KVM... the > > KVM interface used by GVT-g is GPL-only. > > To be clear it's the assumption on the host-based hypervisors e.g. KVM. > GUP is a perfect example, which doesn't work for Xen since DomU's > memory doesn't belong to Dom0. VFIO in Dom0 has to find the HPA > through Xen specific hypercalls. > > > > > > So GVT-g > > > has to maintain an internal abstraction layer to support both Xen and > > > KVM. Maybe someday we will re-consider introducing some hypervisor > > > abstraction layer in VFIO, if this issue starts to hurt other devices and > > > Xen guys are willing to support VFIO. > > > > Once upon a time, we had a KVM specific device assignment interface, > > ie. legacy KVM devie assignment. We developed VFIO specifically to get > > KVM out of the business of being a (bad) device driver. We do have > > some awareness and interaction between VFIO and KVM in the vfio-kvm > > pseudo device, but we still try to keep those interfaces generic. In > > some cases we're not very successful at that, see vfio_group_set_kvm(), > > but that's largely just a mechanism to associate a cookie with a group > > to be consumed by the mdev vendor driver such that it can work with > kvm > > external to vfio. I don't intend to add further hypervisor awareness > > to vfio. > > > > > Back to this IOVA issue, I discussed with Yan and we found another > > > hypervisor-agnostic alternative, by learning from vhost. vhost is very > > > similar to VFIO - DMA also happens in the kernel, while it already > > > supports vIOMMU. > > > > > > Generally speaking, there are three paths of dirty page collection > > > in Qemu so far (as previously noted, Qemu always tracks the dirty > > > bitmap in GFN): > > > > GFNs or simply PFNs within an AddressSpace? > > > > > 1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty bitmaps > > > are updated directly when the guest memory is being updated. For > > > example, PCI writes are completed through pci_dma_write, which > > > goes through vIOMMU to translate IOVA into GPA and then update > > > the bitmap through cpu_physical_memory_set_dirty_range. > > > > Right, so the IOVA to GPA (GFN) occurs through an explicit translation > > on the IOMMU AddressSpace. > > > > > 2) Memory writes that are not tracked by Qemu are collected by > > > registering .log_sync() callback, which is invoked in the dirty logging > > > process. Now there are two users: kvm and vhost. > > > > > > 2.1) KVM tracks CPU-side memory writes, through write-protection > > > or EPT A/D bits (+PML). This part is always based on GFN and returned > > > to Qemu when kvm_log_sync is invoked; > > > > > > 2.2) vhost tracks kernel-side DMA writes, by interpreting vring > > > data structure. It maintains an internal iotlb which is synced with > > > Qemu vIOMMU through a specific interface: > > > - new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE) > > > for Qemu to keep vhost iotlb in sync > > > - new VHOST_IOTLB_MISS message to notify Qemu in case of > > > a miss in vhost iotlb. > > > - Qemu registers a log buffer to kernel vhost driver. The latter > > > update the buffer (using internal iotlb to get GFN) when serving vring > > > descriptor. > > > > > > VFIO could also implement an internal iotlb, so vendor drivers can > > > utilize the iotlb to update the GFN-based dirty bitmap. Ideally we > > > don't need re-invent another iotlb protocol as vhost does. vIOMMU > > > already sends map/unmap ioctl cmds upon any change of IOVA > > > mapping. We may introduce a v2 map/unmap interface, allowing > > > Qemu to pass both {iova, gpa, hva} together to keep internal iotlb > > > in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO > > > doesn't want to cache full-size vIOMMU mappings. > > > > > > Definitely this alternative needs more work and possibly less > > > performant (if maintaining a small size iotlb) than straightforward > > > calling into KVM interface. But the gain is also obvious, since it > > > is fully constrained with VFIO. > > > > > > Thoughts? :-) > > > > So vhost must then be configuring a listener across system memory > > rather than only against the device AddressSpace like we do in vfio, > > such that it get's log_sync() callbacks for the actual GPA space rather > > than only the IOVA space. OTOH, QEMU could understand that the > device > > AddressSpace has a translate function and apply the IOVA dirty bits to > > the system memory AddressSpace. Wouldn't it make more sense for > > QEMU > > to perform a log_sync() prior to removing a MemoryRegionSection within > > an AddressSpace and update the GPA rather than pushing GPA > awareness > > and potentially large tracking structures into the host kernel? Thanks, > > >
Hi, Alex, I moved back the VFIO related discussion to this thread, to not mix with vhost related discussions here. https://lists.nongnu.org/archive/html/qemu-devel/2019-09/msg03126.html Your latest reply still prefers to the userspace approach: > > Same as last time, you're asking VFIO to be aware of an entirely new > > address space and implement tracking structures of that address space > > to make life easier for QEMU. Don't we typically push such complexity > > to userspace rather than into the kernel? I'm not convinced. Thanks, > > I answered two points but didn't hear your further thoughts. Can you take a look and respond? The first point is about complexity and performance: > > Is it really complex? No need of a new tracking structure. Just allowing > the MAP interface to carry a new parameter and then record it in the > existing vfio_dma objects. > > Note the frequency of guest DMA map/unmap could be very high. We > saw >100K invocations per second with a 40G NIC. To do the right > translation Qemu requires log_sync for every unmap, before the > mapping for logged dirty IOVA becomes stale. In current Kirti's patch, > each log_sync requires several system_calls through the migration > info, e.g. setting start_pfn/page_size/total_pfns and then reading > data_offset/data_size. That design is fine for doing log_sync in every > pre-copy round, but too costly if doing so for every IOVA unmap. If > small extension in kernel can lead to great overhead reduction, > why not? > The second point is about write-protection: > There is another value of recording GPA in VFIO. Vendor drivers (e.g. > GVT-g) may need to selectively write-protect guest memory pages > when interpreting certain workload descriptors. Those pages are > recorded in IOVA when vIOMMU is enabled, however the KVM > write-protection API only knows GPA. So currently vIOMMU must > be disabled on Intel vGPUs when GVT-g is enabled. To make it working > we need a way to translate IOVA into GPA in the vendor drivers. There > are two options. One is having KVM export a new API for such > translation purpose. But as you explained earlier it's not good to > have vendor drivers depend on KVM. The other is having VFIO > maintaining such knowledge through extended MAP interface, > then providing a uniform API for all vendor drivers to use. Thanks Kevin