> From: Alex Williamson [mailto:alex.william...@redhat.com] > Sent: Tuesday, September 17, 2019 10:54 PM > > On Tue, 17 Sep 2019 08:48:36 +0000 > "Tian, Kevin" <kevin.t...@intel.com> wrote: > > > > From: Jason Wang [mailto:jasow...@redhat.com] > > > Sent: Monday, September 16, 2019 4:33 PM > > > > > > > > > On 2019/9/16 上午9:51, Tian, Kevin wrote: > > > > Hi, Jason > > > > > > > > We had a discussion about dirty page tracking in VFIO, when vIOMMU > > > > is enabled: > > > > > > > > https://lists.nongnu.org/archive/html/qemu-devel/2019- > > > 09/msg02690.html > > > > > > > > It's actually a similar model as vhost - Qemu cannot interpose the fast- > > > path > > > > DMAs thus relies on the kernel part to track and report dirty page > > > information. > > > > Currently Qemu tracks dirty pages in GFN level, thus demanding a > > > translation > > > > from IOVA to GPA. Then the open in our discussion is where this > > > translation > > > > should happen. Doing the translation in kernel implies a device iotlb > > > flavor, > > > > which is what vhost implements today. It requires potentially large > > > tracking > > > > structures in the host kernel, but leveraging the existing log_sync flow > in > > > Qemu. > > > > On the other hand, Qemu may perform log_sync for every removal of > > > IOVA > > > > mapping and then do the translation itself, then avoiding the GPA > > > awareness > > > > in the kernel side. It needs some change to current Qemu log-sync > flow, > > > and > > > > may bring more overhead if IOVA is frequently unmapped. > > > > > > > > So we'd like to hear about your opinions, especially about how you > came > > > > down to the current iotlb approach for vhost. > > > > > > > > > We don't consider too much in the point when introducing vhost. And > > > before IOTLB, vhost has already know GPA through its mem table > > > (GPA->HVA). So it's nature and easier to track dirty pages at GPA level > > > then it won't any changes in the existing ABI. > > > > This is the same situation as VFIO. > > It is? VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA. In > some cases IOVA is GPA, but not all.
Well, I thought vhost has a similar design, that the index of its mem table is GPA when vIOMMU is off and then becomes IOVA when vIOMMU is on. But I may be wrong here. Jason, can you help clarify? I saw two interfaces which poke the mem table: VHOST_SET_MEM_TABLE (for GPA) and VHOST_IOTLB_UPDATE (for IOVA). Are they used exclusively or together? > > > > For VFIO case, the only advantages of using GPA is that the log can then > > > be shared among all the devices that belongs to the VM. Otherwise > > > syncing through IOVA is cleaner. > > > > I still worry about the potential performance impact with this approach. > > In current mdev live migration series, there are multiple system calls > > involved when retrieving the dirty bitmap information for a given memory > > range. IOVA mappings might be changed frequently. Though one may > > argue that frequent IOVA change already has bad performance, it's still > > not good to introduce further non-negligible overhead in such situation. > > > > On the other hand, I realized that adding IOVA awareness in VFIO is > > actually easy. Today VFIO already maintains a full list of IOVA and its > > associated HVA in vfio_dma structure, according to VFIO_MAP and > > VFIO_UNMAP. As long as we allow the latter two operations to accept > > another parameter (GPA), IOVA->GPA mapping can be naturally cached > > in existing vfio_dma objects. Those objects are always updated according > > to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly > > retrieves the VFIO dirty bitmap for the entire GPA range in every pre-copy > > round, regardless of whether vIOMMU is enabled. There is no need of > > another IOTLB implementation, with the main ask on a v2 MAP/UNMAP > > interface. > > > > Alex, your thoughts? > > Same as last time, you're asking VFIO to be aware of an entirely new > address space and implement tracking structures of that address space > to make life easier for QEMU. Don't we typically push such complexity > to userspace rather than into the kernel? I'm not convinced. Thanks, > Is it really complex? No need of a new tracking structure. Just allowing the MAP interface to carry a new parameter and then record it in the existing vfio_dma objects. Note the frequency of guest DMA map/unmap could be very high. We saw >100K invocations per second with a 40G NIC. To do the right translation Qemu requires log_sync for every unmap, before the mapping for logged dirty IOVA becomes stale. In current Kirti's patch, each log_sync requires several system_calls through the migration info, e.g. setting start_pfn/page_size/total_pfns and then reading data_offset/data_size. That design is fine for doing log_sync in every pre-copy round, but too costly if doing so for every IOVA unmap. If small extension in kernel can lead to great overhead reduction, why not? Thanks Kevin