On Tue, Jan 17, 2017 at 08:46:04AM -0700, Alex Williamson wrote: > On Tue, 17 Jan 2017 22:00:00 +0800 > Peter Xu <pet...@redhat.com> wrote: > > > On Mon, Jan 16, 2017 at 12:53:57PM -0700, Alex Williamson wrote: > > > On Fri, 13 Jan 2017 11:06:39 +0800 > > > Peter Xu <pet...@redhat.com> wrote: > > > > > > > This is preparation work to finally enabled dynamic switching ON/OFF for > > > > VT-d protection. The old VT-d codes is using static IOMMU address space, > > > > and that won't satisfy vfio-pci device listeners. > > > > > > > > Let me explain. > > > > > > > > vfio-pci devices depend on the memory region listener and IOMMU replay > > > > mechanism to make sure the device mapping is coherent with the guest > > > > even if there are domain switches. And there are two kinds of domain > > > > switches: > > > > > > > > (1) switch from domain A -> B > > > > (2) switch from domain A -> no domain (e.g., turn DMAR off) > > > > > > > > Case (1) is handled by the context entry invalidation handling by the > > > > VT-d replay logic. What the replay function should do here is to replay > > > > the existing page mappings in domain B. > > > > > > There's really 2 steps here, right? Invalidate A, replay B. I think > > > the code handles this, but I want to make sure. We don't want to end > > > up with a superset of both A & B. > > > > First of all, this discussion should be beyond this patch's scope, > > since this patch is currently only handling the case when guest > > disables DMAR in general. > > > > Then, my understanding for above question: when we do A -> B domain > > switch, guest will not send specific context entry invalidations for > > A, but will for sure send one when context entry is ready for B. In > > that sense, IMO we don't have a clear "two steps", only one, which is > > the latter "replay B". We do correct unmap based on the PSIs > > (page-selective invalidations) of A when guest unmaps the pages in A. > > > > So, for the use case of nested device assignment (which is the goal of > > this series for now): > > > > - L1 guest put device D1,D2,... of L2 guest into domain A > > - L1 guest map the L2 memory into L1 address space (L2GPA -> L1GPA) > > - ... (L2 guest runs, until it stops running) > > - L1 guest unmap all the pages in domain A > > - L1 guest move device D1,D2,... of L2 guest outside domain A > > > > This series should work for above, since before any device leaves its > > domain, the domain will be clean and without unmapped pages. > > > > However, if we have the following scenario (which I don't know whether > > this's achievable): > > > > - guest iommu domain A has device D1, D2 > > - guest iommu domain B has device D3 > > - move device D2 from domain A into B > > > > Here when D2 move from A to B, IIUC our current Linux IOMMU driver > > code will not send any PSI (page-selected invalidations) for D2 or > > domain A because domain A still has device in it, guest should only > > send a context entry invalidation for device D2, telling that D2 has > > switched to domain B. In that case, I am not sure whether current > > series can work properly, and IMHO we may need to have the domain > > knowledge in VT-d emulation code (while we don't have it yet) in the > > future to further support this kind of domain switches. > > This is a serious issue that needs to be resolved. The context entry > invalidation when D2 is switched from A->B must unmap anything from > domain A before the replay of domain B. Your example is easily > achieved, for instance what if domain A is the SI (static identity) > domain for the L1 guest, domain B is the device assignment domain for > the L2 guest with current device D3. The user hot adds device D2 into > the L2 guest moving it from the L1 SI domain to the device assignment > domain. vfio will not override existing mappings on replay, it will > error, giving the L2 guest a device with access to the static identity > mappings of the L1 host. This isn't acceptable. > > > > On the invalidation, a future optimization when disabling an entire > > > memory region might also be to invalidate the entire range at once > > > rather than each individual mapping within the range, which I think is > > > what happens now, right? > > > > Right. IIUC this can be an enhancement to current page walk logic - we > > can coalesce continuous IOTLB with same property and notify only once > > for these coalesced entries. > > > > Noted in my todo list. > > A context entry invalidation as in the example above might make use of > this to skip any sort of page walk logic, simply invalidate the entire > address space.
Alex, I got one more thing to ask: I was trying to invalidate the entire address space by sending a big IOTLB notification to vfio-pci, which looks like: IOMMUTLBEntry entry = { .target_as = &address_space_memory, .iova = 0, .translated_addr = 0, .addr_mask = (1 << 63) - 1, .perm = IOMMU_NONE, /* UNMAP */ }; Then I feed this entry to vfio-pci IOMMU notifier. However, this was blocked in vfio_iommu_map_notify(), with error: qemu-system-x86_64: iommu has granularity incompatible with target AS Since we have: /* * The IOMMU TLB entry we have just covers translation through * this IOMMU to its immediate target. We need to translate * it the rest of the way through to memory. */ rcu_read_lock(); mr = address_space_translate(&address_space_memory, iotlb->translated_addr, &xlat, &len, iotlb->perm & IOMMU_WO); if (!memory_region_is_ram(mr)) { error_report("iommu map to non memory area %"HWADDR_PRIx"", xlat); goto out; } /* * Translation truncates length to the IOMMU page size, * check that it did not truncate too much. */ if (len & iotlb->addr_mask) { error_report("iommu has granularity incompatible with target AS"); goto out; } In my case len == 0xa0000 (that's the translation result), and iotlb->addr_mask == (1<<63)-1. So looks like the translation above splitted the big region and a simple big UNMAP doesn't work for me. Do you have any suggestion on how I can solve this? In what case will we need the above address_space_translate()? > > > > > > > > However for case (2), we don't want to replay any domain mappings - we > > > > just need the default GPA->HPA mappings (the address_space_memory > > > > mapping). And this patch helps on case (2) to build up the mapping > > > > automatically by leveraging the vfio-pci memory listeners. > > > > > > Have you thought about using this address space switching to emulate > > > ecap.PT? ie. advertise hardware based passthrough so that the guest > > > doesn't need to waste pagetable entries for a direct mapped, static > > > identity domain. > > > > Kind of. Currently we still don't have iommu=pt for the emulated code. > > We can achieve that by leveraging this patch. > > Well, we have iommu=pt, but the L1 guest will implement this as a fully > populated SI domain rather than as a bit in the context entry to do > hardware direct translation. Given the mapping overhead through vfio, > the L1 guest will always want to use iommu=pt as dynamic mapping > performance is going to be horrid. Thanks, I see, so we have iommu=pt in guest even VT-d emulation does not provide that bit. Anyway, supporting ecap.pt is in my todo list. Thanks, -- peterx