On 03/22/2014 01:17 AM, Alex Williamson wrote: > On Fri, 2014-03-21 at 18:59 +1100, Alexey Kardashevskiy wrote: >> On 03/20/2014 06:57 AM, Alex Williamson wrote: >>> On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote: >>>> From: David Gibson <da...@gibson.dropbear.id.au> >>>> >>>> This patch uses the new IOMMU notifiers to allow VFIO pass through devices >>>> to work with guest side IOMMUs, as long as the host-side VFIO iommu has >>>> sufficient capability and granularity to match the guest side. This works >>>> by tracking all map and unmap operations on the guest IOMMU using the >>>> notifiers, and mirroring them into VFIO. >>>> >>>> There are a number of FIXMEs, and the scheme involves rather more notifier >>>> structures than I'd like, but it should make for a reasonable proof of >>>> concept. >>>> >>>> Signed-off-by: David Gibson <da...@gibson.dropbear.id.au> >>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> >>>> >>>> --- >>>> Changes: >>>> v4: >>>> * fixed list objects naming >>>> * vfio_listener_region_add() reworked to call memory_region_ref() from one >>>> place only, it is also easier to review the changes >>>> * fixes boundary check not to fail on sections == 2^64 bytes, >>>> the "vfio: Fix debug output for int128 values" patch is required; >>>> this obsoletes the "[PATCH v3 0/3] vfio: fixes for better support >>>> for 128 bit memory section sizes" patch proposal >>>> --- >>>> hw/misc/vfio.c | 126 >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++--- >>>> 1 file changed, 120 insertions(+), 6 deletions(-) >>>> >>>> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c >>>> index 038010b..4f6f5da 100644 >>>> --- a/hw/misc/vfio.c >>>> +++ b/hw/misc/vfio.c >>>> @@ -159,10 +159,18 @@ typedef struct VFIOContainer { >>>> }; >>>> void (*release)(struct VFIOContainer *); >>>> } iommu_data; >>>> + QLIST_HEAD(, VFIOGuestIOMMU) giommu_list; >>>> QLIST_HEAD(, VFIOGroup) group_list; >>>> QLIST_ENTRY(VFIOContainer) next; >>>> } VFIOContainer; >>>> >>>> +typedef struct VFIOGuestIOMMU { >>>> + VFIOContainer *container; >>>> + MemoryRegion *iommu; >>>> + Notifier n; >>>> + QLIST_ENTRY(VFIOGuestIOMMU) giommu_next; >>>> +} VFIOGuestIOMMU; >>>> + >>>> /* Cache of MSI-X setup plus extra mmap and memory region for split BAR >>>> map */ >>>> typedef struct VFIOMSIXInfo { >>>> uint8_t table_bar; >>>> @@ -2241,8 +2249,9 @@ static int vfio_dma_map(VFIOContainer *container, >>>> hwaddr iova, >>>> >>>> static bool vfio_listener_skipped_section(MemoryRegionSection *section) >>>> { >>>> - return !memory_region_is_ram(section->mr) || >>>> - /* >>>> + return (!memory_region_is_ram(section->mr) && >>>> + !memory_region_is_iommu(section->mr)) || >>>> + /* >>> >>> White space damage >>> >>>> * Sizing an enabled 64-bit BAR can cause spurious mappings to >>>> * addresses in the upper part of the 64-bit address space. >>>> These >>>> * are never accessed by the CPU and beyond the address width >>>> of >>>> @@ -2251,6 +2260,61 @@ static bool >>>> vfio_listener_skipped_section(MemoryRegionSection *section) >>>> section->offset_within_address_space & (1ULL << 63); >>>> } >>>> >>>> +static void vfio_iommu_map_notify(Notifier *n, void *data) >>>> +{ >>>> + VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); >>>> + VFIOContainer *container = giommu->container; >>>> + IOMMUTLBEntry *iotlb = data; >>>> + MemoryRegion *mr; >>>> + hwaddr xlat; >>>> + hwaddr len = iotlb->addr_mask + 1; >>>> + void *vaddr; >>>> + int ret; >>>> + >>>> + DPRINTF("iommu map @ %"HWADDR_PRIx" - %"HWADDR_PRIx"\n", >>>> + iotlb->iova, iotlb->iova + iotlb->addr_mask); >>>> + >>>> + /* >>>> + * The IOMMU TLB entry we have just covers translation through >>>> + * this IOMMU to its immediate target. We need to translate >>>> + * it the rest of the way through to memory. >>>> + */ >>>> + mr = address_space_translate(&address_space_memory, >>>> + iotlb->translated_addr, >>>> + &xlat, &len, iotlb->perm & IOMMU_WO); >>> >>> Write-only? Is this supposed to be read-write to mask just 2 bits? >> >> >> The last parameter of address_space_translate() bool is_write. So I do not >> really understand the problem here. > > Oops, my bad, I didn't look at what address_space_translate() used that > for. Ok. > >>>> + if (!memory_region_is_ram(mr)) { >>>> + DPRINTF("iommu map to non memory area %"HWADDR_PRIx"\n", >>>> + xlat); >>>> + return; >>>> + } >>>> + if (len & iotlb->addr_mask) { >>>> + DPRINTF("iommu has granularity incompatible with target AS\n"); >>> >>> Is this possible? Assuming len is initially a power-of-2, would the >>> translate function change it? Maybe worth a comment to explain. >> >> >> Oh. address_space_translate() actually changes @len to min(len, >> TARGET_PAGE_SIZE) and TARGET_PAGE_SIZE is hardcoded to 4K. So far it was ok >> but lately I have been implementing a huge DMA window (plus one >> sPAPRTCETable and one VFIOGuestIOMMU objects) which currently operates with >> 16MB pages (can do 64K pages too) and now this "granularity incompatible" >> is happening. >> >> I disabled that check but I need to think of better fix... >> >> Adding Paolo to cc, may be he picks the context and gives good piece of >> advise :) >> >> >> >>> >>>> + return; >>>> + } >>>> + >>>> + vaddr = memory_region_get_ram_ptr(mr) + xlat; >>> >>> This lookup isn't free and the unmap path doesn't need it, maybe move >>> the variable and lookup into the first branch below? >>> >>>> + >>>> + if (iotlb->perm != IOMMU_NONE) { >>>> + ret = vfio_dma_map(container, iotlb->iova, >>>> + iotlb->addr_mask + 1, vaddr, >>>> + !(iotlb->perm & IOMMU_WO) || mr->readonly); >>>> + if (ret) { >>>> + error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " >>>> + "0x%"HWADDR_PRIx", %p) = %d (%m)", >>>> + container, iotlb->iova, >>>> + iotlb->addr_mask + 1, vaddr, ret); >>>> + } >>>> + } else { >>>> + ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + >>>> 1); >>>> + if (ret) { >>>> + error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " >>>> + "0x%"HWADDR_PRIx") = %d (%m)", >>>> + container, iotlb->iova, >>>> + iotlb->addr_mask + 1, ret); >>>> + } >>>> + } >>>> +} >>>> + >>>> static void vfio_listener_region_add(MemoryListener *listener, >>>> MemoryRegionSection *section) >>>> { >>>> @@ -2261,8 +2325,6 @@ static void vfio_listener_region_add(MemoryListener >>>> *listener, >>>> void *vaddr; >>>> int ret; >>>> >>>> - assert(!memory_region_is_iommu(section->mr)); >>>> - >>>> if (vfio_listener_skipped_section(section)) { >>>> DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n", >>>> section->offset_within_address_space, >>>> @@ -2286,15 +2348,47 @@ static void >>>> vfio_listener_region_add(MemoryListener *listener, >>>> return; >>>> } >>>> >>>> + memory_region_ref(section->mr); >>>> + >>>> + if (memory_region_is_iommu(section->mr)) { >>>> + VFIOGuestIOMMU *giommu; >>>> + >>>> + DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n", >>>> + iova, int128_get64(int128_sub(llend, int128_one()))); >>>> + /* >>>> + * FIXME: We should do some checking to see if the >>>> + * capabilities of the host VFIO IOMMU are adequate to model >>>> + * the guest IOMMU >>>> + * >>>> + * FIXME: This assumes that the guest IOMMU is empty of >>>> + * mappings at this point - we should either enforce this, or >>>> + * loop through existing mappings to map them into VFIO. >>>> + * >>>> + * FIXME: For VFIO iommu types which have KVM acceleration to >>>> + * avoid bouncing all map/unmaps through qemu this way, this >>>> + * would be the right place to wire that up (tell the KVM >>>> + * device emulation the VFIO iommu handles to use). >>>> + */ >>> >>> That's a lot of FIXMEs... The second one in particular looks like it >>> needs to expand a bit on why this is likely a valid assumption. The >>> last one is more of a TODO than a FIXME. >>> >>>> + giommu = g_malloc0(sizeof(*giommu)); >>>> + giommu->iommu = section->mr; >>>> + giommu->container = container; >>>> + giommu->n.notify = vfio_iommu_map_notify; >>>> + QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next); >>>> + memory_region_register_iommu_notifier(giommu->iommu, &giommu->n); >>>> + >>>> + return; >>>> + } >>>> + >>>> + /* Here we assume that memory_region_is_ram(section->mr)==true */ >>>> + >>>> end = int128_get64(llend); >>>> vaddr = memory_region_get_ram_ptr(section->mr) + >>>> section->offset_within_region + >>>> (iova - section->offset_within_address_space); >>>> >>>> - DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n", >>>> + DPRINTF("region_add [ram] %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n", >>>> iova, end - 1, vaddr); >>>> >>>> - memory_region_ref(section->mr); >>>> ret = vfio_dma_map(container, iova, end - iova, vaddr, >>>> section->readonly); >>>> if (ret) { >>>> error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " >>>> @@ -2338,6 +2432,26 @@ static void vfio_listener_region_del(MemoryListener >>>> *listener, >>>> return; >>>> } >>>> >>>> + if (memory_region_is_iommu(section->mr)) { >>>> + VFIOGuestIOMMU *giommu; >>>> + >>>> + QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) { >>>> + if (giommu->iommu == section->mr) { >>>> + memory_region_unregister_iommu_notifier(&giommu->n); >>>> + QLIST_REMOVE(giommu, giommu_next); >>>> + g_free(giommu); >>>> + break; >>>> + } >>>> + } >>>> + >>>> + /* >>>> + * FIXME: We assume the one big unmap below is adequate to >>>> + * remove any individual page mappings in the IOMMU which >>>> + * might have been copied into VFIO. That may not be true for >>>> + * all IOMMU types >>>> + */ >>> >>> We assume this because the IOVA that gets unmapped is the same >>> regardless of whether a guest IOMMU is present? >> >> >> What exactly is meant by "guest IOMMU is present"? Doing the second DMA >> window, now I am really confused about terminology :( > > The confusion for me is that add_region initializes the giommu and all > the DMA mapping through VFIO is done in the notifier for the giommu. > It's therefore asymmetric that add_region doesn't vfio_dma_map anything, > but region_del does vfio_dma_unmap, which is the basis of my question. > I thought this comment was trying to address why that is, but apparently > it's something else entirely, so it would be nice to understand why this > doesn't return() and decode a bit more clearly what the FIXME is trying > to say. Thanks,
I do not mind extending that comment but need some assistance. My understanding is that: === region_del is called on memory region removal so this particular window is not going to be used anymore and this is the only place where such cleanup could be done. === Asymmetry is here but it does not look terrible. > Alex > >>>> + } >>>> + >>>> iova = TARGET_PAGE_ALIGN(section->offset_within_address_space); >>>> end = (section->offset_within_address_space + >>>> int128_get64(section->size)) & >>>> TARGET_PAGE_MASK; -- Alexey