On 10/12/2016 3:36 AM, Alex Williamson wrote: > On Tue, 11 Oct 2016 01:58:34 +0530 > Kirti Wankhede <kwankh...@nvidia.com> wrote: > ...
>> +static struct vfio_group *vfio_group_from_dev(struct device *dev) >> +{ >> + struct vfio_device *device; >> + struct vfio_group *group; >> + int ret; >> + >> + device = vfio_device_get_from_dev(dev); > > Note how this does dev->iommu_group->vfio_group->vfio_device and then > we back out one level to get the vfio_group, it's not a terribly > lightweight path. Perhaps we should have: > > struct vfio_device *vfio_group_get_from_dev(struct device *dev) > { > struct iommu_group *iommu_group; > struct vfio_group *group; > > iommu_group = iommu_group_get(dev); > if (!iommu_group) > return NULL; > > group = vfio_group_get_from_iommu(iommu_group); > iommu_group_put(iommu_group); > > return group; > } > > vfio_device_get_from_dev() would make use of this. > > Then create a separate: > > static int vfio_group_add_container_user(struct vfio_group *group) > { > >> + if (!atomic_inc_not_zero(&group->container_users)) { > return -EINVAL; >> + } >> + >> + if (group->noiommu) { >> + atomic_dec(&group->container_users); > return -EPERM; >> + } >> + >> + if (!group->container->iommu_driver || >> + !vfio_group_viable(group)) { >> + atomic_dec(&group->container_users); > return -EINVAL; >> + } >> + > return 0; > } > > vfio_group_get_external_user() would be updated to use this. In fact, > creating these two functions and updating the existing code to use > these should be a separate patch. > Ok. I'll update. > Note that your version did not hold a group reference while doing the > pin/unpin operations below, which seems like a bug. > container->group_lock is held for pin/unpin. I think then we don't have to hold the reference to group, because groups are attached and detached holding this lock, right? >> + >> +err_ret: >> + vfio_device_put(device); >> + return ERR_PTR(ret); >> +} >> + >> +/* >> + * Pin a set of guest PFNs and return their associated host PFNs for local >> + * domain only. >> + * @dev [in] : device >> + * @user_pfn [in]: array of user/guest PFNs >> + * @npage [in]: count of array elements >> + * @prot [in] : protection flags >> + * @phys_pfn[out] : array of host PFNs >> + */ >> +long vfio_pin_pages(struct device *dev, unsigned long *user_pfn, >> + long npage, int prot, unsigned long *phys_pfn) >> +{ >> + struct vfio_container *container; >> + struct vfio_group *group; >> + struct vfio_iommu_driver *driver; >> + ssize_t ret = -EINVAL; >> + >> + if (!dev || !user_pfn || !phys_pfn) >> + return -EINVAL; >> + >> + group = vfio_group_from_dev(dev); >> + if (IS_ERR(group)) >> + return PTR_ERR(group); > > As suggested above: > > group = vfio_group_get_from_dev(dev); > if (!group) > return -ENODEV; > > ret = vfio_group_add_container_user(group) > if (ret) > vfio_group_put(group); > return ret; > } > Ok. >> + >> + container = group->container; >> + if (IS_ERR(container)) >> + return PTR_ERR(container); >> + >> + down_read(&container->group_lock); >> + >> + driver = container->iommu_driver; >> + if (likely(driver && driver->ops->pin_pages)) >> + ret = driver->ops->pin_pages(container->iommu_data, user_pfn, >> + npage, prot, phys_pfn); >> + >> + up_read(&container->group_lock); >> + vfio_group_try_dissolve_container(group); > > Even if you're considering that the container_user reference holds the > driver, I think we need a group reference throughout this and this > should end with a vfio_group_put(group); > Same as I mentioned above, container->group_lock is held here. ... >> + >> +static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned long *pfn, >> + long npage) >> +{ >> + struct vfio_iommu *iommu = iommu_data; >> + struct vfio_domain *domain = NULL; >> + long unlocked = 0; >> + int i; >> + >> + if (!iommu || !pfn) >> + return -EINVAL; >> + > > We need iommu->lock here, right? > Oh, yes. >> + domain = iommu->local_domain; >> + >> + for (i = 0; i < npage; i++) { >> + struct vfio_pfn *p; >> + >> + mutex_lock(&domain->local_addr_space->pfn_list_lock); >> + >> + /* verify if pfn exist in pfn_list */ >> + p = vfio_find_pfn(domain, pfn[i]); >> + if (p) >> + unlocked += vfio_unpin_pfn(domain, p, true); >> + >> + mutex_unlock(&domain->local_addr_space->pfn_list_lock); > > We hold this mutex outside the loop in the pin unwind case, why is it > different here? > pin_unwind is error condition, so should be done in one go. Here this is not error case. Holding lock for long could block other threads if there are multiple threads. >> +static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, >> + size_t map_size) >> +{ >> + dma_addr_t iova = dma->iova; >> + unsigned long vaddr = dma->vaddr; >> + size_t size = map_size, dma_size = 0; >> + long npage; >> + unsigned long pfn; >> + int ret = 0; >> + >> + while (size) { >> + /* Pin a contiguous chunk of memory */ >> + npage = __vfio_pin_pages_remote(vaddr + dma_size, >> + size >> PAGE_SHIFT, dma->prot, >> + &pfn); >> + if (npage <= 0) { >> + WARN_ON(!npage); >> + ret = (int)npage; >> + break; >> + } >> + >> + /* Map it! */ >> + ret = vfio_iommu_map(iommu, iova + dma_size, pfn, npage, >> + dma->prot); >> + if (ret) { >> + __vfio_unpin_pages_remote(pfn, npage, dma->prot, true); >> + break; >> + } >> + >> + size -= npage << PAGE_SHIFT; >> + dma_size += npage << PAGE_SHIFT; >> + } >> + >> + if (ret) >> + vfio_remove_dma(iommu, dma); > > > There's a bug introduced here, vfio_remove_dma() needs dma->size to be > accurate to the point of failure, it's not updated until the success > branch below, so it's never going to unmap/unpin anything. > Ops, yes. I'll fix this. >> + else { >> + dma->size = dma_size; >> + dma->iommu_mapped = true; >> + vfio_update_accounting(iommu, dma); > > I'm confused how this works, when called from vfio_dma_do_map() we're > populating a vfio_dma, that is we're populating part of the iova space > of the device. How could we have pinned pfns in the local address > space that overlap that? It would be invalid to have such pinned pfns > since that part of the iova space was not previously mapped. > > Another issue is that if there were existing overlaps, userspace would > need to have locked memory limits sufficient for this temporary double > accounting. I'm not sure how they'd come up with heuristics to handle > that since we're potentially looking at the bulk of VM memory in a > single vfio_dma entry. > I see that when QEMU boots a VM, in the case when first vGPU device is attached and then pass through device is attached, then on first call to vfio_dma_do_map(), pin and iommu_mmap is skipped. Then when a pass through device is attached, all mappings are unmapped and then again vfio_dma_do_map() is called. At this moment IOMMU capable domain is present and so pin and iommu_mmap() on all sys mem is done. Now in between these two device attach, if any pages are pinned by vendor driver, then accounting should be updated. >> + } >> + >> + return ret; >> +} >> + >> static int vfio_dma_do_map(struct vfio_iommu *iommu, >> struct vfio_iommu_type1_dma_map *map) >> { >> dma_addr_t iova = map->iova; >> unsigned long vaddr = map->vaddr; >> size_t size = map->size; >> - long npage; >> int ret = 0, prot = 0; >> uint64_t mask; >> struct vfio_dma *dma; >> - unsigned long pfn; >> >> /* Verify that none of our __u64 fields overflow */ >> if (map->size != size || map->vaddr != vaddr || map->iova != iova) >> @@ -611,29 +981,11 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu, >> /* Insert zero-sized and grow as we map chunks of it */ >> vfio_link_dma(iommu, dma); >> >> - while (size) { >> - /* Pin a contiguous chunk of memory */ >> - npage = vfio_pin_pages(vaddr + dma->size, >> - size >> PAGE_SHIFT, prot, &pfn); >> - if (npage <= 0) { >> - WARN_ON(!npage); >> - ret = (int)npage; >> - break; >> - } >> - >> - /* Map it! */ >> - ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot); >> - if (ret) { >> - vfio_unpin_pages(pfn, npage, prot, true); >> - break; >> - } >> - >> - size -= npage << PAGE_SHIFT; >> - dma->size += npage << PAGE_SHIFT; >> - } >> - >> - if (ret) >> - vfio_remove_dma(iommu, dma); >> + /* Don't pin and map if container doesn't contain IOMMU capable domain*/ >> + if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu)) >> + dma->size = size; >> + else >> + ret = vfio_pin_map_dma(iommu, dma, size); >> >> mutex_unlock(&iommu->lock); >> return ret; >> @@ -662,10 +1014,6 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, >> d = list_first_entry(&iommu->domain_list, struct vfio_domain, next); >> n = rb_first(&iommu->dma_list); >> >> - /* If there's not a domain, there better not be any mappings */ >> - if (WARN_ON(n && !d)) >> - return -EINVAL; >> - >> for (; n; n = rb_next(n)) { >> struct vfio_dma *dma; >> dma_addr_t iova; >> @@ -674,20 +1022,43 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, >> iova = dma->iova; >> >> while (iova < dma->iova + dma->size) { >> - phys_addr_t phys = iommu_iova_to_phys(d->domain, iova); >> + phys_addr_t phys; >> size_t size; >> >> - if (WARN_ON(!phys)) { >> - iova += PAGE_SIZE; >> - continue; >> - } >> + if (dma->iommu_mapped) { >> + phys = iommu_iova_to_phys(d->domain, iova); >> + >> + if (WARN_ON(!phys)) { >> + iova += PAGE_SIZE; >> + continue; >> + } >> >> - size = PAGE_SIZE; >> + size = PAGE_SIZE; >> >> - while (iova + size < dma->iova + dma->size && >> - phys + size == iommu_iova_to_phys(d->domain, >> + while (iova + size < dma->iova + dma->size && >> + phys + size == iommu_iova_to_phys(d->domain, >> iova + size)) >> - size += PAGE_SIZE; >> + size += PAGE_SIZE; >> + } else { >> + unsigned long pfn; >> + unsigned long vaddr = dma->vaddr + >> + (iova - dma->iova); >> + size_t n = dma->iova + dma->size - iova; >> + long npage; >> + >> + npage = __vfio_pin_pages_remote(vaddr, >> + n >> PAGE_SHIFT, >> + dma->prot, >> + &pfn); >> + if (npage <= 0) { >> + WARN_ON(!npage); >> + ret = (int)npage; >> + return ret; >> + } >> + >> + phys = pfn << PAGE_SHIFT; >> + size = npage << PAGE_SHIFT; >> + } >> >> ret = iommu_map(domain->domain, iova, phys, >> size, dma->prot | domain->prot); >> @@ -696,6 +1067,11 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu, >> >> iova += size; >> } >> + >> + if (!dma->iommu_mapped) { >> + dma->iommu_mapped = true; >> + vfio_update_accounting(iommu, dma); >> + } > > This is the case where we potentially have pinned pfns and we've added > an iommu mapped device and need to adjust accounting. But we've fully > pinned and accounted the entire iommu mapped space while still holding > the accounting for any pfn mapped space. So for a time, assuming some > pfn pinned pages, we have duplicate accounting. How does userspace > deal with that? For instance, if I'm using an mdev device where the > vendor driver has pinned 512MB of guest memory, then I hot-add an > assigned NIC and the entire VM address space gets pinned, that pinning > will fail unless my locked memory limits are at least 512MB in excess > of my VM size. Additionally, the user doesn't know how much memory the > vendor driver is going to pin, it might be the whole VM address space, > so the user would need 2x the locked memory limits. > Is the RLIMIT_MEMLOCK set so low? I got your point. I'll update __vfio_pin_pages_remote() to check if page which is pinned is already accounted in __vfio_pin_pages_remote() itself. >> } >> >> return 0; >> @@ -734,11 +1110,24 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain) >> __free_pages(pages, order); >> } >> >> +static struct vfio_group *find_iommu_group(struct vfio_domain *domain, >> + struct iommu_group *iommu_group) >> +{ >> + struct vfio_group *g; >> + >> + list_for_each_entry(g, &domain->group_list, next) { >> + if (g->iommu_group == iommu_group) >> + return g; >> + } >> + >> + return NULL; >> +} > > It would make review easier if changes like splitting this into a > separate function with no functional change on the calling path could > be a separate patch. > OK. Thanks, Kirti