Re: [PATCH v12 06/10] iommu: Add a page fault handler
Hi Jean, It seems that the preprocessing of the page faults(groups) here is relatively generic, and if a device driver wants to reuse it while having its own iopf_handle_single(), is there any chance for this? :-) Thanks, Shenming On 2021/1/27 23:43, Jean-Philippe Brucker wrote: > Some systems allow devices to handle I/O Page Faults in the core mm. For > example systems implementing the PCIe PRI extension or Arm SMMU stall > model. Infrastructure for reporting these recoverable page faults was > added to the IOMMU core by commit 0c830e6b3282 ("iommu: Introduce device > fault report API"). Add a page fault handler for host SVA. > > IOMMU driver can now instantiate several fault workqueues and link them > to IOPF-capable devices. Drivers can choose between a single global > workqueue, one per IOMMU device, one per low-level fault queue, one per > domain, etc. > > When it receives a fault event, most commonly in an IRQ handler, the > IOMMU driver reports the fault using iommu_report_device_fault(), which > calls the registered handler. The page fault handler then calls the mm > fault handler, and reports either success or failure with > iommu_page_response(). After the handler succeeds, the hardware retries > the access. > > The iopf_param pointer could be embedded into iommu_fault_param. But > putting iopf_param into the iommu_param structure allows us not to care > about ordering between calls to iopf_queue_add_device() and > iommu_register_device_fault_handler(). > > Reviewed-by: Jonathan Cameron > Signed-off-by: Jean-Philippe Brucker > --- > drivers/iommu/Makefile| 1 + > drivers/iommu/iommu-sva-lib.h | 53 > include/linux/iommu.h | 2 + > drivers/iommu/io-pgfault.c| 461 ++ > 4 files changed, 517 insertions(+) > create mode 100644 drivers/iommu/io-pgfault.c > > diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile > index 61bd30cd8369..60fafc23dee6 100644 > --- a/drivers/iommu/Makefile > +++ b/drivers/iommu/Makefile > @@ -28,3 +28,4 @@ obj-$(CONFIG_S390_IOMMU) += s390-iommu.o > obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o > obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o > obj-$(CONFIG_IOMMU_SVA_LIB) += iommu-sva-lib.o > +obj-$(CONFIG_IOMMU_SVA_LIB) += io-pgfault.o > diff --git a/drivers/iommu/iommu-sva-lib.h b/drivers/iommu/iommu-sva-lib.h > index b40990aef3fd..031155010ca8 100644 > --- a/drivers/iommu/iommu-sva-lib.h > +++ b/drivers/iommu/iommu-sva-lib.h > @@ -12,4 +12,57 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t > min, ioasid_t max); > void iommu_sva_free_pasid(struct mm_struct *mm); > struct mm_struct *iommu_sva_find(ioasid_t pasid); > > +/* I/O Page fault */ > +struct device; > +struct iommu_fault; > +struct iopf_queue; > + > +#ifdef CONFIG_IOMMU_SVA_LIB > +int iommu_queue_iopf(struct iommu_fault *fault, void *cookie); > + > +int iopf_queue_add_device(struct iopf_queue *queue, struct device *dev); > +int iopf_queue_remove_device(struct iopf_queue *queue, > + struct device *dev); > +int iopf_queue_flush_dev(struct device *dev); > +struct iopf_queue *iopf_queue_alloc(const char *name); > +void iopf_queue_free(struct iopf_queue *queue); > +int iopf_queue_discard_partial(struct iopf_queue *queue); > + > +#else /* CONFIG_IOMMU_SVA_LIB */ > +static inline int iommu_queue_iopf(struct iommu_fault *fault, void *cookie) > +{ > + return -ENODEV; > +} > + > +static inline int iopf_queue_add_device(struct iopf_queue *queue, > + struct device *dev) > +{ > + return -ENODEV; > +} > + > +static inline int iopf_queue_remove_device(struct iopf_queue *queue, > +struct device *dev) > +{ > + return -ENODEV; > +} > + > +static inline int iopf_queue_flush_dev(struct device *dev) > +{ > + return -ENODEV; > +} > + > +static inline struct iopf_queue *iopf_queue_alloc(const char *name) > +{ > + return NULL; > +} > + > +static inline void iopf_queue_free(struct iopf_queue *queue) > +{ > +} > + > +static inline int iopf_queue_discard_partial(struct iopf_queue *queue) > +{ > + return -ENODEV; > +} > +#endif /* CONFIG_IOMMU_SVA_LIB */ > #endif /* _IOMMU_SVA_LIB_H */ > diff --git a/include/linux/iommu.h b/include/linux/iommu.h > index 00348e4c3c26..edc9be443a74 100644 > --- a/include/linux/iommu.h > +++ b/include/linux/iommu.h > @@ -366,6 +366,7 @@ struct iommu_fault_param { > * struct dev_iommu - Collection of per-device IOMMU data > * > * @fault_param: IOMMU detected device fault reporting data > + * @iopf_param: I/O Page Fault queue and data > * @fwspec: IOMMU fwspec data > * @iommu_dev:IOMMU device this device is linked to > * @priv: IOMMU Driver private data > @@ -376,6 +377,7 @@ struct iommu_fault_param { > struct dev_iommu { > struct mutex lock; > struct iommu_fault_param*fault_param; > + struct iopf_device_param*iopf_param; > s
Re: [PATCH v11 04/13] vfio/pci: Add VFIO_REGION_TYPE_NESTED region type
> +static int vfio_pci_dma_fault_init(struct vfio_pci_device *vdev) > +{ > + struct vfio_region_dma_fault *header; > + struct iommu_domain *domain; > + size_t size; > + bool nested; > + int ret; > + > + domain = iommu_get_domain_for_dev(&vdev->pdev->dev); > + ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, &nested); > + if (ret || !nested) > + return ret; Hi Eric, It seems that the type of nested should be int, the use of bool might trigger a panic in arm_smmu_domain_get_attr(). Thanks, Shenming ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH v2 0/6] Add IOPF support for VFIO passthrough
Hi, The static pinning and mapping problem in VFIO and possible solutions have been discussed a lot [1, 2]. One of the solutions is to add I/O page fault support for VFIO devices. Different from those relatively complicated software approaches such as presenting a vIOMMU that provides the DMA buffer information (might include para-virtualized optimizations), IOPF mainly depends on the hardware faulting capability, such as the PCIe PRI extension or Arm SMMU stall model. What's more, the IOPF support in the IOMMU driver is being implemented in SVA [3]. So we add IOPF support for VFIO passthrough based on the IOPF part of SVA in this series. We have measured its performance with UADK [4] (passthrough an accelerator to a VM) on Hisilicon Kunpeng920 board: Run hisi_sec_test... - with varying message lengths and sending times - with/without stage 2 IOPF enabled when msg_len = 1MB and PREMAP_LEN (in patch 3) = 1: speed (KB/s) times w/o IOPFwith IOPF (num of faults)degradation 1 325596 119152 (518) 36.6% 100 7524985 5804659 (1058) 77.1% 1000 8661817 8440209 (1071) 97.4% 5000 8804512 8724368 (1216) 99.1% If we use the same region to send messages, since page faults occur almost only when first accessing, more times, less degradation. when msg_len = 10MB and PREMAP_LEN = 512: speed (KB/s) times w/o IOPFwith IOPF (num of faults)degradation 1 1012758 682257 (13) 67.4% 100 8680688 8374154 (26) 96.5% 1000 8860861 8719918 (26) 98.4% We see that pre-mapping can help. And we also measured the performance of host SVA with the same params: when msg_len = 1MB: speed (KB/s) times w/o IOPFwith IOPF (num of faults)degradation 1 951672 163866 (512) 17.2% 100 8691961 4529971 (1024) 52.1% 1000 9158721 8376346 (1024) 91.5% 5000 9184532 9008739 (1024) 98.1% Besides, the avg time spent in vfio_iommu_dev_fault_handler() (in patch 3) is little less than iopf_handle_group() (in SVA) (1.6 us vs 2.0 us). History: v1 -> v2 - Numerous improvements following the suggestions. Thanks a lot to all of you. Yet TODO: - Add support for PRI. - Consider selective-faulting. (suggested by Kevin) ... Links: [1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS, 2016. [2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer Tracking for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020. [3] https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210302092644.2553014-1-jean-phili...@linaro.org/ [4] https://github.com/Linaro/uadk Any comments and suggestions are very welcome. :-) Thanks, Shenming Shenming Lu (6): iommu: Evolve to support more scenarios of using IOPF vfio: Add an MMU notifier to avoid pinning vfio: Add a page fault handler vfio: VFIO_IOMMU_ENABLE_IOPF vfio: No need to statically pin and map if IOPF enabled vfio: Add nested IOPF support .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 3 +- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 +- drivers/iommu/io-pgfault.c| 4 - drivers/iommu/iommu.c | 56 ++- drivers/vfio/vfio.c | 118 + drivers/vfio/vfio_iommu_type1.c | 446 +- include/linux/iommu.h | 21 +- include/linux/vfio.h | 14 + include/uapi/linux/iommu.h| 3 + include/uapi/linux/vfio.h | 6 + 10 files changed, 661 insertions(+), 21 deletions(-) -- 2.19.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH v2 1/6] iommu: Evolve to support more scenarios of using IOPF
This patch follows the discussion here: https://lore.kernel.org/linux-acpi/YAaxjmJW+ZMvrhac@myrica/ In order to support more scenarios of using IOPF (mainly consider the nested extension), besides keeping IOMMU_DEV_FEAT_IOPF as a general capability for whether delivering faults through the IOMMU, we extend iommu_register_fault_handler() with flags and introduce IOPF_REPORT_FLAT and IOPF_REPORT_NESTED to describe the page fault reporting capability under a specific configuration. IOPF_REPORT_NESTED needs additional info to indicate which level/stage is concerned since the fault client may be interested in only one level. Signed-off-by: Shenming Lu --- .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 3 +- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 ++-- drivers/iommu/io-pgfault.c| 4 -- drivers/iommu/iommu.c | 56 ++- include/linux/iommu.h | 21 ++- include/uapi/linux/iommu.h| 3 + 6 files changed, 85 insertions(+), 13 deletions(-) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c index ee66d1f4cb81..5de9432349d4 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c @@ -482,7 +482,8 @@ static int arm_smmu_master_sva_enable_iopf(struct arm_smmu_master *master) if (ret) return ret; - ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf, dev); + ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf, + IOPF_REPORT_FLAT, dev); if (ret) { iopf_queue_remove_device(master->smmu->evtq.iopf, dev); return ret; diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 363744df8d51..f40529d0075d 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -1447,10 +1447,6 @@ static int arm_smmu_handle_evt(struct arm_smmu_device *smmu, u64 *evt) return -EOPNOTSUPP; } - /* Stage-2 is always pinned at the moment */ - if (evt[1] & EVTQ_1_S2) - return -EFAULT; - if (evt[1] & EVTQ_1_RnW) perm |= IOMMU_FAULT_PERM_READ; else @@ -1468,13 +1464,18 @@ static int arm_smmu_handle_evt(struct arm_smmu_device *smmu, u64 *evt) .flags = IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE, .grpid = FIELD_GET(EVTQ_1_STAG, evt[1]), .perm = perm, - .addr = FIELD_GET(EVTQ_2_ADDR, evt[2]), }; if (ssid_valid) { flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_PASID_VALID; flt->prm.pasid = FIELD_GET(EVTQ_0_SSID, evt[0]); } + + if (evt[1] & EVTQ_1_S2) { + flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_L2; + flt->prm.addr = FIELD_GET(EVTQ_3_IPA, evt[3]); + } else + flt->prm.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]); } else { flt->type = IOMMU_FAULT_DMA_UNRECOV; flt->event = (struct iommu_fault_unrecoverable) { diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c index 1df8c1dcae77..abf16e06bcf5 100644 --- a/drivers/iommu/io-pgfault.c +++ b/drivers/iommu/io-pgfault.c @@ -195,10 +195,6 @@ int iommu_queue_iopf(struct iommu_fault *fault, void *cookie) lockdep_assert_held(¶m->lock); - if (fault->type != IOMMU_FAULT_PAGE_REQ) - /* Not a recoverable page fault */ - return -EOPNOTSUPP; - /* * As long as we're holding param->lock, the queue can't be unlinked * from the device and therefore cannot disappear. diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index d0b0a15dba84..cb1d93b00f7d 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1056,6 +1056,40 @@ int iommu_group_unregister_notifier(struct iommu_group *group, } EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier); +/* + * iommu_update_device_fault_handler - Update the device fault handler via flags + * @dev: the device + * @mask: bits(not set) to clear + * @set: bits to set + * + * Update the device fault handler installed by + * iommu_register_device_fault_handler(). + * + * Return 0 on success, or an error. + */ +int iommu_update_device_fault_handler(struct device *dev, u32 mask, u32 set) +{ + struct dev_iommu *param = dev->iommu; + int ret = 0; + + if (!param) + return -EINVAL; + + mutex_lock(¶m->lock); + + if (param->fault_param) { + ret = -EINVAL; + goto out_unlock; +
[RFC PATCH v2 2/6] vfio: Add an MMU notifier to avoid pinning
To avoid pinning pages when they are mapped in IOMMU page tables, we add an MMU notifier to tell the addresses which are no longer valid and try to unmap them. Signed-off-by: Shenming Lu --- drivers/vfio/vfio_iommu_type1.c | 68 + 1 file changed, 68 insertions(+) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 4bb162c1d649..03ccc11057af 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -40,6 +40,7 @@ #include #include #include +#include #define DRIVER_VERSION "0.2" #define DRIVER_AUTHOR "Alex Williamson " @@ -69,6 +70,7 @@ struct vfio_iommu { struct mutexlock; struct rb_root dma_list; struct blocking_notifier_head notifier; + struct mmu_notifier mn; unsigned intdma_avail; unsigned intvaddr_invalid_count; uint64_tpgsize_bitmap; @@ -101,6 +103,7 @@ struct vfio_dma { struct task_struct *task; struct rb_root pfn_list; /* Ex-user pinned pfn list */ unsigned long *bitmap; + unsigned long *iopf_mapped_bitmap; }; struct vfio_batch { @@ -157,6 +160,10 @@ struct vfio_regions { #define DIRTY_BITMAP_PAGES_MAX ((u64)INT_MAX) #define DIRTY_BITMAP_SIZE_MAX DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX) +#define IOPF_MAPPED_BITMAP_GET(dma, i) \ + ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG] \ + >> ((i) % BITS_PER_LONG)) & 0x1) + #define WAITED 1 static int put_pfn(unsigned long pfn, int prot); @@ -1149,6 +1156,35 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma, return unlocked; } +static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu, + struct vfio_dma *dma, + dma_addr_t start, dma_addr_t end) +{ + unsigned long bit_offset; + size_t len; + struct vfio_domain *d; + + while (start < end) { + bit_offset = (start - dma->iova) >> PAGE_SHIFT; + + for (len = 0; start + len < end; len += PAGE_SIZE) { + if (!IOPF_MAPPED_BITMAP_GET(dma, + bit_offset + (len >> PAGE_SHIFT))) + break; + } + + if (len) { + list_for_each_entry(d, &iommu->domain_list, next) + iommu_unmap(d->domain, start, len); + + bitmap_clear(dma->iopf_mapped_bitmap, +bit_offset, len >> PAGE_SHIFT); + } + + start += (len + PAGE_SIZE); + } +} + static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma) { WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list)); @@ -3096,6 +3132,38 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu, return -EINVAL; } +static void mn_invalidate_range(struct mmu_notifier *mn, struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct vfio_iommu *iommu = container_of(mn, struct vfio_iommu, mn); + struct rb_node *n; + + mutex_lock(&iommu->lock); + + for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) { + struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node); + unsigned long start_n, end_n; + + if (end <= dma->vaddr || start >= dma->vaddr + dma->size) + continue; + + start_n = ALIGN_DOWN(max_t(unsigned long, start, dma->vaddr), +PAGE_SIZE); + end_n = ALIGN(min_t(unsigned long, end, dma->vaddr + dma->size), + PAGE_SIZE); + + vfio_unmap_partial_iopf(iommu, dma, + start_n - dma->vaddr + dma->iova, + end_n - dma->vaddr + dma->iova); + } + + mutex_unlock(&iommu->lock); +} + +static const struct mmu_notifier_ops vfio_iommu_type1_mn_ops = { + .invalidate_range = mn_invalidate_range, +}; + static long vfio_iommu_type1_ioctl(void *iommu_data, unsigned int cmd, unsigned long arg) { -- 2.19.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH v2 3/6] vfio: Add a page fault handler
VFIO manages the DMA mapping itself. To support IOPF for VFIO devices, we add a VFIO page fault handler to serve the reported page faults from the IOMMU driver. Besides, we can pre-map more pages than requested at once to optimize for fewer page fault handlings. Signed-off-by: Shenming Lu --- drivers/vfio/vfio.c | 35 +++ drivers/vfio/vfio_iommu_type1.c | 167 include/linux/vfio.h| 5 + 3 files changed, 207 insertions(+) diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index 38779e6fd80c..77b29bbd3027 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -2354,6 +2354,41 @@ struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group) } EXPORT_SYMBOL_GPL(vfio_group_iommu_domain); +int vfio_iommu_dev_fault_handler(struct iommu_fault *fault, void *data) +{ + struct device *dev = (struct device *)data; + struct vfio_container *container; + struct vfio_group *group; + struct vfio_iommu_driver *driver; + int ret; + + if (!dev) + return -EINVAL; + + group = vfio_group_get_from_dev(dev); + if (!group) + return -ENODEV; + + ret = vfio_group_add_container_user(group); + if (ret) + goto out; + + container = group->container; + driver = container->iommu_driver; + if (likely(driver && driver->ops->dma_map_iopf)) + ret = driver->ops->dma_map_iopf(container->iommu_data, + fault, dev); + else + ret = -ENOTTY; + + vfio_group_try_dissolve_container(group); + +out: + vfio_group_put(group); + return ret; +} +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler); + /** * Module/class support */ diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 03ccc11057af..167d52c1468b 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -3330,6 +3330,172 @@ static void vfio_iommu_type1_notify(void *iommu_data, wake_up_all(&iommu->vaddr_wait); } +/* + * To optimize for fewer page fault handlings, try to + * pre-map more pages than requested. + */ +#define IOPF_PREMAP_LEN512 + +static void unpin_pages_iopf(struct vfio_dma *dma, +unsigned long pfn, unsigned long npages) +{ + while (npages--) + put_pfn(pfn++, dma->prot); +} + +/* + * Return 0 on success or a negative error code, the + * number of pages contiguously pinned is in @*pinned. + */ +static int pin_pages_iopf(struct vfio_dma *dma, unsigned long vaddr, + unsigned long npages, unsigned long *pfn_base, + unsigned long *pinned, struct vfio_batch *batch) +{ + struct mm_struct *mm; + unsigned long pfn; + int ret = 0; + *pinned = 0; + + mm = get_task_mm(dma->task); + if (!mm) + return -ENODEV; + + if (batch->size) { + *pfn_base = page_to_pfn(batch->pages[batch->offset]); + pfn = *pfn_base; + } else + *pfn_base = 0; + + while (npages) { + if (!batch->size) { + unsigned long req_pages = min_t(unsigned long, npages, + batch->capacity); + + ret = vaddr_get_pfns(mm, vaddr, req_pages, dma->prot, +&pfn, batch->pages); + if (ret < 0) + goto out; + + batch->size = ret; + batch->offset = 0; + ret = 0; + + if (!*pfn_base) + *pfn_base = pfn; + } + + while (true) { + if (pfn != *pfn_base + *pinned) + goto out; + + (*pinned)++; + npages--; + vaddr += PAGE_SIZE; + batch->offset++; + batch->size--; + + if (!batch->size) + break; + + pfn = page_to_pfn(batch->pages[batch->offset]); + } + + if (unlikely(disable_hugepages)) + break; + } + +out: + mmput(mm); + return ret; +} + +static int vfio_iommu_type1_dma_map_iopf(void *iommu_data, +struct iommu_fault *fault, +struct device *dev) +{ + struct vfio_iommu *iommu = iommu_data; + dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE); + struct vfio_dma *dma; + int access_flags = 0; + size_t premap_len, ma
[RFC PATCH v2 4/6] vfio: VFIO_IOMMU_ENABLE_IOPF
Since enabling IOPF for devices may lead to a slow ramp up of performance, we add an VFIO_IOMMU_ENABLE_IOPF ioctl to make it configurable. And the IOPF enabling of a VFIO device includes setting IOMMU_DEV_FEAT_IOPF and registering the VFIO page fault handler. Note that VFIO_IOMMU_DISABLE_IOPF is not supported since there may be inflight page faults when disabling. Signed-off-by: Shenming Lu --- drivers/vfio/vfio_iommu_type1.c | 139 +++- include/uapi/linux/vfio.h | 6 ++ 2 files changed, 142 insertions(+), 3 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 167d52c1468b..3997473be4a7 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -71,6 +71,7 @@ struct vfio_iommu { struct rb_root dma_list; struct blocking_notifier_head notifier; struct mmu_notifier mn; + struct mm_struct*mm; unsigned intdma_avail; unsigned intvaddr_invalid_count; uint64_tpgsize_bitmap; @@ -81,6 +82,7 @@ struct vfio_iommu { booldirty_page_tracking; boolpinned_page_dirty_scope; boolcontainer_open; + booliopf_enabled; }; struct vfio_domain { @@ -2278,6 +2280,62 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu, list_splice_tail(iova_copy, iova); } +static int dev_disable_iopf(struct device *dev, void *data) +{ + int *enabled_dev_cnt = data; + + if (enabled_dev_cnt && *enabled_dev_cnt <= 0) + return -1; + + iommu_unregister_device_fault_handler(dev); + iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF); + + if (enabled_dev_cnt) + (*enabled_dev_cnt)--; + + return 0; +} + +static int dev_enable_iopf(struct device *dev, void *data) +{ + int *enabled_dev_cnt = data; + struct iommu_domain *domain; + int nested; + u32 flags; + int ret; + + ret = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_IOPF); + if (ret) + return ret; + + domain = iommu_get_domain_for_dev(dev); + if (!domain) { + ret = -ENODEV; + goto out_disable; + } + + ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, &nested); + if (ret) + goto out_disable; + + if (nested) + flags = IOPF_REPORT_NESTED | IOPF_REPORT_NESTED_L2_CONCERNED; + else + flags = IOPF_REPORT_FLAT; + + ret = iommu_register_device_fault_handler(dev, + vfio_iommu_dev_fault_handler, flags, dev); + if (ret) + goto out_disable; + + (*enabled_dev_cnt)++; + return 0; + +out_disable: + iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF); + return ret; +} + static int vfio_iommu_type1_attach_group(void *iommu_data, struct iommu_group *iommu_group) { @@ -2291,6 +2349,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, struct iommu_domain_geometry geo; LIST_HEAD(iova_copy); LIST_HEAD(group_resv_regions); + int iopf_enabled_dev_cnt = 0; mutex_lock(&iommu->lock); @@ -2368,6 +2427,13 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, if (ret) goto out_domain; + if (iommu->iopf_enabled) { + ret = iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt, + dev_enable_iopf); + if (ret) + goto out_detach; + } + /* Get aperture info */ iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY, &geo); @@ -2449,9 +2515,11 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, vfio_test_domain_fgsp(domain); /* replay mappings on new domains */ - ret = vfio_iommu_replay(iommu, domain); - if (ret) - goto out_detach; + if (!iommu->iopf_enabled) { + ret = vfio_iommu_replay(iommu, domain); + if (ret) + goto out_detach; + } if (resv_msi) { ret = iommu_get_msi_cookie(domain->domain, resv_msi_base); @@ -2482,6 +2550,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, iommu_domain_free(domain->domain); vfio_iommu_iova_free(&iova_copy); vfio_iommu_resv_free(&group_resv_regions); + iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt, +dev_disable_iopf); out_free: kfree(domain); kfree(group); @@ -2643,6 +2713,10 @@ static void vfio_iommu_type1_detach_group(void *iommu_data, if (!group)
[RFC PATCH v2 5/6] vfio: No need to statically pin and map if IOPF enabled
If IOPF enabled for the VFIO container, there is no need to statically pin and map the entire DMA range, we can do it on demand. And unmap according to the IOPF mapped bitmap when removing vfio_dma. Signed-off-by: Shenming Lu --- drivers/vfio/vfio_iommu_type1.c | 35 - 1 file changed, 30 insertions(+), 5 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 3997473be4a7..8d14ced649a6 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -165,6 +165,7 @@ struct vfio_regions { #define IOPF_MAPPED_BITMAP_GET(dma, i) \ ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG] \ >> ((i) % BITS_PER_LONG)) & 0x1) +#define IOPF_MAPPED_BITMAP_BYTES(n)DIRTY_BITMAP_BYTES(n) #define WAITED 1 @@ -877,7 +878,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data, * already pinned and accounted. Accouting should be done if there is no * iommu capable domain in the container. */ - do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu); + do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || + iommu->iopf_enabled; for (i = 0; i < npage; i++) { struct vfio_pfn *vpfn; @@ -966,7 +968,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data, mutex_lock(&iommu->lock); - do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu); + do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || + iommu->iopf_enabled; for (i = 0; i < npage; i++) { struct vfio_dma *dma; dma_addr_t iova; @@ -1087,7 +1090,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma, if (!dma->size) return 0; - if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled) return 0; /* @@ -1187,11 +1190,20 @@ static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu, } } +static void vfio_dma_clean_iopf(struct vfio_iommu *iommu, struct vfio_dma *dma) +{ + vfio_unmap_partial_iopf(iommu, dma, dma->iova, dma->iova + dma->size); + + kfree(dma->iopf_mapped_bitmap); +} + static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma) { WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list)); vfio_unmap_unpin(iommu, dma, true); vfio_unlink_dma(iommu, dma); + if (iommu->iopf_enabled) + vfio_dma_clean_iopf(iommu, dma); put_task_struct(dma->task); vfio_dma_bitmap_free(dma); if (dma->vaddr_invalid) { @@ -1655,6 +1667,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu, goto out_unlock; } + if (iommu->iopf_enabled) { + dma->iopf_mapped_bitmap = kvzalloc(IOPF_MAPPED_BITMAP_BYTES( + size >> PAGE_SHIFT), GFP_KERNEL); + if (!dma->iopf_mapped_bitmap) { + ret = -ENOMEM; + kfree(dma); + goto out_unlock; + } + } + iommu->dma_avail--; dma->iova = iova; dma->vaddr = vaddr; @@ -1694,8 +1716,11 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu, /* Insert zero-sized and grow as we map chunks of it */ vfio_link_dma(iommu, dma); - /* Don't pin and map if container doesn't contain IOMMU capable domain*/ - if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) + /* +* Don't pin and map if container doesn't contain IOMMU capable domain, +* or IOPF enabled for the container. +*/ + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled) dma->size = size; else ret = vfio_pin_map_dma(iommu, dma, size); -- 2.19.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH v2 6/6] vfio: Add nested IOPF support
To set up nested mode, drivers such as vfio_pci have to register a handler to receive stage/level 1 faults from the IOMMU, but since currently each device can only have one iommu dev fault handler, and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF), we choose to update the registered handler (a combined one) via flags (set IOPF_REPORT_NESTED_L1_CONCERNED), and further deliver the received stage 1 faults in the handler to the guest through a newly added vfio_device_ops callback. Signed-off-by: Shenming Lu --- drivers/vfio/vfio.c | 83 + drivers/vfio/vfio_iommu_type1.c | 37 +++ include/linux/vfio.h| 9 3 files changed, 129 insertions(+) diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index 77b29bbd3027..c6a01d947d0d 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -2389,6 +2389,89 @@ int vfio_iommu_dev_fault_handler(struct iommu_fault *fault, void *data) } EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler); +int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev) +{ + struct vfio_container *container; + struct vfio_group *group; + struct vfio_iommu_driver *driver; + int ret; + + if (!dev) + return -EINVAL; + + group = vfio_group_get_from_dev(dev); + if (!group) + return -ENODEV; + + ret = vfio_group_add_container_user(group); + if (ret) + goto out; + + container = group->container; + driver = container->iommu_driver; + if (likely(driver && driver->ops->unregister_hdlr_nested)) + ret = driver->ops->unregister_hdlr_nested(container->iommu_data, + dev); + else + ret = -ENOTTY; + + vfio_group_try_dissolve_container(group); + +out: + vfio_group_put(group); + return ret; +} +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested); + +/* + * Register/Update the VFIO page fault handler + * to receive nested stage/level 1 faults. + */ +int vfio_iommu_dev_fault_handler_register_nested(struct device *dev) +{ + struct vfio_container *container; + struct vfio_group *group; + struct vfio_iommu_driver *driver; + int ret; + + if (!dev) + return -EINVAL; + + group = vfio_group_get_from_dev(dev); + if (!group) + return -ENODEV; + + ret = vfio_group_add_container_user(group); + if (ret) + goto out; + + container = group->container; + driver = container->iommu_driver; + if (likely(driver && driver->ops->register_hdlr_nested)) + ret = driver->ops->register_hdlr_nested(container->iommu_data, + dev); + else + ret = -ENOTTY; + + vfio_group_try_dissolve_container(group); + +out: + vfio_group_put(group); + return ret; +} +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested); + +int vfio_transfer_dev_fault(struct device *dev, struct iommu_fault *fault) +{ + struct vfio_device *device = dev_get_drvdata(dev); + + if (unlikely(!device->ops->transfer)) + return -EOPNOTSUPP; + + return device->ops->transfer(device->device_data, fault); +} +EXPORT_SYMBOL_GPL(vfio_transfer_dev_fault); + /** * Module/class support */ diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 8d14ced649a6..62ad4a47de4a 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -3581,6 +3581,13 @@ static int vfio_iommu_type1_dma_map_iopf(void *iommu_data, enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID; struct iommu_page_response resp = {0}; + /* +* When configured in nested mode, further deliver +* stage/level 1 faults to the guest. +*/ + if (iommu->nesting && !(fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_L2)) + return vfio_transfer_dev_fault(dev, fault); + mutex_lock(&iommu->lock); dma = vfio_find_dma(iommu, iova, PAGE_SIZE); @@ -3654,6 +3661,34 @@ static int vfio_iommu_type1_dma_map_iopf(void *iommu_data, return 0; } +static int vfio_iommu_type1_register_hdlr_nested(void *iommu_data, +struct device *dev) +{ + struct vfio_iommu *iommu = iommu_data; + + if (iommu->iopf_enabled) + return iommu_update_device_fault_handler(dev, ~0, + IOPF_REPORT_NESTED_L1_CONCERNED); + else + return iommu_register_device_fault_handler(dev, + vfio_iommu_dev_fault_handler, +
Re: [RFC PATCH v2 1/6] iommu: Evolve to support more scenarios of using IOPF
Hi Baolu, On 2021/3/10 10:09, Lu Baolu wrote: > Hi Shenming, > > On 3/9/21 2:22 PM, Shenming Lu wrote: >> This patch follows the discussion here: >> >> https://lore.kernel.org/linux-acpi/YAaxjmJW+ZMvrhac@myrica/ >> >> In order to support more scenarios of using IOPF (mainly consider >> the nested extension), besides keeping IOMMU_DEV_FEAT_IOPF as a >> general capability for whether delivering faults through the IOMMU, >> we extend iommu_register_fault_handler() with flags and introduce >> IOPF_REPORT_FLAT and IOPF_REPORT_NESTED to describe the page fault >> reporting capability under a specific configuration. >> IOPF_REPORT_NESTED needs additional info to indicate which level/stage >> is concerned since the fault client may be interested in only one >> level. >> >> Signed-off-by: Shenming Lu >> --- >> .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 3 +- >> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 ++-- >> drivers/iommu/io-pgfault.c | 4 -- >> drivers/iommu/iommu.c | 56 ++- >> include/linux/iommu.h | 21 ++- >> include/uapi/linux/iommu.h | 3 + >> 6 files changed, 85 insertions(+), 13 deletions(-) >> >> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c >> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c >> index ee66d1f4cb81..5de9432349d4 100644 >> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c >> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c >> @@ -482,7 +482,8 @@ static int arm_smmu_master_sva_enable_iopf(struct >> arm_smmu_master *master) >> if (ret) >> return ret; >> - ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf, dev); >> + ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf, >> + IOPF_REPORT_FLAT, dev); >> if (ret) { >> iopf_queue_remove_device(master->smmu->evtq.iopf, dev); >> return ret; >> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >> index 363744df8d51..f40529d0075d 100644 >> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >> @@ -1447,10 +1447,6 @@ static int arm_smmu_handle_evt(struct arm_smmu_device >> *smmu, u64 *evt) >> return -EOPNOTSUPP; >> } >> - /* Stage-2 is always pinned at the moment */ >> - if (evt[1] & EVTQ_1_S2) >> - return -EFAULT; >> - >> if (evt[1] & EVTQ_1_RnW) >> perm |= IOMMU_FAULT_PERM_READ; >> else >> @@ -1468,13 +1464,18 @@ static int arm_smmu_handle_evt(struct >> arm_smmu_device *smmu, u64 *evt) >> .flags = IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE, >> .grpid = FIELD_GET(EVTQ_1_STAG, evt[1]), >> .perm = perm, >> - .addr = FIELD_GET(EVTQ_2_ADDR, evt[2]), >> }; >> if (ssid_valid) { >> flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_PASID_VALID; >> flt->prm.pasid = FIELD_GET(EVTQ_0_SSID, evt[0]); >> } >> + >> + if (evt[1] & EVTQ_1_S2) { >> + flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_L2; >> + flt->prm.addr = FIELD_GET(EVTQ_3_IPA, evt[3]); >> + } else >> + flt->prm.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]); >> } else { >> flt->type = IOMMU_FAULT_DMA_UNRECOV; >> flt->event = (struct iommu_fault_unrecoverable) { >> diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c >> index 1df8c1dcae77..abf16e06bcf5 100644 >> --- a/drivers/iommu/io-pgfault.c >> +++ b/drivers/iommu/io-pgfault.c >> @@ -195,10 +195,6 @@ int iommu_queue_iopf(struct iommu_fault *fault, void >> *cookie) >> lockdep_assert_held(¶m->lock); >> - if (fault->type != IOMMU_FAULT_PAGE_REQ) >> - /* Not a recoverable page fault */ >> - return -EOPNOTSUPP; >> - > > Any reasons why do you want to remove this check? My thought was to make the reporting cap more detailed: IOPF_REPORT_ is only for recoverable page faults (IOMMU_FAULT_PAGE_REQ), and we may add UNRECOV_FAULT_REPORT_ later for unrecoverable faults (IOMMU_FAULT_DMA_UNRECOV)... > >> /* >> * As long as we're holding param->lock, the queue can't be unlinked >> * from
[RFC PATCH v3 4/8] vfio/type1: Pre-map more pages than requested in the IOPF handling
To optimize for fewer page fault handlings, we can pre-map more pages than requested at once. Note that IOPF_PREMAP_LEN is just an arbitrary value for now, which we could try further tuning. Signed-off-by: Shenming Lu --- drivers/vfio/vfio_iommu_type1.c | 131 ++-- 1 file changed, 123 insertions(+), 8 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 1cb9d1f2717b..01e296c6dc9e 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -3217,6 +3217,91 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu, return -EINVAL; } +/* + * To optimize for fewer page fault handlings, try to + * pre-map more pages than requested. + */ +#define IOPF_PREMAP_LEN512 + +/* + * Return 0 on success or a negative error code, the + * number of pages contiguously pinned is in @pinned. + */ +static int pin_pages_iopf(struct vfio_dma *dma, unsigned long vaddr, + unsigned long npages, unsigned long *pfn_base, + unsigned long *pinned, struct vfio_batch *batch) +{ + struct mm_struct *mm; + unsigned long pfn; + int ret = 0; + *pinned = 0; + + mm = get_task_mm(dma->task); + if (!mm) + return -ENODEV; + + if (batch->size) { + *pfn_base = page_to_pfn(batch->pages[batch->offset]); + pfn = *pfn_base; + } else { + *pfn_base = 0; + } + + while (npages) { + if (!batch->size) { + unsigned long req_pages = min_t(unsigned long, npages, + batch->capacity); + + ret = vaddr_get_pfns(mm, vaddr, req_pages, dma->prot, +&pfn, batch->pages); + if (ret < 0) + goto out; + + batch->size = ret; + batch->offset = 0; + ret = 0; + + if (!*pfn_base) + *pfn_base = pfn; + } + + while (true) { + if (pfn != *pfn_base + *pinned) + goto out; + + (*pinned)++; + npages--; + vaddr += PAGE_SIZE; + batch->offset++; + batch->size--; + + if (!batch->size) + break; + + pfn = page_to_pfn(batch->pages[batch->offset]); + } + + if (unlikely(disable_hugepages)) + break; + } + +out: + if (batch->size == 1 && !batch->offset) { + put_pfn(pfn, dma->prot); + batch->size = 0; + } + + mmput(mm); + return ret; +} + +static void unpin_pages_iopf(struct vfio_dma *dma, +unsigned long pfn, unsigned long npages) +{ + while (npages--) + put_pfn(pfn++, dma->prot); +} + /* VFIO I/O Page Fault handler */ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data) { @@ -3225,9 +3310,11 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data) struct vfio_iopf_group *iopf_group; struct vfio_iommu *iommu; struct vfio_dma *dma; + struct vfio_batch batch; dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE); int access_flags = 0; - unsigned long bit_offset, vaddr, pfn; + size_t premap_len, map_len, mapped_len = 0; + unsigned long bit_offset, vaddr, pfn, i, npages; int ret; enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID; struct iommu_page_response resp = {0}; @@ -3263,19 +3350,47 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data) if (IOPF_MAPPED_BITMAP_GET(dma, bit_offset)) goto out_success; + premap_len = IOPF_PREMAP_LEN << PAGE_SHIFT; + npages = dma->size >> PAGE_SHIFT; + map_len = PAGE_SIZE; + for (i = bit_offset + 1; i < npages; i++) { + if (map_len >= premap_len || IOPF_MAPPED_BITMAP_GET(dma, i)) + break; + map_len += PAGE_SIZE; + } vaddr = iova - dma->iova + dma->vaddr; + vfio_batch_init(&batch); - if (vfio_pin_page_external(dma, vaddr, &pfn, false)) - goto out_invalid; + while (map_len) { + ret = pin_pages_iopf(dma, vaddr + mapped_len, +map_len >> PAGE_SHIFT, &pfn, +&npages, &batch); + if (!npag
[RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough
Hi, Requesting for your comments and suggestions. :-) The static pinning and mapping problem in VFIO and possible solutions have been discussed a lot [1, 2]. One of the solutions is to add I/O Page Fault support for VFIO devices. Different from those relatively complicated software approaches such as presenting a vIOMMU that provides the DMA buffer information (might include para-virtualized optimizations), IOPF mainly depends on the hardware faulting capability, such as the PCIe PRI extension or Arm SMMU stall model. What's more, the IOPF support in the IOMMU driver has already been implemented in SVA [3]. So we add IOPF support for VFIO passthrough based on the IOPF part of SVA in this series. We have measured its performance with UADK [4] (passthrough an accelerator to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA): Run hisi_sec_test... - with varying sending times and message lengths - with/without IOPF enabled (speed slowdown) when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1): slowdown (num of faults) times VFIO IOPF host SVA 1 63.4% (518)82.8% (512) 10022.9% (1058) 47.9% (1024) 1000 2.6% (1071)8.5% (1024) when msg_len = 10MB (and PREMAP_LEN = 512): slowdown (num of faults) times VFIO IOPF 1 32.6% (13) 1003.5% (26) 1000 1.6% (26) History: v2 -> v3 - Nit fixes. - No reason to disable reporting the unrecoverable faults. (baolu) - Maintain a global IOPF enabled group list. - Split the pre-mapping optimization to be a separate patch. - Add selective faulting support (use vfio_pin_pages to indicate the non-faultable scope and add a new struct vfio_range to record it, untested). (Kevin) v1 -> v2 - Numerous improvements following the suggestions. Thanks a lot to all of you. Note that PRI is not supported at the moment since there is no hardware. Links: [1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS, 2016. [2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer Tracking for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020. [3] https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210401154718.307519-1-jean-phili...@linaro.org/ [4] https://github.com/Linaro/uadk Thanks, Shenming Shenming Lu (8): iommu: Evolve the device fault reporting framework vfio/type1: Add a page fault handler vfio/type1: Add an MMU notifier to avoid pinning vfio/type1: Pre-map more pages than requested in the IOPF handling vfio/type1: VFIO_IOMMU_ENABLE_IOPF vfio/type1: No need to statically pin and map if IOPF enabled vfio/type1: Add selective DMA faulting support vfio: Add nested IOPF support .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c |3 +- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 +- drivers/iommu/iommu.c | 56 +- drivers/vfio/vfio.c | 85 +- drivers/vfio/vfio_iommu_type1.c | 1000 - include/linux/iommu.h | 19 +- include/linux/vfio.h | 13 + include/uapi/linux/iommu.h|4 + include/uapi/linux/vfio.h |6 + 9 files changed, 1181 insertions(+), 23 deletions(-) -- 2.19.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH v3 1/8] iommu: Evolve the device fault reporting framework
This patch follows the discussion here: https://lore.kernel.org/linux-acpi/YAaxjmJW+ZMvrhac@myrica/ Besides SVA/vSVA, such as VFIO may also enable (2nd level) IOPF to remove pinning restriction. In order to better support more scenarios of using device faults, we extend iommu_register_fault_handler() with flags and introduce FAULT_REPORT_ to describe the device fault reporting capability under a specific configuration. Note that we don't further distinguish recoverable and unrecoverable faults by flags in the fault reporting cap, having PAGE_FAULT_REPORT_ + UNRECOV_FAULT_REPORT_ seems not a clean way. In addition, still take VFIO as an example, in nested mode, the 1st level and 2nd level fault reporting may be configured separately and currently each device can only register one iommu dev fault handler, so we add a handler update interface for this. Signed-off-by: Shenming Lu --- .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 3 +- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 -- drivers/iommu/iommu.c | 56 ++- include/linux/iommu.h | 19 ++- include/uapi/linux/iommu.h| 4 ++ 5 files changed, 90 insertions(+), 10 deletions(-) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c index ee66d1f4cb81..e6d766fb8f1a 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c @@ -482,7 +482,8 @@ static int arm_smmu_master_sva_enable_iopf(struct arm_smmu_master *master) if (ret) return ret; - ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf, dev); + ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf, + FAULT_REPORT_FLAT, dev); if (ret) { iopf_queue_remove_device(master->smmu->evtq.iopf, dev); return ret; diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 53abad8fdd91..51843f54a87f 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -1448,10 +1448,6 @@ static int arm_smmu_handle_evt(struct arm_smmu_device *smmu, u64 *evt) return -EOPNOTSUPP; } - /* Stage-2 is always pinned at the moment */ - if (evt[1] & EVTQ_1_S2) - return -EFAULT; - if (evt[1] & EVTQ_1_RnW) perm |= IOMMU_FAULT_PERM_READ; else @@ -1469,26 +1465,36 @@ static int arm_smmu_handle_evt(struct arm_smmu_device *smmu, u64 *evt) .flags = IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE, .grpid = FIELD_GET(EVTQ_1_STAG, evt[1]), .perm = perm, - .addr = FIELD_GET(EVTQ_2_ADDR, evt[2]), }; if (ssid_valid) { flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_PASID_VALID; flt->prm.pasid = FIELD_GET(EVTQ_0_SSID, evt[0]); } + + if (evt[1] & EVTQ_1_S2) { + flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_L2; + flt->prm.addr = FIELD_GET(EVTQ_3_IPA, evt[3]); + } else + flt->prm.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]); } else { flt->type = IOMMU_FAULT_DMA_UNRECOV; flt->event = (struct iommu_fault_unrecoverable) { .reason = reason, .flags = IOMMU_FAULT_UNRECOV_ADDR_VALID, .perm = perm, - .addr = FIELD_GET(EVTQ_2_ADDR, evt[2]), }; if (ssid_valid) { flt->event.flags |= IOMMU_FAULT_UNRECOV_PASID_VALID; flt->event.pasid = FIELD_GET(EVTQ_0_SSID, evt[0]); } + + if (evt[1] & EVTQ_1_S2) { + flt->event.flags |= IOMMU_FAULT_UNRECOV_L2; + flt->event.addr = FIELD_GET(EVTQ_3_IPA, evt[3]); + } else + flt->event.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]); } mutex_lock(&smmu->streams_mutex); diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index d0b0a15dba84..b50b526b45ac 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1056,6 +1056,40 @@ int iommu_group_unregister_notifier(struct iommu_group *group, } EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier); +/* + * iommu_update_device_fault_handler - Update the device fault handler via flags + * @dev: the device + * @mask: bits(not set) to clear + * @set: bits to set + * + * Update the device fault handler installed by + * iommu_register_device_fault_handler(). + * + * Return 0 on su
[RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF
Since enabling IOPF for devices may lead to a slow ramp up of performance, we add an ioctl VFIO_IOMMU_ENABLE_IOPF to make it configurable. And the IOPF enabling of a VFIO device includes setting IOMMU_DEV_FEAT_IOPF and registering the VFIO IOPF handler. Note that VFIO_IOMMU_DISABLE_IOPF is not supported since there may be inflight page faults when disabling. Signed-off-by: Shenming Lu --- drivers/vfio/vfio_iommu_type1.c | 223 +++- include/uapi/linux/vfio.h | 6 + 2 files changed, 226 insertions(+), 3 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 01e296c6dc9e..7df5711e743a 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -71,6 +71,7 @@ struct vfio_iommu { struct rb_root dma_list; struct blocking_notifier_head notifier; struct mmu_notifier mn; + struct mm_struct*mm; unsigned intdma_avail; unsigned intvaddr_invalid_count; uint64_tpgsize_bitmap; @@ -81,6 +82,7 @@ struct vfio_iommu { booldirty_page_tracking; boolpinned_page_dirty_scope; boolcontainer_open; + booliopf_enabled; }; struct vfio_domain { @@ -461,6 +463,38 @@ vfio_find_iopf_group(struct iommu_group *iommu_group) return node ? iopf_group : NULL; } +static void vfio_link_iopf_group(struct vfio_iopf_group *new) +{ + struct rb_node **link, *parent = NULL; + struct vfio_iopf_group *iopf_group; + + mutex_lock(&iopf_group_list_lock); + + link = &iopf_group_list.rb_node; + + while (*link) { + parent = *link; + iopf_group = rb_entry(parent, struct vfio_iopf_group, node); + + if (new->iommu_group < iopf_group->iommu_group) + link = &(*link)->rb_left; + else + link = &(*link)->rb_right; + } + + rb_link_node(&new->node, parent, link); + rb_insert_color(&new->node, &iopf_group_list); + + mutex_unlock(&iopf_group_list_lock); +} + +static void vfio_unlink_iopf_group(struct vfio_iopf_group *old) +{ + mutex_lock(&iopf_group_list_lock); + rb_erase(&old->node, &iopf_group_list); + mutex_unlock(&iopf_group_list_lock); +} + static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) { struct mm_struct *mm; @@ -2363,6 +2397,68 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu, list_splice_tail(iova_copy, iova); } +static int vfio_dev_domian_nested(struct device *dev, int *nested) +{ + struct iommu_domain *domain; + + domain = iommu_get_domain_for_dev(dev); + if (!domain) + return -ENODEV; + + return iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, nested); +} + +static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data); + +static int dev_enable_iopf(struct device *dev, void *data) +{ + int *enabled_dev_cnt = data; + int nested; + u32 flags; + int ret; + + ret = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_IOPF); + if (ret) + return ret; + + ret = vfio_dev_domian_nested(dev, &nested); + if (ret) + goto out_disable; + + if (nested) + flags = FAULT_REPORT_NESTED_L2; + else + flags = FAULT_REPORT_FLAT; + + ret = iommu_register_device_fault_handler(dev, + vfio_iommu_type1_dma_map_iopf, flags, dev); + if (ret) + goto out_disable; + + (*enabled_dev_cnt)++; + return 0; + +out_disable: + iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF); + return ret; +} + +static int dev_disable_iopf(struct device *dev, void *data) +{ + int *enabled_dev_cnt = data; + + if (enabled_dev_cnt && *enabled_dev_cnt <= 0) + return -1; + + WARN_ON(iommu_unregister_device_fault_handler(dev)); + WARN_ON(iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF)); + + if (enabled_dev_cnt) + (*enabled_dev_cnt)--; + + return 0; +} + static int vfio_iommu_type1_attach_group(void *iommu_data, struct iommu_group *iommu_group) { @@ -2376,6 +2472,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, struct iommu_domain_geometry geo; LIST_HEAD(iova_copy); LIST_HEAD(group_resv_regions); + int iopf_enabled_dev_cnt = 0; + struct vfio_iopf_group *iopf_group = NULL; mutex_lock(&iommu->lock); @@ -2453,6 +2551,24 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, if (ret) goto out_domain;
[RFC PATCH v3 6/8] vfio/type1: No need to statically pin and map if IOPF enabled
If IOPF enabled for the VFIO container, there is no need to statically pin and map the entire DMA range, we can do it on demand. And unmap according to the IOPF mapped bitmap when removing vfio_dma. Note that we still mark all pages dirty even if IOPF enabled, we may add IOPF-based fine grained dirty tracking support in the future. Signed-off-by: Shenming Lu --- drivers/vfio/vfio_iommu_type1.c | 38 +++-- 1 file changed, 32 insertions(+), 6 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 7df5711e743a..dcc93c3b258c 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -175,6 +175,7 @@ struct vfio_iopf_group { #define IOPF_MAPPED_BITMAP_GET(dma, i) \ ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG] \ >> ((i) % BITS_PER_LONG)) & 0x1) +#define IOPF_MAPPED_BITMAP_BYTES(n)DIRTY_BITMAP_BYTES(n) #define WAITED 1 @@ -959,7 +960,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data, * already pinned and accounted. Accouting should be done if there is no * iommu capable domain in the container. */ - do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu); + do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || + iommu->iopf_enabled; for (i = 0; i < npage; i++) { struct vfio_pfn *vpfn; @@ -1048,7 +1050,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data, mutex_lock(&iommu->lock); - do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu); + do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || + iommu->iopf_enabled; for (i = 0; i < npage; i++) { struct vfio_dma *dma; dma_addr_t iova; @@ -1169,7 +1172,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma, if (!dma->size) return 0; - if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled) return 0; /* @@ -1306,11 +1309,20 @@ static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu, } } +static void vfio_dma_clean_iopf(struct vfio_iommu *iommu, struct vfio_dma *dma) +{ + vfio_unmap_partial_iopf(iommu, dma, dma->iova, dma->iova + dma->size); + + kfree(dma->iopf_mapped_bitmap); +} + static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma) { WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list)); vfio_unmap_unpin(iommu, dma, true); vfio_unlink_dma(iommu, dma); + if (iommu->iopf_enabled) + vfio_dma_clean_iopf(iommu, dma); put_task_struct(dma->task); vfio_dma_bitmap_free(dma); if (dma->vaddr_invalid) { @@ -1359,7 +1371,8 @@ static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu, * mark all pages dirty if any IOMMU capable device is not able * to report dirty pages and all pages are pinned and mapped. */ - if (iommu->num_non_pinned_groups && dma->iommu_mapped) + if (iommu->num_non_pinned_groups && + (dma->iommu_mapped || iommu->iopf_enabled)) bitmap_set(dma->bitmap, 0, nbits); if (shift) { @@ -1772,6 +1785,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu, goto out_unlock; } + if (iommu->iopf_enabled) { + dma->iopf_mapped_bitmap = kvzalloc(IOPF_MAPPED_BITMAP_BYTES( + size >> PAGE_SHIFT), GFP_KERNEL); + if (!dma->iopf_mapped_bitmap) { + ret = -ENOMEM; + kfree(dma); + goto out_unlock; + } + } + iommu->dma_avail--; dma->iova = iova; dma->vaddr = vaddr; @@ -1811,8 +1834,11 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu, /* Insert zero-sized and grow as we map chunks of it */ vfio_link_dma(iommu, dma); - /* Don't pin and map if container doesn't contain IOMMU capable domain*/ - if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) + /* +* Don't pin and map if container doesn't contain IOMMU capable domain, +* or IOPF enabled for the container. +*/ + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled) dma->size = size; else ret = vfio_pin_map_dma(iommu, dma, size); -- 2.19.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[RFC PATCH v3 8/8] vfio: Add nested IOPF support
To set up nested mode, drivers such as vfio_pci need to register a handler to receive stage/level 1 faults from the IOMMU, but since currently each device can only have one iommu dev fault handler, and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF), we choose to update the registered handler (a consolidated one) via flags (set FAULT_REPORT_NESTED_L1), and further deliver the received stage 1 faults in the handler to the guest through a newly added vfio_device_ops callback. Signed-off-by: Shenming Lu --- drivers/vfio/vfio.c | 81 + drivers/vfio/vfio_iommu_type1.c | 49 +++- include/linux/vfio.h| 12 + 3 files changed, 141 insertions(+), 1 deletion(-) diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index 44c8dfabf7de..4245f15914bf 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -2356,6 +2356,87 @@ struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group) } EXPORT_SYMBOL_GPL(vfio_group_iommu_domain); +/* + * Register/Update the VFIO IOPF handler to receive + * nested stage/level 1 faults. + */ +int vfio_iommu_dev_fault_handler_register_nested(struct device *dev) +{ + struct vfio_container *container; + struct vfio_group *group; + struct vfio_iommu_driver *driver; + int ret; + + if (!dev) + return -EINVAL; + + group = vfio_group_get_from_dev(dev); + if (!group) + return -ENODEV; + + ret = vfio_group_add_container_user(group); + if (ret) + goto out; + + container = group->container; + driver = container->iommu_driver; + if (likely(driver && driver->ops->register_handler)) + ret = driver->ops->register_handler(container->iommu_data, dev); + else + ret = -ENOTTY; + + vfio_group_try_dissolve_container(group); + +out: + vfio_group_put(group); + return ret; +} +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested); + +int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev) +{ + struct vfio_container *container; + struct vfio_group *group; + struct vfio_iommu_driver *driver; + int ret; + + if (!dev) + return -EINVAL; + + group = vfio_group_get_from_dev(dev); + if (!group) + return -ENODEV; + + ret = vfio_group_add_container_user(group); + if (ret) + goto out; + + container = group->container; + driver = container->iommu_driver; + if (likely(driver && driver->ops->unregister_handler)) + ret = driver->ops->unregister_handler(container->iommu_data, dev); + else + ret = -ENOTTY; + + vfio_group_try_dissolve_container(group); + +out: + vfio_group_put(group); + return ret; +} +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested); + +int vfio_transfer_iommu_fault(struct device *dev, struct iommu_fault *fault) +{ + struct vfio_device *device = dev_get_drvdata(dev); + + if (unlikely(!device->ops->transfer)) + return -EOPNOTSUPP; + + return device->ops->transfer(device->device_data, fault); +} +EXPORT_SYMBOL_GPL(vfio_transfer_iommu_fault); + /** * Module/class support */ diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index ba2b5a1cf6e9..9d1adeddb303 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -3821,13 +3821,32 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data) struct vfio_batch batch; struct vfio_range *range; dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE); - int access_flags = 0; + int access_flags = 0, nested; size_t premap_len, map_len, mapped_len = 0; unsigned long bit_offset, vaddr, pfn, i, npages; int ret; enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID; struct iommu_page_response resp = {0}; + if (vfio_dev_domian_nested(dev, &nested)) + return -ENODEV; + + /* +* When configured in nested mode, further deliver the +* stage/level 1 faults to the guest. +*/ + if (nested) { + bool l2; + + if (fault->type == IOMMU_FAULT_PAGE_REQ) + l2 = fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_L2; + if (fault->type == IOMMU_FAULT_DMA_UNRECOV) + l2 = fault->event.flags & IOMMU_FAULT_UNRECOV_L2; + + if (!l2) + return vfio_transfer_iommu_fault(dev, fault); + } + if (fault->type != IOMMU_FAULT_PAGE_REQ) return -EOPNOTSUPP; @@ -4201,6 +4220,32 @@ static void vfio_iommu_type1_notify(void *iommu
[RFC PATCH v3 7/8] vfio/type1: Add selective DMA faulting support
Some devices only allow selective DMA faulting. Similar to the selective dirty page tracking, the vendor driver can call vfio_pin_pages() to indicate the non-faultable scope, we add a new struct vfio_range to record it, then when the IOPF handler receives any page request out of the scope, we can directly return with an invalid response. Suggested-by: Kevin Tian Signed-off-by: Shenming Lu --- drivers/vfio/vfio.c | 4 +- drivers/vfio/vfio_iommu_type1.c | 357 +++- include/linux/vfio.h| 1 + 3 files changed, 358 insertions(+), 4 deletions(-) diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index 38779e6fd80c..44c8dfabf7de 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -2013,7 +2013,8 @@ int vfio_unpin_pages(struct device *dev, unsigned long *user_pfn, int npage) container = group->container; driver = container->iommu_driver; if (likely(driver && driver->ops->unpin_pages)) - ret = driver->ops->unpin_pages(container->iommu_data, user_pfn, + ret = driver->ops->unpin_pages(container->iommu_data, + group->iommu_group, user_pfn, npage); else ret = -ENOTTY; @@ -2112,6 +2113,7 @@ int vfio_group_unpin_pages(struct vfio_group *group, driver = container->iommu_driver; if (likely(driver && driver->ops->unpin_pages)) ret = driver->ops->unpin_pages(container->iommu_data, + group->iommu_group, user_iova_pfn, npage); else ret = -ENOTTY; diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index dcc93c3b258c..ba2b5a1cf6e9 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -150,10 +150,19 @@ struct vfio_regions { static struct rb_root iopf_group_list = RB_ROOT; static DEFINE_MUTEX(iopf_group_list_lock); +struct vfio_range { + struct rb_node node; + dma_addr_t base_iova; + size_t span; + unsigned intref_count; +}; + struct vfio_iopf_group { struct rb_node node; struct iommu_group *iommu_group; struct vfio_iommu *iommu; + struct rb_root pinned_range_list; + boolselective_faulting; }; #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)\ @@ -496,6 +505,255 @@ static void vfio_unlink_iopf_group(struct vfio_iopf_group *old) mutex_unlock(&iopf_group_list_lock); } +/* + * Helper functions for range list, handle one page at a time. + */ +static struct vfio_range *vfio_find_range(struct rb_root *range_list, + dma_addr_t iova) +{ + struct rb_node *node = range_list->rb_node; + struct vfio_range *range; + + while (node) { + range = rb_entry(node, struct vfio_range, node); + + if (iova + PAGE_SIZE <= range->base_iova) + node = node->rb_left; + else if (iova >= range->base_iova + range->span) + node = node->rb_right; + else + return range; + } + + return NULL; +} + +/* Do the possible merge adjacent to the input range. */ +static void vfio_merge_range_list(struct rb_root *range_list, + struct vfio_range *range) +{ + struct rb_node *node_prev = rb_prev(&range->node); + struct rb_node *node_next = rb_next(&range->node); + + if (node_next) { + struct vfio_range *range_next = rb_entry(node_next, +struct vfio_range, +node); + + if (range_next->base_iova == (range->base_iova + range->span) && + range_next->ref_count == range->ref_count) { + rb_erase(node_next, range_list); + range->span += range_next->span; + kfree(range_next); + } + } + + if (node_prev) { + struct vfio_range *range_prev = rb_entry(node_prev, +struct vfio_range, +node); + + if (range->base_iova == (range_prev->base_iova + range_prev->span) + && range->ref_count == range_prev->ref_count) { + rb_erase(&range->node, range_list); + range_prev->span += range->span; +
[RFC PATCH v3 2/8] vfio/type1: Add a page fault handler
VFIO manages the DMA mapping itself. To support IOPF (on-demand paging) for VFIO (IOMMU capable) devices, we add a VFIO page fault handler to serve the reported page faults from the IOMMU driver. Signed-off-by: Shenming Lu --- drivers/vfio/vfio_iommu_type1.c | 114 1 file changed, 114 insertions(+) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 45cbfd4879a5..ab0ff60ee207 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -101,6 +101,7 @@ struct vfio_dma { struct task_struct *task; struct rb_root pfn_list; /* Ex-user pinned pfn list */ unsigned long *bitmap; + unsigned long *iopf_mapped_bitmap; }; struct vfio_batch { @@ -141,6 +142,16 @@ struct vfio_regions { size_t len; }; +/* A global IOPF enabled group list */ +static struct rb_root iopf_group_list = RB_ROOT; +static DEFINE_MUTEX(iopf_group_list_lock); + +struct vfio_iopf_group { + struct rb_node node; + struct iommu_group *iommu_group; + struct vfio_iommu *iommu; +}; + #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)\ (!list_empty(&iommu->domain_list)) @@ -157,6 +168,10 @@ struct vfio_regions { #define DIRTY_BITMAP_PAGES_MAX ((u64)INT_MAX) #define DIRTY_BITMAP_SIZE_MAX DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX) +#define IOPF_MAPPED_BITMAP_GET(dma, i) \ + ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG] \ + >> ((i) % BITS_PER_LONG)) & 0x1) + #define WAITED 1 static int put_pfn(unsigned long pfn, int prot); @@ -416,6 +431,34 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn) return ret; } +/* + * Helper functions for iopf_group_list + */ +static struct vfio_iopf_group * +vfio_find_iopf_group(struct iommu_group *iommu_group) +{ + struct vfio_iopf_group *iopf_group; + struct rb_node *node; + + mutex_lock(&iopf_group_list_lock); + + node = iopf_group_list.rb_node; + + while (node) { + iopf_group = rb_entry(node, struct vfio_iopf_group, node); + + if (iommu_group < iopf_group->iommu_group) + node = node->rb_left; + else if (iommu_group > iopf_group->iommu_group) + node = node->rb_right; + else + break; + } + + mutex_unlock(&iopf_group_list_lock); + return node ? iopf_group : NULL; +} + static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) { struct mm_struct *mm; @@ -3106,6 +3149,77 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu, return -EINVAL; } +/* VFIO I/O Page Fault handler */ +static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data) +{ + struct device *dev = (struct device *)data; + struct iommu_group *iommu_group; + struct vfio_iopf_group *iopf_group; + struct vfio_iommu *iommu; + struct vfio_dma *dma; + dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE); + int access_flags = 0; + unsigned long bit_offset, vaddr, pfn; + int ret; + enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID; + struct iommu_page_response resp = {0}; + + if (fault->type != IOMMU_FAULT_PAGE_REQ) + return -EOPNOTSUPP; + + iommu_group = iommu_group_get(dev); + if (!iommu_group) + return -ENODEV; + + iopf_group = vfio_find_iopf_group(iommu_group); + iommu_group_put(iommu_group); + if (!iopf_group) + return -ENODEV; + + iommu = iopf_group->iommu; + + mutex_lock(&iommu->lock); + + ret = vfio_find_dma_valid(iommu, iova, PAGE_SIZE, &dma); + if (ret < 0) + goto out_invalid; + + if (fault->prm.perm & IOMMU_FAULT_PERM_READ) + access_flags |= IOMMU_READ; + if (fault->prm.perm & IOMMU_FAULT_PERM_WRITE) + access_flags |= IOMMU_WRITE; + if ((dma->prot & access_flags) != access_flags) + goto out_invalid; + + bit_offset = (iova - dma->iova) >> PAGE_SHIFT; + if (IOPF_MAPPED_BITMAP_GET(dma, bit_offset)) + goto out_success; + + vaddr = iova - dma->iova + dma->vaddr; + + if (vfio_pin_page_external(dma, vaddr, &pfn, true)) + goto out_invalid; + + if (vfio_iommu_map(iommu, iova, pfn, 1, dma->prot)) { + if (put_pfn(pfn, dma->prot)) + vfio_lock_acct(dma, -1, true); + goto out_invalid; + } + + bitmap_set(dma->iopf_mapped_bitmap, bit_offset, 1); + +out_succe
[RFC PATCH v3 3/8] vfio/type1: Add an MMU notifier to avoid pinning
To avoid pinning pages when they are mapped in IOMMU page tables, we add an MMU notifier to tell the addresses which are no longer valid and try to unmap them. Signed-off-by: Shenming Lu --- drivers/vfio/vfio_iommu_type1.c | 112 +++- 1 file changed, 109 insertions(+), 3 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index ab0ff60ee207..1cb9d1f2717b 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -40,6 +40,7 @@ #include #include #include +#include #define DRIVER_VERSION "0.2" #define DRIVER_AUTHOR "Alex Williamson " @@ -69,6 +70,7 @@ struct vfio_iommu { struct mutexlock; struct rb_root dma_list; struct blocking_notifier_head notifier; + struct mmu_notifier mn; unsigned intdma_avail; unsigned intvaddr_invalid_count; uint64_tpgsize_bitmap; @@ -1204,6 +1206,72 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma, return unlocked; } +/* Unmap the IOPF mapped pages in the specified range. */ +static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu, + struct vfio_dma *dma, + dma_addr_t start, dma_addr_t end) +{ + struct iommu_iotlb_gather *gathers; + struct vfio_domain *d; + int i, num_domains = 0; + + list_for_each_entry(d, &iommu->domain_list, next) + num_domains++; + + gathers = kzalloc(sizeof(*gathers) * num_domains, GFP_KERNEL); + if (gathers) { + for (i = 0; i < num_domains; i++) + iommu_iotlb_gather_init(&gathers[i]); + } + + while (start < end) { + unsigned long bit_offset; + size_t len; + + bit_offset = (start - dma->iova) >> PAGE_SHIFT; + + for (len = 0; start + len < end; len += PAGE_SIZE) { + if (!IOPF_MAPPED_BITMAP_GET(dma, + bit_offset + (len >> PAGE_SHIFT))) + break; + } + + if (len) { + i = 0; + list_for_each_entry(d, &iommu->domain_list, next) { + size_t unmapped; + + if (gathers) + unmapped = iommu_unmap_fast(d->domain, + start, len, + &gathers[i++]); + else + unmapped = iommu_unmap(d->domain, + start, len); + + if (WARN_ON(unmapped != len)) + goto out; + } + + bitmap_clear(dma->iopf_mapped_bitmap, +bit_offset, len >> PAGE_SHIFT); + + cond_resched(); + } + + start += (len + PAGE_SIZE); + } + +out: + if (gathers) { + i = 0; + list_for_each_entry(d, &iommu->domain_list, next) + iommu_iotlb_sync(d->domain, &gathers[i++]); + + kfree(gathers); + } +} + static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma) { WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list)); @@ -3197,17 +3265,18 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data) vaddr = iova - dma->iova + dma->vaddr; - if (vfio_pin_page_external(dma, vaddr, &pfn, true)) + if (vfio_pin_page_external(dma, vaddr, &pfn, false)) goto out_invalid; if (vfio_iommu_map(iommu, iova, pfn, 1, dma->prot)) { - if (put_pfn(pfn, dma->prot)) - vfio_lock_acct(dma, -1, true); + put_pfn(pfn, dma->prot); goto out_invalid; } bitmap_set(dma->iopf_mapped_bitmap, bit_offset, 1); + put_pfn(pfn, dma->prot); + out_success: status = IOMMU_PAGE_RESP_SUCCESS; @@ -3220,6 +3289,43 @@ static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data) return 0; } +static void mn_invalidate_range(struct mmu_notifier *mn, struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct vfio_iommu *iommu = container_of(mn, struct vfio_iommu, mn); + struct rb_node *n; + int ret; + + mutex_lock(&iommu->lock); + + ret = vfio_wait_all_valid(iommu); + if (WAR
Re: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough
On 2021/4/9 11:44, Shenming Lu wrote: > Hi, > > Requesting for your comments and suggestions. :-) Kind ping... > > The static pinning and mapping problem in VFIO and possible solutions > have been discussed a lot [1, 2]. One of the solutions is to add I/O > Page Fault support for VFIO devices. Different from those relatively > complicated software approaches such as presenting a vIOMMU that provides > the DMA buffer information (might include para-virtualized optimizations), > IOPF mainly depends on the hardware faulting capability, such as the PCIe > PRI extension or Arm SMMU stall model. What's more, the IOPF support in > the IOMMU driver has already been implemented in SVA [3]. So we add IOPF > support for VFIO passthrough based on the IOPF part of SVA in this series. > > We have measured its performance with UADK [4] (passthrough an accelerator > to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA): > > Run hisi_sec_test... > - with varying sending times and message lengths > - with/without IOPF enabled (speed slowdown) > > when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1): > slowdown (num of faults) > times VFIO IOPF host SVA > 1 63.4% (518)82.8% (512) > 10022.9% (1058) 47.9% (1024) > 1000 2.6% (1071)8.5% (1024) > > when msg_len = 10MB (and PREMAP_LEN = 512): > slowdown (num of faults) > times VFIO IOPF > 1 32.6% (13) > 1003.5% (26) > 1000 1.6% (26) > > History: > > v2 -> v3 > - Nit fixes. > - No reason to disable reporting the unrecoverable faults. (baolu) > - Maintain a global IOPF enabled group list. > - Split the pre-mapping optimization to be a separate patch. > - Add selective faulting support (use vfio_pin_pages to indicate the >non-faultable scope and add a new struct vfio_range to record it, >untested). (Kevin) > > v1 -> v2 > - Numerous improvements following the suggestions. Thanks a lot to all >of you. > > Note that PRI is not supported at the moment since there is no hardware. > > Links: > [1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS, > 2016. > [2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer > Tracking > for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020. > [3] > https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210401154718.307519-1-jean-phili...@linaro.org/ > [4] https://github.com/Linaro/uadk > > Thanks, > Shenming > > > Shenming Lu (8): > iommu: Evolve the device fault reporting framework > vfio/type1: Add a page fault handler > vfio/type1: Add an MMU notifier to avoid pinning > vfio/type1: Pre-map more pages than requested in the IOPF handling > vfio/type1: VFIO_IOMMU_ENABLE_IOPF > vfio/type1: No need to statically pin and map if IOPF enabled > vfio/type1: Add selective DMA faulting support > vfio: Add nested IOPF support > > .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c |3 +- > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 +- > drivers/iommu/iommu.c | 56 +- > drivers/vfio/vfio.c | 85 +- > drivers/vfio/vfio_iommu_type1.c | 1000 - > include/linux/iommu.h | 19 +- > include/linux/vfio.h | 13 + > include/uapi/linux/iommu.h|4 + > include/uapi/linux/vfio.h |6 + > 9 files changed, 1181 insertions(+), 23 deletions(-) > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough
Hi Alex, Hope for some suggestions or comments from you since there seems to be many unsure points in this series. :-) Thanks, Shenming On 2021/4/26 9:41, Shenming Lu wrote: > On 2021/4/9 11:44, Shenming Lu wrote: >> Hi, >> >> Requesting for your comments and suggestions. :-) > > Kind ping... > >> >> The static pinning and mapping problem in VFIO and possible solutions >> have been discussed a lot [1, 2]. One of the solutions is to add I/O >> Page Fault support for VFIO devices. Different from those relatively >> complicated software approaches such as presenting a vIOMMU that provides >> the DMA buffer information (might include para-virtualized optimizations), >> IOPF mainly depends on the hardware faulting capability, such as the PCIe >> PRI extension or Arm SMMU stall model. What's more, the IOPF support in >> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF >> support for VFIO passthrough based on the IOPF part of SVA in this series. >> >> We have measured its performance with UADK [4] (passthrough an accelerator >> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA): >> >> Run hisi_sec_test... >> - with varying sending times and message lengths >> - with/without IOPF enabled (speed slowdown) >> >> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1): >> slowdown (num of faults) >> times VFIO IOPF host SVA >> 1 63.4% (518)82.8% (512) >> 10022.9% (1058) 47.9% (1024) >> 1000 2.6% (1071)8.5% (1024) >> >> when msg_len = 10MB (and PREMAP_LEN = 512): >> slowdown (num of faults) >> times VFIO IOPF >> 1 32.6% (13) >> 1003.5% (26) >> 1000 1.6% (26) >> >> History: >> >> v2 -> v3 >> - Nit fixes. >> - No reason to disable reporting the unrecoverable faults. (baolu) >> - Maintain a global IOPF enabled group list. >> - Split the pre-mapping optimization to be a separate patch. >> - Add selective faulting support (use vfio_pin_pages to indicate the >>non-faultable scope and add a new struct vfio_range to record it, >>untested). (Kevin) >> >> v1 -> v2 >> - Numerous improvements following the suggestions. Thanks a lot to all >>of you. >> >> Note that PRI is not supported at the moment since there is no hardware. >> >> Links: >> [1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS, >> 2016. >> [2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer >> Tracking >> for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020. >> [3] >> https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210401154718.307519-1-jean-phili...@linaro.org/ >> [4] https://github.com/Linaro/uadk >> >> Thanks, >> Shenming >> >> >> Shenming Lu (8): >> iommu: Evolve the device fault reporting framework >> vfio/type1: Add a page fault handler >> vfio/type1: Add an MMU notifier to avoid pinning >> vfio/type1: Pre-map more pages than requested in the IOPF handling >> vfio/type1: VFIO_IOMMU_ENABLE_IOPF >> vfio/type1: No need to statically pin and map if IOPF enabled >> vfio/type1: Add selective DMA faulting support >> vfio: Add nested IOPF support >> >> .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c |3 +- >> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 +- >> drivers/iommu/iommu.c | 56 +- >> drivers/vfio/vfio.c | 85 +- >> drivers/vfio/vfio_iommu_type1.c | 1000 - >> include/linux/iommu.h | 19 +- >> include/linux/vfio.h | 13 + >> include/uapi/linux/iommu.h|4 + >> include/uapi/linux/vfio.h |6 + >> 9 files changed, 1181 insertions(+), 23 deletions(-) >> ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough
Hi Alex, On 2021/5/19 2:57, Alex Williamson wrote: > On Fri, 9 Apr 2021 11:44:12 +0800 > Shenming Lu wrote: > >> Hi, >> >> Requesting for your comments and suggestions. :-) >> >> The static pinning and mapping problem in VFIO and possible solutions >> have been discussed a lot [1, 2]. One of the solutions is to add I/O >> Page Fault support for VFIO devices. Different from those relatively >> complicated software approaches such as presenting a vIOMMU that provides >> the DMA buffer information (might include para-virtualized optimizations), >> IOPF mainly depends on the hardware faulting capability, such as the PCIe >> PRI extension or Arm SMMU stall model. What's more, the IOPF support in >> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF >> support for VFIO passthrough based on the IOPF part of SVA in this series. > > The SVA proposals are being reworked to make use of a new IOASID > object, it's not clear to me that this shouldn't also make use of that > work as it does a significant expansion of the type1 IOMMU with fault > handlers that would duplicate new work using that new model. It seems that the IOASID extension for guest SVA would not affect this series, will we do any host-guest IOASID translation in the VFIO fault handler? > >> We have measured its performance with UADK [4] (passthrough an accelerator >> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA): >> >> Run hisi_sec_test... >> - with varying sending times and message lengths >> - with/without IOPF enabled (speed slowdown) >> >> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1): >> slowdown (num of faults) >> times VFIO IOPF host SVA >> 1 63.4% (518)82.8% (512) >> 10022.9% (1058) 47.9% (1024) >> 1000 2.6% (1071)8.5% (1024) >> >> when msg_len = 10MB (and PREMAP_LEN = 512): >> slowdown (num of faults) >> times VFIO IOPF >> 1 32.6% (13) >> 1003.5% (26) >> 1000 1.6% (26) > > It seems like this is only an example that you can make a benchmark > show anything you want. The best results would be to pre-map > everything, which is what we have without this series. What is an > acceptable overhead to incur to avoid page pinning? What if userspace > had more fine grained control over which mappings were available for > faulting and which were statically mapped? I don't really see what > sense the pre-mapping range makes. If I assume the user is QEMU in a > non-vIOMMU configuration, pre-mapping the beginning of each RAM section > doesn't make any logical sense relative to device DMA. As you said in Patch 4, we can introduce full end-to-end functionality before trying to improve performance, and I will drop the pre-mapping patch in the current stage... Is there a need that userspace wants more fine grained control over which mappings are available for faulting? If so, we may evolve the MAP ioctl to support for specifying the faulting range. As for the overhead of IOPF, it is unavoidable if enabling on-demand paging (and page faults occur almost only when first accessing), and it seems that there is a large optimization space compared to CPU page faulting. Thanks, Shenming > > Comments per patch to follow. Thanks, > > Alex > > >> History: >> >> v2 -> v3 >> - Nit fixes. >> - No reason to disable reporting the unrecoverable faults. (baolu) >> - Maintain a global IOPF enabled group list. >> - Split the pre-mapping optimization to be a separate patch. >> - Add selective faulting support (use vfio_pin_pages to indicate the >>non-faultable scope and add a new struct vfio_range to record it, >>untested). (Kevin) >> >> v1 -> v2 >> - Numerous improvements following the suggestions. Thanks a lot to all >>of you. >> >> Note that PRI is not supported at the moment since there is no hardware. >> >> Links: >> [1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS, >> 2016. >> [2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer >> Tracking >> for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020. >> [3] >> https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210401154718.307519-1-jean-phili...@linaro.org/ >> [4] https://github.com/Linaro/uadk >> >> Thanks, >> Shenming >> >> >> Shenming Lu (8): >> iommu: Evolve the device fault reporting framework >> vfio/type1: Add a page fault handler >&g
Re: [RFC PATCH v3 1/8] iommu: Evolve the device fault reporting framework
On 2021/5/19 2:58, Alex Williamson wrote: > On Fri, 9 Apr 2021 11:44:13 +0800 > Shenming Lu wrote: > >> This patch follows the discussion here: >> >> https://lore.kernel.org/linux-acpi/YAaxjmJW+ZMvrhac@myrica/ >> >> Besides SVA/vSVA, such as VFIO may also enable (2nd level) IOPF to remove >> pinning restriction. In order to better support more scenarios of using >> device faults, we extend iommu_register_fault_handler() with flags and >> introduce FAULT_REPORT_ to describe the device fault reporting capability >> under a specific configuration. >> >> Note that we don't further distinguish recoverable and unrecoverable faults >> by flags in the fault reporting cap, having PAGE_FAULT_REPORT_ + >> UNRECOV_FAULT_REPORT_ seems not a clean way. >> >> In addition, still take VFIO as an example, in nested mode, the 1st level >> and 2nd level fault reporting may be configured separately and currently >> each device can only register one iommu dev fault handler, so we add a >> handler update interface for this. > > > IIUC, you're introducing flags for the fault handler callout, which > essentially allows the IOMMU layer to filter which types of faults the > handler can receive. You then need an update function to modify those > flags. Why can't the handler itself perform this filtering? For > instance in your vfio example, the handler registered by the type1 > backend could itself return fault until the fault transfer path to the > device driver is established. Thanks, As discussed in [1]: In nested IOPF, we have to figure out whether a fault comes from L1 or L2. A SMMU stall event comes with this information, but a PRI page request doesn't. The IOMMU driver can walk the page tables to find out the level information. If the walk terminates at L1, further inject to the guest. Otherwise fix the fault at L2 in VFIO. It's not efficient compared to hardware-provided info. And in pinned case, if VFIO can tell the IOMMU driver that no L2 fault is expected, there is no need to walk the page tables in the IOMMU driver? [1] https://patchwork.kernel.org/project/linux-arm-kernel/patch/20210108145217.2254447-4-jean-phili...@linaro.org/ Thanks, Shenming > > Alex > > . > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 3/8] vfio/type1: Add an MMU notifier to avoid pinning
On 2021/5/19 2:58, Alex Williamson wrote: > On Fri, 9 Apr 2021 11:44:15 +0800 > Shenming Lu wrote: > >> To avoid pinning pages when they are mapped in IOMMU page tables, we >> add an MMU notifier to tell the addresses which are no longer valid >> and try to unmap them. >> >> Signed-off-by: Shenming Lu >> --- >> drivers/vfio/vfio_iommu_type1.c | 112 +++- >> 1 file changed, 109 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/vfio/vfio_iommu_type1.c >> b/drivers/vfio/vfio_iommu_type1.c >> index ab0ff60ee207..1cb9d1f2717b 100644 >> --- a/drivers/vfio/vfio_iommu_type1.c >> +++ b/drivers/vfio/vfio_iommu_type1.c >> @@ -40,6 +40,7 @@ >> #include >> #include >> #include >> +#include >> >> #define DRIVER_VERSION "0.2" >> #define DRIVER_AUTHOR "Alex Williamson " >> @@ -69,6 +70,7 @@ struct vfio_iommu { >> struct mutexlock; >> struct rb_root dma_list; >> struct blocking_notifier_head notifier; >> +struct mmu_notifier mn; >> unsigned intdma_avail; >> unsigned intvaddr_invalid_count; >> uint64_tpgsize_bitmap; >> @@ -1204,6 +1206,72 @@ static long vfio_unmap_unpin(struct vfio_iommu >> *iommu, struct vfio_dma *dma, >> return unlocked; >> } >> >> +/* Unmap the IOPF mapped pages in the specified range. */ >> +static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu, >> +struct vfio_dma *dma, >> +dma_addr_t start, dma_addr_t end) >> +{ >> +struct iommu_iotlb_gather *gathers; >> +struct vfio_domain *d; >> +int i, num_domains = 0; >> + >> +list_for_each_entry(d, &iommu->domain_list, next) >> +num_domains++; >> + >> +gathers = kzalloc(sizeof(*gathers) * num_domains, GFP_KERNEL); >> +if (gathers) { >> +for (i = 0; i < num_domains; i++) >> +iommu_iotlb_gather_init(&gathers[i]); >> +} > > > If we're always serialized on iommu->lock, would it make sense to have > gathers pre-allocated on the vfio_iommu object? Yeah, we can do it. > >> + >> +while (start < end) { >> +unsigned long bit_offset; >> +size_t len; >> + >> +bit_offset = (start - dma->iova) >> PAGE_SHIFT; >> + >> +for (len = 0; start + len < end; len += PAGE_SIZE) { >> +if (!IOPF_MAPPED_BITMAP_GET(dma, >> +bit_offset + (len >> PAGE_SHIFT))) >> +break; > > > There are bitmap helpers for this, find_first_bit(), > find_next_zero_bit(). Thanks for the hint. :-) > > >> +} >> + >> +if (len) { >> +i = 0; >> +list_for_each_entry(d, &iommu->domain_list, next) { >> +size_t unmapped; >> + >> +if (gathers) >> +unmapped = iommu_unmap_fast(d->domain, >> +start, len, >> + >> &gathers[i++]); >> +else >> +unmapped = iommu_unmap(d->domain, >> + start, len); >> + >> +if (WARN_ON(unmapped != len)) > > The IOMMU API does not guarantee arbitrary unmap sizes, this will > trigger and this exit path is wrong. If we've already unmapped the > IOMMU, shouldn't we proceed with @unmapped rather than @len so the > device can re-fault the extra mappings? Otherwise the IOMMU state > doesn't match the iopf bitmap state. OK, I will correct it. And can we assume that the @unmapped values (returned from iommu_unmap) of all domains are the same (since all domains share the same iopf_mapped_bitmap)? > >> +goto out; >> +} >> + >> +bitmap_clear(dma->iopf_mapped_bitmap, >> + bit_offset, len >> PAGE_SHIFT); >> + >> +cond_resched(); >> +} >> + >> +start += (len + PAGE_SIZE); >> +
Re: [RFC PATCH v3 4/8] vfio/type1: Pre-map more pages than requested in the IOPF handling
On 2021/5/19 2:58, Alex Williamson wrote: > On Fri, 9 Apr 2021 11:44:16 +0800 > Shenming Lu wrote: > >> To optimize for fewer page fault handlings, we can pre-map more pages >> than requested at once. >> >> Note that IOPF_PREMAP_LEN is just an arbitrary value for now, which we >> could try further tuning. > > I'd prefer that the series introduced full end-to-end functionality > before trying to improve performance. The pre-map value seems > arbitrary here and as noted in the previous patch, the IOMMU API does > not guarantee unmaps of ranges smaller than the original mapping. This > would need to map with single page granularity in order to guarantee > page granularity at the mmu notifier when the IOMMU supports > superpages. Thanks, Yeah, I will drop this patch in the current stage. Thanks, Shenming > > Alex > > . > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF
On 2021/5/19 2:58, Alex Williamson wrote: > On Fri, 9 Apr 2021 11:44:17 +0800 > Shenming Lu wrote: > >> Since enabling IOPF for devices may lead to a slow ramp up of performance, >> we add an ioctl VFIO_IOMMU_ENABLE_IOPF to make it configurable. And the >> IOPF enabling of a VFIO device includes setting IOMMU_DEV_FEAT_IOPF and >> registering the VFIO IOPF handler. >> >> Note that VFIO_IOMMU_DISABLE_IOPF is not supported since there may be >> inflight page faults when disabling. >> >> Signed-off-by: Shenming Lu >> --- >> drivers/vfio/vfio_iommu_type1.c | 223 +++- >> include/uapi/linux/vfio.h | 6 + >> 2 files changed, 226 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/vfio/vfio_iommu_type1.c >> b/drivers/vfio/vfio_iommu_type1.c >> index 01e296c6dc9e..7df5711e743a 100644 >> --- a/drivers/vfio/vfio_iommu_type1.c >> +++ b/drivers/vfio/vfio_iommu_type1.c >> @@ -71,6 +71,7 @@ struct vfio_iommu { >> struct rb_root dma_list; >> struct blocking_notifier_head notifier; >> struct mmu_notifier mn; >> +struct mm_struct*mm; > > We currently have no requirement that a single mm is used for all > container mappings. Does enabling IOPF impose such a requirement? > Shouldn't MAP/UNMAP enforce that? Did you mean that there is a possibility that each vfio_dma in a container has a different mm? If so, could we register one MMU notifier for each vfio_dma in the MAP ioctl? > >> unsigned intdma_avail; >> unsigned intvaddr_invalid_count; >> uint64_tpgsize_bitmap; >> @@ -81,6 +82,7 @@ struct vfio_iommu { >> booldirty_page_tracking; >> boolpinned_page_dirty_scope; >> boolcontainer_open; >> +booliopf_enabled; >> }; >> >> struct vfio_domain { >> @@ -461,6 +463,38 @@ vfio_find_iopf_group(struct iommu_group *iommu_group) >> return node ? iopf_group : NULL; >> } >> >> +static void vfio_link_iopf_group(struct vfio_iopf_group *new) >> +{ >> +struct rb_node **link, *parent = NULL; >> +struct vfio_iopf_group *iopf_group; >> + >> +mutex_lock(&iopf_group_list_lock); >> + >> +link = &iopf_group_list.rb_node; >> + >> +while (*link) { >> +parent = *link; >> +iopf_group = rb_entry(parent, struct vfio_iopf_group, node); >> + >> +if (new->iommu_group < iopf_group->iommu_group) >> +link = &(*link)->rb_left; >> +else >> +link = &(*link)->rb_right; >> +} >> + >> +rb_link_node(&new->node, parent, link); >> +rb_insert_color(&new->node, &iopf_group_list); >> + >> +mutex_unlock(&iopf_group_list_lock); >> +} >> + >> +static void vfio_unlink_iopf_group(struct vfio_iopf_group *old) >> +{ >> +mutex_lock(&iopf_group_list_lock); >> +rb_erase(&old->node, &iopf_group_list); >> +mutex_unlock(&iopf_group_list_lock); >> +} >> + >> static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) >> { >> struct mm_struct *mm; >> @@ -2363,6 +2397,68 @@ static void vfio_iommu_iova_insert_copy(struct >> vfio_iommu *iommu, >> list_splice_tail(iova_copy, iova); >> } >> >> +static int vfio_dev_domian_nested(struct device *dev, int *nested) > > > s/domian/domain/ > > >> +{ >> +struct iommu_domain *domain; >> + >> +domain = iommu_get_domain_for_dev(dev); >> +if (!domain) >> +return -ENODEV; >> + >> +return iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, nested); > > > Wouldn't this be easier to use if it returned bool, false on -errno? Make sense. > > >> +} >> + >> +static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void >> *data); >> + >> +static int dev_enable_iopf(struct device *dev, void *data) >> +{ >> +int *enabled_dev_cnt = data; >> +int nested; >> +u32 flags; >> +int ret; >> + >> +ret = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_IOPF); >> +if (ret) >> +return ret; >> + >> +ret = vfio_dev_domian_nested(dev, &nested); >> +if (ret) >> +goto out_disable; &
Re: [RFC PATCH v3 2/8] vfio/type1: Add a page fault handler
On 2021/5/19 2:58, Alex Williamson wrote: > On Fri, 9 Apr 2021 11:44:14 +0800 > Shenming Lu wrote: > >> VFIO manages the DMA mapping itself. To support IOPF (on-demand paging) >> for VFIO (IOMMU capable) devices, we add a VFIO page fault handler to >> serve the reported page faults from the IOMMU driver. >> >> Signed-off-by: Shenming Lu >> --- >> drivers/vfio/vfio_iommu_type1.c | 114 >> 1 file changed, 114 insertions(+) >> >> diff --git a/drivers/vfio/vfio_iommu_type1.c >> b/drivers/vfio/vfio_iommu_type1.c >> index 45cbfd4879a5..ab0ff60ee207 100644 >> --- a/drivers/vfio/vfio_iommu_type1.c >> +++ b/drivers/vfio/vfio_iommu_type1.c >> @@ -101,6 +101,7 @@ struct vfio_dma { >> struct task_struct *task; >> struct rb_root pfn_list; /* Ex-user pinned pfn list */ >> unsigned long *bitmap; >> +unsigned long *iopf_mapped_bitmap; >> }; >> >> struct vfio_batch { >> @@ -141,6 +142,16 @@ struct vfio_regions { >> size_t len; >> }; >> >> +/* A global IOPF enabled group list */ >> +static struct rb_root iopf_group_list = RB_ROOT; >> +static DEFINE_MUTEX(iopf_group_list_lock); >> + >> +struct vfio_iopf_group { >> +struct rb_node node; >> +struct iommu_group *iommu_group; >> +struct vfio_iommu *iommu; >> +}; >> + >> #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) \ >> (!list_empty(&iommu->domain_list)) >> >> @@ -157,6 +168,10 @@ struct vfio_regions { >> #define DIRTY_BITMAP_PAGES_MAX ((u64)INT_MAX) >> #define DIRTY_BITMAP_SIZE_MAX >> DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX) >> >> +#define IOPF_MAPPED_BITMAP_GET(dma, i) \ >> + ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG] >> \ >> + >> ((i) % BITS_PER_LONG)) & 0x1) > > > Can't we just use test_bit()? Yeah, we can use it. > > >> + >> #define WAITED 1 >> >> static int put_pfn(unsigned long pfn, int prot); >> @@ -416,6 +431,34 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, >> struct vfio_pfn *vpfn) >> return ret; >> } >> >> +/* >> + * Helper functions for iopf_group_list >> + */ >> +static struct vfio_iopf_group * >> +vfio_find_iopf_group(struct iommu_group *iommu_group) >> +{ >> +struct vfio_iopf_group *iopf_group; >> +struct rb_node *node; >> + >> +mutex_lock(&iopf_group_list_lock); >> + >> +node = iopf_group_list.rb_node; >> + >> +while (node) { >> +iopf_group = rb_entry(node, struct vfio_iopf_group, node); >> + >> +if (iommu_group < iopf_group->iommu_group) >> +node = node->rb_left; >> +else if (iommu_group > iopf_group->iommu_group) >> +node = node->rb_right; >> +else >> +break; >> +} >> + >> +mutex_unlock(&iopf_group_list_lock); >> +return node ? iopf_group : NULL; >> +} > > This looks like a pretty heavy weight operation per DMA fault. > > I'm also suspicious of this validity of this iopf_group after we've > dropped the locking, the ordering of patches makes this very confusing. My thought was to include the handling of DMA faults completely in the type1 backend by introducing the vfio_iopf_group struct. But it seems that introducing a struct with an unknown lifecycle causes more problems... I will use the path from vfio-core as in the v2 for simplicity and validity. Sorry for the confusing, I will reconstruct the series later. :-) > >> + >> static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) >> { >> struct mm_struct *mm; >> @@ -3106,6 +3149,77 @@ static int vfio_iommu_type1_dirty_pages(struct >> vfio_iommu *iommu, >> return -EINVAL; >> } >> >> +/* VFIO I/O Page Fault handler */ >> +static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void >> *data) > >>From the comment, this seems like the IOMMU fault handler (the > construction of this series makes this difficult to follow) and > eventually it handles more than DMA mapping, for example transferring > faults to the device driver. "dma_map_iopf" seems like a poorly scoped > name. Maybe just call it dev_fault_handler?
Re: [RFC PATCH v3 6/8] vfio/type1: No need to statically pin and map if IOPF enabled
On 2021/5/19 2:58, Alex Williamson wrote: > On Fri, 9 Apr 2021 11:44:18 +0800 > Shenming Lu wrote: > >> If IOPF enabled for the VFIO container, there is no need to statically >> pin and map the entire DMA range, we can do it on demand. And unmap >> according to the IOPF mapped bitmap when removing vfio_dma. >> >> Note that we still mark all pages dirty even if IOPF enabled, we may >> add IOPF-based fine grained dirty tracking support in the future. >> >> Signed-off-by: Shenming Lu >> --- >> drivers/vfio/vfio_iommu_type1.c | 38 +++-- >> 1 file changed, 32 insertions(+), 6 deletions(-) >> >> diff --git a/drivers/vfio/vfio_iommu_type1.c >> b/drivers/vfio/vfio_iommu_type1.c >> index 7df5711e743a..dcc93c3b258c 100644 >> --- a/drivers/vfio/vfio_iommu_type1.c >> +++ b/drivers/vfio/vfio_iommu_type1.c >> @@ -175,6 +175,7 @@ struct vfio_iopf_group { >> #define IOPF_MAPPED_BITMAP_GET(dma, i) \ >>((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG] >> \ >> >> ((i) % BITS_PER_LONG)) & 0x1) >> +#define IOPF_MAPPED_BITMAP_BYTES(n) DIRTY_BITMAP_BYTES(n) >> >> #define WAITED 1 >> >> @@ -959,7 +960,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data, >> * already pinned and accounted. Accouting should be done if there is no >> * iommu capable domain in the container. >> */ >> -do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu); >> +do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || >> +iommu->iopf_enabled; >> >> for (i = 0; i < npage; i++) { >> struct vfio_pfn *vpfn; >> @@ -1048,7 +1050,8 @@ static int vfio_iommu_type1_unpin_pages(void >> *iommu_data, >> >> mutex_lock(&iommu->lock); >> >> -do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu); >> +do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || >> +iommu->iopf_enabled; > > pin/unpin are actually still pinning pages, why does iopf exempt them > from accounting? If iopf_enabled is true, do_accounting will be true too, we will account the external pinned pages? > > >> for (i = 0; i < npage; i++) { >> struct vfio_dma *dma; >> dma_addr_t iova; >> @@ -1169,7 +1172,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, >> struct vfio_dma *dma, >> if (!dma->size) >> return 0; >> >> -if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) >> +if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled) >> return 0; >> >> /* >> @@ -1306,11 +1309,20 @@ static void vfio_unmap_partial_iopf(struct >> vfio_iommu *iommu, >> } >> } >> >> +static void vfio_dma_clean_iopf(struct vfio_iommu *iommu, struct vfio_dma >> *dma) >> +{ >> +vfio_unmap_partial_iopf(iommu, dma, dma->iova, dma->iova + dma->size); >> + >> +kfree(dma->iopf_mapped_bitmap); >> +} >> + >> static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma) >> { >> WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list)); >> vfio_unmap_unpin(iommu, dma, true); >> vfio_unlink_dma(iommu, dma); >> +if (iommu->iopf_enabled) >> +vfio_dma_clean_iopf(iommu, dma); >> put_task_struct(dma->task); >> vfio_dma_bitmap_free(dma); >> if (dma->vaddr_invalid) { >> @@ -1359,7 +1371,8 @@ static int update_user_bitmap(u64 __user *bitmap, >> struct vfio_iommu *iommu, >> * mark all pages dirty if any IOMMU capable device is not able >> * to report dirty pages and all pages are pinned and mapped. >> */ >> -if (iommu->num_non_pinned_groups && dma->iommu_mapped) >> +if (iommu->num_non_pinned_groups && >> +(dma->iommu_mapped || iommu->iopf_enabled)) >> bitmap_set(dma->bitmap, 0, nbits); > > This seems like really poor integration of iopf into dirty page > tracking. I'd expect dirty logging to flush the mapped pages and > write faults to mark pages dirty. Shouldn't the fault handler also > provide only the access faulted, so for example a read fault wouldn't > mark the page dirty? I just want to keep the behavior here as before, if IOPF enabled, we will still mark all pages dirty. We can distinguish between write and read faults in the fa
Re: [RFC PATCH v3 7/8] vfio/type1: Add selective DMA faulting support
On 2021/5/19 2:58, Alex Williamson wrote: > On Fri, 9 Apr 2021 11:44:19 +0800 > Shenming Lu wrote: > >> Some devices only allow selective DMA faulting. Similar to the selective >> dirty page tracking, the vendor driver can call vfio_pin_pages() to >> indicate the non-faultable scope, we add a new struct vfio_range to >> record it, then when the IOPF handler receives any page request out >> of the scope, we can directly return with an invalid response. > > Seems like this highlights a deficiency in the design, that the user > can't specify mappings as iopf enabled or disabled. Also, if the > vendor driver has pinned pages within the range, shouldn't that prevent > them from faulting in the first place? Why do we need yet more > tracking structures? Pages pinned by the vendor driver need to count > against the user's locked memory limits regardless of iopf. Thanks, Currently we only have a vfio_pfn struct to track the external pinned pages (single page granularity), so I add a vfio_range struct for efficient lookup. Yeah, by this patch, for the non-pinned scope, we can directly return INVALID, but for the pinned(non-faultable) scope, tracking the pinned range doesn't seem to help more... Thanks, Shenming > > Alex > > . > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support
On 2021/5/19 2:58, Alex Williamson wrote: > On Fri, 9 Apr 2021 11:44:20 +0800 > Shenming Lu wrote: > >> To set up nested mode, drivers such as vfio_pci need to register a >> handler to receive stage/level 1 faults from the IOMMU, but since >> currently each device can only have one iommu dev fault handler, >> and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF), >> we choose to update the registered handler (a consolidated one) via >> flags (set FAULT_REPORT_NESTED_L1), and further deliver the received >> stage 1 faults in the handler to the guest through a newly added >> vfio_device_ops callback. > > Are there proposed in-kernel drivers that would use any of these > symbols? I hope that such as Eric's SMMUv3 Nested Stage Setup series [1] can use these symbols to consolidate the two page fault handlers into one. [1] https://patchwork.kernel.org/project/kvm/cover/20210411114659.15051-1-eric.au...@redhat.com/ > >> Signed-off-by: Shenming Lu >> --- >> drivers/vfio/vfio.c | 81 + >> drivers/vfio/vfio_iommu_type1.c | 49 +++- >> include/linux/vfio.h| 12 + >> 3 files changed, 141 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c >> index 44c8dfabf7de..4245f15914bf 100644 >> --- a/drivers/vfio/vfio.c >> +++ b/drivers/vfio/vfio.c >> @@ -2356,6 +2356,87 @@ struct iommu_domain *vfio_group_iommu_domain(struct >> vfio_group *group) >> } >> EXPORT_SYMBOL_GPL(vfio_group_iommu_domain); >> >> +/* >> + * Register/Update the VFIO IOPF handler to receive >> + * nested stage/level 1 faults. >> + */ >> +int vfio_iommu_dev_fault_handler_register_nested(struct device *dev) >> +{ >> +struct vfio_container *container; >> +struct vfio_group *group; >> +struct vfio_iommu_driver *driver; >> +int ret; >> + >> +if (!dev) >> +return -EINVAL; >> + >> +group = vfio_group_get_from_dev(dev); >> +if (!group) >> +return -ENODEV; >> + >> +ret = vfio_group_add_container_user(group); >> +if (ret) >> +goto out; >> + >> +container = group->container; >> +driver = container->iommu_driver; >> +if (likely(driver && driver->ops->register_handler)) >> +ret = driver->ops->register_handler(container->iommu_data, dev); >> +else >> +ret = -ENOTTY; >> + >> +vfio_group_try_dissolve_container(group); >> + >> +out: >> +vfio_group_put(group); >> +return ret; >> +} >> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested); >> + >> +int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev) >> +{ >> +struct vfio_container *container; >> +struct vfio_group *group; >> +struct vfio_iommu_driver *driver; >> +int ret; >> + >> +if (!dev) >> +return -EINVAL; >> + >> +group = vfio_group_get_from_dev(dev); >> +if (!group) >> +return -ENODEV; >> + >> +ret = vfio_group_add_container_user(group); >> +if (ret) >> +goto out; >> + >> +container = group->container; >> +driver = container->iommu_driver; >> +if (likely(driver && driver->ops->unregister_handler)) >> +ret = driver->ops->unregister_handler(container->iommu_data, >> dev); >> +else >> +ret = -ENOTTY; >> + >> +vfio_group_try_dissolve_container(group); >> + >> +out: >> +vfio_group_put(group); >> +return ret; >> +} >> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested); >> + >> +int vfio_transfer_iommu_fault(struct device *dev, struct iommu_fault *fault) >> +{ >> +struct vfio_device *device = dev_get_drvdata(dev); >> + >> +if (unlikely(!device->ops->transfer)) >> +return -EOPNOTSUPP; >> + >> +return device->ops->transfer(device->device_data, fault); >> +} >> +EXPORT_SYMBOL_GPL(vfio_transfer_iommu_fault); >> + >> /** >> * Module/class support >> */ >> diff --git a/drivers/vfio/vfio_iommu_type1.c >> b/drivers/vfio/vfio_iommu_type1.c >> index ba2b5a1cf6e9..9d1adeddb303 100644 >> --- a/drivers/vfio/vfio_iommu_type1.c >> +++ b/drivers/vfio/vfio_iommu_type1.c >> @@ -3821,13 +3821,32 @@ static int
Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support
On 2021/5/21 15:59, Shenming Lu wrote: > On 2021/5/19 2:58, Alex Williamson wrote: >> On Fri, 9 Apr 2021 11:44:20 +0800 >> Shenming Lu wrote: >> >>> To set up nested mode, drivers such as vfio_pci need to register a >>> handler to receive stage/level 1 faults from the IOMMU, but since >>> currently each device can only have one iommu dev fault handler, >>> and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF), >>> we choose to update the registered handler (a consolidated one) via >>> flags (set FAULT_REPORT_NESTED_L1), and further deliver the received >>> stage 1 faults in the handler to the guest through a newly added >>> vfio_device_ops callback. >> >> Are there proposed in-kernel drivers that would use any of these >> symbols? > > I hope that such as Eric's SMMUv3 Nested Stage Setup series [1] can > use these symbols to consolidate the two page fault handlers into one. > > [1] > https://patchwork.kernel.org/project/kvm/cover/2021044659.15051-1-eric.au...@redhat.com/ > >> >>> Signed-off-by: Shenming Lu >>> --- >>> drivers/vfio/vfio.c | 81 + >>> drivers/vfio/vfio_iommu_type1.c | 49 +++- >>> include/linux/vfio.h| 12 + >>> 3 files changed, 141 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c >>> index 44c8dfabf7de..4245f15914bf 100644 >>> --- a/drivers/vfio/vfio.c >>> +++ b/drivers/vfio/vfio.c >>> @@ -2356,6 +2356,87 @@ struct iommu_domain *vfio_group_iommu_domain(struct >>> vfio_group *group) >>> } >>> EXPORT_SYMBOL_GPL(vfio_group_iommu_domain); >>> >>> +/* >>> + * Register/Update the VFIO IOPF handler to receive >>> + * nested stage/level 1 faults. >>> + */ >>> +int vfio_iommu_dev_fault_handler_register_nested(struct device *dev) >>> +{ >>> + struct vfio_container *container; >>> + struct vfio_group *group; >>> + struct vfio_iommu_driver *driver; >>> + int ret; >>> + >>> + if (!dev) >>> + return -EINVAL; >>> + >>> + group = vfio_group_get_from_dev(dev); >>> + if (!group) >>> + return -ENODEV; >>> + >>> + ret = vfio_group_add_container_user(group); >>> + if (ret) >>> + goto out; >>> + >>> + container = group->container; >>> + driver = container->iommu_driver; >>> + if (likely(driver && driver->ops->register_handler)) >>> + ret = driver->ops->register_handler(container->iommu_data, dev); >>> + else >>> + ret = -ENOTTY; >>> + >>> + vfio_group_try_dissolve_container(group); >>> + >>> +out: >>> + vfio_group_put(group); >>> + return ret; >>> +} >>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested); >>> + >>> +int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev) >>> +{ >>> + struct vfio_container *container; >>> + struct vfio_group *group; >>> + struct vfio_iommu_driver *driver; >>> + int ret; >>> + >>> + if (!dev) >>> + return -EINVAL; >>> + >>> + group = vfio_group_get_from_dev(dev); >>> + if (!group) >>> + return -ENODEV; >>> + >>> + ret = vfio_group_add_container_user(group); >>> + if (ret) >>> + goto out; >>> + >>> + container = group->container; >>> + driver = container->iommu_driver; >>> + if (likely(driver && driver->ops->unregister_handler)) >>> + ret = driver->ops->unregister_handler(container->iommu_data, >>> dev); >>> + else >>> + ret = -ENOTTY; >>> + >>> + vfio_group_try_dissolve_container(group); >>> + >>> +out: >>> + vfio_group_put(group); >>> + return ret; >>> +} >>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested); >>> + >>> +int vfio_transfer_iommu_fault(struct device *dev, struct iommu_fault >>> *fault) >>> +{ >>> + struct vfio_device *device = dev_get_drvdata(dev); >>> + >>> + if (unlikely(!device->ops->transfer)) >>> + return -EOPNOTSUPP; >>> + >>> + return device-
Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support
On 2021/5/25 6:11, Alex Williamson wrote: > On Mon, 24 May 2021 21:11:11 +0800 > Shenming Lu wrote: > >> On 2021/5/21 15:59, Shenming Lu wrote: >>> On 2021/5/19 2:58, Alex Williamson wrote: >>>> On Fri, 9 Apr 2021 11:44:20 +0800 >>>> Shenming Lu wrote: >>>> >>>>> To set up nested mode, drivers such as vfio_pci need to register a >>>>> handler to receive stage/level 1 faults from the IOMMU, but since >>>>> currently each device can only have one iommu dev fault handler, >>>>> and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF), >>>>> we choose to update the registered handler (a consolidated one) via >>>>> flags (set FAULT_REPORT_NESTED_L1), and further deliver the received >>>>> stage 1 faults in the handler to the guest through a newly added >>>>> vfio_device_ops callback. >>>> >>>> Are there proposed in-kernel drivers that would use any of these >>>> symbols? >>> >>> I hope that such as Eric's SMMUv3 Nested Stage Setup series [1] can >>> use these symbols to consolidate the two page fault handlers into one. >>> >>> [1] >>> https://patchwork.kernel.org/project/kvm/cover/2021044659.15051-1-eric.au...@redhat.com/ >>> >>>> >>>>> Signed-off-by: Shenming Lu >>>>> --- >>>>> drivers/vfio/vfio.c | 81 + >>>>> drivers/vfio/vfio_iommu_type1.c | 49 +++- >>>>> include/linux/vfio.h| 12 + >>>>> 3 files changed, 141 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c >>>>> index 44c8dfabf7de..4245f15914bf 100644 >>>>> --- a/drivers/vfio/vfio.c >>>>> +++ b/drivers/vfio/vfio.c >>>>> @@ -2356,6 +2356,87 @@ struct iommu_domain >>>>> *vfio_group_iommu_domain(struct vfio_group *group) >>>>> } >>>>> EXPORT_SYMBOL_GPL(vfio_group_iommu_domain); >>>>> >>>>> +/* >>>>> + * Register/Update the VFIO IOPF handler to receive >>>>> + * nested stage/level 1 faults. >>>>> + */ >>>>> +int vfio_iommu_dev_fault_handler_register_nested(struct device *dev) >>>>> +{ >>>>> + struct vfio_container *container; >>>>> + struct vfio_group *group; >>>>> + struct vfio_iommu_driver *driver; >>>>> + int ret; >>>>> + >>>>> + if (!dev) >>>>> + return -EINVAL; >>>>> + >>>>> + group = vfio_group_get_from_dev(dev); >>>>> + if (!group) >>>>> + return -ENODEV; >>>>> + >>>>> + ret = vfio_group_add_container_user(group); >>>>> + if (ret) >>>>> + goto out; >>>>> + >>>>> + container = group->container; >>>>> + driver = container->iommu_driver; >>>>> + if (likely(driver && driver->ops->register_handler)) >>>>> + ret = driver->ops->register_handler(container->iommu_data, dev); >>>>> + else >>>>> + ret = -ENOTTY; >>>>> + >>>>> + vfio_group_try_dissolve_container(group); >>>>> + >>>>> +out: >>>>> + vfio_group_put(group); >>>>> + return ret; >>>>> +} >>>>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested); >>>>> + >>>>> +int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev) >>>>> +{ >>>>> + struct vfio_container *container; >>>>> + struct vfio_group *group; >>>>> + struct vfio_iommu_driver *driver; >>>>> + int ret; >>>>> + >>>>> + if (!dev) >>>>> + return -EINVAL; >>>>> + >>>>> + group = vfio_group_get_from_dev(dev); >>>>> + if (!group) >>>>> + return -ENODEV; >>>>> + >>>>> + ret = vfio_group_add_container_user(group); >>>>> + if (ret) >>>>> + goto out; >>>>> + >>>>> + container = group->container; >>>>> + driver = container->iommu_driver; >>>>> + if (likely(driver && driver->ops->unregi
Re: [RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF
On 2021/5/25 6:11, Alex Williamson wrote: > On Fri, 21 May 2021 14:38:25 +0800 > Shenming Lu wrote: > >> On 2021/5/19 2:58, Alex Williamson wrote: >>> On Fri, 9 Apr 2021 11:44:17 +0800 >>> Shenming Lu wrote: >>> >>>> Since enabling IOPF for devices may lead to a slow ramp up of performance, >>>> we add an ioctl VFIO_IOMMU_ENABLE_IOPF to make it configurable. And the >>>> IOPF enabling of a VFIO device includes setting IOMMU_DEV_FEAT_IOPF and >>>> registering the VFIO IOPF handler. >>>> >>>> Note that VFIO_IOMMU_DISABLE_IOPF is not supported since there may be >>>> inflight page faults when disabling. >>>> >>>> Signed-off-by: Shenming Lu >>>> --- >>>> drivers/vfio/vfio_iommu_type1.c | 223 +++- >>>> include/uapi/linux/vfio.h | 6 + >>>> 2 files changed, 226 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/drivers/vfio/vfio_iommu_type1.c >>>> b/drivers/vfio/vfio_iommu_type1.c >>>> index 01e296c6dc9e..7df5711e743a 100644 >>>> --- a/drivers/vfio/vfio_iommu_type1.c >>>> +++ b/drivers/vfio/vfio_iommu_type1.c >>>> @@ -71,6 +71,7 @@ struct vfio_iommu { >>>>struct rb_root dma_list; >>>>struct blocking_notifier_head notifier; >>>>struct mmu_notifier mn; >>>> + struct mm_struct*mm; >>> >>> We currently have no requirement that a single mm is used for all >>> container mappings. Does enabling IOPF impose such a requirement? >>> Shouldn't MAP/UNMAP enforce that? >> >> Did you mean that there is a possibility that each vfio_dma in a >> container has a different mm? If so, could we register one MMU >> notifier for each vfio_dma in the MAP ioctl? > > We don't prevent it currently. There needs to be some balance, > typically we'll have one mm, so would it make sense to have potentially > thousands of mmu notifiers registered against the same mm? If we have one MMU notifier per vfio_dma, there is no need to search the iommu->dma_list when receiving an address invalidation event in mn_invalidate_range(). Or could we divide the vfio_dma ranges into parts with different mm, and each part would have one MMU notifier? > >>>>unsigned intdma_avail; >>>>unsigned intvaddr_invalid_count; >>>>uint64_tpgsize_bitmap; >>>> @@ -81,6 +82,7 @@ struct vfio_iommu { >>>>booldirty_page_tracking; >>>>boolpinned_page_dirty_scope; >>>>boolcontainer_open; >>>> + booliopf_enabled; >>>> }; >>>> >>>> struct vfio_domain { >>>> @@ -461,6 +463,38 @@ vfio_find_iopf_group(struct iommu_group *iommu_group) >>>>return node ? iopf_group : NULL; >>>> } >>>> >>>> +static void vfio_link_iopf_group(struct vfio_iopf_group *new) >>>> +{ >>>> + struct rb_node **link, *parent = NULL; >>>> + struct vfio_iopf_group *iopf_group; >>>> + >>>> + mutex_lock(&iopf_group_list_lock); >>>> + >>>> + link = &iopf_group_list.rb_node; >>>> + >>>> + while (*link) { >>>> + parent = *link; >>>> + iopf_group = rb_entry(parent, struct vfio_iopf_group, node); >>>> + >>>> + if (new->iommu_group < iopf_group->iommu_group) >>>> + link = &(*link)->rb_left; >>>> + else >>>> + link = &(*link)->rb_right; >>>> + } >>>> + >>>> + rb_link_node(&new->node, parent, link); >>>> + rb_insert_color(&new->node, &iopf_group_list); >>>> + >>>> + mutex_unlock(&iopf_group_list_lock); >>>> +} >>>> + >>>> +static void vfio_unlink_iopf_group(struct vfio_iopf_group *old) >>>> +{ >>>> + mutex_lock(&iopf_group_list_lock); >>>> + rb_erase(&old->node, &iopf_group_list); >>>> + mutex_unlock(&iopf_group_list_lock); >>>> +} >>>> + >>>> static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) >>>> { >>>>struct mm_struct *mm; >>>> @@ -2363,6 +2397,68 @@ static void vfio_iommu_i
Re: [RFC PATCH v3 2/8] vfio/type1: Add a page fault handler
On 2021/5/25 6:11, Alex Williamson wrote: > On Fri, 21 May 2021 14:38:52 +0800 > Shenming Lu wrote: > >> On 2021/5/19 2:58, Alex Williamson wrote: >>> On Fri, 9 Apr 2021 11:44:14 +0800 >>> Shenming Lu wrote: >>> >>>> VFIO manages the DMA mapping itself. To support IOPF (on-demand paging) >>>> for VFIO (IOMMU capable) devices, we add a VFIO page fault handler to >>>> serve the reported page faults from the IOMMU driver. >>>> >>>> Signed-off-by: Shenming Lu >>>> --- >>>> drivers/vfio/vfio_iommu_type1.c | 114 >>>> 1 file changed, 114 insertions(+) >>>> >>>> diff --git a/drivers/vfio/vfio_iommu_type1.c >>>> b/drivers/vfio/vfio_iommu_type1.c >>>> index 45cbfd4879a5..ab0ff60ee207 100644 >>>> --- a/drivers/vfio/vfio_iommu_type1.c >>>> +++ b/drivers/vfio/vfio_iommu_type1.c >>>> @@ -101,6 +101,7 @@ struct vfio_dma { >>>>struct task_struct *task; >>>>struct rb_root pfn_list; /* Ex-user pinned pfn list */ >>>>unsigned long *bitmap; >>>> + unsigned long *iopf_mapped_bitmap; >>>> }; >>>> >>>> struct vfio_batch { >>>> @@ -141,6 +142,16 @@ struct vfio_regions { >>>>size_t len; >>>> }; >>>> >>>> +/* A global IOPF enabled group list */ >>>> +static struct rb_root iopf_group_list = RB_ROOT; >>>> +static DEFINE_MUTEX(iopf_group_list_lock); >>>> + >>>> +struct vfio_iopf_group { >>>> + struct rb_node node; >>>> + struct iommu_group *iommu_group; >>>> + struct vfio_iommu *iommu; >>>> +}; >>>> + >>>> #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) \ >>>>(!list_empty(&iommu->domain_list)) >>>> >>>> @@ -157,6 +168,10 @@ struct vfio_regions { >>>> #define DIRTY_BITMAP_PAGES_MAX ((u64)INT_MAX) >>>> #define DIRTY_BITMAP_SIZE_MAX >>>> DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX) >>>> >>>> +#define IOPF_MAPPED_BITMAP_GET(dma, i)\ >>>> +((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG] >>>> \ >>>> + >> ((i) % BITS_PER_LONG)) & 0x1) >>> >>> >>> Can't we just use test_bit()? >> >> Yeah, we can use it. >> >>> >>> >>>> + >>>> #define WAITED 1 >>>> >>>> static int put_pfn(unsigned long pfn, int prot); >>>> @@ -416,6 +431,34 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma >>>> *dma, struct vfio_pfn *vpfn) >>>>return ret; >>>> } >>>> >>>> +/* >>>> + * Helper functions for iopf_group_list >>>> + */ >>>> +static struct vfio_iopf_group * >>>> +vfio_find_iopf_group(struct iommu_group *iommu_group) >>>> +{ >>>> + struct vfio_iopf_group *iopf_group; >>>> + struct rb_node *node; >>>> + >>>> + mutex_lock(&iopf_group_list_lock); >>>> + >>>> + node = iopf_group_list.rb_node; >>>> + >>>> + while (node) { >>>> + iopf_group = rb_entry(node, struct vfio_iopf_group, node); >>>> + >>>> + if (iommu_group < iopf_group->iommu_group) >>>> + node = node->rb_left; >>>> + else if (iommu_group > iopf_group->iommu_group) >>>> + node = node->rb_right; >>>> + else >>>> + break; >>>> + } >>>> + >>>> + mutex_unlock(&iopf_group_list_lock); >>>> + return node ? iopf_group : NULL; >>>> +} >>> >>> This looks like a pretty heavy weight operation per DMA fault. >>> >>> I'm also suspicious of this validity of this iopf_group after we've >>> dropped the locking, the ordering of patches makes this very confusing. >> >> My thought was to include the handling of DMA faults completely in the type1 >> backend by introducing the vfio_iopf_group struct. But it seems that >> introducing >> a struct with an unknown lifecycle causes more problems... >> I will use the path from vfio-core as in the v2
Re: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough
On 2021/5/25 6:11, Alex Williamson wrote: > On Fri, 21 May 2021 14:37:21 +0800 > Shenming Lu wrote: > >> Hi Alex, >> >> On 2021/5/19 2:57, Alex Williamson wrote: >>> On Fri, 9 Apr 2021 11:44:12 +0800 >>> Shenming Lu wrote: >>> >>>> Hi, >>>> >>>> Requesting for your comments and suggestions. :-) >>>> >>>> The static pinning and mapping problem in VFIO and possible solutions >>>> have been discussed a lot [1, 2]. One of the solutions is to add I/O >>>> Page Fault support for VFIO devices. Different from those relatively >>>> complicated software approaches such as presenting a vIOMMU that provides >>>> the DMA buffer information (might include para-virtualized optimizations), >>>> IOPF mainly depends on the hardware faulting capability, such as the PCIe >>>> PRI extension or Arm SMMU stall model. What's more, the IOPF support in >>>> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF >>>> support for VFIO passthrough based on the IOPF part of SVA in this series. >>>> >>> >>> The SVA proposals are being reworked to make use of a new IOASID >>> object, it's not clear to me that this shouldn't also make use of that >>> work as it does a significant expansion of the type1 IOMMU with fault >>> handlers that would duplicate new work using that new model. >> >> It seems that the IOASID extension for guest SVA would not affect this >> series, >> will we do any host-guest IOASID translation in the VFIO fault handler? > > Surely it will, we don't currently have any IOMMU fault handling or > forwarding of IOMMU faults through to the vfio bus driver, both of > those would be included in an IOASID implementation. I think Jason's > vision is to use IOASID to deprecate type1 for all use cases, so even > if we were to choose to implement IOPF in type1 we should agree on > common interfaces with IOASID. Yeah, the guest IOPF(L1) handling may include the host-guest IOASID translation, which can be placed in the IOASID layer (in fact it can be placed in many places such as the vfio pci driver since it don't really handle the fault event, it just transfers the event to the vIOMMU). But the host IOPF(L2) has no relationship with IOASID at all, it needs to have a knowledge of the vfio_dma ranges. Could we add the host IOPF support to type1 first (integrate it within the MAP ioctl)? And we may migrate the generic iommu controls (such as MAP/UNMAP...) from type1 to IOASID in the future (it seems to be a huge work, I will be very happy if I could help this)... :-) > >>>> We have measured its performance with UADK [4] (passthrough an accelerator >>>> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA): >>>> >>>> Run hisi_sec_test... >>>> - with varying sending times and message lengths >>>> - with/without IOPF enabled (speed slowdown) >>>> >>>> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1): >>>> slowdown (num of faults) >>>> times VFIO IOPF host SVA >>>> 1 63.4% (518)82.8% (512) >>>> 10022.9% (1058) 47.9% (1024) >>>> 1000 2.6% (1071)8.5% (1024) >>>> >>>> when msg_len = 10MB (and PREMAP_LEN = 512): >>>> slowdown (num of faults) >>>> times VFIO IOPF >>>> 1 32.6% (13) >>>> 1003.5% (26) >>>> 1000 1.6% (26) >>> >>> It seems like this is only an example that you can make a benchmark >>> show anything you want. The best results would be to pre-map >>> everything, which is what we have without this series. What is an >>> acceptable overhead to incur to avoid page pinning? What if userspace >>> had more fine grained control over which mappings were available for >>> faulting and which were statically mapped? I don't really see what >>> sense the pre-mapping range makes. If I assume the user is QEMU in a >>> non-vIOMMU configuration, pre-mapping the beginning of each RAM section >>> doesn't make any logical sense relative to device DMA. >> >> As you said in Patch 4, we can introduce full end-to-end functionality >> before trying to improve performance, and I will drop the pre-mapping patch >> in the current stage... >> >> Is there a need that userspace wants more fine grained control over which >> mappings are available for faulting? If so, we
Re: [RFC] /dev/ioasid uAPI proposal
On 2021/6/1 10:36, Jason Wang wrote: > > 在 2021/5/31 下午4:41, Liu Yi L 写道: >>> I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support >>> hardware nesting. Or is there way to detect the capability before? >> I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid >> is not able to support nesting, then should fail it. >> >>> I think GET_INFO only works after the ATTACH. >> yes. After attaching to gpa_ioasid, userspace could GET_INFO on the >> gpa_ioasid and check if nesting is supported or not. right? > > > Some more questions: > > 1) Is the handle returned by IOASID_ALLOC an fd? > 2) If yes, what's the reason for not simply use the fd opened from /dev/ioas. > (This is the question that is not answered) and what happens if we call > GET_INFO for the ioasid_fd? > 3) If not, how GET_INFO work? It seems that the return value from IOASID_ALLOC is an IOASID number in the ioasid_data struct, then when calling GET_INFO, we should convey this IOASID number to get the associated I/O address space attributes (depend on the physical IOMMU, which could be discovered when attaching a device to the IOASID fd or number), right? Thanks, Shenming ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On 2021/5/27 15:58, Tian, Kevin wrote: > /dev/ioasid provides an unified interface for managing I/O page tables for > devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, > etc.) are expected to use this interface instead of creating their own logic > to > isolate untrusted device DMAs initiated by userspace. > > This proposal describes the uAPI of /dev/ioasid and also sample sequences > with VFIO as example in typical usages. The driver-facing kernel API provided > by the iommu layer is still TBD, which can be discussed after consensus is > made on this uAPI. > > It's based on a lengthy discussion starting from here: > > https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/ > > It ends up to be a long writing due to many things to be summarized and > non-trivial effort required to connect them into a complete proposal. > Hope it provides a clean base to converge. > [..] > > /* > * Page fault report and response > * > * This is TBD. Can be added after other parts are cleared up. Likely it > * will be a ring buffer shared between user/kernel, an eventfd to notify > * the user and an ioctl to complete the fault. > * > * The fault data is per I/O address space, i.e.: IOASID + faulting_addr > */ Hi, It seems that the ioasid has different usage in different situation, it could be directly used in the physical routing, or just a virtual handle that indicates a page table or a vPASID table (such as the GPA address space, in the simple passthrough case, the DMA input to IOMMU will just contain a Stream ID, no Substream ID), right? And Baolu suggested that since one device might consume multiple page tables, it's more reasonable to have one fault handler per page table. By this, do we have to maintain such an ioasid info list in the IOMMU layer? Then if we add host IOPF support (for the GPA address space) in the future (I have sent a series for this but it aimed for VFIO, I will convert it for IOASID later [1] :-)), how could we find the handler for the received fault event which only contains a Stream ID... Do we also have to maintain a dev(vPASID)->ioasid mapping in the IOMMU layer? [1] https://lore.kernel.org/patchwork/cover/1410223/ Thanks, Shenming ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support
On 2021/5/27 19:18, Lu Baolu wrote: > Hi Shenming and Alex, > > On 5/27/21 7:03 PM, Shenming Lu wrote: >>> I haven't fully read all the references, but who imposes the fact that >>> there's only one fault handler per device? If type1 could register one >>> handler and the vfio-pci bus driver another for the other level, would >>> we need this path through vfio-core? >> If we could register more than one handler per device, things would become >> much more simple, and the path through vfio-core would not be needed. >> >> Hi Baolu, >> Is there any restriction for having more than one handler per device? >> > > Currently, each device could only have one fault handler. But one device > might consume multiple page tables. From this point of view, it's more > reasonable to have one handler per page table. Sounds good to me. I have pointed it out in the IOASID uAPI proposal. :-) Thanks, Shenming > > Best regards, > baolu > . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On 2021/6/1 13:10, Lu Baolu wrote: > Hi Shenming, > > On 6/1/21 12:31 PM, Shenming Lu wrote: >> On 2021/5/27 15:58, Tian, Kevin wrote: >>> /dev/ioasid provides an unified interface for managing I/O page tables for >>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, >>> etc.) are expected to use this interface instead of creating their own >>> logic to >>> isolate untrusted device DMAs initiated by userspace. >>> >>> This proposal describes the uAPI of /dev/ioasid and also sample sequences >>> with VFIO as example in typical usages. The driver-facing kernel API >>> provided >>> by the iommu layer is still TBD, which can be discussed after consensus is >>> made on this uAPI. >>> >>> It's based on a lengthy discussion starting from here: >>> https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/ >>> >>> It ends up to be a long writing due to many things to be summarized and >>> non-trivial effort required to connect them into a complete proposal. >>> Hope it provides a clean base to converge. >>> >> >> [..] >> >>> >>> /* >>> * Page fault report and response >>> * >>> * This is TBD. Can be added after other parts are cleared up. Likely it >>> * will be a ring buffer shared between user/kernel, an eventfd to notify >>> * the user and an ioctl to complete the fault. >>> * >>> * The fault data is per I/O address space, i.e.: IOASID + faulting_addr >>> */ >> >> Hi, >> >> It seems that the ioasid has different usage in different situation, it could >> be directly used in the physical routing, or just a virtual handle that >> indicates >> a page table or a vPASID table (such as the GPA address space, in the simple >> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no >> Substream ID), right? >> >> And Baolu suggested that since one device might consume multiple page tables, >> it's more reasonable to have one fault handler per page table. By this, do we >> have to maintain such an ioasid info list in the IOMMU layer? > > As discussed earlier, the I/O page fault and cache invalidation paths > will have "device labels" so that the information could be easily > translated and routed. > > So it's likely the per-device fault handler registering API in iommu > core can be kept, but /dev/ioasid will be grown with a layer to > translate and propagate I/O page fault information to the right > consumers. Yeah, having a general preprocessing of the faults in IOASID seems to be a doable direction. But since there may be more than one consumer at the same time, who is responsible for registering the per-device fault handler? Thanks, Shenming > > If things evolve in this way, probably the SVA I/O page fault also needs > to be ported to /dev/ioasid. > >> >> Then if we add host IOPF support (for the GPA address space) in the future >> (I have sent a series for this but it aimed for VFIO, I will convert it for >> IOASID later [1] :-)), how could we find the handler for the received fault >> event which only contains a Stream ID... Do we also have to maintain a >> dev(vPASID)->ioasid mapping in the IOMMU layer? >> >> [1] https://lore.kernel.org/patchwork/cover/1410223/ > > Best regards, > baolu > . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On 2021/6/1 20:30, Lu Baolu wrote: > On 2021/6/1 15:15, Shenming Lu wrote: >> On 2021/6/1 13:10, Lu Baolu wrote: >>> Hi Shenming, >>> >>> On 6/1/21 12:31 PM, Shenming Lu wrote: >>>> On 2021/5/27 15:58, Tian, Kevin wrote: >>>>> /dev/ioasid provides an unified interface for managing I/O page tables for >>>>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, >>>>> etc.) are expected to use this interface instead of creating their own >>>>> logic to >>>>> isolate untrusted device DMAs initiated by userspace. >>>>> >>>>> This proposal describes the uAPI of /dev/ioasid and also sample sequences >>>>> with VFIO as example in typical usages. The driver-facing kernel API >>>>> provided >>>>> by the iommu layer is still TBD, which can be discussed after consensus is >>>>> made on this uAPI. >>>>> >>>>> It's based on a lengthy discussion starting from here: >>>>> >>>>> https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/ >>>>> >>>>> It ends up to be a long writing due to many things to be summarized and >>>>> non-trivial effort required to connect them into a complete proposal. >>>>> Hope it provides a clean base to converge. >>>>> >>>> [..] >>>> >>>>> /* >>>>> * Page fault report and response >>>>> * >>>>> * This is TBD. Can be added after other parts are cleared up. Likely >>>>> it >>>>> * will be a ring buffer shared between user/kernel, an eventfd to >>>>> notify >>>>> * the user and an ioctl to complete the fault. >>>>> * >>>>> * The fault data is per I/O address space, i.e.: IOASID + >>>>> faulting_addr >>>>> */ >>>> Hi, >>>> >>>> It seems that the ioasid has different usage in different situation, it >>>> could >>>> be directly used in the physical routing, or just a virtual handle that >>>> indicates >>>> a page table or a vPASID table (such as the GPA address space, in the >>>> simple >>>> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no >>>> Substream ID), right? >>>> >>>> And Baolu suggested that since one device might consume multiple page >>>> tables, >>>> it's more reasonable to have one fault handler per page table. By this, do >>>> we >>>> have to maintain such an ioasid info list in the IOMMU layer? >>> As discussed earlier, the I/O page fault and cache invalidation paths >>> will have "device labels" so that the information could be easily >>> translated and routed. >>> >>> So it's likely the per-device fault handler registering API in iommu >>> core can be kept, but /dev/ioasid will be grown with a layer to >>> translate and propagate I/O page fault information to the right >>> consumers. >> Yeah, having a general preprocessing of the faults in IOASID seems to be >> a doable direction. But since there may be more than one consumer at the >> same time, who is responsible for registering the per-device fault handler? > > The drivers register per page table fault handlers to /dev/ioasid which > will then register itself to iommu core to listen and route the per- > device I/O page faults. This is just a top level thought. Haven't gone > through the details yet. Need to wait and see what /dev/ioasid finally > looks like. OK. And it needs to be confirmed by Jean since we might migrate the code from io-pgfault.c to IOASID... Anyway, finalize /dev/ioasid first. Thanks, Shenming > > Best regards, > baolu > . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On 2021/6/2 1:33, Jason Gunthorpe wrote: > On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote: > >> The drivers register per page table fault handlers to /dev/ioasid which >> will then register itself to iommu core to listen and route the per- >> device I/O page faults. > > I'm still confused why drivers need fault handlers at all? Essentially it is the userspace that needs the fault handlers, one case is to deliver the faults to the vIOMMU, and another case is to enable IOPF on the GPA address space for on-demand paging, it seems that both could be specified in/through the IOASID_ALLOC ioctl? Thanks, Shenming ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On 2021/6/4 2:19, Jacob Pan wrote: > Hi Shenming, > > On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu > wrote: > >> On 2021/6/2 1:33, Jason Gunthorpe wrote: >>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote: >>> >>>> The drivers register per page table fault handlers to /dev/ioasid which >>>> will then register itself to iommu core to listen and route the per- >>>> device I/O page faults. >>> >>> I'm still confused why drivers need fault handlers at all? >> >> Essentially it is the userspace that needs the fault handlers, >> one case is to deliver the faults to the vIOMMU, and another >> case is to enable IOPF on the GPA address space for on-demand >> paging, it seems that both could be specified in/through the >> IOASID_ALLOC ioctl? >> > I would think IOASID_BIND_PGTABLE is where fault handler should be > registered. There wouldn't be any IO page fault without the binding anyway. Yeah, I also proposed this before, registering the handler in the BIND_PGTABLE ioctl does make sense for the guest page faults. :-) But how about the page faults from the GPA address space (it's page table is mapped through the MAP_DMA ioctl)? From your point of view, it seems that we should register the handler for the GPA address space in the (first) MAP_DMA ioctl. > > I also don't understand why device drivers should register the fault > handler, the fault is detected by the pIOMMU and injected to the vIOMMU. So > I think it should be the IOASID itself register the handler. Yeah, and it can also be said that the provider of the page table registers the handler (Baolu). Thanks, Shenming ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] /dev/ioasid uAPI proposal
On 2021/6/7 20:19, Liu, Yi L wrote: >> From: Shenming Lu >> Sent: Friday, June 4, 2021 10:03 AM >> >> On 2021/6/4 2:19, Jacob Pan wrote: >>> Hi Shenming, >>> >>> On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu >> >>> wrote: >>> >>>> On 2021/6/2 1:33, Jason Gunthorpe wrote: >>>>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote: >>>>> >>>>>> The drivers register per page table fault handlers to /dev/ioasid which >>>>>> will then register itself to iommu core to listen and route the per- >>>>>> device I/O page faults. >>>>> >>>>> I'm still confused why drivers need fault handlers at all? >>>> >>>> Essentially it is the userspace that needs the fault handlers, >>>> one case is to deliver the faults to the vIOMMU, and another >>>> case is to enable IOPF on the GPA address space for on-demand >>>> paging, it seems that both could be specified in/through the >>>> IOASID_ALLOC ioctl? >>>> >>> I would think IOASID_BIND_PGTABLE is where fault handler should be >>> registered. There wouldn't be any IO page fault without the binding >> anyway. >> >> Yeah, I also proposed this before, registering the handler in the >> BIND_PGTABLE >> ioctl does make sense for the guest page faults. :-) >> >> But how about the page faults from the GPA address space (it's page table is >> mapped through the MAP_DMA ioctl)? From your point of view, it seems >> that we should register the handler for the GPA address space in the (first) >> MAP_DMA ioctl. > > under new proposal, I think the page fault handler is also registered > per ioasid object. The difference compared with guest page table case > is there is no need to inject the fault to VM. Yeah. And there are some issues specific to the GPA address space case which have been discussed with Alex.. Thanks, Shenming ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On 2021/7/9 15:48, Tian, Kevin wrote: > 4.6. I/O page fault > +++ > > uAPI is TBD. Here is just about the high-level flow from host IOMMU driver > to guest IOMMU driver and backwards. This flow assumes that I/O page faults > are reported via IOMMU interrupts. Some devices report faults via device > specific way instead of going through the IOMMU. That usage is not covered > here: > > - Host IOMMU driver receives a I/O page fault with raw fault_data {rid, > pasid, addr}; > > - Host IOMMU driver identifies the faulting I/O page table according to > {rid, pasid} and calls the corresponding fault handler with an opaque > object (registered by the handler) and raw fault_data (rid, pasid, addr); > > - IOASID fault handler identifies the corresponding ioasid and device > cookie according to the opaque object, generates an user fault_data > (ioasid, cookie, addr) in the fault region, and triggers eventfd to > userspace; > Hi, I have some doubts here: For mdev, it seems that the rid in the raw fault_data is the parent device's, then in the vSVA scenario, how can we get to know the mdev(cookie) from the rid and pasid? And from this point of view,would it be better to register the mdev (iommu_register_device()) with the parent device info? Thanks, Shenming ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On 2021/7/15 11:55, Tian, Kevin wrote: >> From: Shenming Lu >> Sent: Thursday, July 15, 2021 11:21 AM >> >> On 2021/7/9 15:48, Tian, Kevin wrote: >>> 4.6. I/O page fault >>> +++ >>> >>> uAPI is TBD. Here is just about the high-level flow from host IOMMU driver >>> to guest IOMMU driver and backwards. This flow assumes that I/O page >> faults >>> are reported via IOMMU interrupts. Some devices report faults via device >>> specific way instead of going through the IOMMU. That usage is not >> covered >>> here: >>> >>> - Host IOMMU driver receives a I/O page fault with raw fault_data {rid, >>> pasid, addr}; >>> >>> - Host IOMMU driver identifies the faulting I/O page table according to >>> {rid, pasid} and calls the corresponding fault handler with an opaque >>> object (registered by the handler) and raw fault_data (rid, pasid, >>> addr); >>> >>> - IOASID fault handler identifies the corresponding ioasid and device >>> cookie according to the opaque object, generates an user fault_data >>> (ioasid, cookie, addr) in the fault region, and triggers eventfd to >>> userspace; >>> >> >> Hi, I have some doubts here: >> >> For mdev, it seems that the rid in the raw fault_data is the parent device's, >> then in the vSVA scenario, how can we get to know the mdev(cookie) from >> the >> rid and pasid? >> >> And from this point of view,would it be better to register the mdev >> (iommu_register_device()) with the parent device info? >> > > This is what is proposed in this RFC. A successful binding generates a new > iommu_dev object for each vfio device. For mdev this object includes > its parent device, the defPASID marking this mdev, and the cookie > representing it in userspace. Later it is iommu_dev being recorded in > the attaching_data when the mdev is attached to an IOASID: > > struct iommu_attach_data *__iommu_device_attach( > struct iommu_dev *dev, u32 ioasid, u32 pasid, int flags); > > Then when a fault is reported, the fault handler just needs to figure out > iommu_dev according to {rid, pasid} in the raw fault data. > Yeah, we have the defPASID that marks the mdev and refers to the default I/O address space, but how about the non-default I/O address spaces? Is there a case that two different mdevs (on the same parent device) are used by the same process in the guest, thus have a same pasid route in the physical IOMMU? It seems that we can't figure out the mdev from the rid and pasid in this case... Did I misunderstand something?... :-) Thanks, Shenming ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On 2021/7/15 14:49, Tian, Kevin wrote: >> From: Shenming Lu >> Sent: Thursday, July 15, 2021 2:29 PM >> >> On 2021/7/15 11:55, Tian, Kevin wrote: >>>> From: Shenming Lu >>>> Sent: Thursday, July 15, 2021 11:21 AM >>>> >>>> On 2021/7/9 15:48, Tian, Kevin wrote: >>>>> 4.6. I/O page fault >>>>> +++ >>>>> >>>>> uAPI is TBD. Here is just about the high-level flow from host IOMMU >> driver >>>>> to guest IOMMU driver and backwards. This flow assumes that I/O page >>>> faults >>>>> are reported via IOMMU interrupts. Some devices report faults via >> device >>>>> specific way instead of going through the IOMMU. That usage is not >>>> covered >>>>> here: >>>>> >>>>> - Host IOMMU driver receives a I/O page fault with raw fault_data {rid, >>>>> pasid, addr}; >>>>> >>>>> - Host IOMMU driver identifies the faulting I/O page table according to >>>>> {rid, pasid} and calls the corresponding fault handler with an opaque >>>>> object (registered by the handler) and raw fault_data (rid, pasid, >>>>> addr); >>>>> >>>>> - IOASID fault handler identifies the corresponding ioasid and device >>>>> cookie according to the opaque object, generates an user fault_data >>>>> (ioasid, cookie, addr) in the fault region, and triggers eventfd to >>>>> userspace; >>>>> >>>> >>>> Hi, I have some doubts here: >>>> >>>> For mdev, it seems that the rid in the raw fault_data is the parent >>>> device's, >>>> then in the vSVA scenario, how can we get to know the mdev(cookie) from >>>> the >>>> rid and pasid? >>>> >>>> And from this point of view,would it be better to register the mdev >>>> (iommu_register_device()) with the parent device info? >>>> >>> >>> This is what is proposed in this RFC. A successful binding generates a new >>> iommu_dev object for each vfio device. For mdev this object includes >>> its parent device, the defPASID marking this mdev, and the cookie >>> representing it in userspace. Later it is iommu_dev being recorded in >>> the attaching_data when the mdev is attached to an IOASID: >>> >>> struct iommu_attach_data *__iommu_device_attach( >>> struct iommu_dev *dev, u32 ioasid, u32 pasid, int flags); >>> >>> Then when a fault is reported, the fault handler just needs to figure out >>> iommu_dev according to {rid, pasid} in the raw fault data. >>> >> >> Yeah, we have the defPASID that marks the mdev and refers to the default >> I/O address space, but how about the non-default I/O address spaces? >> Is there a case that two different mdevs (on the same parent device) >> are used by the same process in the guest, thus have a same pasid route >> in the physical IOMMU? It seems that we can't figure out the mdev from >> the rid and pasid in this case... >> >> Did I misunderstand something?... :-) >> > > No. You are right on this case. I don't think there is a way to > differentiate one mdev from the other if they come from the > same parent and attached by the same guest process. In this > case the fault could be reported on either mdev (e.g. the first > matching one) to get it fixed in the guest. > OK. Thanks, Shenming ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On 2021/7/16 9:20, Tian, Kevin wrote: > To summarize, for vIOMMU we can work with the spec owner to > define a proper interface to feedback such restriction into the guest > if necessary. For the kernel part, it's clear that IOMMU fd should > disallow two devices attached to a single [RID] or [RID, PASID] slot > in the first place. > > Then the next question is how to communicate such restriction > to the userspace. It sounds like a group, but different in concept. > An iommu group describes the minimal isolation boundary thus all > devices in the group can be only assigned to a single user. But this > case is opposite - the two mdevs (both support ENQCMD submission) > with the same parent have problem when assigned to a single VM > (in this case vPASID is vm-wide translated thus a same pPASID will be > used cross both mdevs) while they instead work pretty well when > assigned to different VMs (completely different vPASID spaces thus > different pPASIDs). > > One thought is to have vfio device driver deal with it. In this proposal > it is the vfio device driver to define the PASID virtualization policy and > report it to userspace via VFIO_DEVICE_GET_INFO. The driver understands > the restriction thus could just hide the vPASID capability when the user > calls GET_INFO on the 2nd mdev in above scenario. In this way the > user even doesn't need to know such restriction at all and both mdevs > can be assigned to a single VM w/o any problem. > The restriction only probably happens when two mdevs are assigned to one VM, how could the vfio device driver get to know this info to accurately hide the vPASID capability for the 2nd mdev when VFIO_DEVICE_GET_INFO? There is no need to do this in other cases. Thanks, Shenming ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu