Re: [PATCH v12 06/10] iommu: Add a page fault handler

2021-02-01 Thread Shenming Lu
Hi Jean,

It seems that the preprocessing of the page faults(groups) here is relatively
generic, and if a device driver wants to reuse it while having its own 
iopf_handle_single(),
is there any chance for this? :-)

Thanks,
Shenming

On 2021/1/27 23:43, Jean-Philippe Brucker wrote:
> Some systems allow devices to handle I/O Page Faults in the core mm. For
> example systems implementing the PCIe PRI extension or Arm SMMU stall
> model. Infrastructure for reporting these recoverable page faults was
> added to the IOMMU core by commit 0c830e6b3282 ("iommu: Introduce device
> fault report API"). Add a page fault handler for host SVA.
> 
> IOMMU driver can now instantiate several fault workqueues and link them
> to IOPF-capable devices. Drivers can choose between a single global
> workqueue, one per IOMMU device, one per low-level fault queue, one per
> domain, etc.
> 
> When it receives a fault event, most commonly in an IRQ handler, the
> IOMMU driver reports the fault using iommu_report_device_fault(), which
> calls the registered handler. The page fault handler then calls the mm
> fault handler, and reports either success or failure with
> iommu_page_response(). After the handler succeeds, the hardware retries
> the access.
> 
> The iopf_param pointer could be embedded into iommu_fault_param. But
> putting iopf_param into the iommu_param structure allows us not to care
> about ordering between calls to iopf_queue_add_device() and
> iommu_register_device_fault_handler().
> 
> Reviewed-by: Jonathan Cameron 
> Signed-off-by: Jean-Philippe Brucker 
> ---
>  drivers/iommu/Makefile|   1 +
>  drivers/iommu/iommu-sva-lib.h |  53 
>  include/linux/iommu.h |   2 +
>  drivers/iommu/io-pgfault.c| 461 ++
>  4 files changed, 517 insertions(+)
>  create mode 100644 drivers/iommu/io-pgfault.c
> 
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 61bd30cd8369..60fafc23dee6 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -28,3 +28,4 @@ obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
>  obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
>  obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
>  obj-$(CONFIG_IOMMU_SVA_LIB) += iommu-sva-lib.o
> +obj-$(CONFIG_IOMMU_SVA_LIB) += io-pgfault.o
> diff --git a/drivers/iommu/iommu-sva-lib.h b/drivers/iommu/iommu-sva-lib.h
> index b40990aef3fd..031155010ca8 100644
> --- a/drivers/iommu/iommu-sva-lib.h
> +++ b/drivers/iommu/iommu-sva-lib.h
> @@ -12,4 +12,57 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t 
> min, ioasid_t max);
>  void iommu_sva_free_pasid(struct mm_struct *mm);
>  struct mm_struct *iommu_sva_find(ioasid_t pasid);
>  
> +/* I/O Page fault */
> +struct device;
> +struct iommu_fault;
> +struct iopf_queue;
> +
> +#ifdef CONFIG_IOMMU_SVA_LIB
> +int iommu_queue_iopf(struct iommu_fault *fault, void *cookie);
> +
> +int iopf_queue_add_device(struct iopf_queue *queue, struct device *dev);
> +int iopf_queue_remove_device(struct iopf_queue *queue,
> +  struct device *dev);
> +int iopf_queue_flush_dev(struct device *dev);
> +struct iopf_queue *iopf_queue_alloc(const char *name);
> +void iopf_queue_free(struct iopf_queue *queue);
> +int iopf_queue_discard_partial(struct iopf_queue *queue);
> +
> +#else /* CONFIG_IOMMU_SVA_LIB */
> +static inline int iommu_queue_iopf(struct iommu_fault *fault, void *cookie)
> +{
> + return -ENODEV;
> +}
> +
> +static inline int iopf_queue_add_device(struct iopf_queue *queue,
> + struct device *dev)
> +{
> + return -ENODEV;
> +}
> +
> +static inline int iopf_queue_remove_device(struct iopf_queue *queue,
> +struct device *dev)
> +{
> + return -ENODEV;
> +}
> +
> +static inline int iopf_queue_flush_dev(struct device *dev)
> +{
> + return -ENODEV;
> +}
> +
> +static inline struct iopf_queue *iopf_queue_alloc(const char *name)
> +{
> + return NULL;
> +}
> +
> +static inline void iopf_queue_free(struct iopf_queue *queue)
> +{
> +}
> +
> +static inline int iopf_queue_discard_partial(struct iopf_queue *queue)
> +{
> + return -ENODEV;
> +}
> +#endif /* CONFIG_IOMMU_SVA_LIB */
>  #endif /* _IOMMU_SVA_LIB_H */
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 00348e4c3c26..edc9be443a74 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -366,6 +366,7 @@ struct iommu_fault_param {
>   * struct dev_iommu - Collection of per-device IOMMU data
>   *
>   * @fault_param: IOMMU detected device fault reporting data
> + * @iopf_param:   I/O Page Fault queue and data
>   * @fwspec:   IOMMU fwspec data
>   * @iommu_dev:IOMMU device this device is linked to
>   * @priv: IOMMU Driver private data
> @@ -376,6 +377,7 @@ struct iommu_fault_param {
>  struct dev_iommu {
>   struct mutex lock;
>   struct iommu_fault_param*fault_param;
> + struct iopf_device_param*iopf_param;
>   s

Re: [PATCH v11 04/13] vfio/pci: Add VFIO_REGION_TYPE_NESTED region type

2021-02-23 Thread Shenming Lu
> +static int vfio_pci_dma_fault_init(struct vfio_pci_device *vdev)
> +{
> + struct vfio_region_dma_fault *header;
> + struct iommu_domain *domain;
> + size_t size;
> + bool nested;
> + int ret;
> +
> + domain = iommu_get_domain_for_dev(&vdev->pdev->dev);
> + ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, &nested);
> + if (ret || !nested)
> + return ret;

Hi Eric,

It seems that the type of nested should be int, the use of bool might trigger
a panic in arm_smmu_domain_get_attr().

Thanks,
Shenming
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v2 0/6] Add IOPF support for VFIO passthrough

2021-03-08 Thread Shenming Lu
Hi,

The static pinning and mapping problem in VFIO and possible solutions
have been discussed a lot [1, 2]. One of the solutions is to add I/O
page fault support for VFIO devices. Different from those relatively
complicated software approaches such as presenting a vIOMMU that provides
the DMA buffer information (might include para-virtualized optimizations),
IOPF mainly depends on the hardware faulting capability, such as the PCIe
PRI extension or Arm SMMU stall model. What's more, the IOPF support in
the IOMMU driver is being implemented in SVA [3]. So we add IOPF support
for VFIO passthrough based on the IOPF part of SVA in this series.

We have measured its performance with UADK [4] (passthrough an accelerator
to a VM) on Hisilicon Kunpeng920 board:

Run hisi_sec_test...
 - with varying message lengths and sending times
 - with/without stage 2 IOPF enabled

when msg_len = 1MB and PREMAP_LEN (in patch 3) = 1:
   speed (KB/s)
 times w/o IOPFwith IOPF (num of faults)degradation
 1 325596  119152 (518) 36.6%
 100   7524985 5804659 (1058)   77.1%
 1000  8661817 8440209 (1071)   97.4%
 5000  8804512 8724368 (1216)   99.1%

If we use the same region to send messages, since page faults occur almost
only when first accessing, more times, less degradation.

when msg_len = 10MB and PREMAP_LEN = 512:
   speed (KB/s)
 times w/o IOPFwith IOPF (num of faults)degradation
 1 1012758 682257 (13)  67.4%
 100   8680688 8374154 (26) 96.5%
 1000  8860861 8719918 (26) 98.4%

We see that pre-mapping can help.

And we also measured the performance of host SVA with the same params:

when msg_len = 1MB:
   speed (KB/s)
 times w/o IOPFwith IOPF (num of faults)degradation
 1 951672  163866 (512) 17.2%
 100   8691961 4529971 (1024)   52.1%
 1000  9158721 8376346 (1024)   91.5%
 5000  9184532 9008739 (1024)   98.1%

Besides, the avg time spent in vfio_iommu_dev_fault_handler() (in patch 3)
is little less than iopf_handle_group() (in SVA) (1.6 us vs 2.0 us).

History:

v1 -> v2
 - Numerous improvements following the suggestions. Thanks a lot to all
   of you.

Yet TODO:
 - Add support for PRI.
 - Consider selective-faulting. (suggested by Kevin)
 ...

Links:
[1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS,
2016.
[2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer Tracking
for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020.
[3] 
https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210302092644.2553014-1-jean-phili...@linaro.org/
[4] https://github.com/Linaro/uadk

Any comments and suggestions are very welcome. :-)

Thanks,
Shenming


Shenming Lu (6):
  iommu: Evolve to support more scenarios of using IOPF
  vfio: Add an MMU notifier to avoid pinning
  vfio: Add a page fault handler
  vfio: VFIO_IOMMU_ENABLE_IOPF
  vfio: No need to statically pin and map if IOPF enabled
  vfio: Add nested IOPF support

 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |   3 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |  11 +-
 drivers/iommu/io-pgfault.c|   4 -
 drivers/iommu/iommu.c |  56 ++-
 drivers/vfio/vfio.c   | 118 +
 drivers/vfio/vfio_iommu_type1.c   | 446 +-
 include/linux/iommu.h |  21 +-
 include/linux/vfio.h  |  14 +
 include/uapi/linux/iommu.h|   3 +
 include/uapi/linux/vfio.h |   6 +
 10 files changed, 661 insertions(+), 21 deletions(-)

-- 
2.19.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v2 1/6] iommu: Evolve to support more scenarios of using IOPF

2021-03-08 Thread Shenming Lu
This patch follows the discussion here:

https://lore.kernel.org/linux-acpi/YAaxjmJW+ZMvrhac@myrica/

In order to support more scenarios of using IOPF (mainly consider
the nested extension), besides keeping IOMMU_DEV_FEAT_IOPF as a
general capability for whether delivering faults through the IOMMU,
we extend iommu_register_fault_handler() with flags and introduce
IOPF_REPORT_FLAT and IOPF_REPORT_NESTED to describe the page fault
reporting capability under a specific configuration.
IOPF_REPORT_NESTED needs additional info to indicate which level/stage
is concerned since the fault client may be interested in only one
level.

Signed-off-by: Shenming Lu 
---
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |  3 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 11 ++--
 drivers/iommu/io-pgfault.c|  4 --
 drivers/iommu/iommu.c | 56 ++-
 include/linux/iommu.h | 21 ++-
 include/uapi/linux/iommu.h|  3 +
 6 files changed, 85 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index ee66d1f4cb81..5de9432349d4 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -482,7 +482,8 @@ static int arm_smmu_master_sva_enable_iopf(struct 
arm_smmu_master *master)
if (ret)
return ret;
 
-   ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf, dev);
+   ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf,
+ IOPF_REPORT_FLAT, dev);
if (ret) {
iopf_queue_remove_device(master->smmu->evtq.iopf, dev);
return ret;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 363744df8d51..f40529d0075d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1447,10 +1447,6 @@ static int arm_smmu_handle_evt(struct arm_smmu_device 
*smmu, u64 *evt)
return -EOPNOTSUPP;
}
 
-   /* Stage-2 is always pinned at the moment */
-   if (evt[1] & EVTQ_1_S2)
-   return -EFAULT;
-
if (evt[1] & EVTQ_1_RnW)
perm |= IOMMU_FAULT_PERM_READ;
else
@@ -1468,13 +1464,18 @@ static int arm_smmu_handle_evt(struct arm_smmu_device 
*smmu, u64 *evt)
.flags = IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE,
.grpid = FIELD_GET(EVTQ_1_STAG, evt[1]),
.perm = perm,
-   .addr = FIELD_GET(EVTQ_2_ADDR, evt[2]),
};
 
if (ssid_valid) {
flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
flt->prm.pasid = FIELD_GET(EVTQ_0_SSID, evt[0]);
}
+
+   if (evt[1] & EVTQ_1_S2) {
+   flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_L2;
+   flt->prm.addr = FIELD_GET(EVTQ_3_IPA, evt[3]);
+   } else
+   flt->prm.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]);
} else {
flt->type = IOMMU_FAULT_DMA_UNRECOV;
flt->event = (struct iommu_fault_unrecoverable) {
diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c
index 1df8c1dcae77..abf16e06bcf5 100644
--- a/drivers/iommu/io-pgfault.c
+++ b/drivers/iommu/io-pgfault.c
@@ -195,10 +195,6 @@ int iommu_queue_iopf(struct iommu_fault *fault, void 
*cookie)
 
lockdep_assert_held(¶m->lock);
 
-   if (fault->type != IOMMU_FAULT_PAGE_REQ)
-   /* Not a recoverable page fault */
-   return -EOPNOTSUPP;
-
/*
 * As long as we're holding param->lock, the queue can't be unlinked
 * from the device and therefore cannot disappear.
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d0b0a15dba84..cb1d93b00f7d 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1056,6 +1056,40 @@ int iommu_group_unregister_notifier(struct iommu_group 
*group,
 }
 EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
 
+/*
+ * iommu_update_device_fault_handler - Update the device fault handler via 
flags
+ * @dev: the device
+ * @mask: bits(not set) to clear
+ * @set: bits to set
+ *
+ * Update the device fault handler installed by
+ * iommu_register_device_fault_handler().
+ *
+ * Return 0 on success, or an error.
+ */
+int iommu_update_device_fault_handler(struct device *dev, u32 mask, u32 set)
+{
+   struct dev_iommu *param = dev->iommu;
+   int ret = 0;
+
+   if (!param)
+   return -EINVAL;
+
+   mutex_lock(¶m->lock);
+
+   if (param->fault_param) {
+   ret = -EINVAL;
+   goto out_unlock;
+  

[RFC PATCH v2 2/6] vfio: Add an MMU notifier to avoid pinning

2021-03-08 Thread Shenming Lu
To avoid pinning pages when they are mapped in IOMMU page tables,
we add an MMU notifier to tell the addresses which are no longer
valid and try to unmap them.

Signed-off-by: Shenming Lu 
---
 drivers/vfio/vfio_iommu_type1.c | 68 +
 1 file changed, 68 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 4bb162c1d649..03ccc11057af 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson "
@@ -69,6 +70,7 @@ struct vfio_iommu {
struct mutexlock;
struct rb_root  dma_list;
struct blocking_notifier_head notifier;
+   struct mmu_notifier mn;
unsigned intdma_avail;
unsigned intvaddr_invalid_count;
uint64_tpgsize_bitmap;
@@ -101,6 +103,7 @@ struct vfio_dma {
struct task_struct  *task;
struct rb_root  pfn_list;   /* Ex-user pinned pfn list */
unsigned long   *bitmap;
+   unsigned long   *iopf_mapped_bitmap;
 };
 
 struct vfio_batch {
@@ -157,6 +160,10 @@ struct vfio_regions {
 #define DIRTY_BITMAP_PAGES_MAX  ((u64)INT_MAX)
 #define DIRTY_BITMAP_SIZE_MAX   DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
 
+#define IOPF_MAPPED_BITMAP_GET(dma, i) \
+ ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG]
\
+  >> ((i) % BITS_PER_LONG)) & 0x1)
+
 #define WAITED 1
 
 static int put_pfn(unsigned long pfn, int prot);
@@ -1149,6 +1156,35 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, 
struct vfio_dma *dma,
return unlocked;
 }
 
+static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu,
+   struct vfio_dma *dma,
+   dma_addr_t start, dma_addr_t end)
+{
+   unsigned long bit_offset;
+   size_t len;
+   struct vfio_domain *d;
+
+   while (start < end) {
+   bit_offset = (start - dma->iova) >> PAGE_SHIFT;
+
+   for (len = 0; start + len < end; len += PAGE_SIZE) {
+   if (!IOPF_MAPPED_BITMAP_GET(dma,
+   bit_offset + (len >> PAGE_SHIFT)))
+   break;
+   }
+
+   if (len) {
+   list_for_each_entry(d, &iommu->domain_list, next)
+   iommu_unmap(d->domain, start, len);
+
+   bitmap_clear(dma->iopf_mapped_bitmap,
+bit_offset, len >> PAGE_SHIFT);
+   }
+
+   start += (len + PAGE_SIZE);
+   }
+}
+
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
@@ -3096,6 +3132,38 @@ static int vfio_iommu_type1_dirty_pages(struct 
vfio_iommu *iommu,
return -EINVAL;
 }
 
+static void mn_invalidate_range(struct mmu_notifier *mn, struct mm_struct *mm,
+   unsigned long start, unsigned long end)
+{
+   struct vfio_iommu *iommu = container_of(mn, struct vfio_iommu, mn);
+   struct rb_node *n;
+
+   mutex_lock(&iommu->lock);
+
+   for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
+   struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
+   unsigned long start_n, end_n;
+
+   if (end <= dma->vaddr || start >= dma->vaddr + dma->size)
+   continue;
+
+   start_n = ALIGN_DOWN(max_t(unsigned long, start, dma->vaddr),
+PAGE_SIZE);
+   end_n = ALIGN(min_t(unsigned long, end, dma->vaddr + dma->size),
+ PAGE_SIZE);
+
+   vfio_unmap_partial_iopf(iommu, dma,
+   start_n - dma->vaddr + dma->iova,
+   end_n - dma->vaddr + dma->iova);
+   }
+
+   mutex_unlock(&iommu->lock);
+}
+
+static const struct mmu_notifier_ops vfio_iommu_type1_mn_ops = {
+   .invalidate_range   = mn_invalidate_range,
+};
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
   unsigned int cmd, unsigned long arg)
 {
-- 
2.19.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v2 3/6] vfio: Add a page fault handler

2021-03-08 Thread Shenming Lu
VFIO manages the DMA mapping itself. To support IOPF for VFIO
devices, we add a VFIO page fault handler to serve the reported
page faults from the IOMMU driver. Besides, we can pre-map more
pages than requested at once to optimize for fewer page fault
handlings.

Signed-off-by: Shenming Lu 
---
 drivers/vfio/vfio.c |  35 +++
 drivers/vfio/vfio_iommu_type1.c | 167 
 include/linux/vfio.h|   5 +
 3 files changed, 207 insertions(+)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 38779e6fd80c..77b29bbd3027 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -2354,6 +2354,41 @@ struct iommu_domain *vfio_group_iommu_domain(struct 
vfio_group *group)
 }
 EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
 
+int vfio_iommu_dev_fault_handler(struct iommu_fault *fault, void *data)
+{
+   struct device *dev = (struct device *)data;
+   struct vfio_container *container;
+   struct vfio_group *group;
+   struct vfio_iommu_driver *driver;
+   int ret;
+
+   if (!dev)
+   return -EINVAL;
+
+   group = vfio_group_get_from_dev(dev);
+   if (!group)
+   return -ENODEV;
+
+   ret = vfio_group_add_container_user(group);
+   if (ret)
+   goto out;
+
+   container = group->container;
+   driver = container->iommu_driver;
+   if (likely(driver && driver->ops->dma_map_iopf))
+   ret = driver->ops->dma_map_iopf(container->iommu_data,
+   fault, dev);
+   else
+   ret = -ENOTTY;
+
+   vfio_group_try_dissolve_container(group);
+
+out:
+   vfio_group_put(group);
+   return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler);
+
 /**
  * Module/class support
  */
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 03ccc11057af..167d52c1468b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -3330,6 +3330,172 @@ static void vfio_iommu_type1_notify(void *iommu_data,
wake_up_all(&iommu->vaddr_wait);
 }
 
+/*
+ * To optimize for fewer page fault handlings, try to
+ * pre-map more pages than requested.
+ */
+#define IOPF_PREMAP_LEN512
+
+static void unpin_pages_iopf(struct vfio_dma *dma,
+unsigned long pfn, unsigned long npages)
+{
+   while (npages--)
+   put_pfn(pfn++, dma->prot);
+}
+
+/*
+ * Return 0 on success or a negative error code, the
+ * number of pages contiguously pinned is in @*pinned.
+ */
+static int pin_pages_iopf(struct vfio_dma *dma, unsigned long vaddr,
+ unsigned long npages, unsigned long *pfn_base,
+ unsigned long *pinned, struct vfio_batch *batch)
+{
+   struct mm_struct *mm;
+   unsigned long pfn;
+   int ret = 0;
+   *pinned = 0;
+
+   mm = get_task_mm(dma->task);
+   if (!mm)
+   return -ENODEV;
+
+   if (batch->size) {
+   *pfn_base = page_to_pfn(batch->pages[batch->offset]);
+   pfn = *pfn_base;
+   } else
+   *pfn_base = 0;
+
+   while (npages) {
+   if (!batch->size) {
+   unsigned long req_pages = min_t(unsigned long, npages,
+   batch->capacity);
+
+   ret = vaddr_get_pfns(mm, vaddr, req_pages, dma->prot,
+&pfn, batch->pages);
+   if (ret < 0)
+   goto out;
+
+   batch->size = ret;
+   batch->offset = 0;
+   ret = 0;
+
+   if (!*pfn_base)
+   *pfn_base = pfn;
+   }
+
+   while (true) {
+   if (pfn != *pfn_base + *pinned)
+   goto out;
+
+   (*pinned)++;
+   npages--;
+   vaddr += PAGE_SIZE;
+   batch->offset++;
+   batch->size--;
+
+   if (!batch->size)
+   break;
+
+   pfn = page_to_pfn(batch->pages[batch->offset]);
+   }
+
+   if (unlikely(disable_hugepages))
+   break;
+   }
+
+out:
+   mmput(mm);
+   return ret;
+}
+
+static int vfio_iommu_type1_dma_map_iopf(void *iommu_data,
+struct iommu_fault *fault,
+struct device *dev)
+{
+   struct vfio_iommu *iommu = iommu_data;
+   dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
+   struct vfio_dma *dma;
+   int access_flags = 0;
+   size_t premap_len, ma

[RFC PATCH v2 4/6] vfio: VFIO_IOMMU_ENABLE_IOPF

2021-03-08 Thread Shenming Lu
Since enabling IOPF for devices may lead to a slow ramp up of
performance, we add an VFIO_IOMMU_ENABLE_IOPF ioctl to make it
configurable. And the IOPF enabling of a VFIO device includes
setting IOMMU_DEV_FEAT_IOPF and registering the VFIO page fault
handler. Note that VFIO_IOMMU_DISABLE_IOPF is not supported
since there may be inflight page faults when disabling.

Signed-off-by: Shenming Lu 
---
 drivers/vfio/vfio_iommu_type1.c | 139 +++-
 include/uapi/linux/vfio.h   |   6 ++
 2 files changed, 142 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 167d52c1468b..3997473be4a7 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -71,6 +71,7 @@ struct vfio_iommu {
struct rb_root  dma_list;
struct blocking_notifier_head notifier;
struct mmu_notifier mn;
+   struct mm_struct*mm;
unsigned intdma_avail;
unsigned intvaddr_invalid_count;
uint64_tpgsize_bitmap;
@@ -81,6 +82,7 @@ struct vfio_iommu {
booldirty_page_tracking;
boolpinned_page_dirty_scope;
boolcontainer_open;
+   booliopf_enabled;
 };
 
 struct vfio_domain {
@@ -2278,6 +2280,62 @@ static void vfio_iommu_iova_insert_copy(struct 
vfio_iommu *iommu,
list_splice_tail(iova_copy, iova);
 }
 
+static int dev_disable_iopf(struct device *dev, void *data)
+{
+   int *enabled_dev_cnt = data;
+
+   if (enabled_dev_cnt && *enabled_dev_cnt <= 0)
+   return -1;
+
+   iommu_unregister_device_fault_handler(dev);
+   iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF);
+
+   if (enabled_dev_cnt)
+   (*enabled_dev_cnt)--;
+
+   return 0;
+}
+
+static int dev_enable_iopf(struct device *dev, void *data)
+{
+   int *enabled_dev_cnt = data;
+   struct iommu_domain *domain;
+   int nested;
+   u32 flags;
+   int ret;
+
+   ret = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_IOPF);
+   if (ret)
+   return ret;
+
+   domain = iommu_get_domain_for_dev(dev);
+   if (!domain) {
+   ret = -ENODEV;
+   goto out_disable;
+   }
+
+   ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, &nested);
+   if (ret)
+   goto out_disable;
+
+   if (nested)
+   flags = IOPF_REPORT_NESTED | IOPF_REPORT_NESTED_L2_CONCERNED;
+   else
+   flags = IOPF_REPORT_FLAT;
+
+   ret = iommu_register_device_fault_handler(dev,
+   vfio_iommu_dev_fault_handler, flags, dev);
+   if (ret)
+   goto out_disable;
+
+   (*enabled_dev_cnt)++;
+   return 0;
+
+out_disable:
+   iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF);
+   return ret;
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 struct iommu_group *iommu_group)
 {
@@ -2291,6 +2349,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
struct iommu_domain_geometry geo;
LIST_HEAD(iova_copy);
LIST_HEAD(group_resv_regions);
+   int iopf_enabled_dev_cnt = 0;
 
mutex_lock(&iommu->lock);
 
@@ -2368,6 +2427,13 @@ static int vfio_iommu_type1_attach_group(void 
*iommu_data,
if (ret)
goto out_domain;
 
+   if (iommu->iopf_enabled) {
+   ret = iommu_group_for_each_dev(iommu_group, 
&iopf_enabled_dev_cnt,
+  dev_enable_iopf);
+   if (ret)
+   goto out_detach;
+   }
+
/* Get aperture info */
iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY, &geo);
 
@@ -2449,9 +2515,11 @@ static int vfio_iommu_type1_attach_group(void 
*iommu_data,
vfio_test_domain_fgsp(domain);
 
/* replay mappings on new domains */
-   ret = vfio_iommu_replay(iommu, domain);
-   if (ret)
-   goto out_detach;
+   if (!iommu->iopf_enabled) {
+   ret = vfio_iommu_replay(iommu, domain);
+   if (ret)
+   goto out_detach;
+   }
 
if (resv_msi) {
ret = iommu_get_msi_cookie(domain->domain, resv_msi_base);
@@ -2482,6 +2550,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
iommu_domain_free(domain->domain);
vfio_iommu_iova_free(&iova_copy);
vfio_iommu_resv_free(&group_resv_regions);
+   iommu_group_for_each_dev(iommu_group, &iopf_enabled_dev_cnt,
+dev_disable_iopf);
 out_free:
kfree(domain);
kfree(group);
@@ -2643,6 +2713,10 @@ static void vfio_iommu_type1_detach_group(void 
*iommu_data,
if (!group)

[RFC PATCH v2 5/6] vfio: No need to statically pin and map if IOPF enabled

2021-03-08 Thread Shenming Lu
If IOPF enabled for the VFIO container, there is no need to
statically pin and map the entire DMA range, we can do it on
demand. And unmap according to the IOPF mapped bitmap when
removing vfio_dma.

Signed-off-by: Shenming Lu 
---
 drivers/vfio/vfio_iommu_type1.c | 35 -
 1 file changed, 30 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 3997473be4a7..8d14ced649a6 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -165,6 +165,7 @@ struct vfio_regions {
 #define IOPF_MAPPED_BITMAP_GET(dma, i) \
  ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG]
\
   >> ((i) % BITS_PER_LONG)) & 0x1)
+#define IOPF_MAPPED_BITMAP_BYTES(n)DIRTY_BITMAP_BYTES(n)
 
 #define WAITED 1
 
@@ -877,7 +878,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 * already pinned and accounted. Accouting should be done if there is no
 * iommu capable domain in the container.
 */
-   do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
+   do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
+   iommu->iopf_enabled;
 
for (i = 0; i < npage; i++) {
struct vfio_pfn *vpfn;
@@ -966,7 +968,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
 
mutex_lock(&iommu->lock);
 
-   do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
+   do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
+   iommu->iopf_enabled;
for (i = 0; i < npage; i++) {
struct vfio_dma *dma;
dma_addr_t iova;
@@ -1087,7 +1090,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, 
struct vfio_dma *dma,
if (!dma->size)
return 0;
 
-   if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
+   if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled)
return 0;
 
/*
@@ -1187,11 +1190,20 @@ static void vfio_unmap_partial_iopf(struct vfio_iommu 
*iommu,
}
 }
 
+static void vfio_dma_clean_iopf(struct vfio_iommu *iommu, struct vfio_dma *dma)
+{
+   vfio_unmap_partial_iopf(iommu, dma, dma->iova, dma->iova + dma->size);
+
+   kfree(dma->iopf_mapped_bitmap);
+}
+
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
vfio_unmap_unpin(iommu, dma, true);
vfio_unlink_dma(iommu, dma);
+   if (iommu->iopf_enabled)
+   vfio_dma_clean_iopf(iommu, dma);
put_task_struct(dma->task);
vfio_dma_bitmap_free(dma);
if (dma->vaddr_invalid) {
@@ -1655,6 +1667,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
goto out_unlock;
}
 
+   if (iommu->iopf_enabled) {
+   dma->iopf_mapped_bitmap = kvzalloc(IOPF_MAPPED_BITMAP_BYTES(
+   size >> PAGE_SHIFT), 
GFP_KERNEL);
+   if (!dma->iopf_mapped_bitmap) {
+   ret = -ENOMEM;
+   kfree(dma);
+   goto out_unlock;
+   }
+   }
+
iommu->dma_avail--;
dma->iova = iova;
dma->vaddr = vaddr;
@@ -1694,8 +1716,11 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
/* Insert zero-sized and grow as we map chunks of it */
vfio_link_dma(iommu, dma);
 
-   /* Don't pin and map if container doesn't contain IOMMU capable domain*/
-   if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
+   /*
+* Don't pin and map if container doesn't contain IOMMU capable domain,
+* or IOPF enabled for the container.
+*/
+   if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled)
dma->size = size;
else
ret = vfio_pin_map_dma(iommu, dma, size);
-- 
2.19.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v2 6/6] vfio: Add nested IOPF support

2021-03-08 Thread Shenming Lu
To set up nested mode, drivers such as vfio_pci have to register
a handler to receive stage/level 1 faults from the IOMMU, but
since currently each device can only have one iommu dev fault
handler, and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF),
we choose to update the registered handler (a combined one) via
flags (set IOPF_REPORT_NESTED_L1_CONCERNED), and further deliver
the received stage 1 faults in the handler to the guest through
a newly added vfio_device_ops callback.

Signed-off-by: Shenming Lu 
---
 drivers/vfio/vfio.c | 83 +
 drivers/vfio/vfio_iommu_type1.c | 37 +++
 include/linux/vfio.h|  9 
 3 files changed, 129 insertions(+)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 77b29bbd3027..c6a01d947d0d 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -2389,6 +2389,89 @@ int vfio_iommu_dev_fault_handler(struct iommu_fault 
*fault, void *data)
 }
 EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler);
 
+int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev)
+{
+   struct vfio_container *container;
+   struct vfio_group *group;
+   struct vfio_iommu_driver *driver;
+   int ret;
+
+   if (!dev)
+   return -EINVAL;
+
+   group = vfio_group_get_from_dev(dev);
+   if (!group)
+   return -ENODEV;
+
+   ret = vfio_group_add_container_user(group);
+   if (ret)
+   goto out;
+
+   container = group->container;
+   driver = container->iommu_driver;
+   if (likely(driver && driver->ops->unregister_hdlr_nested))
+   ret = driver->ops->unregister_hdlr_nested(container->iommu_data,
+ dev);
+   else
+   ret = -ENOTTY;
+
+   vfio_group_try_dissolve_container(group);
+
+out:
+   vfio_group_put(group);
+   return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested);
+
+/*
+ * Register/Update the VFIO page fault handler
+ * to receive nested stage/level 1 faults.
+ */
+int vfio_iommu_dev_fault_handler_register_nested(struct device *dev)
+{
+   struct vfio_container *container;
+   struct vfio_group *group;
+   struct vfio_iommu_driver *driver;
+   int ret;
+
+   if (!dev)
+   return -EINVAL;
+
+   group = vfio_group_get_from_dev(dev);
+   if (!group)
+   return -ENODEV;
+
+   ret = vfio_group_add_container_user(group);
+   if (ret)
+   goto out;
+
+   container = group->container;
+   driver = container->iommu_driver;
+   if (likely(driver && driver->ops->register_hdlr_nested))
+   ret = driver->ops->register_hdlr_nested(container->iommu_data,
+   dev);
+   else
+   ret = -ENOTTY;
+
+   vfio_group_try_dissolve_container(group);
+
+out:
+   vfio_group_put(group);
+   return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested);
+
+int vfio_transfer_dev_fault(struct device *dev, struct iommu_fault *fault)
+{
+   struct vfio_device *device = dev_get_drvdata(dev);
+
+   if (unlikely(!device->ops->transfer))
+   return -EOPNOTSUPP;
+
+   return device->ops->transfer(device->device_data, fault);
+}
+EXPORT_SYMBOL_GPL(vfio_transfer_dev_fault);
+
 /**
  * Module/class support
  */
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 8d14ced649a6..62ad4a47de4a 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -3581,6 +3581,13 @@ static int vfio_iommu_type1_dma_map_iopf(void 
*iommu_data,
enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
struct iommu_page_response resp = {0};
 
+   /*
+* When configured in nested mode, further deliver
+* stage/level 1 faults to the guest.
+*/
+   if (iommu->nesting && !(fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_L2))
+   return vfio_transfer_dev_fault(dev, fault);
+
mutex_lock(&iommu->lock);
 
dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
@@ -3654,6 +3661,34 @@ static int vfio_iommu_type1_dma_map_iopf(void 
*iommu_data,
return 0;
 }
 
+static int vfio_iommu_type1_register_hdlr_nested(void *iommu_data,
+struct device *dev)
+{
+   struct vfio_iommu *iommu = iommu_data;
+
+   if (iommu->iopf_enabled)
+   return iommu_update_device_fault_handler(dev, ~0,
+   IOPF_REPORT_NESTED_L1_CONCERNED);
+   else
+   return iommu_register_device_fault_handler(dev,
+   vfio_iommu_dev_fault_handler,
+  

Re: [RFC PATCH v2 1/6] iommu: Evolve to support more scenarios of using IOPF

2021-03-09 Thread Shenming Lu
Hi Baolu,

On 2021/3/10 10:09, Lu Baolu wrote:
> Hi Shenming,
> 
> On 3/9/21 2:22 PM, Shenming Lu wrote:
>> This patch follows the discussion here:
>>
>> https://lore.kernel.org/linux-acpi/YAaxjmJW+ZMvrhac@myrica/
>>
>> In order to support more scenarios of using IOPF (mainly consider
>> the nested extension), besides keeping IOMMU_DEV_FEAT_IOPF as a
>> general capability for whether delivering faults through the IOMMU,
>> we extend iommu_register_fault_handler() with flags and introduce
>> IOPF_REPORT_FLAT and IOPF_REPORT_NESTED to describe the page fault
>> reporting capability under a specific configuration.
>> IOPF_REPORT_NESTED needs additional info to indicate which level/stage
>> is concerned since the fault client may be interested in only one
>> level.
>>
>> Signed-off-by: Shenming Lu 
>> ---
>>   .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |  3 +-
>>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 11 ++--
>>   drivers/iommu/io-pgfault.c    |  4 --
>>   drivers/iommu/iommu.c | 56 ++-
>>   include/linux/iommu.h | 21 ++-
>>   include/uapi/linux/iommu.h    |  3 +
>>   6 files changed, 85 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c 
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
>> index ee66d1f4cb81..5de9432349d4 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
>> @@ -482,7 +482,8 @@ static int arm_smmu_master_sva_enable_iopf(struct 
>> arm_smmu_master *master)
>>   if (ret)
>>   return ret;
>>   -    ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf, dev);
>> +    ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf,
>> +  IOPF_REPORT_FLAT, dev);
>>   if (ret) {
>>   iopf_queue_remove_device(master->smmu->evtq.iopf, dev);
>>   return ret;
>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> index 363744df8d51..f40529d0075d 100644
>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>> @@ -1447,10 +1447,6 @@ static int arm_smmu_handle_evt(struct arm_smmu_device 
>> *smmu, u64 *evt)
>>   return -EOPNOTSUPP;
>>   }
>>   -    /* Stage-2 is always pinned at the moment */
>> -    if (evt[1] & EVTQ_1_S2)
>> -    return -EFAULT;
>> -
>>   if (evt[1] & EVTQ_1_RnW)
>>   perm |= IOMMU_FAULT_PERM_READ;
>>   else
>> @@ -1468,13 +1464,18 @@ static int arm_smmu_handle_evt(struct 
>> arm_smmu_device *smmu, u64 *evt)
>>   .flags = IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE,
>>   .grpid = FIELD_GET(EVTQ_1_STAG, evt[1]),
>>   .perm = perm,
>> -    .addr = FIELD_GET(EVTQ_2_ADDR, evt[2]),
>>   };
>>     if (ssid_valid) {
>>   flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
>>   flt->prm.pasid = FIELD_GET(EVTQ_0_SSID, evt[0]);
>>   }
>> +
>> +    if (evt[1] & EVTQ_1_S2) {
>> +    flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_L2;
>> +    flt->prm.addr = FIELD_GET(EVTQ_3_IPA, evt[3]);
>> +    } else
>> +    flt->prm.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]);
>>   } else {
>>   flt->type = IOMMU_FAULT_DMA_UNRECOV;
>>   flt->event = (struct iommu_fault_unrecoverable) {
>> diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c
>> index 1df8c1dcae77..abf16e06bcf5 100644
>> --- a/drivers/iommu/io-pgfault.c
>> +++ b/drivers/iommu/io-pgfault.c
>> @@ -195,10 +195,6 @@ int iommu_queue_iopf(struct iommu_fault *fault, void 
>> *cookie)
>>     lockdep_assert_held(¶m->lock);
>>   -    if (fault->type != IOMMU_FAULT_PAGE_REQ)
>> -    /* Not a recoverable page fault */
>> -    return -EOPNOTSUPP;
>> -
> 
> Any reasons why do you want to remove this check?

My thought was to make the reporting cap more detailed: IOPF_REPORT_ is only 
for recoverable
page faults (IOMMU_FAULT_PAGE_REQ), and we may add UNRECOV_FAULT_REPORT_ later 
for unrecoverable
faults (IOMMU_FAULT_DMA_UNRECOV)...

> 
>>   /*
>>    * As long as we're holding param->lock, the queue can't be unlinked
>>    * from

[RFC PATCH v3 4/8] vfio/type1: Pre-map more pages than requested in the IOPF handling

2021-04-08 Thread Shenming Lu
To optimize for fewer page fault handlings, we can pre-map more pages
than requested at once.

Note that IOPF_PREMAP_LEN is just an arbitrary value for now, which we
could try further tuning.

Signed-off-by: Shenming Lu 
---
 drivers/vfio/vfio_iommu_type1.c | 131 ++--
 1 file changed, 123 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 1cb9d1f2717b..01e296c6dc9e 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -3217,6 +3217,91 @@ static int vfio_iommu_type1_dirty_pages(struct 
vfio_iommu *iommu,
return -EINVAL;
 }
 
+/*
+ * To optimize for fewer page fault handlings, try to
+ * pre-map more pages than requested.
+ */
+#define IOPF_PREMAP_LEN512
+
+/*
+ * Return 0 on success or a negative error code, the
+ * number of pages contiguously pinned is in @pinned.
+ */
+static int pin_pages_iopf(struct vfio_dma *dma, unsigned long vaddr,
+ unsigned long npages, unsigned long *pfn_base,
+ unsigned long *pinned, struct vfio_batch *batch)
+{
+   struct mm_struct *mm;
+   unsigned long pfn;
+   int ret = 0;
+   *pinned = 0;
+
+   mm = get_task_mm(dma->task);
+   if (!mm)
+   return -ENODEV;
+
+   if (batch->size) {
+   *pfn_base = page_to_pfn(batch->pages[batch->offset]);
+   pfn = *pfn_base;
+   } else {
+   *pfn_base = 0;
+   }
+
+   while (npages) {
+   if (!batch->size) {
+   unsigned long req_pages = min_t(unsigned long, npages,
+   batch->capacity);
+
+   ret = vaddr_get_pfns(mm, vaddr, req_pages, dma->prot,
+&pfn, batch->pages);
+   if (ret < 0)
+   goto out;
+
+   batch->size = ret;
+   batch->offset = 0;
+   ret = 0;
+
+   if (!*pfn_base)
+   *pfn_base = pfn;
+   }
+
+   while (true) {
+   if (pfn != *pfn_base + *pinned)
+   goto out;
+
+   (*pinned)++;
+   npages--;
+   vaddr += PAGE_SIZE;
+   batch->offset++;
+   batch->size--;
+
+   if (!batch->size)
+   break;
+
+   pfn = page_to_pfn(batch->pages[batch->offset]);
+   }
+
+   if (unlikely(disable_hugepages))
+   break;
+   }
+
+out:
+   if (batch->size == 1 && !batch->offset) {
+   put_pfn(pfn, dma->prot);
+   batch->size = 0;
+   }
+
+   mmput(mm);
+   return ret;
+}
+
+static void unpin_pages_iopf(struct vfio_dma *dma,
+unsigned long pfn, unsigned long npages)
+{
+   while (npages--)
+   put_pfn(pfn++, dma->prot);
+}
+
 /* VFIO I/O Page Fault handler */
 static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
 {
@@ -3225,9 +3310,11 @@ static int vfio_iommu_type1_dma_map_iopf(struct 
iommu_fault *fault, void *data)
struct vfio_iopf_group *iopf_group;
struct vfio_iommu *iommu;
struct vfio_dma *dma;
+   struct vfio_batch batch;
dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
int access_flags = 0;
-   unsigned long bit_offset, vaddr, pfn;
+   size_t premap_len, map_len, mapped_len = 0;
+   unsigned long bit_offset, vaddr, pfn, i, npages;
int ret;
enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
struct iommu_page_response resp = {0};
@@ -3263,19 +3350,47 @@ static int vfio_iommu_type1_dma_map_iopf(struct 
iommu_fault *fault, void *data)
if (IOPF_MAPPED_BITMAP_GET(dma, bit_offset))
goto out_success;
 
+   premap_len = IOPF_PREMAP_LEN << PAGE_SHIFT;
+   npages = dma->size >> PAGE_SHIFT;
+   map_len = PAGE_SIZE;
+   for (i = bit_offset + 1; i < npages; i++) {
+   if (map_len >= premap_len || IOPF_MAPPED_BITMAP_GET(dma, i))
+   break;
+   map_len += PAGE_SIZE;
+   }
vaddr = iova - dma->iova + dma->vaddr;
+   vfio_batch_init(&batch);
 
-   if (vfio_pin_page_external(dma, vaddr, &pfn, false))
-   goto out_invalid;
+   while (map_len) {
+   ret = pin_pages_iopf(dma, vaddr + mapped_len,
+map_len >> PAGE_SHIFT, &pfn,
+&npages, &batch);
+   if (!npag

[RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough

2021-04-08 Thread Shenming Lu
Hi,

Requesting for your comments and suggestions. :-)

The static pinning and mapping problem in VFIO and possible solutions
have been discussed a lot [1, 2]. One of the solutions is to add I/O
Page Fault support for VFIO devices. Different from those relatively
complicated software approaches such as presenting a vIOMMU that provides
the DMA buffer information (might include para-virtualized optimizations),
IOPF mainly depends on the hardware faulting capability, such as the PCIe
PRI extension or Arm SMMU stall model. What's more, the IOPF support in
the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
support for VFIO passthrough based on the IOPF part of SVA in this series.

We have measured its performance with UADK [4] (passthrough an accelerator
to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):

Run hisi_sec_test...
 - with varying sending times and message lengths
 - with/without IOPF enabled (speed slowdown)

when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
slowdown (num of faults)
 times  VFIO IOPF  host SVA
 1  63.4% (518)82.8% (512)
 10022.9% (1058)   47.9% (1024)
 1000   2.6% (1071)8.5% (1024)

when msg_len = 10MB (and PREMAP_LEN = 512):
slowdown (num of faults)
 times  VFIO IOPF
 1  32.6% (13)
 1003.5% (26)
 1000   1.6% (26)

History:

v2 -> v3
 - Nit fixes.
 - No reason to disable reporting the unrecoverable faults. (baolu)
 - Maintain a global IOPF enabled group list.
 - Split the pre-mapping optimization to be a separate patch.
 - Add selective faulting support (use vfio_pin_pages to indicate the
   non-faultable scope and add a new struct vfio_range to record it,
   untested). (Kevin)

v1 -> v2
 - Numerous improvements following the suggestions. Thanks a lot to all
   of you.

Note that PRI is not supported at the moment since there is no hardware.

Links:
[1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS,
2016.
[2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer Tracking
for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020.
[3] 
https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210401154718.307519-1-jean-phili...@linaro.org/
[4] https://github.com/Linaro/uadk

Thanks,
Shenming


Shenming Lu (8):
  iommu: Evolve the device fault reporting framework
  vfio/type1: Add a page fault handler
  vfio/type1: Add an MMU notifier to avoid pinning
  vfio/type1: Pre-map more pages than requested in the IOPF handling
  vfio/type1: VFIO_IOMMU_ENABLE_IOPF
  vfio/type1: No need to statically pin and map if IOPF enabled
  vfio/type1: Add selective DMA faulting support
  vfio: Add nested IOPF support

 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |3 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |   18 +-
 drivers/iommu/iommu.c |   56 +-
 drivers/vfio/vfio.c   |   85 +-
 drivers/vfio/vfio_iommu_type1.c   | 1000 -
 include/linux/iommu.h |   19 +-
 include/linux/vfio.h  |   13 +
 include/uapi/linux/iommu.h|4 +
 include/uapi/linux/vfio.h |6 +
 9 files changed, 1181 insertions(+), 23 deletions(-)

-- 
2.19.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v3 1/8] iommu: Evolve the device fault reporting framework

2021-04-08 Thread Shenming Lu
This patch follows the discussion here:

https://lore.kernel.org/linux-acpi/YAaxjmJW+ZMvrhac@myrica/

Besides SVA/vSVA, such as VFIO may also enable (2nd level) IOPF to remove
pinning restriction. In order to better support more scenarios of using
device faults, we extend iommu_register_fault_handler() with flags and
introduce FAULT_REPORT_ to describe the device fault reporting capability
under a specific configuration.

Note that we don't further distinguish recoverable and unrecoverable faults
by flags in the fault reporting cap, having PAGE_FAULT_REPORT_ +
UNRECOV_FAULT_REPORT_ seems not a clean way.

In addition, still take VFIO as an example, in nested mode, the 1st level
and 2nd level fault reporting may be configured separately and currently
each device can only register one iommu dev fault handler, so we add a
handler update interface for this.

Signed-off-by: Shenming Lu 
---
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |  3 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 18 --
 drivers/iommu/iommu.c | 56 ++-
 include/linux/iommu.h | 19 ++-
 include/uapi/linux/iommu.h|  4 ++
 5 files changed, 90 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index ee66d1f4cb81..e6d766fb8f1a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -482,7 +482,8 @@ static int arm_smmu_master_sva_enable_iopf(struct 
arm_smmu_master *master)
if (ret)
return ret;
 
-   ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf, dev);
+   ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf,
+ FAULT_REPORT_FLAT, dev);
if (ret) {
iopf_queue_remove_device(master->smmu->evtq.iopf, dev);
return ret;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 53abad8fdd91..51843f54a87f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1448,10 +1448,6 @@ static int arm_smmu_handle_evt(struct arm_smmu_device 
*smmu, u64 *evt)
return -EOPNOTSUPP;
}
 
-   /* Stage-2 is always pinned at the moment */
-   if (evt[1] & EVTQ_1_S2)
-   return -EFAULT;
-
if (evt[1] & EVTQ_1_RnW)
perm |= IOMMU_FAULT_PERM_READ;
else
@@ -1469,26 +1465,36 @@ static int arm_smmu_handle_evt(struct arm_smmu_device 
*smmu, u64 *evt)
.flags = IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE,
.grpid = FIELD_GET(EVTQ_1_STAG, evt[1]),
.perm = perm,
-   .addr = FIELD_GET(EVTQ_2_ADDR, evt[2]),
};
 
if (ssid_valid) {
flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
flt->prm.pasid = FIELD_GET(EVTQ_0_SSID, evt[0]);
}
+
+   if (evt[1] & EVTQ_1_S2) {
+   flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_L2;
+   flt->prm.addr = FIELD_GET(EVTQ_3_IPA, evt[3]);
+   } else
+   flt->prm.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]);
} else {
flt->type = IOMMU_FAULT_DMA_UNRECOV;
flt->event = (struct iommu_fault_unrecoverable) {
.reason = reason,
.flags = IOMMU_FAULT_UNRECOV_ADDR_VALID,
.perm = perm,
-   .addr = FIELD_GET(EVTQ_2_ADDR, evt[2]),
};
 
if (ssid_valid) {
flt->event.flags |= IOMMU_FAULT_UNRECOV_PASID_VALID;
flt->event.pasid = FIELD_GET(EVTQ_0_SSID, evt[0]);
}
+
+   if (evt[1] & EVTQ_1_S2) {
+   flt->event.flags |= IOMMU_FAULT_UNRECOV_L2;
+   flt->event.addr = FIELD_GET(EVTQ_3_IPA, evt[3]);
+   } else
+   flt->event.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]);
}
 
mutex_lock(&smmu->streams_mutex);
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d0b0a15dba84..b50b526b45ac 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1056,6 +1056,40 @@ int iommu_group_unregister_notifier(struct iommu_group 
*group,
 }
 EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
 
+/*
+ * iommu_update_device_fault_handler - Update the device fault handler via 
flags
+ * @dev: the device
+ * @mask: bits(not set) to clear
+ * @set: bits to set
+ *
+ * Update the device fault handler installed by
+ * iommu_register_device_fault_handler().
+ *
+ * Return 0 on su

[RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF

2021-04-08 Thread Shenming Lu
Since enabling IOPF for devices may lead to a slow ramp up of performance,
we add an ioctl VFIO_IOMMU_ENABLE_IOPF to make it configurable. And the
IOPF enabling of a VFIO device includes setting IOMMU_DEV_FEAT_IOPF and
registering the VFIO IOPF handler.

Note that VFIO_IOMMU_DISABLE_IOPF is not supported since there may be
inflight page faults when disabling.

Signed-off-by: Shenming Lu 
---
 drivers/vfio/vfio_iommu_type1.c | 223 +++-
 include/uapi/linux/vfio.h   |   6 +
 2 files changed, 226 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 01e296c6dc9e..7df5711e743a 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -71,6 +71,7 @@ struct vfio_iommu {
struct rb_root  dma_list;
struct blocking_notifier_head notifier;
struct mmu_notifier mn;
+   struct mm_struct*mm;
unsigned intdma_avail;
unsigned intvaddr_invalid_count;
uint64_tpgsize_bitmap;
@@ -81,6 +82,7 @@ struct vfio_iommu {
booldirty_page_tracking;
boolpinned_page_dirty_scope;
boolcontainer_open;
+   booliopf_enabled;
 };
 
 struct vfio_domain {
@@ -461,6 +463,38 @@ vfio_find_iopf_group(struct iommu_group *iommu_group)
return node ? iopf_group : NULL;
 }
 
+static void vfio_link_iopf_group(struct vfio_iopf_group *new)
+{
+   struct rb_node **link, *parent = NULL;
+   struct vfio_iopf_group *iopf_group;
+
+   mutex_lock(&iopf_group_list_lock);
+
+   link = &iopf_group_list.rb_node;
+
+   while (*link) {
+   parent = *link;
+   iopf_group = rb_entry(parent, struct vfio_iopf_group, node);
+
+   if (new->iommu_group < iopf_group->iommu_group)
+   link = &(*link)->rb_left;
+   else
+   link = &(*link)->rb_right;
+   }
+
+   rb_link_node(&new->node, parent, link);
+   rb_insert_color(&new->node, &iopf_group_list);
+
+   mutex_unlock(&iopf_group_list_lock);
+}
+
+static void vfio_unlink_iopf_group(struct vfio_iopf_group *old)
+{
+   mutex_lock(&iopf_group_list_lock);
+   rb_erase(&old->node, &iopf_group_list);
+   mutex_unlock(&iopf_group_list_lock);
+}
+
 static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
 {
struct mm_struct *mm;
@@ -2363,6 +2397,68 @@ static void vfio_iommu_iova_insert_copy(struct 
vfio_iommu *iommu,
list_splice_tail(iova_copy, iova);
 }
 
+static int vfio_dev_domian_nested(struct device *dev, int *nested)
+{
+   struct iommu_domain *domain;
+
+   domain = iommu_get_domain_for_dev(dev);
+   if (!domain)
+   return -ENODEV;
+
+   return iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, nested);
+}
+
+static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void 
*data);
+
+static int dev_enable_iopf(struct device *dev, void *data)
+{
+   int *enabled_dev_cnt = data;
+   int nested;
+   u32 flags;
+   int ret;
+
+   ret = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_IOPF);
+   if (ret)
+   return ret;
+
+   ret = vfio_dev_domian_nested(dev, &nested);
+   if (ret)
+   goto out_disable;
+
+   if (nested)
+   flags = FAULT_REPORT_NESTED_L2;
+   else
+   flags = FAULT_REPORT_FLAT;
+
+   ret = iommu_register_device_fault_handler(dev,
+   vfio_iommu_type1_dma_map_iopf, flags, dev);
+   if (ret)
+   goto out_disable;
+
+   (*enabled_dev_cnt)++;
+   return 0;
+
+out_disable:
+   iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF);
+   return ret;
+}
+
+static int dev_disable_iopf(struct device *dev, void *data)
+{
+   int *enabled_dev_cnt = data;
+
+   if (enabled_dev_cnt && *enabled_dev_cnt <= 0)
+   return -1;
+
+   WARN_ON(iommu_unregister_device_fault_handler(dev));
+   WARN_ON(iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_IOPF));
+
+   if (enabled_dev_cnt)
+   (*enabled_dev_cnt)--;
+
+   return 0;
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 struct iommu_group *iommu_group)
 {
@@ -2376,6 +2472,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
struct iommu_domain_geometry geo;
LIST_HEAD(iova_copy);
LIST_HEAD(group_resv_regions);
+   int iopf_enabled_dev_cnt = 0;
+   struct vfio_iopf_group *iopf_group = NULL;
 
mutex_lock(&iommu->lock);
 
@@ -2453,6 +2551,24 @@ static int vfio_iommu_type1_attach_group(void 
*iommu_data,
if (ret)
goto out_domain;
 

[RFC PATCH v3 6/8] vfio/type1: No need to statically pin and map if IOPF enabled

2021-04-08 Thread Shenming Lu
If IOPF enabled for the VFIO container, there is no need to statically
pin and map the entire DMA range, we can do it on demand. And unmap
according to the IOPF mapped bitmap when removing vfio_dma.

Note that we still mark all pages dirty even if IOPF enabled, we may
add IOPF-based fine grained dirty tracking support in the future.

Signed-off-by: Shenming Lu 
---
 drivers/vfio/vfio_iommu_type1.c | 38 +++--
 1 file changed, 32 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 7df5711e743a..dcc93c3b258c 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -175,6 +175,7 @@ struct vfio_iopf_group {
 #define IOPF_MAPPED_BITMAP_GET(dma, i) \
  ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG]
\
   >> ((i) % BITS_PER_LONG)) & 0x1)
+#define IOPF_MAPPED_BITMAP_BYTES(n)DIRTY_BITMAP_BYTES(n)
 
 #define WAITED 1
 
@@ -959,7 +960,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 * already pinned and accounted. Accouting should be done if there is no
 * iommu capable domain in the container.
 */
-   do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
+   do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
+   iommu->iopf_enabled;
 
for (i = 0; i < npage; i++) {
struct vfio_pfn *vpfn;
@@ -1048,7 +1050,8 @@ static int vfio_iommu_type1_unpin_pages(void *iommu_data,
 
mutex_lock(&iommu->lock);
 
-   do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
+   do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
+   iommu->iopf_enabled;
for (i = 0; i < npage; i++) {
struct vfio_dma *dma;
dma_addr_t iova;
@@ -1169,7 +1172,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, 
struct vfio_dma *dma,
if (!dma->size)
return 0;
 
-   if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
+   if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled)
return 0;
 
/*
@@ -1306,11 +1309,20 @@ static void vfio_unmap_partial_iopf(struct vfio_iommu 
*iommu,
}
 }
 
+static void vfio_dma_clean_iopf(struct vfio_iommu *iommu, struct vfio_dma *dma)
+{
+   vfio_unmap_partial_iopf(iommu, dma, dma->iova, dma->iova + dma->size);
+
+   kfree(dma->iopf_mapped_bitmap);
+}
+
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
vfio_unmap_unpin(iommu, dma, true);
vfio_unlink_dma(iommu, dma);
+   if (iommu->iopf_enabled)
+   vfio_dma_clean_iopf(iommu, dma);
put_task_struct(dma->task);
vfio_dma_bitmap_free(dma);
if (dma->vaddr_invalid) {
@@ -1359,7 +1371,8 @@ static int update_user_bitmap(u64 __user *bitmap, struct 
vfio_iommu *iommu,
 * mark all pages dirty if any IOMMU capable device is not able
 * to report dirty pages and all pages are pinned and mapped.
 */
-   if (iommu->num_non_pinned_groups && dma->iommu_mapped)
+   if (iommu->num_non_pinned_groups &&
+   (dma->iommu_mapped || iommu->iopf_enabled))
bitmap_set(dma->bitmap, 0, nbits);
 
if (shift) {
@@ -1772,6 +1785,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
goto out_unlock;
}
 
+   if (iommu->iopf_enabled) {
+   dma->iopf_mapped_bitmap = kvzalloc(IOPF_MAPPED_BITMAP_BYTES(
+   size >> PAGE_SHIFT), 
GFP_KERNEL);
+   if (!dma->iopf_mapped_bitmap) {
+   ret = -ENOMEM;
+   kfree(dma);
+   goto out_unlock;
+   }
+   }
+
iommu->dma_avail--;
dma->iova = iova;
dma->vaddr = vaddr;
@@ -1811,8 +1834,11 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
/* Insert zero-sized and grow as we map chunks of it */
vfio_link_dma(iommu, dma);
 
-   /* Don't pin and map if container doesn't contain IOMMU capable domain*/
-   if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
+   /*
+* Don't pin and map if container doesn't contain IOMMU capable domain,
+* or IOPF enabled for the container.
+*/
+   if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled)
dma->size = size;
else
ret = vfio_pin_map_dma(iommu, dma, size);
-- 
2.19.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH v3 8/8] vfio: Add nested IOPF support

2021-04-08 Thread Shenming Lu
To set up nested mode, drivers such as vfio_pci need to register a
handler to receive stage/level 1 faults from the IOMMU, but since
currently each device can only have one iommu dev fault handler,
and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF),
we choose to update the registered handler (a consolidated one) via
flags (set FAULT_REPORT_NESTED_L1), and further deliver the received
stage 1 faults in the handler to the guest through a newly added
vfio_device_ops callback.

Signed-off-by: Shenming Lu 
---
 drivers/vfio/vfio.c | 81 +
 drivers/vfio/vfio_iommu_type1.c | 49 +++-
 include/linux/vfio.h| 12 +
 3 files changed, 141 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 44c8dfabf7de..4245f15914bf 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -2356,6 +2356,87 @@ struct iommu_domain *vfio_group_iommu_domain(struct 
vfio_group *group)
 }
 EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
 
+/*
+ * Register/Update the VFIO IOPF handler to receive
+ * nested stage/level 1 faults.
+ */
+int vfio_iommu_dev_fault_handler_register_nested(struct device *dev)
+{
+   struct vfio_container *container;
+   struct vfio_group *group;
+   struct vfio_iommu_driver *driver;
+   int ret;
+
+   if (!dev)
+   return -EINVAL;
+
+   group = vfio_group_get_from_dev(dev);
+   if (!group)
+   return -ENODEV;
+
+   ret = vfio_group_add_container_user(group);
+   if (ret)
+   goto out;
+
+   container = group->container;
+   driver = container->iommu_driver;
+   if (likely(driver && driver->ops->register_handler))
+   ret = driver->ops->register_handler(container->iommu_data, dev);
+   else
+   ret = -ENOTTY;
+
+   vfio_group_try_dissolve_container(group);
+
+out:
+   vfio_group_put(group);
+   return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested);
+
+int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev)
+{
+   struct vfio_container *container;
+   struct vfio_group *group;
+   struct vfio_iommu_driver *driver;
+   int ret;
+
+   if (!dev)
+   return -EINVAL;
+
+   group = vfio_group_get_from_dev(dev);
+   if (!group)
+   return -ENODEV;
+
+   ret = vfio_group_add_container_user(group);
+   if (ret)
+   goto out;
+
+   container = group->container;
+   driver = container->iommu_driver;
+   if (likely(driver && driver->ops->unregister_handler))
+   ret = driver->ops->unregister_handler(container->iommu_data, 
dev);
+   else
+   ret = -ENOTTY;
+
+   vfio_group_try_dissolve_container(group);
+
+out:
+   vfio_group_put(group);
+   return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested);
+
+int vfio_transfer_iommu_fault(struct device *dev, struct iommu_fault *fault)
+{
+   struct vfio_device *device = dev_get_drvdata(dev);
+
+   if (unlikely(!device->ops->transfer))
+   return -EOPNOTSUPP;
+
+   return device->ops->transfer(device->device_data, fault);
+}
+EXPORT_SYMBOL_GPL(vfio_transfer_iommu_fault);
+
 /**
  * Module/class support
  */
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index ba2b5a1cf6e9..9d1adeddb303 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -3821,13 +3821,32 @@ static int vfio_iommu_type1_dma_map_iopf(struct 
iommu_fault *fault, void *data)
struct vfio_batch batch;
struct vfio_range *range;
dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
-   int access_flags = 0;
+   int access_flags = 0, nested;
size_t premap_len, map_len, mapped_len = 0;
unsigned long bit_offset, vaddr, pfn, i, npages;
int ret;
enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
struct iommu_page_response resp = {0};
 
+   if (vfio_dev_domian_nested(dev, &nested))
+   return -ENODEV;
+
+   /*
+* When configured in nested mode, further deliver the
+* stage/level 1 faults to the guest.
+*/
+   if (nested) {
+   bool l2;
+
+   if (fault->type == IOMMU_FAULT_PAGE_REQ)
+   l2 = fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_L2;
+   if (fault->type == IOMMU_FAULT_DMA_UNRECOV)
+   l2 = fault->event.flags & IOMMU_FAULT_UNRECOV_L2;
+
+   if (!l2)
+   return vfio_transfer_iommu_fault(dev, fault);
+   }
+
if (fault->type != IOMMU_FAULT_PAGE_REQ)
return -EOPNOTSUPP;
 
@@ -4201,6 +4220,32 @@ static void vfio_iommu_type1_notify(void *iommu

[RFC PATCH v3 7/8] vfio/type1: Add selective DMA faulting support

2021-04-08 Thread Shenming Lu
Some devices only allow selective DMA faulting. Similar to the selective
dirty page tracking, the vendor driver can call vfio_pin_pages() to
indicate the non-faultable scope, we add a new struct vfio_range to
record it, then when the IOPF handler receives any page request out
of the scope, we can directly return with an invalid response.

Suggested-by: Kevin Tian 
Signed-off-by: Shenming Lu 
---
 drivers/vfio/vfio.c |   4 +-
 drivers/vfio/vfio_iommu_type1.c | 357 +++-
 include/linux/vfio.h|   1 +
 3 files changed, 358 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 38779e6fd80c..44c8dfabf7de 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -2013,7 +2013,8 @@ int vfio_unpin_pages(struct device *dev, unsigned long 
*user_pfn, int npage)
container = group->container;
driver = container->iommu_driver;
if (likely(driver && driver->ops->unpin_pages))
-   ret = driver->ops->unpin_pages(container->iommu_data, user_pfn,
+   ret = driver->ops->unpin_pages(container->iommu_data,
+  group->iommu_group, user_pfn,
   npage);
else
ret = -ENOTTY;
@@ -2112,6 +2113,7 @@ int vfio_group_unpin_pages(struct vfio_group *group,
driver = container->iommu_driver;
if (likely(driver && driver->ops->unpin_pages))
ret = driver->ops->unpin_pages(container->iommu_data,
+  group->iommu_group,
   user_iova_pfn, npage);
else
ret = -ENOTTY;
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index dcc93c3b258c..ba2b5a1cf6e9 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -150,10 +150,19 @@ struct vfio_regions {
 static struct rb_root iopf_group_list = RB_ROOT;
 static DEFINE_MUTEX(iopf_group_list_lock);
 
+struct vfio_range {
+   struct rb_node  node;
+   dma_addr_t  base_iova;
+   size_t  span;
+   unsigned intref_count;
+};
+
 struct vfio_iopf_group {
struct rb_node  node;
struct iommu_group  *iommu_group;
struct vfio_iommu   *iommu;
+   struct rb_root  pinned_range_list;
+   boolselective_faulting;
 };
 
 #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)\
@@ -496,6 +505,255 @@ static void vfio_unlink_iopf_group(struct vfio_iopf_group 
*old)
mutex_unlock(&iopf_group_list_lock);
 }
 
+/*
+ * Helper functions for range list, handle one page at a time.
+ */
+static struct vfio_range *vfio_find_range(struct rb_root *range_list,
+ dma_addr_t iova)
+{
+   struct rb_node *node = range_list->rb_node;
+   struct vfio_range *range;
+
+   while (node) {
+   range = rb_entry(node, struct vfio_range, node);
+
+   if (iova + PAGE_SIZE <= range->base_iova)
+   node = node->rb_left;
+   else if (iova >= range->base_iova + range->span)
+   node = node->rb_right;
+   else
+   return range;
+   }
+
+   return NULL;
+}
+
+/* Do the possible merge adjacent to the input range. */
+static void vfio_merge_range_list(struct rb_root *range_list,
+ struct vfio_range *range)
+{
+   struct rb_node *node_prev = rb_prev(&range->node);
+   struct rb_node *node_next = rb_next(&range->node);
+
+   if (node_next) {
+   struct vfio_range *range_next = rb_entry(node_next,
+struct vfio_range,
+node);
+
+   if (range_next->base_iova == (range->base_iova + range->span) &&
+   range_next->ref_count == range->ref_count) {
+   rb_erase(node_next, range_list);
+   range->span += range_next->span;
+   kfree(range_next);
+   }
+   }
+
+   if (node_prev) {
+   struct vfio_range *range_prev = rb_entry(node_prev,
+struct vfio_range,
+node);
+
+   if (range->base_iova == (range_prev->base_iova + 
range_prev->span)
+   && range->ref_count == range_prev->ref_count) {
+   rb_erase(&range->node, range_list);
+   range_prev->span += range->span;
+  

[RFC PATCH v3 2/8] vfio/type1: Add a page fault handler

2021-04-08 Thread Shenming Lu
VFIO manages the DMA mapping itself. To support IOPF (on-demand paging)
for VFIO (IOMMU capable) devices, we add a VFIO page fault handler to
serve the reported page faults from the IOMMU driver.

Signed-off-by: Shenming Lu 
---
 drivers/vfio/vfio_iommu_type1.c | 114 
 1 file changed, 114 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 45cbfd4879a5..ab0ff60ee207 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -101,6 +101,7 @@ struct vfio_dma {
struct task_struct  *task;
struct rb_root  pfn_list;   /* Ex-user pinned pfn list */
unsigned long   *bitmap;
+   unsigned long   *iopf_mapped_bitmap;
 };
 
 struct vfio_batch {
@@ -141,6 +142,16 @@ struct vfio_regions {
size_t len;
 };
 
+/* A global IOPF enabled group list */
+static struct rb_root iopf_group_list = RB_ROOT;
+static DEFINE_MUTEX(iopf_group_list_lock);
+
+struct vfio_iopf_group {
+   struct rb_node  node;
+   struct iommu_group  *iommu_group;
+   struct vfio_iommu   *iommu;
+};
+
 #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)\
(!list_empty(&iommu->domain_list))
 
@@ -157,6 +168,10 @@ struct vfio_regions {
 #define DIRTY_BITMAP_PAGES_MAX  ((u64)INT_MAX)
 #define DIRTY_BITMAP_SIZE_MAX   DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
 
+#define IOPF_MAPPED_BITMAP_GET(dma, i) \
+ ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG]
\
+  >> ((i) % BITS_PER_LONG)) & 0x1)
+
 #define WAITED 1
 
 static int put_pfn(unsigned long pfn, int prot);
@@ -416,6 +431,34 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, 
struct vfio_pfn *vpfn)
return ret;
 }
 
+/*
+ * Helper functions for iopf_group_list
+ */
+static struct vfio_iopf_group *
+vfio_find_iopf_group(struct iommu_group *iommu_group)
+{
+   struct vfio_iopf_group *iopf_group;
+   struct rb_node *node;
+
+   mutex_lock(&iopf_group_list_lock);
+
+   node = iopf_group_list.rb_node;
+
+   while (node) {
+   iopf_group = rb_entry(node, struct vfio_iopf_group, node);
+
+   if (iommu_group < iopf_group->iommu_group)
+   node = node->rb_left;
+   else if (iommu_group > iopf_group->iommu_group)
+   node = node->rb_right;
+   else
+   break;
+   }
+
+   mutex_unlock(&iopf_group_list_lock);
+   return node ? iopf_group : NULL;
+}
+
 static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
 {
struct mm_struct *mm;
@@ -3106,6 +3149,77 @@ static int vfio_iommu_type1_dirty_pages(struct 
vfio_iommu *iommu,
return -EINVAL;
 }
 
+/* VFIO I/O Page Fault handler */
+static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void *data)
+{
+   struct device *dev = (struct device *)data;
+   struct iommu_group *iommu_group;
+   struct vfio_iopf_group *iopf_group;
+   struct vfio_iommu *iommu;
+   struct vfio_dma *dma;
+   dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
+   int access_flags = 0;
+   unsigned long bit_offset, vaddr, pfn;
+   int ret;
+   enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID;
+   struct iommu_page_response resp = {0};
+
+   if (fault->type != IOMMU_FAULT_PAGE_REQ)
+   return -EOPNOTSUPP;
+
+   iommu_group = iommu_group_get(dev);
+   if (!iommu_group)
+   return -ENODEV;
+
+   iopf_group = vfio_find_iopf_group(iommu_group);
+   iommu_group_put(iommu_group);
+   if (!iopf_group)
+   return -ENODEV;
+
+   iommu = iopf_group->iommu;
+
+   mutex_lock(&iommu->lock);
+
+   ret = vfio_find_dma_valid(iommu, iova, PAGE_SIZE, &dma);
+   if (ret < 0)
+   goto out_invalid;
+
+   if (fault->prm.perm & IOMMU_FAULT_PERM_READ)
+   access_flags |= IOMMU_READ;
+   if (fault->prm.perm & IOMMU_FAULT_PERM_WRITE)
+   access_flags |= IOMMU_WRITE;
+   if ((dma->prot & access_flags) != access_flags)
+   goto out_invalid;
+
+   bit_offset = (iova - dma->iova) >> PAGE_SHIFT;
+   if (IOPF_MAPPED_BITMAP_GET(dma, bit_offset))
+   goto out_success;
+
+   vaddr = iova - dma->iova + dma->vaddr;
+
+   if (vfio_pin_page_external(dma, vaddr, &pfn, true))
+   goto out_invalid;
+
+   if (vfio_iommu_map(iommu, iova, pfn, 1, dma->prot)) {
+   if (put_pfn(pfn, dma->prot))
+   vfio_lock_acct(dma, -1, true);
+   goto out_invalid;
+   }
+
+   bitmap_set(dma->iopf_mapped_bitmap, bit_offset, 1);
+
+out_succe

[RFC PATCH v3 3/8] vfio/type1: Add an MMU notifier to avoid pinning

2021-04-08 Thread Shenming Lu
To avoid pinning pages when they are mapped in IOMMU page tables, we
add an MMU notifier to tell the addresses which are no longer valid
and try to unmap them.

Signed-off-by: Shenming Lu 
---
 drivers/vfio/vfio_iommu_type1.c | 112 +++-
 1 file changed, 109 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index ab0ff60ee207..1cb9d1f2717b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson "
@@ -69,6 +70,7 @@ struct vfio_iommu {
struct mutexlock;
struct rb_root  dma_list;
struct blocking_notifier_head notifier;
+   struct mmu_notifier mn;
unsigned intdma_avail;
unsigned intvaddr_invalid_count;
uint64_tpgsize_bitmap;
@@ -1204,6 +1206,72 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, 
struct vfio_dma *dma,
return unlocked;
 }
 
+/* Unmap the IOPF mapped pages in the specified range. */
+static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu,
+   struct vfio_dma *dma,
+   dma_addr_t start, dma_addr_t end)
+{
+   struct iommu_iotlb_gather *gathers;
+   struct vfio_domain *d;
+   int i, num_domains = 0;
+
+   list_for_each_entry(d, &iommu->domain_list, next)
+   num_domains++;
+
+   gathers = kzalloc(sizeof(*gathers) * num_domains, GFP_KERNEL);
+   if (gathers) {
+   for (i = 0; i < num_domains; i++)
+   iommu_iotlb_gather_init(&gathers[i]);
+   }
+
+   while (start < end) {
+   unsigned long bit_offset;
+   size_t len;
+
+   bit_offset = (start - dma->iova) >> PAGE_SHIFT;
+
+   for (len = 0; start + len < end; len += PAGE_SIZE) {
+   if (!IOPF_MAPPED_BITMAP_GET(dma,
+   bit_offset + (len >> PAGE_SHIFT)))
+   break;
+   }
+
+   if (len) {
+   i = 0;
+   list_for_each_entry(d, &iommu->domain_list, next) {
+   size_t unmapped;
+
+   if (gathers)
+   unmapped = iommu_unmap_fast(d->domain,
+   start, len,
+   
&gathers[i++]);
+   else
+   unmapped = iommu_unmap(d->domain,
+  start, len);
+
+   if (WARN_ON(unmapped != len))
+   goto out;
+   }
+
+   bitmap_clear(dma->iopf_mapped_bitmap,
+bit_offset, len >> PAGE_SHIFT);
+
+   cond_resched();
+   }
+
+   start += (len + PAGE_SIZE);
+   }
+
+out:
+   if (gathers) {
+   i = 0;
+   list_for_each_entry(d, &iommu->domain_list, next)
+   iommu_iotlb_sync(d->domain, &gathers[i++]);
+
+   kfree(gathers);
+   }
+}
+
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
@@ -3197,17 +3265,18 @@ static int vfio_iommu_type1_dma_map_iopf(struct 
iommu_fault *fault, void *data)
 
vaddr = iova - dma->iova + dma->vaddr;
 
-   if (vfio_pin_page_external(dma, vaddr, &pfn, true))
+   if (vfio_pin_page_external(dma, vaddr, &pfn, false))
goto out_invalid;
 
if (vfio_iommu_map(iommu, iova, pfn, 1, dma->prot)) {
-   if (put_pfn(pfn, dma->prot))
-   vfio_lock_acct(dma, -1, true);
+   put_pfn(pfn, dma->prot);
goto out_invalid;
}
 
bitmap_set(dma->iopf_mapped_bitmap, bit_offset, 1);
 
+   put_pfn(pfn, dma->prot);
+
 out_success:
status = IOMMU_PAGE_RESP_SUCCESS;
 
@@ -3220,6 +3289,43 @@ static int vfio_iommu_type1_dma_map_iopf(struct 
iommu_fault *fault, void *data)
return 0;
 }
 
+static void mn_invalidate_range(struct mmu_notifier *mn, struct mm_struct *mm,
+   unsigned long start, unsigned long end)
+{
+   struct vfio_iommu *iommu = container_of(mn, struct vfio_iommu, mn);
+   struct rb_node *n;
+   int ret;
+
+   mutex_lock(&iommu->lock);
+
+   ret = vfio_wait_all_valid(iommu);
+   if (WAR

Re: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough

2021-04-25 Thread Shenming Lu
On 2021/4/9 11:44, Shenming Lu wrote:
> Hi,
> 
> Requesting for your comments and suggestions. :-)

Kind ping...

> 
> The static pinning and mapping problem in VFIO and possible solutions
> have been discussed a lot [1, 2]. One of the solutions is to add I/O
> Page Fault support for VFIO devices. Different from those relatively
> complicated software approaches such as presenting a vIOMMU that provides
> the DMA buffer information (might include para-virtualized optimizations),
> IOPF mainly depends on the hardware faulting capability, such as the PCIe
> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
> support for VFIO passthrough based on the IOPF part of SVA in this series.
> 
> We have measured its performance with UADK [4] (passthrough an accelerator
> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
> 
> Run hisi_sec_test...
>  - with varying sending times and message lengths
>  - with/without IOPF enabled (speed slowdown)
> 
> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
> slowdown (num of faults)
>  times  VFIO IOPF  host SVA
>  1  63.4% (518)82.8% (512)
>  10022.9% (1058)   47.9% (1024)
>  1000   2.6% (1071)8.5% (1024)
> 
> when msg_len = 10MB (and PREMAP_LEN = 512):
> slowdown (num of faults)
>  times  VFIO IOPF
>  1  32.6% (13)
>  1003.5% (26)
>  1000   1.6% (26)
> 
> History:
> 
> v2 -> v3
>  - Nit fixes.
>  - No reason to disable reporting the unrecoverable faults. (baolu)
>  - Maintain a global IOPF enabled group list.
>  - Split the pre-mapping optimization to be a separate patch.
>  - Add selective faulting support (use vfio_pin_pages to indicate the
>non-faultable scope and add a new struct vfio_range to record it,
>untested). (Kevin)
> 
> v1 -> v2
>  - Numerous improvements following the suggestions. Thanks a lot to all
>of you.
> 
> Note that PRI is not supported at the moment since there is no hardware.
> 
> Links:
> [1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS,
> 2016.
> [2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer 
> Tracking
> for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020.
> [3] 
> https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210401154718.307519-1-jean-phili...@linaro.org/
> [4] https://github.com/Linaro/uadk
> 
> Thanks,
> Shenming
> 
> 
> Shenming Lu (8):
>   iommu: Evolve the device fault reporting framework
>   vfio/type1: Add a page fault handler
>   vfio/type1: Add an MMU notifier to avoid pinning
>   vfio/type1: Pre-map more pages than requested in the IOPF handling
>   vfio/type1: VFIO_IOMMU_ENABLE_IOPF
>   vfio/type1: No need to statically pin and map if IOPF enabled
>   vfio/type1: Add selective DMA faulting support
>   vfio: Add nested IOPF support
> 
>  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |3 +-
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |   18 +-
>  drivers/iommu/iommu.c |   56 +-
>  drivers/vfio/vfio.c   |   85 +-
>  drivers/vfio/vfio_iommu_type1.c   | 1000 -
>  include/linux/iommu.h |   19 +-
>  include/linux/vfio.h  |   13 +
>  include/uapi/linux/iommu.h|4 +
>  include/uapi/linux/vfio.h |6 +
>  9 files changed, 1181 insertions(+), 23 deletions(-)
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough

2021-05-11 Thread Shenming Lu
Hi Alex,

Hope for some suggestions or comments from you since there seems to be many 
unsure
points in this series. :-)

Thanks,
Shenming


On 2021/4/26 9:41, Shenming Lu wrote:
> On 2021/4/9 11:44, Shenming Lu wrote:
>> Hi,
>>
>> Requesting for your comments and suggestions. :-)
> 
> Kind ping...
> 
>>
>> The static pinning and mapping problem in VFIO and possible solutions
>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
>> Page Fault support for VFIO devices. Different from those relatively
>> complicated software approaches such as presenting a vIOMMU that provides
>> the DMA buffer information (might include para-virtualized optimizations),
>> IOPF mainly depends on the hardware faulting capability, such as the PCIe
>> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
>> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
>> support for VFIO passthrough based on the IOPF part of SVA in this series.
>>
>> We have measured its performance with UADK [4] (passthrough an accelerator
>> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
>>
>> Run hisi_sec_test...
>>  - with varying sending times and message lengths
>>  - with/without IOPF enabled (speed slowdown)
>>
>> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
>> slowdown (num of faults)
>>  times  VFIO IOPF  host SVA
>>  1  63.4% (518)82.8% (512)
>>  10022.9% (1058)   47.9% (1024)
>>  1000   2.6% (1071)8.5% (1024)
>>
>> when msg_len = 10MB (and PREMAP_LEN = 512):
>> slowdown (num of faults)
>>  times  VFIO IOPF
>>  1  32.6% (13)
>>  1003.5% (26)
>>  1000   1.6% (26)
>>
>> History:
>>
>> v2 -> v3
>>  - Nit fixes.
>>  - No reason to disable reporting the unrecoverable faults. (baolu)
>>  - Maintain a global IOPF enabled group list.
>>  - Split the pre-mapping optimization to be a separate patch.
>>  - Add selective faulting support (use vfio_pin_pages to indicate the
>>non-faultable scope and add a new struct vfio_range to record it,
>>untested). (Kevin)
>>
>> v1 -> v2
>>  - Numerous improvements following the suggestions. Thanks a lot to all
>>of you.
>>
>> Note that PRI is not supported at the moment since there is no hardware.
>>
>> Links:
>> [1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS,
>> 2016.
>> [2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer 
>> Tracking
>> for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020.
>> [3] 
>> https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210401154718.307519-1-jean-phili...@linaro.org/
>> [4] https://github.com/Linaro/uadk
>>
>> Thanks,
>> Shenming
>>
>>
>> Shenming Lu (8):
>>   iommu: Evolve the device fault reporting framework
>>   vfio/type1: Add a page fault handler
>>   vfio/type1: Add an MMU notifier to avoid pinning
>>   vfio/type1: Pre-map more pages than requested in the IOPF handling
>>   vfio/type1: VFIO_IOMMU_ENABLE_IOPF
>>   vfio/type1: No need to statically pin and map if IOPF enabled
>>   vfio/type1: Add selective DMA faulting support
>>   vfio: Add nested IOPF support
>>
>>  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |3 +-
>>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |   18 +-
>>  drivers/iommu/iommu.c |   56 +-
>>  drivers/vfio/vfio.c   |   85 +-
>>  drivers/vfio/vfio_iommu_type1.c   | 1000 -
>>  include/linux/iommu.h |   19 +-
>>  include/linux/vfio.h  |   13 +
>>  include/uapi/linux/iommu.h|4 +
>>  include/uapi/linux/vfio.h |6 +
>>  9 files changed, 1181 insertions(+), 23 deletions(-)
>>
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough

2021-05-20 Thread Shenming Lu
Hi Alex,

On 2021/5/19 2:57, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:12 +0800
> Shenming Lu  wrote:
> 
>> Hi,
>>
>> Requesting for your comments and suggestions. :-)
>>
>> The static pinning and mapping problem in VFIO and possible solutions
>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
>> Page Fault support for VFIO devices. Different from those relatively
>> complicated software approaches such as presenting a vIOMMU that provides
>> the DMA buffer information (might include para-virtualized optimizations),
>> IOPF mainly depends on the hardware faulting capability, such as the PCIe
>> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
>> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
>> support for VFIO passthrough based on the IOPF part of SVA in this series.
> 
> The SVA proposals are being reworked to make use of a new IOASID
> object, it's not clear to me that this shouldn't also make use of that
> work as it does a significant expansion of the type1 IOMMU with fault
> handlers that would duplicate new work using that new model.

It seems that the IOASID extension for guest SVA would not affect this series,
will we do any host-guest IOASID translation in the VFIO fault handler?

> 
>> We have measured its performance with UADK [4] (passthrough an accelerator
>> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
>>
>> Run hisi_sec_test...
>>  - with varying sending times and message lengths
>>  - with/without IOPF enabled (speed slowdown)
>>
>> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
>> slowdown (num of faults)
>>  times  VFIO IOPF  host SVA
>>  1  63.4% (518)82.8% (512)
>>  10022.9% (1058)   47.9% (1024)
>>  1000   2.6% (1071)8.5% (1024)
>>
>> when msg_len = 10MB (and PREMAP_LEN = 512):
>> slowdown (num of faults)
>>  times  VFIO IOPF
>>  1  32.6% (13)
>>  1003.5% (26)
>>  1000   1.6% (26)
> 
> It seems like this is only an example that you can make a benchmark
> show anything you want.  The best results would be to pre-map
> everything, which is what we have without this series.  What is an
> acceptable overhead to incur to avoid page pinning?  What if userspace
> had more fine grained control over which mappings were available for
> faulting and which were statically mapped?  I don't really see what
> sense the pre-mapping range makes.  If I assume the user is QEMU in a
> non-vIOMMU configuration, pre-mapping the beginning of each RAM section
> doesn't make any logical sense relative to device DMA.

As you said in Patch 4, we can introduce full end-to-end functionality
before trying to improve performance, and I will drop the pre-mapping patch
in the current stage...

Is there a need that userspace wants more fine grained control over which
mappings are available for faulting? If so, we may evolve the MAP ioctl
to support for specifying the faulting range.

As for the overhead of IOPF, it is unavoidable if enabling on-demand paging
(and page faults occur almost only when first accessing), and it seems that
there is a large optimization space compared to CPU page faulting.

Thanks,
Shenming

> 
> Comments per patch to follow.  Thanks,
> 
> Alex
> 
> 
>> History:
>>
>> v2 -> v3
>>  - Nit fixes.
>>  - No reason to disable reporting the unrecoverable faults. (baolu)
>>  - Maintain a global IOPF enabled group list.
>>  - Split the pre-mapping optimization to be a separate patch.
>>  - Add selective faulting support (use vfio_pin_pages to indicate the
>>non-faultable scope and add a new struct vfio_range to record it,
>>untested). (Kevin)
>>
>> v1 -> v2
>>  - Numerous improvements following the suggestions. Thanks a lot to all
>>of you.
>>
>> Note that PRI is not supported at the moment since there is no hardware.
>>
>> Links:
>> [1] Lesokhin I, et al. Page Fault Support for Network Controllers. In ASPLOS,
>> 2016.
>> [2] Tian K, et al. coIOMMU: A Virtual IOMMU with Cooperative DMA Buffer 
>> Tracking
>> for Efficient Memory Management in Direct I/O. In USENIX ATC, 2020.
>> [3] 
>> https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210401154718.307519-1-jean-phili...@linaro.org/
>> [4] https://github.com/Linaro/uadk
>>
>> Thanks,
>> Shenming
>>
>>
>> Shenming Lu (8):
>>   iommu: Evolve the device fault reporting framework
>>   vfio/type1: Add a page fault handler
>&g

Re: [RFC PATCH v3 1/8] iommu: Evolve the device fault reporting framework

2021-05-20 Thread Shenming Lu
On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:13 +0800
> Shenming Lu  wrote:
> 
>> This patch follows the discussion here:
>>
>> https://lore.kernel.org/linux-acpi/YAaxjmJW+ZMvrhac@myrica/
>>
>> Besides SVA/vSVA, such as VFIO may also enable (2nd level) IOPF to remove
>> pinning restriction. In order to better support more scenarios of using
>> device faults, we extend iommu_register_fault_handler() with flags and
>> introduce FAULT_REPORT_ to describe the device fault reporting capability
>> under a specific configuration.
>>
>> Note that we don't further distinguish recoverable and unrecoverable faults
>> by flags in the fault reporting cap, having PAGE_FAULT_REPORT_ +
>> UNRECOV_FAULT_REPORT_ seems not a clean way.
>>
>> In addition, still take VFIO as an example, in nested mode, the 1st level
>> and 2nd level fault reporting may be configured separately and currently
>> each device can only register one iommu dev fault handler, so we add a
>> handler update interface for this.
> 
> 
> IIUC, you're introducing flags for the fault handler callout, which
> essentially allows the IOMMU layer to filter which types of faults the
> handler can receive.  You then need an update function to modify those
> flags.  Why can't the handler itself perform this filtering?  For
> instance in your vfio example, the handler registered by the type1
> backend could itself return fault until the fault transfer path to the
> device driver is established.  Thanks,

As discussed in [1]:

In nested IOPF, we have to figure out whether a fault comes from L1 or L2.
A SMMU stall event comes with this information, but a PRI page request doesn't.
The IOMMU driver can walk the page tables to find out the level information.
If the walk terminates at L1, further inject to the guest. Otherwise fix the
fault at L2 in VFIO. It's not efficient compared to hardware-provided info.

And in pinned case, if VFIO can tell the IOMMU driver that no L2 fault is
expected, there is no need to walk the page tables in the IOMMU driver?

[1] 
https://patchwork.kernel.org/project/linux-arm-kernel/patch/20210108145217.2254447-4-jean-phili...@linaro.org/

Thanks,
Shenming

> 
> Alex
> 
> .
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 3/8] vfio/type1: Add an MMU notifier to avoid pinning

2021-05-20 Thread Shenming Lu
On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:15 +0800
> Shenming Lu  wrote:
> 
>> To avoid pinning pages when they are mapped in IOMMU page tables, we
>> add an MMU notifier to tell the addresses which are no longer valid
>> and try to unmap them.
>>
>> Signed-off-by: Shenming Lu 
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 112 +++-
>>  1 file changed, 109 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>> b/drivers/vfio/vfio_iommu_type1.c
>> index ab0ff60ee207..1cb9d1f2717b 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -40,6 +40,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #define DRIVER_VERSION  "0.2"
>>  #define DRIVER_AUTHOR   "Alex Williamson "
>> @@ -69,6 +70,7 @@ struct vfio_iommu {
>>  struct mutexlock;
>>  struct rb_root  dma_list;
>>  struct blocking_notifier_head notifier;
>> +struct mmu_notifier mn;
>>  unsigned intdma_avail;
>>  unsigned intvaddr_invalid_count;
>>  uint64_tpgsize_bitmap;
>> @@ -1204,6 +1206,72 @@ static long vfio_unmap_unpin(struct vfio_iommu 
>> *iommu, struct vfio_dma *dma,
>>  return unlocked;
>>  }
>>  
>> +/* Unmap the IOPF mapped pages in the specified range. */
>> +static void vfio_unmap_partial_iopf(struct vfio_iommu *iommu,
>> +struct vfio_dma *dma,
>> +dma_addr_t start, dma_addr_t end)
>> +{
>> +struct iommu_iotlb_gather *gathers;
>> +struct vfio_domain *d;
>> +int i, num_domains = 0;
>> +
>> +list_for_each_entry(d, &iommu->domain_list, next)
>> +num_domains++;
>> +
>> +gathers = kzalloc(sizeof(*gathers) * num_domains, GFP_KERNEL);
>> +if (gathers) {
>> +for (i = 0; i < num_domains; i++)
>> +iommu_iotlb_gather_init(&gathers[i]);
>> +}
> 
> 
> If we're always serialized on iommu->lock, would it make sense to have
> gathers pre-allocated on the vfio_iommu object?

Yeah, we can do it.

> 
>> +
>> +while (start < end) {
>> +unsigned long bit_offset;
>> +size_t len;
>> +
>> +bit_offset = (start - dma->iova) >> PAGE_SHIFT;
>> +
>> +for (len = 0; start + len < end; len += PAGE_SIZE) {
>> +if (!IOPF_MAPPED_BITMAP_GET(dma,
>> +bit_offset + (len >> PAGE_SHIFT)))
>> +break;
> 
> 
> There are bitmap helpers for this, find_first_bit(),
> find_next_zero_bit().

Thanks for the hint. :-)

> 
> 
>> +}
>> +
>> +if (len) {
>> +i = 0;
>> +list_for_each_entry(d, &iommu->domain_list, next) {
>> +size_t unmapped;
>> +
>> +if (gathers)
>> +unmapped = iommu_unmap_fast(d->domain,
>> +start, len,
>> +
>> &gathers[i++]);
>> +else
>> +unmapped = iommu_unmap(d->domain,
>> +   start, len);
>> +
>> +if (WARN_ON(unmapped != len))
> 
> The IOMMU API does not guarantee arbitrary unmap sizes, this will
> trigger and this exit path is wrong.  If we've already unmapped the
> IOMMU, shouldn't we proceed with @unmapped rather than @len so the
> device can re-fault the extra mappings?  Otherwise the IOMMU state
> doesn't match the iopf bitmap state.

OK, I will correct it.

And can we assume that the @unmapped values (returned from iommu_unmap)
of all domains are the same (since all domains share the same 
iopf_mapped_bitmap)?

> 
>> +goto out;
>> +}
>> +
>> +bitmap_clear(dma->iopf_mapped_bitmap,
>> + bit_offset, len >> PAGE_SHIFT);
>> +
>> +cond_resched();
>> +}
>> +
>> +start += (len + PAGE_SIZE);
>> + 

Re: [RFC PATCH v3 4/8] vfio/type1: Pre-map more pages than requested in the IOPF handling

2021-05-20 Thread Shenming Lu
On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:16 +0800
> Shenming Lu  wrote:
> 
>> To optimize for fewer page fault handlings, we can pre-map more pages
>> than requested at once.
>>
>> Note that IOPF_PREMAP_LEN is just an arbitrary value for now, which we
>> could try further tuning.
> 
> I'd prefer that the series introduced full end-to-end functionality
> before trying to improve performance.  The pre-map value seems
> arbitrary here and as noted in the previous patch, the IOMMU API does
> not guarantee unmaps of ranges smaller than the original mapping.  This
> would need to map with single page granularity in order to guarantee
> page granularity at the mmu notifier when the IOMMU supports
> superpages.  Thanks,

Yeah, I will drop this patch in the current stage.

Thanks,
Shenming

> 
> Alex
> 
> .
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF

2021-05-20 Thread Shenming Lu
On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:17 +0800
> Shenming Lu  wrote:
> 
>> Since enabling IOPF for devices may lead to a slow ramp up of performance,
>> we add an ioctl VFIO_IOMMU_ENABLE_IOPF to make it configurable. And the
>> IOPF enabling of a VFIO device includes setting IOMMU_DEV_FEAT_IOPF and
>> registering the VFIO IOPF handler.
>>
>> Note that VFIO_IOMMU_DISABLE_IOPF is not supported since there may be
>> inflight page faults when disabling.
>>
>> Signed-off-by: Shenming Lu 
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 223 +++-
>>  include/uapi/linux/vfio.h   |   6 +
>>  2 files changed, 226 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>> b/drivers/vfio/vfio_iommu_type1.c
>> index 01e296c6dc9e..7df5711e743a 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -71,6 +71,7 @@ struct vfio_iommu {
>>  struct rb_root  dma_list;
>>  struct blocking_notifier_head notifier;
>>  struct mmu_notifier mn;
>> +struct mm_struct*mm;
> 
> We currently have no requirement that a single mm is used for all
> container mappings.  Does enabling IOPF impose such a requirement?
> Shouldn't MAP/UNMAP enforce that?

Did you mean that there is a possibility that each vfio_dma in a
container has a different mm? If so, could we register one MMU
notifier for each vfio_dma in the MAP ioctl?

> 
>>  unsigned intdma_avail;
>>  unsigned intvaddr_invalid_count;
>>  uint64_tpgsize_bitmap;
>> @@ -81,6 +82,7 @@ struct vfio_iommu {
>>  booldirty_page_tracking;
>>  boolpinned_page_dirty_scope;
>>  boolcontainer_open;
>> +booliopf_enabled;
>>  };
>>  
>>  struct vfio_domain {
>> @@ -461,6 +463,38 @@ vfio_find_iopf_group(struct iommu_group *iommu_group)
>>  return node ? iopf_group : NULL;
>>  }
>>  
>> +static void vfio_link_iopf_group(struct vfio_iopf_group *new)
>> +{
>> +struct rb_node **link, *parent = NULL;
>> +struct vfio_iopf_group *iopf_group;
>> +
>> +mutex_lock(&iopf_group_list_lock);
>> +
>> +link = &iopf_group_list.rb_node;
>> +
>> +while (*link) {
>> +parent = *link;
>> +iopf_group = rb_entry(parent, struct vfio_iopf_group, node);
>> +
>> +if (new->iommu_group < iopf_group->iommu_group)
>> +link = &(*link)->rb_left;
>> +else
>> +link = &(*link)->rb_right;
>> +}
>> +
>> +rb_link_node(&new->node, parent, link);
>> +rb_insert_color(&new->node, &iopf_group_list);
>> +
>> +mutex_unlock(&iopf_group_list_lock);
>> +}
>> +
>> +static void vfio_unlink_iopf_group(struct vfio_iopf_group *old)
>> +{
>> +mutex_lock(&iopf_group_list_lock);
>> +rb_erase(&old->node, &iopf_group_list);
>> +mutex_unlock(&iopf_group_list_lock);
>> +}
>> +
>>  static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>  {
>>  struct mm_struct *mm;
>> @@ -2363,6 +2397,68 @@ static void vfio_iommu_iova_insert_copy(struct 
>> vfio_iommu *iommu,
>>  list_splice_tail(iova_copy, iova);
>>  }
>>  
>> +static int vfio_dev_domian_nested(struct device *dev, int *nested)
> 
> 
> s/domian/domain/
> 
> 
>> +{
>> +struct iommu_domain *domain;
>> +
>> +domain = iommu_get_domain_for_dev(dev);
>> +if (!domain)
>> +return -ENODEV;
>> +
>> +return iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, nested);
> 
> 
> Wouldn't this be easier to use if it returned bool, false on -errno?

Make sense.

> 
> 
>> +}
>> +
>> +static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void 
>> *data);
>> +
>> +static int dev_enable_iopf(struct device *dev, void *data)
>> +{
>> +int *enabled_dev_cnt = data;
>> +int nested;
>> +u32 flags;
>> +int ret;
>> +
>> +ret = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_IOPF);
>> +if (ret)
>> +return ret;
>> +
>> +ret = vfio_dev_domian_nested(dev, &nested);
>> +if (ret)
>> +goto out_disable;
&

Re: [RFC PATCH v3 2/8] vfio/type1: Add a page fault handler

2021-05-20 Thread Shenming Lu
On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:14 +0800
> Shenming Lu  wrote:
> 
>> VFIO manages the DMA mapping itself. To support IOPF (on-demand paging)
>> for VFIO (IOMMU capable) devices, we add a VFIO page fault handler to
>> serve the reported page faults from the IOMMU driver.
>>
>> Signed-off-by: Shenming Lu 
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 114 
>>  1 file changed, 114 insertions(+)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>> b/drivers/vfio/vfio_iommu_type1.c
>> index 45cbfd4879a5..ab0ff60ee207 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -101,6 +101,7 @@ struct vfio_dma {
>>  struct task_struct  *task;
>>  struct rb_root  pfn_list;   /* Ex-user pinned pfn list */
>>  unsigned long   *bitmap;
>> +unsigned long   *iopf_mapped_bitmap;
>>  };
>>  
>>  struct vfio_batch {
>> @@ -141,6 +142,16 @@ struct vfio_regions {
>>  size_t len;
>>  };
>>  
>> +/* A global IOPF enabled group list */
>> +static struct rb_root iopf_group_list = RB_ROOT;
>> +static DEFINE_MUTEX(iopf_group_list_lock);
>> +
>> +struct vfio_iopf_group {
>> +struct rb_node  node;
>> +struct iommu_group  *iommu_group;
>> +struct vfio_iommu   *iommu;
>> +};
>> +
>>  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) \
>>  (!list_empty(&iommu->domain_list))
>>  
>> @@ -157,6 +168,10 @@ struct vfio_regions {
>>  #define DIRTY_BITMAP_PAGES_MAX   ((u64)INT_MAX)
>>  #define DIRTY_BITMAP_SIZE_MAX
>> DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
>>  
>> +#define IOPF_MAPPED_BITMAP_GET(dma, i)  \
>> +  ((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG]
>> \
>> +   >> ((i) % BITS_PER_LONG)) & 0x1)
> 
> 
> Can't we just use test_bit()?

Yeah, we can use it.

> 
> 
>> +
>>  #define WAITED 1
>>  
>>  static int put_pfn(unsigned long pfn, int prot);
>> @@ -416,6 +431,34 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, 
>> struct vfio_pfn *vpfn)
>>  return ret;
>>  }
>>  
>> +/*
>> + * Helper functions for iopf_group_list
>> + */
>> +static struct vfio_iopf_group *
>> +vfio_find_iopf_group(struct iommu_group *iommu_group)
>> +{
>> +struct vfio_iopf_group *iopf_group;
>> +struct rb_node *node;
>> +
>> +mutex_lock(&iopf_group_list_lock);
>> +
>> +node = iopf_group_list.rb_node;
>> +
>> +while (node) {
>> +iopf_group = rb_entry(node, struct vfio_iopf_group, node);
>> +
>> +if (iommu_group < iopf_group->iommu_group)
>> +node = node->rb_left;
>> +else if (iommu_group > iopf_group->iommu_group)
>> +node = node->rb_right;
>> +else
>> +break;
>> +}
>> +
>> +mutex_unlock(&iopf_group_list_lock);
>> +return node ? iopf_group : NULL;
>> +}
> 
> This looks like a pretty heavy weight operation per DMA fault.
> 
> I'm also suspicious of this validity of this iopf_group after we've
> dropped the locking, the ordering of patches makes this very confusing.

My thought was to include the handling of DMA faults completely in the type1
backend by introducing the vfio_iopf_group struct. But it seems that introducing
a struct with an unknown lifecycle causes more problems...
I will use the path from vfio-core as in the v2 for simplicity and validity.

Sorry for the confusing, I will reconstruct the series later. :-)

> 
>> +
>>  static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>  {
>>  struct mm_struct *mm;
>> @@ -3106,6 +3149,77 @@ static int vfio_iommu_type1_dirty_pages(struct 
>> vfio_iommu *iommu,
>>  return -EINVAL;
>>  }
>>  
>> +/* VFIO I/O Page Fault handler */
>> +static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void 
>> *data)
> 
>>From the comment, this seems like the IOMMU fault handler (the
> construction of this series makes this difficult to follow) and
> eventually it handles more than DMA mapping, for example transferring
> faults to the device driver.  "dma_map_iopf" seems like a poorly scoped
> name.

Maybe just call it dev_fault_handler?

Re: [RFC PATCH v3 6/8] vfio/type1: No need to statically pin and map if IOPF enabled

2021-05-20 Thread Shenming Lu
On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:18 +0800
> Shenming Lu  wrote:
> 
>> If IOPF enabled for the VFIO container, there is no need to statically
>> pin and map the entire DMA range, we can do it on demand. And unmap
>> according to the IOPF mapped bitmap when removing vfio_dma.
>>
>> Note that we still mark all pages dirty even if IOPF enabled, we may
>> add IOPF-based fine grained dirty tracking support in the future.
>>
>> Signed-off-by: Shenming Lu 
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 38 +++--
>>  1 file changed, 32 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>> b/drivers/vfio/vfio_iommu_type1.c
>> index 7df5711e743a..dcc93c3b258c 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -175,6 +175,7 @@ struct vfio_iopf_group {
>>  #define IOPF_MAPPED_BITMAP_GET(dma, i)  \
>>((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG]
>> \
>> >> ((i) % BITS_PER_LONG)) & 0x1)  
>> +#define IOPF_MAPPED_BITMAP_BYTES(n) DIRTY_BITMAP_BYTES(n)
>>  
>>  #define WAITED 1
>>  
>> @@ -959,7 +960,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>   * already pinned and accounted. Accouting should be done if there is no
>>   * iommu capable domain in the container.
>>   */
>> -do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
>> +do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
>> +iommu->iopf_enabled;
>>  
>>  for (i = 0; i < npage; i++) {
>>  struct vfio_pfn *vpfn;
>> @@ -1048,7 +1050,8 @@ static int vfio_iommu_type1_unpin_pages(void 
>> *iommu_data,
>>  
>>  mutex_lock(&iommu->lock);
>>  
>> -do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
>> +do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) ||
>> +iommu->iopf_enabled;
> 
> pin/unpin are actually still pinning pages, why does iopf exempt them
> from accounting?

If iopf_enabled is true, do_accounting will be true too, we will account
the external pinned pages?

> 
> 
>>  for (i = 0; i < npage; i++) {
>>  struct vfio_dma *dma;
>>  dma_addr_t iova;
>> @@ -1169,7 +1172,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, 
>> struct vfio_dma *dma,
>>  if (!dma->size)
>>  return 0;
>>  
>> -if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
>> +if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu) || iommu->iopf_enabled)
>>  return 0;
>>  
>>  /*
>> @@ -1306,11 +1309,20 @@ static void vfio_unmap_partial_iopf(struct 
>> vfio_iommu *iommu,
>>  }
>>  }
>>  
>> +static void vfio_dma_clean_iopf(struct vfio_iommu *iommu, struct vfio_dma 
>> *dma)
>> +{
>> +vfio_unmap_partial_iopf(iommu, dma, dma->iova, dma->iova + dma->size);
>> +
>> +kfree(dma->iopf_mapped_bitmap);
>> +}
>> +
>>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>  {
>>  WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
>>  vfio_unmap_unpin(iommu, dma, true);
>>  vfio_unlink_dma(iommu, dma);
>> +if (iommu->iopf_enabled)
>> +vfio_dma_clean_iopf(iommu, dma);
>>  put_task_struct(dma->task);
>>  vfio_dma_bitmap_free(dma);
>>  if (dma->vaddr_invalid) {
>> @@ -1359,7 +1371,8 @@ static int update_user_bitmap(u64 __user *bitmap, 
>> struct vfio_iommu *iommu,
>>   * mark all pages dirty if any IOMMU capable device is not able
>>   * to report dirty pages and all pages are pinned and mapped.
>>   */
>> -if (iommu->num_non_pinned_groups && dma->iommu_mapped)
>> +if (iommu->num_non_pinned_groups &&
>> +(dma->iommu_mapped || iommu->iopf_enabled))
>>  bitmap_set(dma->bitmap, 0, nbits);
> 
> This seems like really poor integration of iopf into dirty page
> tracking.  I'd expect dirty logging to flush the mapped pages and
> write faults to mark pages dirty.  Shouldn't the fault handler also
> provide only the access faulted, so for example a read fault wouldn't
> mark the page dirty?
I just want to keep the behavior here as before, if IOPF enabled, we
will still mark all pages dirty.

We can distinguish between write and read faults in the fa

Re: [RFC PATCH v3 7/8] vfio/type1: Add selective DMA faulting support

2021-05-20 Thread Shenming Lu
On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:19 +0800
> Shenming Lu  wrote:
> 
>> Some devices only allow selective DMA faulting. Similar to the selective
>> dirty page tracking, the vendor driver can call vfio_pin_pages() to
>> indicate the non-faultable scope, we add a new struct vfio_range to
>> record it, then when the IOPF handler receives any page request out
>> of the scope, we can directly return with an invalid response.
> 
> Seems like this highlights a deficiency in the design, that the user
> can't specify mappings as iopf enabled or disabled.  Also, if the
> vendor driver has pinned pages within the range, shouldn't that prevent
> them from faulting in the first place?  Why do we need yet more
> tracking structures?  Pages pinned by the vendor driver need to count
> against the user's locked memory limits regardless of iopf.  Thanks,

Currently we only have a vfio_pfn struct to track the external pinned pages
(single page granularity), so I add a vfio_range struct for efficient lookup.

Yeah, by this patch, for the non-pinned scope, we can directly return INVALID,
but for the pinned(non-faultable) scope, tracking the pinned range doesn't seem
to help more...

Thanks,
Shenming

> 
> Alex
> 
> .
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support

2021-05-21 Thread Shenming Lu
On 2021/5/19 2:58, Alex Williamson wrote:
> On Fri, 9 Apr 2021 11:44:20 +0800
> Shenming Lu  wrote:
> 
>> To set up nested mode, drivers such as vfio_pci need to register a
>> handler to receive stage/level 1 faults from the IOMMU, but since
>> currently each device can only have one iommu dev fault handler,
>> and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF),
>> we choose to update the registered handler (a consolidated one) via
>> flags (set FAULT_REPORT_NESTED_L1), and further deliver the received
>> stage 1 faults in the handler to the guest through a newly added
>> vfio_device_ops callback.
> 
> Are there proposed in-kernel drivers that would use any of these
> symbols?

I hope that such as Eric's SMMUv3 Nested Stage Setup series [1] can
use these symbols to consolidate the two page fault handlers into one.

[1] 
https://patchwork.kernel.org/project/kvm/cover/20210411114659.15051-1-eric.au...@redhat.com/

> 
>> Signed-off-by: Shenming Lu 
>> ---
>>  drivers/vfio/vfio.c | 81 +
>>  drivers/vfio/vfio_iommu_type1.c | 49 +++-
>>  include/linux/vfio.h| 12 +
>>  3 files changed, 141 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>> index 44c8dfabf7de..4245f15914bf 100644
>> --- a/drivers/vfio/vfio.c
>> +++ b/drivers/vfio/vfio.c
>> @@ -2356,6 +2356,87 @@ struct iommu_domain *vfio_group_iommu_domain(struct 
>> vfio_group *group)
>>  }
>>  EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
>>  
>> +/*
>> + * Register/Update the VFIO IOPF handler to receive
>> + * nested stage/level 1 faults.
>> + */
>> +int vfio_iommu_dev_fault_handler_register_nested(struct device *dev)
>> +{
>> +struct vfio_container *container;
>> +struct vfio_group *group;
>> +struct vfio_iommu_driver *driver;
>> +int ret;
>> +
>> +if (!dev)
>> +return -EINVAL;
>> +
>> +group = vfio_group_get_from_dev(dev);
>> +if (!group)
>> +return -ENODEV;
>> +
>> +ret = vfio_group_add_container_user(group);
>> +if (ret)
>> +goto out;
>> +
>> +container = group->container;
>> +driver = container->iommu_driver;
>> +if (likely(driver && driver->ops->register_handler))
>> +ret = driver->ops->register_handler(container->iommu_data, dev);
>> +else
>> +ret = -ENOTTY;
>> +
>> +vfio_group_try_dissolve_container(group);
>> +
>> +out:
>> +vfio_group_put(group);
>> +return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested);
>> +
>> +int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev)
>> +{
>> +struct vfio_container *container;
>> +struct vfio_group *group;
>> +struct vfio_iommu_driver *driver;
>> +int ret;
>> +
>> +if (!dev)
>> +return -EINVAL;
>> +
>> +group = vfio_group_get_from_dev(dev);
>> +if (!group)
>> +return -ENODEV;
>> +
>> +ret = vfio_group_add_container_user(group);
>> +if (ret)
>> +goto out;
>> +
>> +container = group->container;
>> +driver = container->iommu_driver;
>> +if (likely(driver && driver->ops->unregister_handler))
>> +ret = driver->ops->unregister_handler(container->iommu_data, 
>> dev);
>> +else
>> +ret = -ENOTTY;
>> +
>> +vfio_group_try_dissolve_container(group);
>> +
>> +out:
>> +vfio_group_put(group);
>> +return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested);
>> +
>> +int vfio_transfer_iommu_fault(struct device *dev, struct iommu_fault *fault)
>> +{
>> +struct vfio_device *device = dev_get_drvdata(dev);
>> +
>> +if (unlikely(!device->ops->transfer))
>> +return -EOPNOTSUPP;
>> +
>> +return device->ops->transfer(device->device_data, fault);
>> +}
>> +EXPORT_SYMBOL_GPL(vfio_transfer_iommu_fault);
>> +
>>  /**
>>   * Module/class support
>>   */
>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>> b/drivers/vfio/vfio_iommu_type1.c
>> index ba2b5a1cf6e9..9d1adeddb303 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -3821,13 +3821,32 @@ static int

Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support

2021-05-24 Thread Shenming Lu
On 2021/5/21 15:59, Shenming Lu wrote:
> On 2021/5/19 2:58, Alex Williamson wrote:
>> On Fri, 9 Apr 2021 11:44:20 +0800
>> Shenming Lu  wrote:
>>
>>> To set up nested mode, drivers such as vfio_pci need to register a
>>> handler to receive stage/level 1 faults from the IOMMU, but since
>>> currently each device can only have one iommu dev fault handler,
>>> and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF),
>>> we choose to update the registered handler (a consolidated one) via
>>> flags (set FAULT_REPORT_NESTED_L1), and further deliver the received
>>> stage 1 faults in the handler to the guest through a newly added
>>> vfio_device_ops callback.
>>
>> Are there proposed in-kernel drivers that would use any of these
>> symbols?
> 
> I hope that such as Eric's SMMUv3 Nested Stage Setup series [1] can
> use these symbols to consolidate the two page fault handlers into one.
> 
> [1] 
> https://patchwork.kernel.org/project/kvm/cover/2021044659.15051-1-eric.au...@redhat.com/
> 
>>
>>> Signed-off-by: Shenming Lu 
>>> ---
>>>  drivers/vfio/vfio.c | 81 +
>>>  drivers/vfio/vfio_iommu_type1.c | 49 +++-
>>>  include/linux/vfio.h| 12 +
>>>  3 files changed, 141 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>> index 44c8dfabf7de..4245f15914bf 100644
>>> --- a/drivers/vfio/vfio.c
>>> +++ b/drivers/vfio/vfio.c
>>> @@ -2356,6 +2356,87 @@ struct iommu_domain *vfio_group_iommu_domain(struct 
>>> vfio_group *group)
>>>  }
>>>  EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
>>>  
>>> +/*
>>> + * Register/Update the VFIO IOPF handler to receive
>>> + * nested stage/level 1 faults.
>>> + */
>>> +int vfio_iommu_dev_fault_handler_register_nested(struct device *dev)
>>> +{
>>> +   struct vfio_container *container;
>>> +   struct vfio_group *group;
>>> +   struct vfio_iommu_driver *driver;
>>> +   int ret;
>>> +
>>> +   if (!dev)
>>> +   return -EINVAL;
>>> +
>>> +   group = vfio_group_get_from_dev(dev);
>>> +   if (!group)
>>> +   return -ENODEV;
>>> +
>>> +   ret = vfio_group_add_container_user(group);
>>> +   if (ret)
>>> +   goto out;
>>> +
>>> +   container = group->container;
>>> +   driver = container->iommu_driver;
>>> +   if (likely(driver && driver->ops->register_handler))
>>> +   ret = driver->ops->register_handler(container->iommu_data, dev);
>>> +   else
>>> +   ret = -ENOTTY;
>>> +
>>> +   vfio_group_try_dissolve_container(group);
>>> +
>>> +out:
>>> +   vfio_group_put(group);
>>> +   return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested);
>>> +
>>> +int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev)
>>> +{
>>> +   struct vfio_container *container;
>>> +   struct vfio_group *group;
>>> +   struct vfio_iommu_driver *driver;
>>> +   int ret;
>>> +
>>> +   if (!dev)
>>> +   return -EINVAL;
>>> +
>>> +   group = vfio_group_get_from_dev(dev);
>>> +   if (!group)
>>> +   return -ENODEV;
>>> +
>>> +   ret = vfio_group_add_container_user(group);
>>> +   if (ret)
>>> +   goto out;
>>> +
>>> +   container = group->container;
>>> +   driver = container->iommu_driver;
>>> +   if (likely(driver && driver->ops->unregister_handler))
>>> +   ret = driver->ops->unregister_handler(container->iommu_data, 
>>> dev);
>>> +   else
>>> +   ret = -ENOTTY;
>>> +
>>> +   vfio_group_try_dissolve_container(group);
>>> +
>>> +out:
>>> +   vfio_group_put(group);
>>> +   return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested);
>>> +
>>> +int vfio_transfer_iommu_fault(struct device *dev, struct iommu_fault 
>>> *fault)
>>> +{
>>> +   struct vfio_device *device = dev_get_drvdata(dev);
>>> +
>>> +   if (unlikely(!device->ops->transfer))
>>> +   return -EOPNOTSUPP;
>>> +
>>> +   return device-

Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support

2021-05-27 Thread Shenming Lu
On 2021/5/25 6:11, Alex Williamson wrote:
> On Mon, 24 May 2021 21:11:11 +0800
> Shenming Lu  wrote:
> 
>> On 2021/5/21 15:59, Shenming Lu wrote:
>>> On 2021/5/19 2:58, Alex Williamson wrote:  
>>>> On Fri, 9 Apr 2021 11:44:20 +0800
>>>> Shenming Lu  wrote:
>>>>  
>>>>> To set up nested mode, drivers such as vfio_pci need to register a
>>>>> handler to receive stage/level 1 faults from the IOMMU, but since
>>>>> currently each device can only have one iommu dev fault handler,
>>>>> and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF),
>>>>> we choose to update the registered handler (a consolidated one) via
>>>>> flags (set FAULT_REPORT_NESTED_L1), and further deliver the received
>>>>> stage 1 faults in the handler to the guest through a newly added
>>>>> vfio_device_ops callback.  
>>>>
>>>> Are there proposed in-kernel drivers that would use any of these
>>>> symbols?  
>>>
>>> I hope that such as Eric's SMMUv3 Nested Stage Setup series [1] can
>>> use these symbols to consolidate the two page fault handlers into one.
>>>
>>> [1] 
>>> https://patchwork.kernel.org/project/kvm/cover/2021044659.15051-1-eric.au...@redhat.com/
>>>   
>>>>  
>>>>> Signed-off-by: Shenming Lu 
>>>>> ---
>>>>>  drivers/vfio/vfio.c | 81 +
>>>>>  drivers/vfio/vfio_iommu_type1.c | 49 +++-
>>>>>  include/linux/vfio.h| 12 +
>>>>>  3 files changed, 141 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>>>> index 44c8dfabf7de..4245f15914bf 100644
>>>>> --- a/drivers/vfio/vfio.c
>>>>> +++ b/drivers/vfio/vfio.c
>>>>> @@ -2356,6 +2356,87 @@ struct iommu_domain 
>>>>> *vfio_group_iommu_domain(struct vfio_group *group)
>>>>>  }
>>>>>  EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
>>>>>  
>>>>> +/*
>>>>> + * Register/Update the VFIO IOPF handler to receive
>>>>> + * nested stage/level 1 faults.
>>>>> + */
>>>>> +int vfio_iommu_dev_fault_handler_register_nested(struct device *dev)
>>>>> +{
>>>>> + struct vfio_container *container;
>>>>> + struct vfio_group *group;
>>>>> + struct vfio_iommu_driver *driver;
>>>>> + int ret;
>>>>> +
>>>>> + if (!dev)
>>>>> + return -EINVAL;
>>>>> +
>>>>> + group = vfio_group_get_from_dev(dev);
>>>>> + if (!group)
>>>>> + return -ENODEV;
>>>>> +
>>>>> + ret = vfio_group_add_container_user(group);
>>>>> + if (ret)
>>>>> + goto out;
>>>>> +
>>>>> + container = group->container;
>>>>> + driver = container->iommu_driver;
>>>>> + if (likely(driver && driver->ops->register_handler))
>>>>> + ret = driver->ops->register_handler(container->iommu_data, dev);
>>>>> + else
>>>>> + ret = -ENOTTY;
>>>>> +
>>>>> + vfio_group_try_dissolve_container(group);
>>>>> +
>>>>> +out:
>>>>> + vfio_group_put(group);
>>>>> + return ret;
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested);
>>>>> +
>>>>> +int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev)
>>>>> +{
>>>>> + struct vfio_container *container;
>>>>> + struct vfio_group *group;
>>>>> + struct vfio_iommu_driver *driver;
>>>>> + int ret;
>>>>> +
>>>>> + if (!dev)
>>>>> + return -EINVAL;
>>>>> +
>>>>> + group = vfio_group_get_from_dev(dev);
>>>>> + if (!group)
>>>>> + return -ENODEV;
>>>>> +
>>>>> + ret = vfio_group_add_container_user(group);
>>>>> + if (ret)
>>>>> + goto out;
>>>>> +
>>>>> + container = group->container;
>>>>> + driver = container->iommu_driver;
>>>>> + if (likely(driver && driver->ops->unregi

Re: [RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF

2021-05-27 Thread Shenming Lu
On 2021/5/25 6:11, Alex Williamson wrote:
> On Fri, 21 May 2021 14:38:25 +0800
> Shenming Lu  wrote:
> 
>> On 2021/5/19 2:58, Alex Williamson wrote:
>>> On Fri, 9 Apr 2021 11:44:17 +0800
>>> Shenming Lu  wrote:
>>>   
>>>> Since enabling IOPF for devices may lead to a slow ramp up of performance,
>>>> we add an ioctl VFIO_IOMMU_ENABLE_IOPF to make it configurable. And the
>>>> IOPF enabling of a VFIO device includes setting IOMMU_DEV_FEAT_IOPF and
>>>> registering the VFIO IOPF handler.
>>>>
>>>> Note that VFIO_IOMMU_DISABLE_IOPF is not supported since there may be
>>>> inflight page faults when disabling.
>>>>
>>>> Signed-off-by: Shenming Lu 
>>>> ---
>>>>  drivers/vfio/vfio_iommu_type1.c | 223 +++-
>>>>  include/uapi/linux/vfio.h   |   6 +
>>>>  2 files changed, 226 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>>>> b/drivers/vfio/vfio_iommu_type1.c
>>>> index 01e296c6dc9e..7df5711e743a 100644
>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>> @@ -71,6 +71,7 @@ struct vfio_iommu {
>>>>struct rb_root  dma_list;
>>>>struct blocking_notifier_head notifier;
>>>>struct mmu_notifier mn;
>>>> +  struct mm_struct*mm;  
>>>
>>> We currently have no requirement that a single mm is used for all
>>> container mappings.  Does enabling IOPF impose such a requirement?
>>> Shouldn't MAP/UNMAP enforce that?  
>>
>> Did you mean that there is a possibility that each vfio_dma in a
>> container has a different mm? If so, could we register one MMU
>> notifier for each vfio_dma in the MAP ioctl?
> 
> We don't prevent it currently.  There needs to be some balance,
> typically we'll have one mm, so would it make sense to have potentially
> thousands of mmu notifiers registered against the same mm?

If we have one MMU notifier per vfio_dma, there is no need to search the
iommu->dma_list when receiving an address invalidation event in
mn_invalidate_range().
Or could we divide the vfio_dma ranges into parts with different mm, and
each part would have one MMU notifier?

>  
>>>>unsigned intdma_avail;
>>>>unsigned intvaddr_invalid_count;
>>>>uint64_tpgsize_bitmap;
>>>> @@ -81,6 +82,7 @@ struct vfio_iommu {
>>>>booldirty_page_tracking;
>>>>boolpinned_page_dirty_scope;
>>>>boolcontainer_open;
>>>> +  booliopf_enabled;
>>>>  };
>>>>  
>>>>  struct vfio_domain {
>>>> @@ -461,6 +463,38 @@ vfio_find_iopf_group(struct iommu_group *iommu_group)
>>>>return node ? iopf_group : NULL;
>>>>  }
>>>>  
>>>> +static void vfio_link_iopf_group(struct vfio_iopf_group *new)
>>>> +{
>>>> +  struct rb_node **link, *parent = NULL;
>>>> +  struct vfio_iopf_group *iopf_group;
>>>> +
>>>> +  mutex_lock(&iopf_group_list_lock);
>>>> +
>>>> +  link = &iopf_group_list.rb_node;
>>>> +
>>>> +  while (*link) {
>>>> +  parent = *link;
>>>> +  iopf_group = rb_entry(parent, struct vfio_iopf_group, node);
>>>> +
>>>> +  if (new->iommu_group < iopf_group->iommu_group)
>>>> +  link = &(*link)->rb_left;
>>>> +  else
>>>> +  link = &(*link)->rb_right;
>>>> +  }
>>>> +
>>>> +  rb_link_node(&new->node, parent, link);
>>>> +  rb_insert_color(&new->node, &iopf_group_list);
>>>> +
>>>> +  mutex_unlock(&iopf_group_list_lock);
>>>> +}
>>>> +
>>>> +static void vfio_unlink_iopf_group(struct vfio_iopf_group *old)
>>>> +{
>>>> +  mutex_lock(&iopf_group_list_lock);
>>>> +  rb_erase(&old->node, &iopf_group_list);
>>>> +  mutex_unlock(&iopf_group_list_lock);
>>>> +}
>>>> +
>>>>  static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>>>>  {
>>>>struct mm_struct *mm;
>>>> @@ -2363,6 +2397,68 @@ static void vfio_iommu_i

Re: [RFC PATCH v3 2/8] vfio/type1: Add a page fault handler

2021-05-27 Thread Shenming Lu
On 2021/5/25 6:11, Alex Williamson wrote:
> On Fri, 21 May 2021 14:38:52 +0800
> Shenming Lu  wrote:
> 
>> On 2021/5/19 2:58, Alex Williamson wrote:
>>> On Fri, 9 Apr 2021 11:44:14 +0800
>>> Shenming Lu  wrote:
>>>   
>>>> VFIO manages the DMA mapping itself. To support IOPF (on-demand paging)
>>>> for VFIO (IOMMU capable) devices, we add a VFIO page fault handler to
>>>> serve the reported page faults from the IOMMU driver.
>>>>
>>>> Signed-off-by: Shenming Lu 
>>>> ---
>>>>  drivers/vfio/vfio_iommu_type1.c | 114 
>>>>  1 file changed, 114 insertions(+)
>>>>
>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>>>> b/drivers/vfio/vfio_iommu_type1.c
>>>> index 45cbfd4879a5..ab0ff60ee207 100644
>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>> @@ -101,6 +101,7 @@ struct vfio_dma {
>>>>struct task_struct  *task;
>>>>struct rb_root  pfn_list;   /* Ex-user pinned pfn list */
>>>>unsigned long   *bitmap;
>>>> +  unsigned long   *iopf_mapped_bitmap;
>>>>  };
>>>>  
>>>>  struct vfio_batch {
>>>> @@ -141,6 +142,16 @@ struct vfio_regions {
>>>>size_t len;
>>>>  };
>>>>  
>>>> +/* A global IOPF enabled group list */
>>>> +static struct rb_root iopf_group_list = RB_ROOT;
>>>> +static DEFINE_MUTEX(iopf_group_list_lock);
>>>> +
>>>> +struct vfio_iopf_group {
>>>> +  struct rb_node  node;
>>>> +  struct iommu_group  *iommu_group;
>>>> +  struct vfio_iommu   *iommu;
>>>> +};
>>>> +
>>>>  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)   \
>>>>(!list_empty(&iommu->domain_list))
>>>>  
>>>> @@ -157,6 +168,10 @@ struct vfio_regions {
>>>>  #define DIRTY_BITMAP_PAGES_MAX ((u64)INT_MAX)
>>>>  #define DIRTY_BITMAP_SIZE_MAX  
>>>> DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
>>>>  
>>>> +#define IOPF_MAPPED_BITMAP_GET(dma, i)\
>>>> +((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG]
>>>> \
>>>> + >> ((i) % BITS_PER_LONG)) & 0x1)  
>>>
>>>
>>> Can't we just use test_bit()?  
>>
>> Yeah, we can use it.
>>
>>>
>>>   
>>>> +
>>>>  #define WAITED 1
>>>>  
>>>>  static int put_pfn(unsigned long pfn, int prot);
>>>> @@ -416,6 +431,34 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma 
>>>> *dma, struct vfio_pfn *vpfn)
>>>>return ret;
>>>>  }
>>>>  
>>>> +/*
>>>> + * Helper functions for iopf_group_list
>>>> + */
>>>> +static struct vfio_iopf_group *
>>>> +vfio_find_iopf_group(struct iommu_group *iommu_group)
>>>> +{
>>>> +  struct vfio_iopf_group *iopf_group;
>>>> +  struct rb_node *node;
>>>> +
>>>> +  mutex_lock(&iopf_group_list_lock);
>>>> +
>>>> +  node = iopf_group_list.rb_node;
>>>> +
>>>> +  while (node) {
>>>> +  iopf_group = rb_entry(node, struct vfio_iopf_group, node);
>>>> +
>>>> +  if (iommu_group < iopf_group->iommu_group)
>>>> +  node = node->rb_left;
>>>> +  else if (iommu_group > iopf_group->iommu_group)
>>>> +  node = node->rb_right;
>>>> +  else
>>>> +  break;
>>>> +  }
>>>> +
>>>> +  mutex_unlock(&iopf_group_list_lock);
>>>> +  return node ? iopf_group : NULL;
>>>> +}  
>>>
>>> This looks like a pretty heavy weight operation per DMA fault.
>>>
>>> I'm also suspicious of this validity of this iopf_group after we've
>>> dropped the locking, the ordering of patches makes this very confusing.  
>>
>> My thought was to include the handling of DMA faults completely in the type1
>> backend by introducing the vfio_iopf_group struct. But it seems that 
>> introducing
>> a struct with an unknown lifecycle causes more problems...
>> I will use the path from vfio-core as in the v2

Re: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough

2021-05-27 Thread Shenming Lu
On 2021/5/25 6:11, Alex Williamson wrote:
> On Fri, 21 May 2021 14:37:21 +0800
> Shenming Lu  wrote:
> 
>> Hi Alex,
>>
>> On 2021/5/19 2:57, Alex Williamson wrote:
>>> On Fri, 9 Apr 2021 11:44:12 +0800
>>> Shenming Lu  wrote:
>>>   
>>>> Hi,
>>>>
>>>> Requesting for your comments and suggestions. :-)
>>>>
>>>> The static pinning and mapping problem in VFIO and possible solutions
>>>> have been discussed a lot [1, 2]. One of the solutions is to add I/O
>>>> Page Fault support for VFIO devices. Different from those relatively
>>>> complicated software approaches such as presenting a vIOMMU that provides
>>>> the DMA buffer information (might include para-virtualized optimizations),
>>>> IOPF mainly depends on the hardware faulting capability, such as the PCIe
>>>> PRI extension or Arm SMMU stall model. What's more, the IOPF support in
>>>> the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
>>>> support for VFIO passthrough based on the IOPF part of SVA in this series. 
>>>>  
>>>
>>> The SVA proposals are being reworked to make use of a new IOASID
>>> object, it's not clear to me that this shouldn't also make use of that
>>> work as it does a significant expansion of the type1 IOMMU with fault
>>> handlers that would duplicate new work using that new model.  
>>
>> It seems that the IOASID extension for guest SVA would not affect this 
>> series,
>> will we do any host-guest IOASID translation in the VFIO fault handler?
> 
> Surely it will, we don't currently have any IOMMU fault handling or
> forwarding of IOMMU faults through to the vfio bus driver, both of
> those would be included in an IOASID implementation.  I think Jason's
> vision is to use IOASID to deprecate type1 for all use cases, so even
> if we were to choose to implement IOPF in type1 we should agree on
> common interfaces with IOASID.

Yeah, the guest IOPF(L1) handling may include the host-guest IOASID translation,
which can be placed in the IOASID layer (in fact it can be placed in many places
such as the vfio pci driver since it don't really handle the fault event, it 
just
transfers the event to the vIOMMU).
But the host IOPF(L2) has no relationship with IOASID at all, it needs to have a
knowledge of the vfio_dma ranges.
Could we add the host IOPF support to type1 first (integrate it within the MAP 
ioctl)?
And we may migrate the generic iommu controls (such as MAP/UNMAP...) from type1 
to
IOASID in the future (it seems to be a huge work, I will be very happy if I 
could
help this)... :-)

>  
>>>> We have measured its performance with UADK [4] (passthrough an accelerator
>>>> to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):
>>>>
>>>> Run hisi_sec_test...
>>>>  - with varying sending times and message lengths
>>>>  - with/without IOPF enabled (speed slowdown)
>>>>
>>>> when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
>>>> slowdown (num of faults)
>>>>  times  VFIO IOPF  host SVA
>>>>  1  63.4% (518)82.8% (512)
>>>>  10022.9% (1058)   47.9% (1024)
>>>>  1000   2.6% (1071)8.5% (1024)
>>>>
>>>> when msg_len = 10MB (and PREMAP_LEN = 512):
>>>> slowdown (num of faults)
>>>>  times  VFIO IOPF
>>>>  1  32.6% (13)
>>>>  1003.5% (26)
>>>>  1000   1.6% (26)  
>>>
>>> It seems like this is only an example that you can make a benchmark
>>> show anything you want.  The best results would be to pre-map
>>> everything, which is what we have without this series.  What is an
>>> acceptable overhead to incur to avoid page pinning?  What if userspace
>>> had more fine grained control over which mappings were available for
>>> faulting and which were statically mapped?  I don't really see what
>>> sense the pre-mapping range makes.  If I assume the user is QEMU in a
>>> non-vIOMMU configuration, pre-mapping the beginning of each RAM section
>>> doesn't make any logical sense relative to device DMA.  
>>
>> As you said in Patch 4, we can introduce full end-to-end functionality
>> before trying to improve performance, and I will drop the pre-mapping patch
>> in the current stage...
>>
>> Is there a need that userspace wants more fine grained control over which
>> mappings are available for faulting? If so, we 

Re: [RFC] /dev/ioasid uAPI proposal

2021-05-31 Thread Shenming Lu
On 2021/6/1 10:36, Jason Wang wrote:
> 
> 在 2021/5/31 下午4:41, Liu Yi L 写道:
>>> I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
>>> hardware nesting. Or is there way to detect the capability before?
>> I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
>> is not able to support nesting, then should fail it.
>>
>>> I think GET_INFO only works after the ATTACH.
>> yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
>> gpa_ioasid and check if nesting is supported or not. right?
> 
> 
> Some more questions:
> 
> 1) Is the handle returned by IOASID_ALLOC an fd?
> 2) If yes, what's the reason for not simply use the fd opened from /dev/ioas. 
> (This is the question that is not answered) and what happens if we call 
> GET_INFO for the ioasid_fd?
> 3) If not, how GET_INFO work?

It seems that the return value from IOASID_ALLOC is an IOASID number in the
ioasid_data struct, then when calling GET_INFO, we should convey this IOASID
number to get the associated I/O address space attributes (depend on the
physical IOMMU, which could be discovered when attaching a device to the
IOASID fd or number), right?

Thanks,
Shenming
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-05-31 Thread Shenming Lu
On 2021/5/27 15:58, Tian, Kevin wrote:
> /dev/ioasid provides an unified interface for managing I/O page tables for 
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
> etc.) are expected to use this interface instead of creating their own logic 
> to 
> isolate untrusted device DMAs initiated by userspace. 
> 
> This proposal describes the uAPI of /dev/ioasid and also sample sequences 
> with VFIO as example in typical usages. The driver-facing kernel API provided 
> by the iommu layer is still TBD, which can be discussed after consensus is 
> made on this uAPI.
> 
> It's based on a lengthy discussion starting from here:
>   
> https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/ 
> 
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.
> 

[..]

> 
> /*
>   * Page fault report and response
>   *
>   * This is TBD. Can be added after other parts are cleared up. Likely it 
>   * will be a ring buffer shared between user/kernel, an eventfd to notify 
>   * the user and an ioctl to complete the fault.
>   *
>   * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>   */

Hi,

It seems that the ioasid has different usage in different situation, it could
be directly used in the physical routing, or just a virtual handle that 
indicates
a page table or a vPASID table (such as the GPA address space, in the simple
passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
Substream ID), right?

And Baolu suggested that since one device might consume multiple page tables,
it's more reasonable to have one fault handler per page table. By this, do we
have to maintain such an ioasid info list in the IOMMU layer?

Then if we add host IOPF support (for the GPA address space) in the future
(I have sent a series for this but it aimed for VFIO, I will convert it for
IOASID later [1] :-)), how could we find the handler for the received fault
event which only contains a Stream ID... Do we also have to maintain a
dev(vPASID)->ioasid mapping in the IOMMU layer?

[1] https://lore.kernel.org/patchwork/cover/1410223/

Thanks,
Shenming
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support

2021-05-31 Thread Shenming Lu
On 2021/5/27 19:18, Lu Baolu wrote:
> Hi Shenming and Alex,
> 
> On 5/27/21 7:03 PM, Shenming Lu wrote:
>>> I haven't fully read all the references, but who imposes the fact that
>>> there's only one fault handler per device?  If type1 could register one
>>> handler and the vfio-pci bus driver another for the other level, would
>>> we need this path through vfio-core?
>> If we could register more than one handler per device, things would become
>> much more simple, and the path through vfio-core would not be needed.
>>
>> Hi Baolu,
>> Is there any restriction for having more than one handler per device?
>>
> 
> Currently, each device could only have one fault handler. But one device
> might consume multiple page tables. From this point of view, it's more
> reasonable to have one handler per page table.

Sounds good to me. I have pointed it out in the IOASID uAPI proposal. :-)

Thanks,
Shenming

> 
> Best regards,
> baolu
> .
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Shenming Lu
On 2021/6/1 13:10, Lu Baolu wrote:
> Hi Shenming,
> 
> On 6/1/21 12:31 PM, Shenming Lu wrote:
>> On 2021/5/27 15:58, Tian, Kevin wrote:
>>> /dev/ioasid provides an unified interface for managing I/O page tables for
>>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
>>> etc.) are expected to use this interface instead of creating their own 
>>> logic to
>>> isolate untrusted device DMAs initiated by userspace.
>>>
>>> This proposal describes the uAPI of /dev/ioasid and also sample sequences
>>> with VFIO as example in typical usages. The driver-facing kernel API 
>>> provided
>>> by the iommu layer is still TBD, which can be discussed after consensus is
>>> made on this uAPI.
>>>
>>> It's based on a lengthy discussion starting from here:
>>> https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/
>>>
>>> It ends up to be a long writing due to many things to be summarized and
>>> non-trivial effort required to connect them into a complete proposal.
>>> Hope it provides a clean base to converge.
>>>
>>
>> [..]
>>
>>>
>>> /*
>>>    * Page fault report and response
>>>    *
>>>    * This is TBD. Can be added after other parts are cleared up. Likely it
>>>    * will be a ring buffer shared between user/kernel, an eventfd to notify
>>>    * the user and an ioctl to complete the fault.
>>>    *
>>>    * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>>>    */
>>
>> Hi,
>>
>> It seems that the ioasid has different usage in different situation, it could
>> be directly used in the physical routing, or just a virtual handle that 
>> indicates
>> a page table or a vPASID table (such as the GPA address space, in the simple
>> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
>> Substream ID), right?
>>
>> And Baolu suggested that since one device might consume multiple page tables,
>> it's more reasonable to have one fault handler per page table. By this, do we
>> have to maintain such an ioasid info list in the IOMMU layer?
> 
> As discussed earlier, the I/O page fault and cache invalidation paths
> will have "device labels" so that the information could be easily
> translated and routed.
> 
> So it's likely the per-device fault handler registering API in iommu
> core can be kept, but /dev/ioasid will be grown with a layer to
> translate and propagate I/O page fault information to the right
> consumers.

Yeah, having a general preprocessing of the faults in IOASID seems to be
a doable direction. But since there may be more than one consumer at the
same time, who is responsible for registering the per-device fault handler?

Thanks,
Shenming

> 
> If things evolve in this way, probably the SVA I/O page fault also needs
> to be ported to /dev/ioasid.
> 
>>
>> Then if we add host IOPF support (for the GPA address space) in the future
>> (I have sent a series for this but it aimed for VFIO, I will convert it for
>> IOASID later [1] :-)), how could we find the handler for the received fault
>> event which only contains a Stream ID... Do we also have to maintain a
>> dev(vPASID)->ioasid mapping in the IOMMU layer?
>>
>> [1] https://lore.kernel.org/patchwork/cover/1410223/
> 
> Best regards,
> baolu
> .
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Shenming Lu
On 2021/6/1 20:30, Lu Baolu wrote:
> On 2021/6/1 15:15, Shenming Lu wrote:
>> On 2021/6/1 13:10, Lu Baolu wrote:
>>> Hi Shenming,
>>>
>>> On 6/1/21 12:31 PM, Shenming Lu wrote:
>>>> On 2021/5/27 15:58, Tian, Kevin wrote:
>>>>> /dev/ioasid provides an unified interface for managing I/O page tables for
>>>>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
>>>>> etc.) are expected to use this interface instead of creating their own 
>>>>> logic to
>>>>> isolate untrusted device DMAs initiated by userspace.
>>>>>
>>>>> This proposal describes the uAPI of /dev/ioasid and also sample sequences
>>>>> with VFIO as example in typical usages. The driver-facing kernel API 
>>>>> provided
>>>>> by the iommu layer is still TBD, which can be discussed after consensus is
>>>>> made on this uAPI.
>>>>>
>>>>> It's based on a lengthy discussion starting from here:
>>>>>  
>>>>> https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/
>>>>>
>>>>> It ends up to be a long writing due to many things to be summarized and
>>>>> non-trivial effort required to connect them into a complete proposal.
>>>>> Hope it provides a clean base to converge.
>>>>>
>>>> [..]
>>>>
>>>>> /*
>>>>>     * Page fault report and response
>>>>>     *
>>>>>     * This is TBD. Can be added after other parts are cleared up. Likely 
>>>>> it
>>>>>     * will be a ring buffer shared between user/kernel, an eventfd to 
>>>>> notify
>>>>>     * the user and an ioctl to complete the fault.
>>>>>     *
>>>>>     * The fault data is per I/O address space, i.e.: IOASID + 
>>>>> faulting_addr
>>>>>     */
>>>> Hi,
>>>>
>>>> It seems that the ioasid has different usage in different situation, it 
>>>> could
>>>> be directly used in the physical routing, or just a virtual handle that 
>>>> indicates
>>>> a page table or a vPASID table (such as the GPA address space, in the 
>>>> simple
>>>> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
>>>> Substream ID), right?
>>>>
>>>> And Baolu suggested that since one device might consume multiple page 
>>>> tables,
>>>> it's more reasonable to have one fault handler per page table. By this, do 
>>>> we
>>>> have to maintain such an ioasid info list in the IOMMU layer?
>>> As discussed earlier, the I/O page fault and cache invalidation paths
>>> will have "device labels" so that the information could be easily
>>> translated and routed.
>>>
>>> So it's likely the per-device fault handler registering API in iommu
>>> core can be kept, but /dev/ioasid will be grown with a layer to
>>> translate and propagate I/O page fault information to the right
>>> consumers.
>> Yeah, having a general preprocessing of the faults in IOASID seems to be
>> a doable direction. But since there may be more than one consumer at the
>> same time, who is responsible for registering the per-device fault handler?
> 
> The drivers register per page table fault handlers to /dev/ioasid which
> will then register itself to iommu core to listen and route the per-
> device I/O page faults. This is just a top level thought. Haven't gone
> through the details yet. Need to wait and see what /dev/ioasid finally
> looks like.

OK. And it needs to be confirmed by Jean since we might migrate the code from
io-pgfault.c to IOASID... Anyway, finalize /dev/ioasid first.  Thanks,

Shenming

> 
> Best regards,
> baolu
> .
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-01 Thread Shenming Lu
On 2021/6/2 1:33, Jason Gunthorpe wrote:
> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
> 
>> The drivers register per page table fault handlers to /dev/ioasid which
>> will then register itself to iommu core to listen and route the per-
>> device I/O page faults. 
> 
> I'm still confused why drivers need fault handlers at all?

Essentially it is the userspace that needs the fault handlers,
one case is to deliver the faults to the vIOMMU, and another
case is to enable IOPF on the GPA address space for on-demand
paging, it seems that both could be specified in/through the
IOASID_ALLOC ioctl?

Thanks,
Shenming

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-03 Thread Shenming Lu
On 2021/6/4 2:19, Jacob Pan wrote:
> Hi Shenming,
> 
> On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu 
> wrote:
> 
>> On 2021/6/2 1:33, Jason Gunthorpe wrote:
>>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
>>>   
>>>> The drivers register per page table fault handlers to /dev/ioasid which
>>>> will then register itself to iommu core to listen and route the per-
>>>> device I/O page faults.   
>>>
>>> I'm still confused why drivers need fault handlers at all?  
>>
>> Essentially it is the userspace that needs the fault handlers,
>> one case is to deliver the faults to the vIOMMU, and another
>> case is to enable IOPF on the GPA address space for on-demand
>> paging, it seems that both could be specified in/through the
>> IOASID_ALLOC ioctl?
>>
> I would think IOASID_BIND_PGTABLE is where fault handler should be
> registered. There wouldn't be any IO page fault without the binding anyway.

Yeah, I also proposed this before, registering the handler in the BIND_PGTABLE
ioctl does make sense for the guest page faults. :-)

But how about the page faults from the GPA address space (it's page table is
mapped through the MAP_DMA ioctl)? From your point of view, it seems that we
should register the handler for the GPA address space in the (first) MAP_DMA
ioctl.

> 
> I also don't understand why device drivers should register the fault
> handler, the fault is detected by the pIOMMU and injected to the vIOMMU. So
> I think it should be the IOASID itself register the handler.

Yeah, and it can also be said that the provider of the page table registers the
handler (Baolu).

Thanks,
Shenming

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Shenming Lu
On 2021/6/7 20:19, Liu, Yi L wrote:
>> From: Shenming Lu 
>> Sent: Friday, June 4, 2021 10:03 AM
>>
>> On 2021/6/4 2:19, Jacob Pan wrote:
>>> Hi Shenming,
>>>
>>> On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu
>> 
>>> wrote:
>>>
>>>> On 2021/6/2 1:33, Jason Gunthorpe wrote:
>>>>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
>>>>>
>>>>>> The drivers register per page table fault handlers to /dev/ioasid which
>>>>>> will then register itself to iommu core to listen and route the per-
>>>>>> device I/O page faults.
>>>>>
>>>>> I'm still confused why drivers need fault handlers at all?
>>>>
>>>> Essentially it is the userspace that needs the fault handlers,
>>>> one case is to deliver the faults to the vIOMMU, and another
>>>> case is to enable IOPF on the GPA address space for on-demand
>>>> paging, it seems that both could be specified in/through the
>>>> IOASID_ALLOC ioctl?
>>>>
>>> I would think IOASID_BIND_PGTABLE is where fault handler should be
>>> registered. There wouldn't be any IO page fault without the binding
>> anyway.
>>
>> Yeah, I also proposed this before, registering the handler in the
>> BIND_PGTABLE
>> ioctl does make sense for the guest page faults. :-)
>>
>> But how about the page faults from the GPA address space (it's page table is
>> mapped through the MAP_DMA ioctl)? From your point of view, it seems
>> that we should register the handler for the GPA address space in the (first)
>> MAP_DMA ioctl.
> 
> under new proposal, I think the page fault handler is also registered
> per ioasid object. The difference compared with guest page table case
> is there is no need to inject the fault to VM.

Yeah.  And there are some issues specific to the GPA address space case
which have been discussed with Alex..  Thanks,

Shenming
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-14 Thread Shenming Lu
On 2021/7/9 15:48, Tian, Kevin wrote:
> 4.6. I/O page fault
> +++
> 
> uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards. This flow assumes that I/O page faults
> are reported via IOMMU interrupts. Some devices report faults via device
> specific way instead of going through the IOMMU. That usage is not covered
> here:
> 
> -   Host IOMMU driver receives a I/O page fault with raw fault_data {rid, 
> pasid, addr};
> 
> -   Host IOMMU driver identifies the faulting I/O page table according to
> {rid, pasid} and calls the corresponding fault handler with an opaque
> object (registered by the handler) and raw fault_data (rid, pasid, addr);
> 
> -   IOASID fault handler identifies the corresponding ioasid and device 
> cookie according to the opaque object, generates an user fault_data 
> (ioasid, cookie, addr) in the fault region, and triggers eventfd to 
> userspace;
> 

Hi, I have some doubts here:

For mdev, it seems that the rid in the raw fault_data is the parent device's,
then in the vSVA scenario, how can we get to know the mdev(cookie) from the
rid and pasid?

And from this point of view,would it be better to register the mdev
(iommu_register_device()) with the parent device info?

Thanks,
Shenming
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-14 Thread Shenming Lu
On 2021/7/15 11:55, Tian, Kevin wrote:
>> From: Shenming Lu 
>> Sent: Thursday, July 15, 2021 11:21 AM
>>
>> On 2021/7/9 15:48, Tian, Kevin wrote:
>>> 4.6. I/O page fault
>>> +++
>>>
>>> uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
>>> to guest IOMMU driver and backwards. This flow assumes that I/O page
>> faults
>>> are reported via IOMMU interrupts. Some devices report faults via device
>>> specific way instead of going through the IOMMU. That usage is not
>> covered
>>> here:
>>>
>>> -   Host IOMMU driver receives a I/O page fault with raw fault_data {rid,
>>> pasid, addr};
>>>
>>> -   Host IOMMU driver identifies the faulting I/O page table according to
>>> {rid, pasid} and calls the corresponding fault handler with an opaque
>>> object (registered by the handler) and raw fault_data (rid, pasid, 
>>> addr);
>>>
>>> -   IOASID fault handler identifies the corresponding ioasid and device
>>> cookie according to the opaque object, generates an user fault_data
>>> (ioasid, cookie, addr) in the fault region, and triggers eventfd to
>>> userspace;
>>>
>>
>> Hi, I have some doubts here:
>>
>> For mdev, it seems that the rid in the raw fault_data is the parent device's,
>> then in the vSVA scenario, how can we get to know the mdev(cookie) from
>> the
>> rid and pasid?
>>
>> And from this point of view,would it be better to register the mdev
>> (iommu_register_device()) with the parent device info?
>>
> 
> This is what is proposed in this RFC. A successful binding generates a new
> iommu_dev object for each vfio device. For mdev this object includes 
> its parent device, the defPASID marking this mdev, and the cookie 
> representing it in userspace. Later it is iommu_dev being recorded in
> the attaching_data when the mdev is attached to an IOASID:
> 
>   struct iommu_attach_data *__iommu_device_attach(
>   struct iommu_dev *dev, u32 ioasid, u32 pasid, int flags);
> 
> Then when a fault is reported, the fault handler just needs to figure out 
> iommu_dev according to {rid, pasid} in the raw fault data.
> 

Yeah, we have the defPASID that marks the mdev and refers to the default
I/O address space, but how about the non-default I/O address spaces?
Is there a case that two different mdevs (on the same parent device)
are used by the same process in the guest, thus have a same pasid route
in the physical IOMMU? It seems that we can't figure out the mdev from
the rid and pasid in this case...

Did I misunderstand something?... :-)

Thanks,
Shenming
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-15 Thread Shenming Lu
On 2021/7/15 14:49, Tian, Kevin wrote:
>> From: Shenming Lu 
>> Sent: Thursday, July 15, 2021 2:29 PM
>>
>> On 2021/7/15 11:55, Tian, Kevin wrote:
>>>> From: Shenming Lu 
>>>> Sent: Thursday, July 15, 2021 11:21 AM
>>>>
>>>> On 2021/7/9 15:48, Tian, Kevin wrote:
>>>>> 4.6. I/O page fault
>>>>> +++
>>>>>
>>>>> uAPI is TBD. Here is just about the high-level flow from host IOMMU
>> driver
>>>>> to guest IOMMU driver and backwards. This flow assumes that I/O page
>>>> faults
>>>>> are reported via IOMMU interrupts. Some devices report faults via
>> device
>>>>> specific way instead of going through the IOMMU. That usage is not
>>>> covered
>>>>> here:
>>>>>
>>>>> -   Host IOMMU driver receives a I/O page fault with raw fault_data {rid,
>>>>> pasid, addr};
>>>>>
>>>>> -   Host IOMMU driver identifies the faulting I/O page table according to
>>>>> {rid, pasid} and calls the corresponding fault handler with an opaque
>>>>> object (registered by the handler) and raw fault_data (rid, pasid, 
>>>>> addr);
>>>>>
>>>>> -   IOASID fault handler identifies the corresponding ioasid and device
>>>>> cookie according to the opaque object, generates an user fault_data
>>>>> (ioasid, cookie, addr) in the fault region, and triggers eventfd to
>>>>> userspace;
>>>>>
>>>>
>>>> Hi, I have some doubts here:
>>>>
>>>> For mdev, it seems that the rid in the raw fault_data is the parent 
>>>> device's,
>>>> then in the vSVA scenario, how can we get to know the mdev(cookie) from
>>>> the
>>>> rid and pasid?
>>>>
>>>> And from this point of view,would it be better to register the mdev
>>>> (iommu_register_device()) with the parent device info?
>>>>
>>>
>>> This is what is proposed in this RFC. A successful binding generates a new
>>> iommu_dev object for each vfio device. For mdev this object includes
>>> its parent device, the defPASID marking this mdev, and the cookie
>>> representing it in userspace. Later it is iommu_dev being recorded in
>>> the attaching_data when the mdev is attached to an IOASID:
>>>
>>> struct iommu_attach_data *__iommu_device_attach(
>>> struct iommu_dev *dev, u32 ioasid, u32 pasid, int flags);
>>>
>>> Then when a fault is reported, the fault handler just needs to figure out
>>> iommu_dev according to {rid, pasid} in the raw fault data.
>>>
>>
>> Yeah, we have the defPASID that marks the mdev and refers to the default
>> I/O address space, but how about the non-default I/O address spaces?
>> Is there a case that two different mdevs (on the same parent device)
>> are used by the same process in the guest, thus have a same pasid route
>> in the physical IOMMU? It seems that we can't figure out the mdev from
>> the rid and pasid in this case...
>>
>> Did I misunderstand something?... :-)
>>
> 
> No. You are right on this case. I don't think there is a way to 
> differentiate one mdev from the other if they come from the
> same parent and attached by the same guest process. In this
> case the fault could be reported on either mdev (e.g. the first
> matching one) to get it fixed in the guest.
> 

OK. Thanks,

Shenming
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-16 Thread Shenming Lu
On 2021/7/16 9:20, Tian, Kevin wrote:
 > To summarize, for vIOMMU we can work with the spec owner to
> define a proper interface to feedback such restriction into the guest 
> if necessary. For the kernel part, it's clear that IOMMU fd should 
> disallow two devices attached to a single [RID] or [RID, PASID] slot 
> in the first place.
> 
> Then the next question is how to communicate such restriction
> to the userspace. It sounds like a group, but different in concept.
> An iommu group describes the minimal isolation boundary thus all
> devices in the group can be only assigned to a single user. But this
> case is opposite - the two mdevs (both support ENQCMD submission)
> with the same parent have problem when assigned to a single VM 
> (in this case vPASID is vm-wide translated thus a same pPASID will be 
> used cross both mdevs) while they instead work pretty well when 
> assigned to different VMs (completely different vPASID spaces thus 
> different pPASIDs).
> 
> One thought is to have vfio device driver deal with it. In this proposal
> it is the vfio device driver to define the PASID virtualization policy and
> report it to userspace via VFIO_DEVICE_GET_INFO. The driver understands
> the restriction thus could just hide the vPASID capability when the user 
> calls GET_INFO on the 2nd mdev in above scenario. In this way the 
> user even doesn't need to know such restriction at all and both mdevs
> can be assigned to a single VM w/o any problem.
> 

The restriction only probably happens when two mdevs are assigned to one VM,
how could the vfio device driver get to know this info to accurately hide
the vPASID capability for the 2nd mdev when VFIO_DEVICE_GET_INFO? There is no
need to do this in other cases.

Thanks,
Shenming
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu