Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-03-29 Thread Yi Liu

On 2022/3/24 06:51, Alex Williamson wrote:

On Fri, 18 Mar 2022 14:27:36 -0300
Jason Gunthorpe  wrote:


iommufd can directly implement the /dev/vfio/vfio container IOCTLs by
mapping them into io_pagetable operations. Doing so allows the use of
iommufd by symliking /dev/vfio/vfio to /dev/iommufd. Allowing VFIO to
SET_CONTAINER using a iommufd instead of a container fd is a followup
series.

Internally the compatibility API uses a normal IOAS object that, like
vfio, is automatically allocated when the first device is
attached.

Userspace can also query or set this IOAS object directly using the
IOMMU_VFIO_IOAS ioctl. This allows mixing and matching new iommufd only
features while still using the VFIO style map/unmap ioctls.

While this is enough to operate qemu, it is still a bit of a WIP with a
few gaps to be resolved:

  - Only the TYPE1v2 mode is supported where unmap cannot punch holes or
split areas. The old mode can be implemented with a new operation to
split an iopt_area into two without disturbing the iopt_pages or the
domains, then unmapping a whole area as normal.

  - Resource limits rely on memory cgroups to bound what userspace can do
instead of the module parameter dma_entry_limit.

  - VFIO P2P is not implemented. Avoiding the follow_pfn() mis-design will
require some additional work to properly expose PFN lifecycle between
VFIO and iommfd

  - Various components of the mdev API are not completed yet

  - Indefinite suspend of SW access (VFIO_DMA_MAP_FLAG_VADDR) is not
implemented.

  - The 'dirty tracking' is not implemented

  - A full audit for pedantic compatibility details (eg errnos, etc) has
not yet been done

  - powerpc SPAPR is left out, as it is not connected to the iommu_domain
framework. My hope is that SPAPR will be moved into the iommu_domain
framework as a special HW specific type and would expect power to
support the generic interface through a normal iommu_domain.


My overall question here would be whether we can actually achieve a
compatibility interface that has sufficient feature transparency that we
can dump vfio code in favor of this interface, or will there be enough
niche use cases that we need to keep type1 and vfio containers around
through a deprecation process?

The locked memory differences for one seem like something that libvirt
wouldn't want hidden and we have questions regarding support for vaddr
hijacking and different ideas how to implement dirty page tracking, not
to mention the missing features that are currently well used, like p2p
mappings, coherency tracking, mdev, etc.

It seems like quite an endeavor to fill all these gaps, while at the
same time QEMU will be working to move to use iommufd directly in order
to gain all the new features.


Hi Alex,

Jason hasn't included the vfio changes for adapting to iommufd. But it's
in this branch 
(https://github.com/luxis1999/iommufd/commits/iommufd-v5.17-rc6). Eric and 
me are working on adding the iommufd support in QEMU as well. If wanting to 
run the new QEMU on old kernel, QEMU is supposed to support both the legacy 
group/container interface and the latest device/iommufd interface. We've 
got some draft code toward this direction 
(https://github.com/luxis1999/qemu/commits/qemu-for-5.17-rc4-vm). It works 
for both legacy group/container and device/iommufd path. It's just for 
reference so far, Eric and me will have a further sync on it.



Where do we focus attention?  Is symlinking device files our proposal
to userspace and is that something achievable, or do we want to use
this compatibility interface as a means to test the interface and
allow userspace to make use of it for transition, if their use cases
allow it, perhaps eventually performing the symlink after deprecation
and eventual removal of the vfio container and type1 code?  Thanks,


I'm sure it is possible that one day the group/container interface will be
removed in kernel. Perhaps this will happen when SPAPR is supported by 
iommufd. But how about QEMU, should QEMU keep backward compatibility 
forever? or one day QEMU may also remove the group/container path and hence

unable to work on the old kernels?

--
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-30 Thread Yi Liu
is
+ * not allowed. out_num_iovas will be set to the total number of iovas
+ * and the out_valid_iovas[] will be filled in as space permits.
+ * size should include the allocated flex array.
+ */
+struct iommu_ioas_iova_ranges {
+   __u32 size;
+   __u32 ioas_id;
+   __u32 out_num_iovas;
+   __u32 __reserved;
+   struct iommu_valid_iovas {
+   __aligned_u64 start;
+   __aligned_u64 last;
+   } out_valid_iovas[];
+};
+#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
+
+/**
+ * enum iommufd_ioas_map_flags - Flags for map and copy
+ * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an appropriate
+ * IOVA to place the mapping at
+ * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping
+ * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping
+ */
+enum iommufd_ioas_map_flags {
+   IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0,
+   IOMMU_IOAS_MAP_WRITEABLE = 1 << 1,
+   IOMMU_IOAS_MAP_READABLE = 1 << 2,
+};
+
+/**
+ * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP)
+ * @size: sizeof(struct iommu_ioas_map)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @ioas_id: IOAS ID to change the mapping of
+ * @__reserved: Must be 0
+ * @user_va: Userspace pointer to start mapping from
+ * @length: Number of bytes to map
+ * @iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is set
+ *then this must be provided as input.
+ *
+ * Set an IOVA mapping from a user pointer. If FIXED_IOVA is specified then the
+ * mapping will be established at iova, otherwise a suitable location will be
+ * automatically selected and returned in iova.
+ */
+struct iommu_ioas_map {
+   __u32 size;
+   __u32 flags;
+   __u32 ioas_id;
+   __u32 __reserved;
+   __aligned_u64 user_va;
+   __aligned_u64 length;
+   __aligned_u64 iova;
+};
+#define IOMMU_IOAS_MAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_MAP)
+
+/**
+ * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @dst_ioas_id: IOAS ID to change the mapping of
+ * @src_ioas_id: IOAS ID to copy from


so the dst and src ioas_id are allocated via the same iommufd.
right? just out of curious, do you think it is possible that
the srs/dst ioas_ids are from different iommufds? In that case
may need to add src/dst iommufd. It's not needed today, just to
see if any blocker in kernel to support such copy. :-)


+ * @length: Number of bytes to copy and map
+ * @dst_iova: IOVA the mapping was placed at. If IOMMU_IOAS_MAP_FIXED_IOVA is
+ *set then this must be provided as input.
+ * @src_iova: IOVA to start the copy
+ *
+ * Copy an already existing mapping from src_ioas_id and establish it in
+ * dst_ioas_id. The src iova/length must exactly match a range used with
+ * IOMMU_IOAS_MAP.
+ */
+struct iommu_ioas_copy {
+   __u32 size;
+   __u32 flags;
+   __u32 dst_ioas_id;
+   __u32 src_ioas_id;
+   __aligned_u64 length;
+   __aligned_u64 dst_iova;
+   __aligned_u64 src_iova;
+};
+#define IOMMU_IOAS_COPY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_COPY)
+
+/**
+ * struct iommu_ioas_unmap - ioctl(IOMMU_IOAS_UNMAP)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @ioas_id: IOAS ID to change the mapping of
+ * @iova: IOVA to start the unmapping at
+ * @length: Number of bytes to unmap
+ *
+ * Unmap an IOVA range. The iova/length must exactly match a range
+ * used with IOMMU_IOAS_PAGETABLE_MAP, or be the values 0 & U64_MAX.
+ * In the latter case all IOVAs will be unmaped.
+ */
+struct iommu_ioas_unmap {
+   __u32 size;
+   __u32 ioas_id;
+   __aligned_u64 iova;
+   __aligned_u64 length;
+};
+#define IOMMU_IOAS_UNMAP _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_UNMAP)
  #endif


--
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC v2 02/11] iommu: Add iommu_group_singleton_lockdown()

2022-03-31 Thread Yi Liu



On 2022/3/29 16:42, Tian, Kevin wrote:

From: Lu Baolu 
Sent: Tuesday, March 29, 2022 1:38 PM

Some of the interfaces in the IOMMU core require that only a single
kernel device driver controls the device in the IOMMU group. The
existing method is to check the device count in the IOMMU group in
the interfaces. This is unreliable because any device added to the
IOMMU group later breaks this assumption without notifying the driver
using the interface. This adds iommu_group_singleton_lockdown() that
checks the requirement and locks down the IOMMU group with only single
device driver bound.

Signed-off-by: Lu Baolu 
---
  drivers/iommu/iommu.c | 30 ++
  1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 0c42ece25854..9fb8a5b4491e 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -48,6 +48,7 @@ struct iommu_group {
struct list_head entry;
unsigned int owner_cnt;
void *owner;
+   bool singleton_lockdown;
  };

  struct group_device {
@@ -968,15 +969,16 @@ void iommu_group_remove_device(struct device
*dev)
  }
  EXPORT_SYMBOL_GPL(iommu_group_remove_device);

-static int iommu_group_device_count(struct iommu_group *group)
+/* Callers should hold the group->mutex. */
+static bool iommu_group_singleton_lockdown(struct iommu_group *group)
  {
-   struct group_device *entry;
-   int ret = 0;
-
-   list_for_each_entry(entry, &group->devices, list)
-   ret++;
+   if (group->owner_cnt != 1 ||
+   group->domain != group->default_domain ||
+   group->owner)
+   return false;


Curious why there will be a case where group uses default_domain
while still having a owner? I have the impression that owner is used
for userspace DMA where a different domain is used.


+   group->singleton_lockdown = true;

-   return ret;
+   return true;
  }


btw I'm not sure whether this is what SVA requires. IIRC the problem with
SVA is because PASID TLP prefix is not counted in PCI packet routing thus
a DMA target address with PASID might be treated as P2P if the address
falls into the MMIO BAR of other devices in the group. This is why the
original code needs to strictly apply SVA in a group containing a single
device, instead of a group attached by a single driver, unless we want to
reserve those MMIO ranges in CPU VA space.

Jean can correct me if my memory is wrong.


I think so. I remember the major gap is PASID info is not used in the PCI's 
address based TLP routing mechanism. So P2P may happen if the address
happens to be device MMIO. Per the commit message from Jean. Even for 
singular groups, it's not an easy thing. So non-sigleton groups are not

considered so far.

"
IOMMU groups with more than one device aren't supported for SVA at the
moment. There may be P2P traffic between devices within a group, which
cannot be seen by an IOMMU (note that supporting PASID doesn't add any
form of isolation with regard to P2P). Supporting groups would require
calling bind() for all bound processes every time a device is added to a
group, to perform sanity checks (e.g. ensure that new devices support
PASIDs at least as big as those already allocated in the group). It also
means making sure that reserved ranges (IOMMU_RESV_*) of all devices are
carved out of processes. This is already tricky with single devices, but
becomes very difficult with groups. Since SVA-capable devices are expected
to be cleanly isolated, and since we don't have any way to test groups or
hot-plug, we only allow singular groups for now.
"

https://lore.kernel.org/kvm/20180511190641.23008-3-jean-philippe.bruc...@arm.com/

--
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC v2 02/11] iommu: Add iommu_group_singleton_lockdown()

2022-03-31 Thread Yi Liu




On 2022/3/29 19:42, Jason Gunthorpe wrote:

On Tue, Mar 29, 2022 at 08:42:13AM +, Tian, Kevin wrote:


btw I'm not sure whether this is what SVA requires. IIRC the problem with
SVA is because PASID TLP prefix is not counted in PCI packet routing thus
a DMA target address with PASID might be treated as P2P if the address
falls into the MMIO BAR of other devices in the group. This is why the
original code needs to strictly apply SVA in a group containing a single
device, instead of a group attached by a single driver, unless we want to
reserve those MMIO ranges in CPU VA space.


I think it is not such a good idea to mix up group with this test

Here you want to say that all TLPs from the RID route to the host
bridge - ie ACS is on/etc. This is subtly different from a group with
a single device. Specifically it is an immutable property of the
fabric and doesn't change after hot plug events.


so the group size can be immutable for specific topology. right? I think 
for non-multi-function devices plugged behind an PCIE bridge which has 
enabled ACS, such devices should have their own groups. Under such topology 
the group size should be 1 constantly. May just enable SVA for such devices.



ie if we have a singleton group that doesn't have ACS and someone
hotplugs in another device on a bridge, then our SVA is completely
broken and we get data corruption.


I think this may be a device plugged in a PCIE-to-PCI bridge, and then 
hotplug a device to this bridge. The group size is variable. right? Per my 
understanding, maybe such a bridge cannot support PASID Prefix at all, 
hence no SVA support for such devices.


--
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-04-01 Thread Yi Liu




On 2022/3/31 20:59, Jason Gunthorpe wrote:

On Wed, Mar 30, 2022 at 09:35:52PM +0800, Yi Liu wrote:


+/**
+ * struct iommu_ioas_copy - ioctl(IOMMU_IOAS_COPY)
+ * @size: sizeof(struct iommu_ioas_copy)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @dst_ioas_id: IOAS ID to change the mapping of
+ * @src_ioas_id: IOAS ID to copy from


so the dst and src ioas_id are allocated via the same iommufd.
right? just out of curious, do you think it is possible that
the srs/dst ioas_ids are from different iommufds? In that case
may need to add src/dst iommufd. It's not needed today, just to
see if any blocker in kernel to support such copy. :-)


Yes, all IDs in all ioctls are within the scope of a single iommufd.

There should be no reason for a single application to open multiple
iommufds.


then should this be documented?

--
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC v3 02/12] iommu: Add a flag to indicate immutable singleton group

2022-04-11 Thread Yi Liu
 connected to the root bridge and it is not an
MFD, do we still need ACS on it? The Intel idxd device seems to be such
a device. I had a quick check with lspci, it has no ACS support.

I probably missed anything.


seems not necessary per my knowledge.






+
+    /* Filter out devices which has any alias device. */
+    if (pci_for_each_dma_alias(pdev, has_pci_alias, NULL))
+    return false;
+
+    return true;
+}
+
  /**
   * iommu_group_add_device - add a device to an iommu group
   * @group: the group into which to add the device (reference should be 
held)

@@ -898,6 +941,20 @@ int iommu_group_add_device(struct iommu_group
*group, struct device *dev)
  list_add_tail(&device->list, &group->devices);
  if (group->domain  && !iommu_is_attach_deferred(dev))
  ret = __iommu_attach_device(group->domain, dev);
+
+    /*
+ * Use standard PCI bus topology, isolation features, and DMA
+ * alias quirks to set the immutable singleton attribute. If
+ * the device came from DT, assume it is static and then
+ * singleton can know from the device count in the group.
+ */
+    if (dev_is_pci(dev))
+    group->immutable_singleton =
+    pci_immutably_isolated(to_pci_dev(dev));
+    else if (is_of_node(dev_fwnode(dev)))
+    group->immutable_singleton =
+    (iommu_group_device_count(group) == 1);
+
  mutex_unlock(&group->mutex);
  if (ret)
  goto err_put_group;
@@ -1290,16 +1347,6 @@ EXPORT_SYMBOL_GPL(iommu_group_id);
  static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
 unsigned long *devfns);

-/*
- * To consider a PCI device isolated, we require ACS to support Source
- * Validation, Request Redirection, Completer Redirection, and Upstream
- * Forwarding.  This effectively means that devices cannot spoof their
- * requester ID, requests and completions cannot be redirected, and all
- * transactions are forwarded upstream, even as it passes through a
- * bridge where the target device is downstream.
- */
-#define REQ_ACS_FLAGS   (PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR |
PCI_ACS_UF)
-
  /*
   * For multifunction devices which are not isolated from each other, find
   * all the other non-isolated functions and look for existing groups.  
For

--
2.25.1




Best regards,
baolu


--
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping

2022-04-13 Thread Yi Liu
_resv_regions(struct io_pagetable *iopt,
+ struct iommu_group *group,
+ phys_addr_t *sw_msi_start)
+{
+   struct iommu_resv_region *resv;
+   struct iommu_resv_region *tmp;
+   LIST_HEAD(group_resv_regions);
+   int rc;
+
+   down_write(&iopt->iova_rwsem);
+   rc = iommu_get_group_resv_regions(group, &group_resv_regions);
+   if (rc)
+   goto out_unlock;
+
+   list_for_each_entry (resv, &group_resv_regions, list) {
+   if (resv->type == IOMMU_RESV_DIRECT_RELAXABLE)
+   continue;
+
+   /*
+* The presence of any 'real' MSI regions should take precedence
+* over the software-managed one if the IOMMU driver happens to
+* advertise both types.
+*/
+   if (sw_msi_start && resv->type == IOMMU_RESV_MSI) {
+   *sw_msi_start = 0;
+   sw_msi_start = NULL;
+   }
+   if (sw_msi_start && resv->type == IOMMU_RESV_SW_MSI)
+   *sw_msi_start = resv->start;
+
+   rc = iopt_reserve_iova(iopt, resv->start,
+  resv->length - 1 + resv->start, group);
+   if (rc)
+   goto out_reserved;
+   }
+   rc = 0;
+   goto out_free_resv;
+
+out_reserved:
+   iopt_remove_reserved_iova(iopt, group);
+out_free_resv:
+   list_for_each_entry_safe (resv, tmp, &group_resv_regions, list)
+   kfree(resv);
+out_unlock:
+   up_write(&iopt->iova_rwsem);
+   return rc;
+}
diff --git a/drivers/iommu/iommufd/iommufd_private.h 
b/drivers/iommu/iommufd/iommufd_private.h
index 2f1301d39bba7c..bcf08e61bc87e9 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -9,6 +9,9 @@
  #include 
  #include 
  
+struct iommu_domain;

+struct iommu_group;
+
  /*
   * The IOVA to PFN map. The mapper automatically copies the PFNs into multiple
   * domains and permits sharing of PFNs between io_pagetable instances. This
@@ -27,8 +30,40 @@ struct io_pagetable {
struct rw_semaphore iova_rwsem;
struct rb_root_cached area_itree;
struct rb_root_cached reserved_iova_itree;
+   unsigned long iova_alignment;
  };
  
+int iopt_init_table(struct io_pagetable *iopt);

+void iopt_destroy_table(struct io_pagetable *iopt);
+struct iopt_pages *iopt_get_pages(struct io_pagetable *iopt, unsigned long 
iova,
+ unsigned long *start_byte,
+ unsigned long length);
+enum { IOPT_ALLOC_IOVA = 1 << 0 };
+int iopt_map_user_pages(struct io_pagetable *iopt, unsigned long *iova,
+   void __user *uptr, unsigned long length, int iommu_prot,
+   unsigned int flags);
+int iopt_map_pages(struct io_pagetable *iopt, struct iopt_pages *pages,
+  unsigned long *dst_iova, unsigned long start_byte,
+  unsigned long length, int iommu_prot, unsigned int flags);
+int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
+   unsigned long length);
+int iopt_unmap_all(struct io_pagetable *iopt);
+
+int iopt_access_pages(struct io_pagetable *iopt, unsigned long iova,
+ unsigned long npages, struct page **out_pages, bool 
write);
+void iopt_unaccess_pages(struct io_pagetable *iopt, unsigned long iova,
+size_t npages);
+int iopt_table_add_domain(struct io_pagetable *iopt,
+ struct iommu_domain *domain);
+void iopt_table_remove_domain(struct io_pagetable *iopt,
+ struct iommu_domain *domain);
+int iopt_table_enforce_group_resv_regions(struct io_pagetable *iopt,
+ struct iommu_group *group,
+ phys_addr_t *sw_msi_start);
+int iopt_reserve_iova(struct io_pagetable *iopt, unsigned long start,
+ unsigned long last, void *owner);
+void iopt_remove_reserved_iova(struct io_pagetable *iopt, void *owner);
+
  struct iommufd_ctx {
struct file *filp;
struct xarray objects;


--
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping

2022-04-13 Thread Yi Liu

On 2022/4/13 22:36, Jason Gunthorpe wrote:

On Wed, Apr 13, 2022 at 10:02:58PM +0800, Yi Liu wrote:

+/**
+ * iopt_unmap_iova() - Remove a range of iova
+ * @iopt: io_pagetable to act on
+ * @iova: Starting iova to unmap
+ * @length: Number of bytes to unmap
+ *
+ * The requested range must exactly match an existing range.
+ * Splitting/truncating IOVA mappings is not allowed.
+ */
+int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
+   unsigned long length)
+{
+   struct iopt_pages *pages;
+   struct iopt_area *area;
+   unsigned long iova_end;
+   int rc;
+
+   if (!length)
+   return -EINVAL;
+
+   if (check_add_overflow(iova, length - 1, &iova_end))
+   return -EOVERFLOW;
+
+   down_read(&iopt->domains_rwsem);
+   down_write(&iopt->iova_rwsem);
+   area = iopt_find_exact_area(iopt, iova, iova_end);


when testing vIOMMU with Qemu using iommufd, I hit a problem as log #3
shows. Qemu failed when trying to do map due to an IOVA still in use.
After debugging, the 0xf000 IOVA is mapped but not unmapped. But per log
#2, Qemu has issued unmap with a larger range (0xff00 -
0x1) which includes the 0xf000. But iopt_find_exact_area()
doesn't find any area. So 0xf000 is not unmapped. Is this correct? Same
test passed with vfio iommu type1 driver. any idea?


There are a couple of good reasons why the iopt_unmap_iova() should
proccess any contiguous range of fully contained areas, so I would
consider this something worth fixing. can you send a small patch and
test case and I'll fold it in?


sure. just spotted it, so haven't got fix patch yet. I may work on
it tomorrow.

--
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 00/12] IOMMUFD Generic interface

2022-04-14 Thread Yi Liu

On 2022/3/19 01:27, Jason Gunthorpe wrote:

iommufd is the user API to control the IOMMU subsystem as it relates to
managing IO page tables that point at user space memory.

It takes over from drivers/vfio/vfio_iommu_type1.c (aka the VFIO
container) which is the VFIO specific interface for a similar idea.

We see a broad need for extended features, some being highly IOMMU device
specific:
  - Binding iommu_domain's to PASID/SSID
  - Userspace page tables, for ARM, x86 and S390
  - Kernel bypass'd invalidation of user page tables
  - Re-use of the KVM page table in the IOMMU
  - Dirty page tracking in the IOMMU
  - Runtime Increase/Decrease of IOPTE size
  - PRI support with faults resolved in userspace

As well as a need to access these features beyond just VFIO, VDPA for
instance, but other classes of accelerator HW are touching on these areas
now too.

The v1 series proposed re-using the VFIO type 1 data structure, however it
was suggested that if we are doing this big update then we should also
come with a data structure that solves the limitations that VFIO type1
has. Notably this addresses:

  - Multiple IOAS/'containers' and multiple domains inside a single FD

  - Single-pin operation no matter how many domains and containers use
a page

  - A fine grained locking scheme supporting user managed concurrency for
multi-threaded map/unmap

  - A pre-registration mechanism to optimize vIOMMU use cases by
pre-pinning pages

  - Extended ioctl API that can manage these new objects and exposes
domains directly to user space

  - domains are sharable between subsystems, eg VFIO and VDPA

The bulk of this code is a new data structure design to track how the
IOVAs are mapped to PFNs.

iommufd intends to be general and consumable by any driver that wants to
DMA to userspace. From a driver perspective it can largely be dropped in
in-place of iommu_attach_device() and provides a uniform full feature set
to all consumers.

As this is a larger project this series is the first step. This series
provides the iommfd "generic interface" which is designed to be suitable
for applications like DPDK and VMM flows that are not optimized to
specific HW scenarios. It is close to being a drop in replacement for the
existing VFIO type 1.

This is part two of three for an initial sequence:
  - Move IOMMU Group security into the iommu layer

https://lore.kernel.org/linux-iommu/20220218005521.172832-1-baolu...@linux.intel.com/
  * Generic IOMMUFD implementation
  - VFIO ability to consume IOMMUFD
An early exploration of this is available here:
 https://github.com/luxis1999/iommufd/commits/iommufd-v5.17-rc6


Eric Auger and me have posted a QEMU rfc based on this branch.

https://lore.kernel.org/kvm/20220414104710.28534-1-yi.l....@intel.com/

--
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping

2022-04-17 Thread Yi Liu

On 2022/4/13 22:49, Yi Liu wrote:

On 2022/4/13 22:36, Jason Gunthorpe wrote:

On Wed, Apr 13, 2022 at 10:02:58PM +0800, Yi Liu wrote:

+/**
+ * iopt_unmap_iova() - Remove a range of iova
+ * @iopt: io_pagetable to act on
+ * @iova: Starting iova to unmap
+ * @length: Number of bytes to unmap
+ *
+ * The requested range must exactly match an existing range.
+ * Splitting/truncating IOVA mappings is not allowed.
+ */
+int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
+    unsigned long length)
+{
+    struct iopt_pages *pages;
+    struct iopt_area *area;
+    unsigned long iova_end;
+    int rc;
+
+    if (!length)
+    return -EINVAL;
+
+    if (check_add_overflow(iova, length - 1, &iova_end))
+    return -EOVERFLOW;
+
+    down_read(&iopt->domains_rwsem);
+    down_write(&iopt->iova_rwsem);
+    area = iopt_find_exact_area(iopt, iova, iova_end);


when testing vIOMMU with Qemu using iommufd, I hit a problem as log #3
shows. Qemu failed when trying to do map due to an IOVA still in use.
After debugging, the 0xf000 IOVA is mapped but not unmapped. But per 
log

#2, Qemu has issued unmap with a larger range (0xff00 -
0x1) which includes the 0xf000. But iopt_find_exact_area()
doesn't find any area. So 0xf000 is not unmapped. Is this correct? Same
test passed with vfio iommu type1 driver. any idea?


There are a couple of good reasons why the iopt_unmap_iova() should
proccess any contiguous range of fully contained areas, so I would
consider this something worth fixing. can you send a small patch and
test case and I'll fold it in?


sure. just spotted it, so haven't got fix patch yet. I may work on
it tomorrow.


Hi Jason,

Got below patch for it. Also pushed to the exploration branch.

https://github.com/luxis1999/iommufd/commit/d764f3288de0fd52c578684788a437701ec31b2d

From 22a758c401a1c7f6656625013bb87204c9ea65fe Mon Sep 17 00:00:00 2001
From: Yi Liu 
Date: Sun, 17 Apr 2022 07:39:03 -0700
Subject: [PATCH] iommufd/io_pagetable: Support unmap fully contained areas

Changes:
- return the unmapped bytes to caller
- supports unmap fully containerd contiguous areas
- add a test case in selftest

Signed-off-by: Yi Liu 
---
 drivers/iommu/iommufd/io_pagetable.c| 90 -
 drivers/iommu/iommufd/ioas.c|  8 ++-
 drivers/iommu/iommufd/iommufd_private.h |  4 +-
 drivers/iommu/iommufd/vfio_compat.c |  8 ++-
 include/uapi/linux/iommufd.h|  2 +-
 tools/testing/selftests/iommu/iommufd.c | 40 +++
 6 files changed, 99 insertions(+), 53 deletions(-)

diff --git a/drivers/iommu/iommufd/io_pagetable.c 
b/drivers/iommu/iommufd/io_pagetable.c

index f9f3b06946bf..5142f797a812 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -315,61 +315,26 @@ static int __iopt_unmap_iova(struct io_pagetable 
*iopt, struct iopt_area *area,

return 0;
 }

-/**
- * iopt_unmap_iova() - Remove a range of iova
- * @iopt: io_pagetable to act on
- * @iova: Starting iova to unmap
- * @length: Number of bytes to unmap
- *
- * The requested range must exactly match an existing range.
- * Splitting/truncating IOVA mappings is not allowed.
- */
-int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
-   unsigned long length)
-{
-   struct iopt_pages *pages;
-   struct iopt_area *area;
-   unsigned long iova_end;
-   int rc;
-
-   if (!length)
-   return -EINVAL;
-
-   if (check_add_overflow(iova, length - 1, &iova_end))
-   return -EOVERFLOW;
-
-   down_read(&iopt->domains_rwsem);
-   down_write(&iopt->iova_rwsem);
-   area = iopt_find_exact_area(iopt, iova, iova_end);
-   if (!area) {
-   up_write(&iopt->iova_rwsem);
-   up_read(&iopt->domains_rwsem);
-   return -ENOENT;
-   }
-   pages = area->pages;
-   area->pages = NULL;
-   up_write(&iopt->iova_rwsem);
-
-   rc = __iopt_unmap_iova(iopt, area, pages);
-   up_read(&iopt->domains_rwsem);
-   return rc;
-}
-
-int iopt_unmap_all(struct io_pagetable *iopt)
+static int __iopt_unmap_iova_range(struct io_pagetable *iopt,
+  unsigned long start,
+  unsigned long end,
+  unsigned long *unmapped)
 {
struct iopt_area *area;
+   unsigned long unmapped_bytes = 0;
int rc;

down_read(&iopt->domains_rwsem);
down_write(&iopt->iova_rwsem);
-   while ((area = iopt_area_iter_first(iopt, 0, ULONG_MAX))) {
+   while ((area = iopt_area_iter_first(iopt, start, end))) {
struct iopt_pages *pages;

-   /* Userspace should not race unmap all and map */
-   if (!area->pages) {
-   rc = -EBUSY;
+   i

Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping

2022-04-18 Thread Yi Liu

Hi Jason,

On 2022/4/17 22:56, Yi Liu wrote:

On 2022/4/13 22:49, Yi Liu wrote:

On 2022/4/13 22:36, Jason Gunthorpe wrote:

On Wed, Apr 13, 2022 at 10:02:58PM +0800, Yi Liu wrote:

+/**
+ * iopt_unmap_iova() - Remove a range of iova
+ * @iopt: io_pagetable to act on
+ * @iova: Starting iova to unmap
+ * @length: Number of bytes to unmap
+ *
+ * The requested range must exactly match an existing range.
+ * Splitting/truncating IOVA mappings is not allowed.
+ */
+int iopt_unmap_iova(struct io_pagetable *iopt, unsigned long iova,
+    unsigned long length)
+{
+    struct iopt_pages *pages;
+    struct iopt_area *area;
+    unsigned long iova_end;
+    int rc;
+
+    if (!length)
+    return -EINVAL;
+
+    if (check_add_overflow(iova, length - 1, &iova_end))
+    return -EOVERFLOW;
+
+    down_read(&iopt->domains_rwsem);
+    down_write(&iopt->iova_rwsem);
+    area = iopt_find_exact_area(iopt, iova, iova_end);


when testing vIOMMU with Qemu using iommufd, I hit a problem as log #3
shows. Qemu failed when trying to do map due to an IOVA still in use.
After debugging, the 0xf000 IOVA is mapped but not unmapped. But 
per log

#2, Qemu has issued unmap with a larger range (0xff00 -
0x1) which includes the 0xf000. But iopt_find_exact_area()
doesn't find any area. So 0xf000 is not unmapped. Is this correct? 
Same

test passed with vfio iommu type1 driver. any idea?


There are a couple of good reasons why the iopt_unmap_iova() should
proccess any contiguous range of fully contained areas, so I would
consider this something worth fixing. can you send a small patch and
test case and I'll fold it in?


sure. just spotted it, so haven't got fix patch yet. I may work on
it tomorrow.


Hi Jason,

Got below patch for it. Also pushed to the exploration branch.

https://github.com/luxis1999/iommufd/commit/d764f3288de0fd52c578684788a437701ec31b2d 


0-day reports a use without initialization to me. So updated it. Please get
the change in below commit. Sorry for the noise.

https://github.com/luxis1999/iommufd/commit/10674417c235cb4a4caf2202fffb078611441da2

--
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH RFC 10/12] iommufd: Add kAPI toward external drivers

2022-05-11 Thread Yi Liu



On 2022/3/19 01:27, Jason Gunthorpe wrote:


+
+/**
+ * iommufd_device_attach - Connect a device to an iommu_domain
+ * @idev: device to attach
+ * @pt_id: Input a IOMMUFD_OBJ_IOAS, or IOMMUFD_OBJ_HW_PAGETABLE
+ * Output the IOMMUFD_OBJ_HW_PAGETABLE ID
+ * @flags: Optional flags
+ *
+ * This connects the device to an iommu_domain, either automatically or 
manually
+ * selected. Once this completes the device could do DMA.
+ *
+ * The caller should return the resulting pt_id back to userspace.
+ * This function is undone by calling iommufd_device_detach().
+ */
+int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
+ unsigned int flags)
+{
+   struct iommufd_hw_pagetable *hwpt;
+   int rc;
+
+   refcount_inc(&idev->obj.users);
+
+   hwpt = iommufd_hw_pagetable_from_id(idev->ictx, *pt_id, idev->dev);
+   if (IS_ERR(hwpt)) {
+   rc = PTR_ERR(hwpt);
+   goto out_users;
+   }
+
+   mutex_lock(&hwpt->devices_lock);
+   /* FIXME: Use a device-centric iommu api. For now check if the
+* hw_pagetable already has a device of the same group joined to tell if
+* we are the first and need to attach the group. */
+   if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
+   phys_addr_t sw_msi_start = 0;
+
+   rc = iommu_attach_group(hwpt->domain, idev->group);
+   if (rc)
+   goto out_unlock;
+
+   /*
+* hwpt is now the exclusive owner of the group so this is the
+* first time enforce is called for this group.
+*/
+   rc = iopt_table_enforce_group_resv_regions(
+   &hwpt->ioas->iopt, idev->group, &sw_msi_start);
+   if (rc)
+   goto out_detach;
+   rc = iommufd_device_setup_msi(idev, hwpt, sw_msi_start, flags);
+   if (rc)
+   goto out_iova;
+   }
+
+   idev->hwpt = hwpt;


could the below list_empty check be moved to the above "if branch"? If
above "if branch" is false, that means there is already group attached with
the hwpt->domain. So the hwpt->devices should be non-empty. Only if the 
above "if branch" is true, should the hwpt->devices possible to be empty.

So moving it into above "if branch" may be better?


+   if (list_empty(&hwpt->devices)) {
+   rc = iopt_table_add_domain(&hwpt->ioas->iopt, hwpt->domain);
+   if (rc)
+   goto out_iova;
+   }
+   list_add(&idev->devices_item, &hwpt->devices);
+   mutex_unlock(&hwpt->devices_lock);
+
+   *pt_id = idev->hwpt->obj.id;
+   return 0;
+
+out_iova:
+   iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->group);
+out_detach:
+   iommu_detach_group(hwpt->domain, idev->group);
+out_unlock:
+   mutex_unlock(&hwpt->devices_lock);
+   iommufd_hw_pagetable_put(idev->ictx, hwpt);
+out_users:
+   refcount_dec(&idev->obj.users);
+   return rc;
+}
+EXPORT_SYMBOL_GPL(iommufd_device_attach);


--
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 10/12] iommufd: Add kAPI toward external drivers

2022-05-19 Thread Yi Liu
(struct iommufd_ctx *ictx,
  struct iommufd_hw_pagetable *hwpt);
  void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
  
+void iommufd_device_destroy(struct iommufd_object *obj);

+
  #endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 954cde173c86fc..6a895489fb5b82 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -284,6 +284,9 @@ struct iommufd_ctx *iommufd_fget(int fd)
  }
  
  static struct iommufd_object_ops iommufd_object_ops[] = {

+   [IOMMUFD_OBJ_DEVICE] = {
+   .destroy = iommufd_device_destroy,
+   },
[IOMMUFD_OBJ_IOAS] = {
.destroy = iommufd_ioas_destroy,
},
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
new file mode 100644
index 00..6caac05475e39f
--- /dev/null
+++ b/include/linux/iommufd.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2021 Intel Corporation
+ * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES
+ */
+#ifndef __LINUX_IOMMUFD_H
+#define __LINUX_IOMMUFD_H
+
+#include 
+#include 
+#include 
+#include 
+
+struct pci_dev;
+struct iommufd_device;
+
+#if IS_ENABLED(CONFIG_IOMMUFD)
+struct iommufd_device *iommufd_bind_pci_device(int fd, struct pci_dev *pdev,
+  u32 *id);
+void iommufd_unbind_device(struct iommufd_device *idev);
+
+enum {
+   IOMMUFD_ATTACH_FLAGS_ALLOW_UNSAFE_INTERRUPT = 1 << 0,
+};
+int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
+ unsigned int flags);
+void iommufd_device_detach(struct iommufd_device *idev);
+
+#else /* !CONFIG_IOMMUFD */
+static inline struct iommufd_device *
+iommufd_bind_pci_device(int fd, struct pci_dev *pdev, u32 *id)
+{
+   return ERR_PTR(-EOPNOTSUPP);
+}
+
+static inline void iommufd_unbind_device(struct iommufd_device *idev)
+{
+}
+
+static inline int iommufd_device_attach(struct iommufd_device *idev,
+   u32 ioas_id)
+{
+   return -EOPNOTSUPP;
+}
+
+static inline void iommufd_device_detach(struct iommufd_device *idev)
+{
+}
+#endif /* CONFIG_IOMMUFD */
+#endif


--
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/1] iommu/vt-d: Fix RID2PASID setup failure

2022-06-20 Thread Yi Liu

Hi Baolu,

On 2022/6/20 16:17, Lu Baolu wrote:

The IOMMU driver shares the pasid table for PCI alias devices. When the
RID2PASID entry of the shared pasid table has been filled by the first
device, the subsequent devices will encounter the "DMAR: Setup RID2PASID
failed" failure as the pasid entry has already been marke as present. As


s/marke/marked/


the result, the IOMMU probing process will be aborted.

This fixes it by skipping RID2PASID setting if the pasid entry has been
populated. This works because the IOMMU core ensures that only the same
IOMMU domain can be attached to all PCI alias devices at the same time.


nit. this sentence is a little bit to interpret. :-) I guess what you want
to describe is the PCI alias devices should be attached to the same domain
instead of different domain. right?

also, does it apply to all domain types? e.g. the SVA domains introduced in 
"iommu: SVA and IOPF refactoring"



Therefore the subsequent devices just try to setup the RID2PASID entry
with the same domain, which is negligible.

Fixes: ef848b7e5a6a0 ("iommu/vt-d: Setup pasid entry for RID2PASID support")
Reported-by: Chenyi Qiang 
Cc: sta...@vger.kernel.org
Signed-off-by: Lu Baolu 
---
  drivers/iommu/intel/iommu.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 44016594831d..b9966c01a2a2 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2564,7 +2564,7 @@ static int domain_add_dev_info(struct dmar_domain 
*domain, struct device *dev)
ret = intel_pasid_setup_second_level(iommu, domain,
dev, PASID_RID2PASID);
spin_unlock_irqrestore(&iommu->lock, flags);
-   if (ret) {
+   if (ret && ret != -EBUSY) {
dev_err(dev, "Setup RID2PASID failed\n");
dmar_remove_one_dev_info(dev);
    return ret;


--
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu