ioasid uAPI proposal

Tian, Kevin Tue, 01 Jun 2021 01:38:20 -0700

> From: Jason Gunthorpe <[email protected]>
> Sent: Saturday, May 29, 2021 3:59 AM
> 
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> >
> > 5. Use Cases and Flows
> >
> > Here assume VFIO will support a new model where every bound device
> > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > going through legacy container/group interface. For illustration purpose
> > those devices are just called dev[1...N]:
> >
> >     device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> >
> > As explained earlier, one IOASID fd is sufficient for all intended use 
> > cases:
> >
> >     ioasid_fd = open("/dev/ioasid", mode);
> >
> > For simplicity below examples are all made for the virtualization story.
> > They are representative and could be easily adapted to a non-virtualization
> > scenario.
> 
> For others, I don't think this is *strictly* necessary, we can
> probably still get to the device_fd using the group_fd and fit in
> /dev/ioasid. It does make the rest of this more readable though.


Jason, want to confirm here. Per earlier discussion we remain an
impression that you want VFIO to be a pure device driver thus
container/group are used only for legacy application. From this
comment are you suggesting that VFIO can still keep container/
group concepts and user just deprecates the use of vfio iommu
uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has
a simple policy that an IOASID will reject cmd if partially-attached 
group exists)?

> 
> 
> > Three types of IOASIDs are considered:
> >
> >     gpa_ioasid[1...N]:      for GPA address space
> >     giova_ioasid[1...N]:    for guest IOVA address space
> >     gva_ioasid[1...N]:      for guest CPU VA address space
> >
> > At least one gpa_ioasid must always be created per guest, while the other
> > two are relevant as far as vIOMMU is concerned.
> >
> > Examples here apply to both pdev and mdev, if not explicitly marked out
> > (e.g. in section 5.5). VFIO device driver in the kernel will figure out the
> > associated routing information in the attaching operation.
> >
> > For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> > INFO are skipped in these examples.
> >
> > 5.1. A simple example
> > ++++++++++++++++++
> >
> > Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> > space is managed through DMA mapping protocol:
> >
> >     /* Bind device to IOASID fd */
> >     device_fd = open("/dev/vfio/devices/dev1", mode);
> >     ioasid_fd = open("/dev/ioasid", mode);
> >     ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
> >
> >     /* Attach device to IOASID */
> >     gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> >     at_data = { .ioasid = gpa_ioasid};
> >     ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
> >
> >     /* Setup GPA mapping */
> >     dma_map = {
> >             .ioasid = gpa_ioasid;
> >             .iova   = 0;            // GPA
> >             .vaddr  = 0x40000000;   // HVA
> >             .size   = 1GB;
> >     };
> >     ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> >
> > If the guest is assigned with more than dev1, user follows above sequence
> > to attach other devices to the same gpa_ioasid i.e. sharing the GPA
> > address space cross all assigned devices.
> 
> eg
> 
>       device2_fd = open("/dev/vfio/devices/dev1", mode);
>       ioctl(device2_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
>       ioctl(device2_fd, VFIO_ATTACH_IOASID, &at_data);
> 
> Right?

Exactly, except a small typo ('dev1' -> 'dev2'). :)

> 
> >
> > 5.2. Multiple IOASIDs (no nesting)
> > ++++++++++++++++++++++++++++
> >
> > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> > both devices are attached to gpa_ioasid. After boot the guest creates
> > an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> > through mode (gpa_ioasid).
> >
> > Suppose IOASID nesting is not supported in this case. Qemu need to
> > generate shadow mappings in userspace for giova_ioasid (like how
> > VFIO works today).
> >
> > To avoid duplicated locked page accounting, it's recommended to pre-
> > register the virtual address range that will be used for DMA:
> >
> >     device_fd1 = open("/dev/vfio/devices/dev1", mode);
> >     device_fd2 = open("/dev/vfio/devices/dev2", mode);
> >     ioasid_fd = open("/dev/ioasid", mode);
> >     ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> >     ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
> >
> >     /* pre-register the virtual address range for accounting */
> >     mem_info = { .vaddr = 0x40000000; .size = 1GB };
> >     ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);
> >
> >     /* Attach dev1 and dev2 to gpa_ioasid */
> >     gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> >     at_data = { .ioasid = gpa_ioasid};
> >     ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> >     ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> >     /* Setup GPA mapping */
> >     dma_map = {
> >             .ioasid = gpa_ioasid;
> >             .iova   = 0;            // GPA
> >             .vaddr  = 0x40000000;   // HVA
> >             .size   = 1GB;
> >     };
> >     ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> >
> >     /* After boot, guest enables an GIOVA space for dev2 */
> >     giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> >
> >     /* First detach dev2 from previous address space */
> >     at_data = { .ioasid = gpa_ioasid};
> >     ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
> >
> >     /* Then attach dev2 to the new address space */
> >     at_data = { .ioasid = giova_ioasid};
> >     ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> >     /* Setup a shadow DMA mapping according to vIOMMU
> >       * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
> >       */
> 
> Here "shadow DMA" means relay the guest's vIOMMU page tables to the HW
> IOMMU?

'shadow' means the merged mapping: GIOVA(0x2000) -> HVA (0x40001000)

> 
> >     dma_map = {
> >             .ioasid = giova_ioasid;
> >             .iova   = 0x2000;       // GIOVA
> >             .vaddr  = 0x40001000;   // HVA
> 
> eg HVA came from reading the guest's page tables and finding it wanted
> GPA 0x1000 mapped to IOVA 0x2000?

yes

> 
> 
> > 5.3. IOASID nesting (software)
> > +++++++++++++++++++++++++
> >
> > Same usage scenario as 5.2, with software-based IOASID nesting
> > available. In this mode it is the kernel instead of user to create the
> > shadow mapping.
> >
> > The flow before guest boots is same as 5.2, except one point. Because
> > giova_ioasid is nested on gpa_ioasid, locked accounting is only
> > conducted for gpa_ioasid. So it's not necessary to pre-register virtual
> > memory.
> >
> > To save space we only list the steps after boots (i.e. both dev1/dev2
> > have been attached to gpa_ioasid before guest boots):
> >
> >     /* After boots */
> >     /* Make GIOVA space nested on GPA space */
> >     giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> >                             gpa_ioasid);
> >
> >     /* Attach dev2 to the new address space (child)
> >       * Note dev2 is still attached to gpa_ioasid (parent)
> >       */
> >     at_data = { .ioasid = giova_ioasid};
> >     ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> >     /* Setup a GIOVA->GPA mapping for giova_ioasid, which will be
> >       * merged by the kernel with GPA->HVA mapping of gpa_ioasid
> >       * to form a shadow mapping.
> >       */
> >     dma_map = {
> >             .ioasid = giova_ioasid;
> >             .iova   = 0x2000;       // GIOVA
> >             .vaddr  = 0x1000;       // GPA
> >             .size   = 4KB;
> >     };
> >     ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> 
> And in this version the kernel reaches into the parent IOASID's page
> tables to translate 0x1000 to 0x40001000 to physical page? So we
> basically remove the qemu process address space entirely from this
> translation. It does seem convenient

yes.

> 
> > 5.4. IOASID nesting (hardware)
> > +++++++++++++++++++++++++
> >
> > Same usage scenario as 5.2, with hardware-based IOASID nesting
> > available. In this mode the pgtable binding protocol is used to
> > bind the guest IOVA page table with the IOMMU:
> >
> >     /* After boots */
> >     /* Make GIOVA space nested on GPA space */
> >     giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> >                             gpa_ioasid);
> >
> >     /* Attach dev2 to the new address space (child)
> >       * Note dev2 is still attached to gpa_ioasid (parent)
> >       */
> >     at_data = { .ioasid = giova_ioasid};
> >     ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> >     /* Bind guest I/O page table  */
> >     bind_data = {
> >             .ioasid = giova_ioasid;
> >             .addr   = giova_pgtable;
> >             // and format information
> >     };
> >     ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> I really think you need to use consistent language. Things that
> allocate a new IOASID should be calle IOASID_ALLOC_IOASID. If multiple
> IOCTLs are needed then it is IOASID_ALLOC_IOASID_PGTABLE, etc.
> alloc/create/bind is too confusing.
> 
> > 5.5. Guest SVA (vSVA)
> > ++++++++++++++++++
> >
> > After boots the guest further create a GVA address spaces (gpasid1) on
> > dev1. Dev2 is not affected (still attached to giova_ioasid).
> >
> > As explained in section 4, user should avoid expose ENQCMD on both
> > pdev and mdev.
> >
> > The sequence applies to all device types (being pdev or mdev), except
> > one additional step to call KVM for ENQCMD-capable mdev:
> >
> >     /* After boots */
> >     /* Make GVA space nested on GPA space */
> >     gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> >                             gpa_ioasid);
> >
> >     /* Attach dev1 to the new address space and specify vPASID */
> >     at_data = {
> >             .ioasid         = gva_ioasid;
> >             .flag           = IOASID_ATTACH_USER_PASID;
> >             .user_pasid     = gpasid1;
> >     };
> >     ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> 
> Still a little unsure why the vPASID is here not on the gva_ioasid. Is
> there any scenario where we want different vpasid's for the same
> IOASID? I guess it is OK like this. Hum.

Yes, it's completely sane that the guest links a I/O page table to 
different vpasids on dev1 and dev2. The IOMMU doesn't mandate
that when multiple devices share an I/O page table they must use
the same PASID#. 

> 
> >     /* if dev1 is ENQCMD-capable mdev, update CPU PASID
> >       * translation structure through KVM
> >       */
> >     pa_data = {
> >             .ioasid_fd      = ioasid_fd;
> >             .ioasid         = gva_ioasid;
> >             .guest_pasid    = gpasid1;
> >     };
> >     ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
> 
> Make sense
> 
> >     /* Bind guest I/O page table  */
> >     bind_data = {
> >             .ioasid = gva_ioasid;
> >             .addr   = gva_pgtable1;
> >             // and format information
> >     };
> >     ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> Again I do wonder if this should just be part of alloc_ioasid. Is
> there any reason to split these things? The only advantage to the
> split is the device is known, but the device shouldn't impact
> anything..

I summarized this as open#4 in another mail for focused discussion.

> 
> > 5.6. I/O page fault
> > +++++++++++++++
> >
> > (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> > to guest IOMMU driver and backwards).
> >
> > -   Host IOMMU driver receives a page request with raw fault_data {rid,
> >     pasid, addr};
> >
> > -   Host IOMMU driver identifies the faulting I/O page table according to
> >     information registered by IOASID fault handler;
> >
> > -   IOASID fault handler is called with raw fault_data (rid, pasid, addr),
> which
> >     is saved in ioasid_data->fault_data (used for response);
> >
> > -   IOASID fault handler generates an user fault_data (ioasid, addr), links 
> > it
> >     to the shared ring buffer and triggers eventfd to userspace;
> 
> Here rid should be translated to a labeled device and return the
> device label from VFIO_BIND_IOASID_FD. Depending on how the device
> bound the label might match to a rid or to a rid,pasid

Yes, I acknowledged this input from you and Jean about page fault and 
bind_pasid_table. I summarized it as open#3 in another mail.

thus following is skipped...

Thanks
Kevin

> 
> > -   Upon received event, Qemu needs to find the virtual routing information
> >     (v_rid + v_pasid) of the device attached to the faulting ioasid. If 
> > there are
> >     multiple, pick a random one. This should be fine since the purpose is to
> >     fix the I/O page table on the guest;
> 
> The device label should fix this
> 
> > -   Qemu finds the pending fault event, converts virtual completion data
> >     into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
> >     complete the pending fault;
> >
> > -   /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
> >     ioasid_data->fault_data, and then calls iommu api to complete it with
> >     {rid, pasid, response_code};
> 
> So resuming a fault on an ioasid will resume all devices pending on
> the fault?
> 
> > 5.7. BIND_PASID_TABLE
> > ++++++++++++++++++++
> >
> > PASID table is put in the GPA space on some platform, thus must be
> updated
> > by the guest. It is treated as another user page table to be bound with the
> > IOMMU.
> >
> > As explained earlier, the user still needs to explicitly bind every user I/O
> > page table to the kernel so the same pgtable binding protocol (bind, cache
> > invalidate and fault handling) is unified cross platforms.
> >
> > vIOMMUs may include a caching mode (or paravirtualized way) which,
> once
> > enabled, requires the guest to invalidate PASID cache for any change on the
> > PASID table. This allows Qemu to track the lifespan of guest I/O page 
> > tables.
> >
> > In case of missing such capability, Qemu could enable write-protection on
> > the guest PASID table to achieve the same effect.
> >
> >     /* After boots */
> >     /* Make vPASID space nested on GPA space */
> >     pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> >                             gpa_ioasid);
> >
> >     /* Attach dev1 to pasidtbl_ioasid */
> >     at_data = { .ioasid = pasidtbl_ioasid};
> >     ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> >
> >     /* Bind PASID table */
> >     bind_data = {
> >             .ioasid = pasidtbl_ioasid;
> >             .addr   = gpa_pasid_table;
> >             // and format information
> >     };
> >     ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
> >
> >     /* vIOMMU detects a new GVA I/O space created */
> >     gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> >                             gpa_ioasid);
> >
> >     /* Attach dev1 to the new address space, with gpasid1 */
> >     at_data = {
> >             .ioasid         = gva_ioasid;
> >             .flag           = IOASID_ATTACH_USER_PASID;
> >             .user_pasid     = gpasid1;
> >     };
> >     ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> >
> >     /* Bind guest I/O page table. Because SET_PASID_TABLE has been
> >       * used, the kernel will not update the PASID table. Instead, just
> >       * track the bound I/O page table for handling invalidation and
> >       * I/O page faults.
> >       */
> >     bind_data = {
> >             .ioasid = gva_ioasid;
> >             .addr   = gva_pgtable1;
> >             // and format information
> >     };
> >     ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> 
> I still don't quite get the benifit from doing this.
> 
> The idea to create an all PASID IOASID seems to work better with less
> fuss on HW that is directly parsing the guest's PASID table.
> 
> Cache invalidate seems easy enough to support
> 
> Fault handling needs to return the (ioasid, device_label, pasid) when
> working with this kind of ioasid.
> 
> It is true that it does create an additional flow qemu has to
> implement, but it does directly mirror the HW.
> 
> Jason
_______________________________________________
iommu mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC] /dev/ioasid uAPI proposal

Reply via email to