Re: [PATCH v2 1/9] docs: Document IO Address Space ID (IOASID) APIs

Jacob Pan Tue, 08 Sep 2020 10:23:27 -0700

On Mon, 7 Sep 2020 10:03:39 +0200
Auger Eric <eric.au...@redhat.com> wrote:


> Hi Jacob,
> 
> On 9/1/20 6:56 PM, Jacob Pan wrote:
> > Hi Eric,
> > 
> > On Thu, 27 Aug 2020 18:21:07 +0200
> > Auger Eric <eric.au...@redhat.com> wrote:
> >   
> >> Hi Jacob,
> >> On 8/24/20 12:32 PM, Jean-Philippe Brucker wrote:  
> >>> On Fri, Aug 21, 2020 at 09:35:10PM -0700, Jacob Pan wrote:    
> >>>> IOASID is used to identify address spaces that can be targeted by
> >>>> device DMA. It is a system-wide resource that is essential to its
> >>>> many users. This document is an attempt to help developers from
> >>>> all vendors navigate the APIs. At this time, ARM SMMU and Intel’s
> >>>> Scalable IO Virtualization (SIOV) enabled platforms are the
> >>>> primary users of IOASID. Examples of how SIOV components interact
> >>>> with IOASID APIs are provided in that many APIs are driven by the
> >>>> requirements from SIOV.
> >>>>
> >>>> Signed-off-by: Liu Yi L <yi.l....@intel.com>
> >>>> Signed-off-by: Wu Hao <hao...@intel.com>
> >>>> Signed-off-by: Jacob Pan <jacob.jun....@linux.intel.com>
> >>>> ---
> >>>>  Documentation/ioasid.rst | 618
> >>>> +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed,
> >>>> 618 insertions(+) create mode 100644 Documentation/ioasid.rst
> >>>>
> >>>> diff --git a/Documentation/ioasid.rst b/Documentation/ioasid.rst    
> >>>
> >>> Thanks for writing this up. Should it go to
> >>> Documentation/driver-api/, or Documentation/driver-api/iommu/? I
> >>> think this also needs to Cc linux-...@vger.kernel.org and
> >>> cor...@lwn.net   
> >>>> new file mode 100644
> >>>> index 000000000000..b6a8cdc885ff
> >>>> --- /dev/null
> >>>> +++ b/Documentation/ioasid.rst
> >>>> @@ -0,0 +1,618 @@
> >>>> +.. ioasid:
> >>>> +
> >>>> +=====================================
> >>>> +IO Address Space ID
> >>>> +=====================================
> >>>> +
> >>>> +IOASID is a generic name for PCIe Process Address ID (PASID) or
> >>>> ARM +SMMU sub-stream ID. An IOASID identifies an address space
> >>>> that DMA    
> >>>
> >>> "SubstreamID"    
> >> On ARM if we don't use PASIDs we have streamids (SID) which can also
> >> identify address spaces that DMA requests can target. So maybe this
> >> definition is not sufficient.
> >>  
> > According to SMMU spec, the SubstreamID is equivalent to PASID. My
> > understanding is that SID is equivalent to PCI requester ID that
> > identifies stage 2. Do you plan to use IOASID for stage 2?  
> No. So actually if PASID is not used we still have a default single
> IOASID matching the single context. So that may be fine as a definition.
OK, thanks for explaining.

> > IOASID is mostly for SVA and DMA request w/ PASID.
> >   
> >>>     
> >>>> +requests can target.
> >>>> +
> >>>> +The primary use cases for IOASID are Shared Virtual Address (SVA)
> >>>> and +IO Virtual Address (IOVA). However, the requirements for
> >>>> IOASID    
> >>>
> >>> IOVA alone isn't a use case, maybe "multiple IOVA spaces per
> >>> device"?   
> >>>> +management can vary among hardware architectures.
> >>>> +
> >>>> +This document covers the generic features supported by IOASID
> >>>> +APIs. Vendor-specific use cases are also illustrated with Intel's
> >>>> VT-d +based platforms as the first example.
> >>>> +
> >>>> +.. contents:: :local:
> >>>> +
> >>>> +Glossary
> >>>> +========
> >>>> +PASID - Process Address Space ID
> >>>> +
> >>>> +IOASID - IO Address Space ID (generic term for PCIe PASID and
> >>>> +sub-stream ID in SMMU)    
> >>>
> >>> "SubstreamID"
> >>>     
> >>>> +
> >>>> +SVA/SVM - Shared Virtual Addressing/Memory
> >>>> +
> >>>> +ENQCMD - New Intel X86 ISA for efficient workqueue submission
> >>>> [1]    
> >>>
> >>> Maybe drop the "New", to keep the documentation perennial. It might
> >>> be good to add internal links here to the specifications URLs at
> >>> the bottom.   
> >>>> +
> >>>> +DSA - Intel Data Streaming Accelerator [2]
> >>>> +
> >>>> +VDCM - Virtual device composition module [3]
> >>>> +
> >>>> +SIOV - Intel Scalable IO Virtualization
> >>>> +
> >>>> +
> >>>> +Key Concepts
> >>>> +============
> >>>> +
> >>>> +IOASID Set
> >>>> +-----------
> >>>> +An IOASID set is a group of IOASIDs allocated from the system-wide
> >>>> +IOASID pool. An IOASID set is created and can be identified by a
> >>>> +token of u64. Refer to IOASID set APIs for more details.    
> >>>
> >>> Identified either by an u64 or an mm_struct, right?  Maybe just
> >>> drop the second sentence if it's detailed in the IOASID set section
> >>> below.   
> >>>> +
> >>>> +IOASID set is particularly useful for guest SVA where each guest
> >>>> could +have its own IOASID set for security and efficiency reasons.
> >>>> +
> >>>> +IOASID Set Private ID (SPID)
> >>>> +----------------------------
> >>>> +SPIDs are introduced as IOASIDs within its set. Each SPID maps to
> >>>> a +system-wide IOASID but the namespace of SPID is within its
> >>>> IOASID +set.    
> >>>
> >>> The intro isn't super clear. Perhaps this is simpler:
> >>> "Each IOASID set has a private namespace of SPIDs. An SPID maps to a
> >>> single system-wide IOASID."    
> >> or, "within an ioasid set, each ioasid can be associated with an alias
> >> ID, named SPID."  
> > I don't have strong opinion, I feel it is good to explain the
> > relationship between SPID and IOASID in both directions, how about add?
> > " Conversely, each IOASID is associated with an alias ID, named SPID."  
> yep. I amy suggest: each IOASID may be associated with an alias ID,
> local to the IOASID set, named SPID.
This is more precise. thanks for suggesting that.

> >   
> >>>     
> >>>> SPIDs can be used as guest IOASIDs where each guest could do
> >>>> +IOASID allocation from its own pool and map them to host physical
> >>>> +IOASIDs. SPIDs are particularly useful for supporting live
> >>>> migration +where decoupling guest and host physical resources are
> >>>> necessary. +
> >>>> +For example, two VMs can both allocate guest PASID/SPID #101 but
> >>>> map to +different host PASIDs #201 and #202 respectively as shown
> >>>> in the +diagram below.
> >>>> +::
> >>>> +
> >>>> + .------------------.    .------------------.
> >>>> + |   VM 1           |    |   VM 2           |
> >>>> + |                  |    |                  |
> >>>> + |------------------|    |------------------|
> >>>> + | GPASID/SPID 101  |    | GPASID/SPID 101  |
> >>>> + '------------------'    -------------------'     Guest
> >>>> + __________|______________________|______________________
> >>>> +           |                      |               Host
> >>>> +           v                      v
> >>>> + .------------------.    .------------------.
> >>>> + | Host IOASID 201  |    | Host IOASID 202  |
> >>>> + '------------------'    '------------------'
> >>>> + |   IOASID set 1   |    |   IOASID set 2   |
> >>>> + '------------------'    '------------------'
> >>>> +
> >>>> +Guest PASID is treated as IOASID set private ID (SPID) within an
> >>>> +IOASID set, mappings between guest and host IOASIDs are stored in
> >>>> the +set for inquiry.
> >>>> +
> >>>> +IOASID APIs
> >>>> +===========
> >>>> +To get the IOASID APIs, users must #include <linux/ioasid.h>.
> >>>> These APIs +serve the following functionalities:
> >>>> +
> >>>> +  - IOASID allocation/Free
> >>>> +  - Group management in the form of ioasid_set
> >>>> +  - Private data storage and lookup
> >>>> +  - Reference counting
> >>>> +  - Event notification in case of state change    
> >> (a)  
> > got it
> >   
> >>>> +
> >>>> +IOASID Set Level APIs
> >>>> +--------------------------
> >>>> +For use cases such as guest SVA it is necessary to manage IOASIDs
> >>>> at +a group level. For example, VMs may allocate multiple IOASIDs
> >>>> for    
> >> I would use the introduced ioasid_set terminology instead of "group".  
> > Right, we already introduced it.
> >   
> >>>> +guest process address sharing (vSVA). It is imperative to enforce
> >>>> +VM-IOASID ownership such that malicious guest cannot target DMA    
> >>>
> >>> "a malicious guest"
> >>>     
> >>>> +traffic outside its own IOASIDs, or free an active IOASID belong
> >>>> to    
> >>>
> >>> "that belongs to"
> >>>     
> >>>> +another VM.
> >>>> +::
> >>>> +
> >>>> + struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota,
> >>>> u32 type)    
> >> what is this void *token? also the type may be explained here.  
> > token is explained in the text following API list. I can move it up.
> >   
> >>>> +
> >>>> + int ioasid_adjust_set(struct ioasid_set *set, int quota);    
> >>>
> >>> These could be named "ioasid_set_alloc" and "ioasid_set_adjust" to
> >>> be consistent with the rest of the API.
> >>>     
> >>>> +
> >>>> + void ioasid_set_get(struct ioasid_set *set)
> >>>> +
> >>>> + void ioasid_set_put(struct ioasid_set *set)
> >>>> +
> >>>> + void ioasid_set_get_locked(struct ioasid_set *set)
> >>>> +
> >>>> + void ioasid_set_put_locked(struct ioasid_set *set)
> >>>> +
> >>>> + int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,    
> >>>
> >>> Might be nicer to keep the same argument names within the API. Here
> >>> "set" rather than "sdata".
> >>>     
> >>>> +                                void (*fn)(ioasid_t id, void
> >>>> *data),
> >>>> +                                void *data)    
> >>>
> >>> (alignment)
> >>>     
> >>>> +
> >>>> +
> >>>> +IOASID set concept is introduced to represent such IOASID groups.
> >>>> Each    
> >>>
> >>> Or just "IOASID sets represent such IOASID groups", but might be
> >>> redundant.
> >>>     
> >>>> +IOASID set is created with a token which can be one of the
> >>>> following +types:    
> >> I think this explanation should happen before the above function
> >> prototypes  
> > ditto.
> >   
> >>>> +
> >>>> + - IOASID_SET_TYPE_NULL (Arbitrary u64 value)
> >>>> + - IOASID_SET_TYPE_MM (Set token is a mm_struct)
> >>>> +
> >>>> +The explicit MM token type is useful when multiple users of an
> >>>> IOASID +set under the same process need to communicate about their
> >>>> shared IOASIDs. +E.g. An IOASID set created by VFIO for one guest
> >>>> can be associated +with the KVM instance for the same guest since
> >>>> they share a common mm_struct. +
> >>>> +The IOASID set APIs serve the following purposes:
> >>>> +
> >>>> + - Ownership/permission enforcement
> >>>> + - Take collective actions, e.g. free an entire set
> >>>> + - Event notifications within a set
> >>>> + - Look up a set based on token
> >>>> + - Quota enforcement    
> >>>
> >>> This paragraph could be earlier in the section    
> >>
> >> yes this is a kind of repetition of (a), above  
> > I meant to highlight on what the APIs do such that readers don't
> > need to read the code instead.
> >   
> >>>     
> >>>> +
> >>>> +Individual IOASID APIs
> >>>> +----------------------
> >>>> +Once an ioasid_set is created, IOASIDs can be allocated from the
> >>>> set. +Within the IOASID set namespace, set private ID (SPID) is
> >>>> supported. In +the VM use case, SPID can be used for storing guest
> >>>> PASID. +
> >>>> +::
> >>>> +
> >>>> + ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> >>>> ioasid_t max,
> >>>> +                       void *private);
> >>>> +
> >>>> + int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
> >>>> +
> >>>> + void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
> >>>> +
> >>>> + int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
> >>>> +
> >>>> + void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
> >>>> +
> >>>> + void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> >>>> +                   bool (*getter)(void *));
> >>>> +
> >>>> + ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t
> >>>> spid) +
> >>>> + int ioasid_attach_data(struct ioasid_set *set, ioasid_t ioasid,
> >>>> +                        void *data);
> >>>> + int ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid,
> >>>> +                        ioasid_t ssid);    
> >>>     
> >>> s/ssid/spid>  
> > got it
> >   
> >>>> +
> >>>> +
> >>>> +Notifications
> >>>> +-------------
> >>>> +An IOASID may have multiple users, each user may have hardware
> >>>> context +associated with an IOASID. When the status of an IOASID
> >>>> changes, +e.g. an IOASID is being freed, users need to be notified
> >>>> such that the +associated hardware context can be cleared,
> >>>> flushed, and drained. +
> >>>> +::
> >>>> +
> >>>> + int ioasid_register_notifier(struct ioasid_set *set, struct
> >>>> +                              notifier_block *nb)
> >>>> +
> >>>> + void ioasid_unregister_notifier(struct ioasid_set *set,
> >>>> +                                 struct notifier_block *nb)
> >>>> +
> >>>> + int ioasid_register_notifier_mm(struct mm_struct *mm, struct
> >>>> +                                 notifier_block *nb)
> >>>> +
> >>>> + void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct
> >>>> +                                    notifier_block *nb)    
> >> the mm_struct prototypes may be justified  
> > This is the mm type token, i.e.
> >  - IOASID_SET_TYPE_MM (Set token is a mm_struct)
> > I am not sure if it is better to keep the explanation in code or in
> > this document, certainly don't want to duplicate.  
> OK. Maybe add a text explaining why it makes sense to register a
> notifier at mm_struct granularity.
OK. I will add the following:
"_mm" flavor of the ioasid_register_notifier() APIs are used when
an IOASID user need to isten to the IOASID events belong to a
process but without the knowledge of the associated ioasid_set.

Thanks,

Jacob

> >   
> >>>> +
> >>>> + int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
> >>>> +                   unsigned int flags)    
> >> this one is not obvious either.  
> > Here I just wanted to list the API functions, perhaps readers can check
> > out the code comments?  
> OK never mind. The exercise is difficult anyway.
> >   
> >>>> +
> >>>> +
> >>>> +Events
> >>>> +~~~~~~
> >>>> +Notification events are pertinent to individual IOASIDs, they can
> >>>> be +one of the following:
> >>>> +
> >>>> + - ALLOC
> >>>> + - FREE
> >>>> + - BIND
> >>>> + - UNBIND
> >>>> +
> >>>> +Ordering
> >>>> +~~~~~~~~
> >>>> +Ordering is supported by IOASID notification priorities as the
> >>>> +following (in ascending order):
> >>>> +
> >>>> +::
> >>>> +
> >>>> + enum ioasid_notifier_prios {
> >>>> +        IOASID_PRIO_LAST,
> >>>> +        IOASID_PRIO_IOMMU,
> >>>> +        IOASID_PRIO_DEVICE,
> >>>> +        IOASID_PRIO_CPU,
> >>>> + };    
> >>
> >> Maybe:
> >> when registered, notifiers are assigned a priority that affect the
> >> call order. Notifiers with CPU priority get called before notifiers
> >> with device priority and so on.  
> > Sounds good.
> >   
> >>>> +
> >>>> +The typical use case is when an IOASID is freed due to an
> >>>> exception, DMA +source should be quiesced before tearing down
> >>>> other hardware contexts +in the system. This will reduce the churn
> >>>> in handling faults. DMA work +submission is performed by the CPU
> >>>> which is granted higher priority than +devices.
> >>>> +
> >>>> +
> >>>> +Scopes
> >>>> +~~~~~~
> >>>> +There are two types of notifiers in IOASID core: system-wide and
> >>>> +ioasid_set-wide.
> >>>> +
> >>>> +System-wide notifier is catering for users that need to handle all
> >>>> +IOASIDs in the system. E.g. The IOMMU driver handles all IOASIDs.
> >>>> +
> >>>> +Per ioasid_set notifier can be used by VM specific components
> >>>> such as +KVM. After all, each KVM instance only cares about
> >>>> IOASIDs within its +own set.
> >>>> +
> >>>> +
> >>>> +Atomicity
> >>>> +~~~~~~~~~
> >>>> +IOASID notifiers are atomic due to spinlocks used inside the
> >>>> IOASID +core. For tasks cannot be completed in the notifier
> >>>> handler, async work    
> >>>
> >>> "tasks that cannot be"
> >>>     
> >>>> +can be submitted to complete the work later as long as there is no
> >>>> +ordering requirement.
> >>>> +
> >>>> +Reference counting
> >>>> +------------------
> >>>> +IOASID lifecycle management is based on reference counting. Users
> >>>> of +IOASID intend to align lifecycle with the IOASID need to hold    
> >>>
> >>> "who intend to"
> >>>     
> >>>> +reference of the IOASID. IOASID will not be returned to the pool
> >>>> for    
> >>>
> >>> "a reference to the IOASID. The IOASID"
> >>>     
> >>>> +allocation until all references are dropped. Calling ioasid_free()
> >>>> +will mark the IOASID as FREE_PENDING if the IOASID has outstanding
> >>>> +reference. ioasid_get() is not allowed once an IOASID is in the
> >>>> +FREE_PENDING state.
> >>>> +
> >>>> +Event notifications are used to inform users of IOASID status
> >>>> change. +IOASID_FREE event prompts users to drop their references
> >>>> after +clearing its context.
> >>>> +
> >>>> +For example, on VT-d platform when an IOASID is freed, teardown
> >>>> +actions are performed on KVM, device driver, and IOMMU driver.
> >>>> +KVM shall register notifier block with::
> >>>> +
> >>>> + static struct notifier_block pasid_nb_kvm = {
> >>>> +        .notifier_call = pasid_status_change_kvm,
> >>>> +        .priority      = IOASID_PRIO_CPU,
> >>>> + };
> >>>> +
> >>>> +VDCM driver shall register notifier block with::
> >>>> +
> >>>> + static struct notifier_block pasid_nb_vdcm = {
> >>>> +        .notifier_call = pasid_status_change_vdcm,
> >>>> +        .priority      = IOASID_PRIO_DEVICE,
> >>>> + };    
> >> not sure those code snippets are really useful. Maybe simply say who
> >> is supposed to use each prio.  
> > Agreed, not all the bits in the snippets are explained. I will explain
> > KVM and VDCM need to use priority to ensure call order.
> >   
> >>>> +
> >>>> +In both cases, notifier blocks shall be registered on the IOASID
> >>>> set +such that *only* events from the matching VM is received.
> >>>> +
> >>>> +If KVM attempts to register notifier block before the IOASID set
> >>>> is +created for the MM token, the notifier block will be placed on
> >>>> a    
> >> using the MM token  
> > sounds good
> >   
> >>>> +pending list inside IOASID core. Once the token matching IOASID
> >>>> set +is created, IOASID will register the notifier block
> >>>> automatically.    
> >> Is this implementation mandated? Can't you enforce the ioasid_set to
> >> be created before the notifier gets registered?  
> >>>> +IOASID core does not replay events for the existing IOASIDs in the
> >>>> +set. For IOASID set of MM type, notification blocks can be
> >>>> registered +on empty sets only. This is to avoid lost events.
> >>>> +
> >>>> +IOMMU driver shall register notifier block on global chain::
> >>>> +
> >>>> + static struct notifier_block pasid_nb_vtd = {
> >>>> +        .notifier_call = pasid_status_change_vtd,
> >>>> +        .priority      = IOASID_PRIO_IOMMU,
> >>>> + };
> >>>> +
> >>>> +Custom allocator APIs
> >>>> +---------------------
> >>>> +
> >>>> +::
> >>>> +
> >>>> + int ioasid_register_allocator(struct ioasid_allocator_ops
> >>>> *allocator); +
> >>>> + void ioasid_unregister_allocator(struct ioasid_allocator_ops
> >>>> *allocator); +
> >>>> +Allocator Choices
> >>>> +~~~~~~~~~~~~~~~~~
> >>>> +IOASIDs are allocated for both host and guest SVA/IOVA usage.
> >>>> However, +allocators can be different. For example, on VT-d guest
> >>>> PASID +allocation must be performed via a virtual command
> >>>> interface which is +emulated by VMM.
> >>>> +
> >>>> +IOASID core has the notion of "custom allocator" such that guest
> >>>> can +register virtual command allocator that precedes the default
> >>>> one. +
> >>>> +Namespaces
> >>>> +~~~~~~~~~~
> >>>> +IOASIDs are limited system resources that default to 20 bits in
> >>>> +size. Since each device has its own table, theoretically the
> >>>> namespace +can be per device also. However, for security reasons
> >>>> sharing PASID +tables among devices are not good for isolation.
> >>>> Therefore, IOASID +namespace is system-wide.    
> >>>
> >>> I don't follow this development. Having per-device PASID table
> >>> would work fine for isolation (assuming no hardware bug
> >>> necessitating IOMMU groups). If I remember correctly IOASID space
> >>> was chosen to be OS-wide because it simplifies the management code
> >>> (single PASID per task), and it is system-wide across VMs only in
> >>> the case of VT-d scalable mode.   
> >>>> +
> >>>> +There are also other reasons to have this simpler system-wide
> >>>> +namespace. Take VT-d as an example, VT-d supports shared workqueue
> >>>> +and ENQCMD[1] where one IOASID could be used to submit work on    
> >>>
> >>> Maybe use the Sphinx glossary syntax rather than "[1]"
> >>> https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html#glossary-directive
> >>>     
> >>>> +multiple devices that are shared with other VMs. This requires
> >>>> IOASID +to be system-wide. This is also the reason why guests must
> >>>> use an +emulated virtual command interface to allocate IOASID from
> >>>> the host. +
> >>>> +
> >>>> +Life cycle
> >>>> +==========
> >>>> +This section covers IOASID lifecycle management for both
> >>>> bare-metal +and guest usages. In bare-metal SVA, MMU notifier is
> >>>> directly hooked +up with IOMMU driver, therefore the process
> >>>> address space (MM) +lifecycle is aligned with IOASID.    
> >> therefore the IOASID lifecyle matches the process address space (MM)
> >> lifecyle?  
> > Sounds good.
> >   
> >>>> +
> >>>> +However, guest MMU notifier is not available to host IOMMU
> >>>> driver,    
> >> the guest MMU notifier  
> >>>> +when guest MM terminates unexpectedly, the events have to go
> >>>> through    
> >> the guest MM  
> >>>> +VFIO and IOMMU UAPI to reach host IOMMU driver. There are also
> >>>> more +parties involved in guest SVA, e.g. on Intel VT-d platform,
> >>>> IOASIDs +are used by IOMMU driver, KVM, VDCM, and VFIO.
> >>>> +
> >>>> +Native IOASID Life Cycle (VT-d Example)
> >>>> +---------------------------------------
> >>>> +
> >>>> +The normal flow of native SVA code with Intel Data Streaming
> >>>> +Accelerator(DSA) [2] as example:
> >>>> +
> >>>> +1. Host user opens accelerator FD, e.g. DSA driver, or uacce;
> >>>> +2. DSA driver allocate WQ, do sva_bind_device();
> >>>> +3. IOMMU driver calls ioasid_alloc(), then bind PASID with device,
> >>>> +   mmu_notifier_get()
> >>>> +4. DMA starts by DSA driver userspace
> >>>> +5. DSA userspace close FD
> >>>> +6. DSA/uacce kernel driver handles FD.close()
> >>>> +7. DSA driver stops DMA
> >>>> +8. DSA driver calls sva_unbind_device();
> >>>> +9. IOMMU driver does unbind, clears PASID context in IOMMU, flush
> >>>> +   TLBs. mmu_notifier_put() called.
> >>>> +10. mmu_notifier.release() called, IOMMU SVA code calls
> >>>> ioasid_free()* +11. The IOASID is returned to the pool, reclaimed.
> >>>> +
> >>>> +::
> >>>> +    
> >>>
> >>> Use a footnote?
> >>> https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#footnotes
> >>>    
> >>>> +   * With ENQCMD, PASID used on VT-d is not released in
> >>>> mmu_notifier() but
> >>>> +     mmdrop(). mmdrop comes after FD close. Should not matter.    
> >>>
> >>> "comes after FD close, which doesn't make a difference?"
> >>> The following might not be necessary since early process
> >>> termination is described later.
> >>>     
> >>>> +     If the user process dies unexpectedly, Step #10 may come
> >>>> before
> >>>> +     Step #5, in between, all DMA faults discarded. PRQ responded
> >>>> with    
> >>>
> >>> PRQ hasn't been defined in this document.
> >>>     
> >>>> +     code INVALID REQUEST.
> >>>> +
> >>>> +During the normal teardown, the following three steps would
> >>>> happen in +order:    
> >> can't this be illustrated in the above 1-11 sequence, just adding
> >> NORMAL TEARDONW before #7?  
> >>>> +
> >>>> +1. Device driver stops DMA request
> >>>> +2. IOMMU driver unbinds PASID and mm, flush all TLBs, drain
> >>>> in-flight
> >>>> +   requests.
> >>>> +3. IOASID freed
> >>>> +    
> >> Then you can just focus on abnormal termination  
> > Yes, will refer to the steps starting #7. These can be removed.
> >   
> >>>> +Exception happens when process terminates *before* device driver
> >>>> stops +DMA and call IOMMU driver to unbind. The flow of process
> >>>> exists are as    
> >> Can't this be explained with something simpler looking at the steps
> >> 1-11?  
> > It meant to be educational given this level of details. Simpler
> > steps are labeled with (1) (2) (3). Perhaps these labels didn't stand
> > out right? I will use the steps in the 1-11 sequence.
> >   
> >>>
> >>> "exits"
> >>>     
> >>>> +follows:
> >>>> +
> >>>> +::
> >>>> +
> >>>> +   do_exit() {
> >>>> +        exit_mm() {
> >>>> +                mm_put();
> >>>> +                exit_mmap() {
> >>>> +                        intel_invalidate_range() //mmu notifier
> >>>> +                        tlb_finish_mmu()
> >>>> +                        mmu_notifier_release(mm) {
> >>>> +                                intel_iommu_release() {
> >>>> +   [2]
> >>>> intel_iommu_teardown_pasid();    
> >>>
> >>> Parentheses might be better than square brackets for step numbers
> >>>     
> >>>> +                                        intel_iommu_flush_tlbs();
> >>>> +                                }
> >>>> +                                // tlb_invalidate_range cb removed
> >>>> +                        }
> >>>> +                        unmap_vmas();
> >>>> +                        free_pgtables(); // IOMMU cannot walk PGT
> >>>> after this
> >>>> +                };
> >>>> +        }
> >>>> +        exit_files(tsk) {
> >>>> +                close_files() {
> >>>> +                        dsa_close();
> >>>> +   [1]                  dsa_stop_dma();
> >>>> +                        intel_svm_unbind_pasid(); //nothing to do
> >>>> +                }
> >>>> +        }
> >>>> +   }
> >>>> +
> >>>> +   mmdrop() /* some random time later, lazy mm user */ {
> >>>> +        mm_free_pgd();
> >>>> +        destroy_context(mm); {
> >>>> +   [3]          ioasid_free();
> >>>> +        }
> >>>> +   }
> >>>> +
> >>>> +As shown in the list above, step #2 could happen before
> >>>> +#1. Unrecoverable(UR) faults could happen between #2 and #1.
> >>>> +
> >>>> +Also notice that TLB invalidation occurs at mmu_notifier
> >>>> +invalidate_range callback as well as the release callback. The
> >>>> reason +is that release callback will delete IOMMU driver from the
> >>>> notifier +chain which may skip invalidate_range() calls during the
> >>>> exit path. +
> >>>> +To avoid unnecessary reporting of UR fault, IOMMU driver shall
> >>>> disable    
> >> UR?  
> > Unrecoverable, mentioned in the previous paragraph.
> >   
> >>>> +fault reporting after free and before unbind.
> >>>> +
> >>>> +Guest IOASID Life Cycle (VT-d Example)
> >>>> +--------------------------------------
> >>>> +Guest IOASID life cycle starts with guest driver open(), this
> >>>> could be +uacce or individual accelerator driver such as DSA. At
> >>>> FD open, +sva_bind_device() is called which triggers a series of
> >>>> actions. +
> >>>> +The example below is an illustration of *normal* operations that
> >>>> +involves *all* the SW components in VT-d. The flow can be simpler
> >>>> if +no ENQCMD is supported.
> >>>> +
> >>>> +::
> >>>> +
> >>>> +     VFIO        IOMMU        KVM        VDCM        IOASID
> >>>> Ref
> >>>> +   ..................................................................
> >>>> +   1             ioasid_register_notifier/_mm()
> >>>> +   2 ioasid_alloc()
> >>>> 1
> >>>> +   3 bind_gpasid()
> >>>> +   4             iommu_bind()->ioasid_get()
> >>>> 2
> >>>> +   5             ioasid_notify(BIND)
> >>>> +   6                          -> ioasid_get()
> >>>> 3
> >>>> +   7                          -> vmcs_update_atomic()
> >>>> +   8 mdev_write(gpasid)
> >>>> +   9                                    hpasid=
> >>>> +   10                                   find_by_spid(gpasid)
> >>>> 4
> >>>> +   11                                   vdev_write(hpasid)
> >>>> +   12 -------- GUEST STARTS DMA --------------------------
> >>>> +   13 -------- GUEST STOPS DMA --------------------------
> >>>> +   14 mdev_clear(gpasid)
> >>>> +   15                                   vdev_clear(hpasid)
> >>>> +   16
> >>>> ioasid_put()               3
> >>>> +   17 unbind_gpasid()
> >>>> +   18            iommu_ubind()
> >>>> +   19            ioasid_notify(UNBIND)
> >>>> +   20                          -> vmcs_update_atomic()
> >>>> +   21                          ->
> >>>> ioasid_put()                     2
> >>>> +   22
> >>>> ioasid_free()                                                1
> >>>> +   23
> >>>> ioasid_put()                                      0
> >>>> +   24                                                 Reclaimed
> >>>> +   -------------- New Life Cycle Begin
> >>>> ----------------------------
> >>>> +   1  ioasid_alloc()  
> >>>> ->           1 +  
> >>>> +   Note: IOASID Notification Events: FREE, BIND, UNBIND
> >>>> +
> >>>> +Exception cases arise when a guest crashes or a malicious guest
> >>>> +attempts to cause disruption on the host system. The fault
> >>>> handling +rules are:
> >>>> +
> >>>> +1. IOASID free must *always* succeed.
> >>>> +2. An inactive period may be required before the freed IOASID is
> >>>> +   reclaimed. During this period, consumers of IOASID perform
> >>>> cleanup. +3. Malfunction is limited to the guest owned resources
> >>>> for all
> >>>> +   programming errors.
> >>>> +
> >>>> +The primary source of exception is when the following are out of
> >>>> +order:
> >>>> +
> >>>> +1. Start/Stop of DMA activity
> >>>> +   (Guest device driver, mdev via VFIO)    
> >> please explain the meaning of what is inside (): initiator?  
> >>>> +2. Setup/Teardown of IOMMU PASID context, IOTLB, DevTLB flushes
> >>>> +   (Host IOMMU driver bind/unbind)
> >>>> +3. Setup/Teardown of VMCS PASID translation table entries (KVM) in
> >>>> +   case of ENQCMD
> >>>> +4. Programming/Clearing host PASID in VDCM (Host VDCM driver)
> >>>> +5. IOASID alloc/free (Host IOASID)
> >>>> +
> >>>> +VFIO is the *only* user-kernel interface, which is ultimately
> >>>> +responsible for exception handlings.    
> >>>
> >>> "handling"
> >>>     
> >>>> +
> >>>> +#1 is processed the same way as the assigned device today based on
> >>>> +device file descriptors and events. There is no special handling.
> >>>> +
> >>>> +#3 is based on bind/unbind events emitted by #2.
> >>>> +
> >>>> +#4 is naturally aligned with IOASID life cycle in that an illegal
> >>>> +guest PASID programming would fail in obtaining reference of the
> >>>> +matching host IOASID.
> >>>> +
> >>>> +#5 is similar to #4. The fault will be reported to the user if
> >>>> PASID +used in the ENQCMD is not set up in VMCS PASID translation
> >>>> table. +
> >>>> +Therefore, the remaining out of order problem is between #2 and
> >>>> +#5. I.e. unbind vs. free. More specifically, free before unbind.
> >>>> +
> >>>> +IOASID notifier and refcounting are used to ensure order.
> >>>> Following +a publisher-subscriber pattern where:    
> >> with the following actors:  
> >>>> +
> >>>> +- Publishers: VFIO & IOMMU
> >>>> +- Subscribers: KVM, VDCM, IOMMU    
> >> this may be introduced before.  
> >>>> +
> >>>> +IOASID notifier is atomic which requires subscribers to do quick
> >>>> +handling of the event in the atomic context. Workqueue can be
> >>>> used for +any processing that requires thread context.    
> >> repetition of what was said before.
> >>  IOASID reference must be  
> > Right, will remove.
> >   
> >>>> +acquired before receiving the FREE event. The reference must be
> >>>> +dropped at the end of the processing in order to return the
> >>>> IOASID to +the pool.
> >>>> +
> >>>> +Let's examine the IOASID life cycle again when free happens
> >>>> *before* +unbind. This could be a result of misbehaving guests or
> >>>> crash. Assuming +VFIO cannot enforce unbind->free order. Notice
> >>>> that the setup part up +until step #12 is identical to the normal
> >>>> case, the flow below starts +with step 13.
> >>>> +
> >>>> +::
> >>>> +
> >>>> +     VFIO        IOMMU        KVM        VDCM        IOASID
> >>>> Ref
> >>>> +   ..................................................................
> >>>> +   13 -------- GUEST STARTS DMA --------------------------
> >>>> +   14 -------- *GUEST MISBEHAVES!!!* ----------------
> >>>> +   15 ioasid_free()
> >>>> +   16
> >>>> ioasid_notify(FREE)
> >>>> +   17
> >>>> mark_ioasid_inactive[1]
> >>>> +   18                          kvm_nb_handler(FREE)
> >>>> +   19                          vmcs_update_atomic()
> >>>> +   20                          ioasid_put_locked()   ->
> >>>> 3
> >>>> +   21                                   vdcm_nb_handler(FREE)
> >>>> +   22            iomm_nb_handler(FREE)
> >>>> +   23 ioasid_free() returns[2]          schedule_work()
> >>>> 2
> >>>> +   24            schedule_work()        vdev_clear_wk(hpasid)
> >>>> +   25            teardown_pasid_wk()
> >>>> +   26                                   ioasid_put() ->
> >>>> 1
> >>>> +   27            ioasid_put()
> >>>> 0
> >>>> +   28                                                 Reclaimed
> >>>> +   29 unbind_gpasid()
> >>>> +   30            iommu_unbind()->ioasid_find() Fails[3]
> >>>> +   -------------- New Life Cycle Begin
> >>>> ---------------------------- +
> >>>> +Note:
> >>>> +
> >>>> +1. By marking IOASID inactive at step #17, no new references can
> >>>> be    
> >>>
> >>> Is "inactive" FREE_PENDING?
> >>>     
> >>>> +   held. ioasid_get/find() will return -ENOENT;
> >>>> +2. After step #23, all events can go out of order. Shall not
> >>>> affect
> >>>> +   the outcome.
> >>>> +3. IOMMU driver fails to find private data for unbinding. If
> >>>> unbind is
> >>>> +   called after the same IOASID is allocated for the same guest
> >>>> again,
> >>>> +   this is a programming error. The damage is limited to the guest
> >>>> +   itself since unbind performs permission checking based on the
> >>>> +   IOASID set associated with the guest process.
> >>>> +
> >>>> +KVM PASID Translation Table Updates
> >>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>> +Per VM PASID translation table is maintained by KVM in order to
> >>>> +support ENQCMD in the guest. The table contains host-guest PASID
> >>>> +translations to be consumed by CPU ucode. The synchronization of
> >>>> the +PASID states depends on VFIO/IOMMU driver, where IOCTL and
> >>>> atomic +notifiers are used. KVM must register IOASID notifier per
> >>>> VM instance +during launch time. The following events are handled:
> >>>> +
> >>>> +1. BIND/UNBIND
> >>>> +2. FREE
> >>>> +
> >>>> +Rules:
> >>>> +
> >>>> +1. Multiple devices can bind with the same PASID, this can be
> >>>> different PCI
> >>>> +   devices or mdevs within the same PCI device. However, only the
> >>>> +   *first* BIND and *last* UNBIND emit notifications.
> >>>> +2. IOASID code is responsible for ensuring the correctness of H-G
> >>>> +   PASID mapping. There is no need for KVM to validate the
> >>>> +   notification data.
> >>>> +3. When UNBIND happens *after* FREE, KVM will see error in
> >>>> +   ioasid_get() even when the reclaim is not done. IOMMU driver
> >>>> will
> >>>> +   also avoid sending UNBIND if the PASID is already FREE.
> >>>> +4. When KVM terminates *before* FREE & UNBIND, references will be
> >>>> +   dropped for all host PASIDs.
> >>>> +
> >>>> +VDCM PASID Programming
> >>>> +~~~~~~~~~~~~~~~~~~~~~~
> >>>> +VDCM composes virtual devices and exposes them to the guests. When
> >>>> +the guest allocates a PASID then program it to the virtual
> >>>> device, VDCM    
> >> programs as well  
> >>>> +intercepts the programming attempt then program the matching
> >>>> host    
> >>>
> >>> "programs"
> >>>
> >>> Thanks,
> >>> Jean
> >>>     
> >>>> +PASID on to the hardware.
> >>>> +Conversely, when a device is going away, VDCM must be informed
> >>>> such +that PASID context on the hardware can be cleared. There
> >>>> could be +multiple mdevs assigned to different guests in the same
> >>>> VDCM. Since +the PASID table is shared at PCI device level, lazy
> >>>> clearing is not +secure. A malicious guest can attack by using
> >>>> newly freed PASIDs that +are allocated by another guest.
> >>>> +
> >>>> +By holding a reference of the PASID until VDCM cleans up the HW
> >>>> context, +it is guaranteed that PASID life cycles do not cross
> >>>> within the same +device.
> >>>> +
> >>>> +
> >>>> +Reference
> >>>> +====================================================
> >>>> +1.
> >>>> https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
> >>>> + +2.
> >>>> https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
> >>>> + +3.
> >>>> https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
> >>>> -- 2.7.4    
> >>
> >> Thanks
> >>
> >> Eric  
> >>>>    
> >>>     
> >>
> >> _______________________________________________
> >> iommu mailing list
> >> iommu@lists.linux-foundation.org
> >> https://lists.linuxfoundation.org/mailman/listinfo/iommu  
> > [Jacob Pan]
> >   
> Thanks
> 
> Eric
> 

[Jacob Pan]

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 1/9] docs: Document IO Address Space ID (IOASID) APIs

Reply via email to