On Mon, 7 Sep 2020 10:03:39 +0200 Auger Eric <eric.au...@redhat.com> wrote:
> Hi Jacob, > > On 9/1/20 6:56 PM, Jacob Pan wrote: > > Hi Eric, > > > > On Thu, 27 Aug 2020 18:21:07 +0200 > > Auger Eric <eric.au...@redhat.com> wrote: > > > >> Hi Jacob, > >> On 8/24/20 12:32 PM, Jean-Philippe Brucker wrote: > >>> On Fri, Aug 21, 2020 at 09:35:10PM -0700, Jacob Pan wrote: > >>>> IOASID is used to identify address spaces that can be targeted by > >>>> device DMA. It is a system-wide resource that is essential to its > >>>> many users. This document is an attempt to help developers from > >>>> all vendors navigate the APIs. At this time, ARM SMMU and Intel’s > >>>> Scalable IO Virtualization (SIOV) enabled platforms are the > >>>> primary users of IOASID. Examples of how SIOV components interact > >>>> with IOASID APIs are provided in that many APIs are driven by the > >>>> requirements from SIOV. > >>>> > >>>> Signed-off-by: Liu Yi L <yi.l....@intel.com> > >>>> Signed-off-by: Wu Hao <hao...@intel.com> > >>>> Signed-off-by: Jacob Pan <jacob.jun....@linux.intel.com> > >>>> --- > >>>> Documentation/ioasid.rst | 618 > >>>> +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, > >>>> 618 insertions(+) create mode 100644 Documentation/ioasid.rst > >>>> > >>>> diff --git a/Documentation/ioasid.rst b/Documentation/ioasid.rst > >>> > >>> Thanks for writing this up. Should it go to > >>> Documentation/driver-api/, or Documentation/driver-api/iommu/? I > >>> think this also needs to Cc linux-...@vger.kernel.org and > >>> cor...@lwn.net > >>>> new file mode 100644 > >>>> index 000000000000..b6a8cdc885ff > >>>> --- /dev/null > >>>> +++ b/Documentation/ioasid.rst > >>>> @@ -0,0 +1,618 @@ > >>>> +.. ioasid: > >>>> + > >>>> +===================================== > >>>> +IO Address Space ID > >>>> +===================================== > >>>> + > >>>> +IOASID is a generic name for PCIe Process Address ID (PASID) or > >>>> ARM +SMMU sub-stream ID. An IOASID identifies an address space > >>>> that DMA > >>> > >>> "SubstreamID" > >> On ARM if we don't use PASIDs we have streamids (SID) which can also > >> identify address spaces that DMA requests can target. So maybe this > >> definition is not sufficient. > >> > > According to SMMU spec, the SubstreamID is equivalent to PASID. My > > understanding is that SID is equivalent to PCI requester ID that > > identifies stage 2. Do you plan to use IOASID for stage 2? > No. So actually if PASID is not used we still have a default single > IOASID matching the single context. So that may be fine as a definition. OK, thanks for explaining. > > IOASID is mostly for SVA and DMA request w/ PASID. > > > >>> > >>>> +requests can target. > >>>> + > >>>> +The primary use cases for IOASID are Shared Virtual Address (SVA) > >>>> and +IO Virtual Address (IOVA). However, the requirements for > >>>> IOASID > >>> > >>> IOVA alone isn't a use case, maybe "multiple IOVA spaces per > >>> device"? > >>>> +management can vary among hardware architectures. > >>>> + > >>>> +This document covers the generic features supported by IOASID > >>>> +APIs. Vendor-specific use cases are also illustrated with Intel's > >>>> VT-d +based platforms as the first example. > >>>> + > >>>> +.. contents:: :local: > >>>> + > >>>> +Glossary > >>>> +======== > >>>> +PASID - Process Address Space ID > >>>> + > >>>> +IOASID - IO Address Space ID (generic term for PCIe PASID and > >>>> +sub-stream ID in SMMU) > >>> > >>> "SubstreamID" > >>> > >>>> + > >>>> +SVA/SVM - Shared Virtual Addressing/Memory > >>>> + > >>>> +ENQCMD - New Intel X86 ISA for efficient workqueue submission > >>>> [1] > >>> > >>> Maybe drop the "New", to keep the documentation perennial. It might > >>> be good to add internal links here to the specifications URLs at > >>> the bottom. > >>>> + > >>>> +DSA - Intel Data Streaming Accelerator [2] > >>>> + > >>>> +VDCM - Virtual device composition module [3] > >>>> + > >>>> +SIOV - Intel Scalable IO Virtualization > >>>> + > >>>> + > >>>> +Key Concepts > >>>> +============ > >>>> + > >>>> +IOASID Set > >>>> +----------- > >>>> +An IOASID set is a group of IOASIDs allocated from the system-wide > >>>> +IOASID pool. An IOASID set is created and can be identified by a > >>>> +token of u64. Refer to IOASID set APIs for more details. > >>> > >>> Identified either by an u64 or an mm_struct, right? Maybe just > >>> drop the second sentence if it's detailed in the IOASID set section > >>> below. > >>>> + > >>>> +IOASID set is particularly useful for guest SVA where each guest > >>>> could +have its own IOASID set for security and efficiency reasons. > >>>> + > >>>> +IOASID Set Private ID (SPID) > >>>> +---------------------------- > >>>> +SPIDs are introduced as IOASIDs within its set. Each SPID maps to > >>>> a +system-wide IOASID but the namespace of SPID is within its > >>>> IOASID +set. > >>> > >>> The intro isn't super clear. Perhaps this is simpler: > >>> "Each IOASID set has a private namespace of SPIDs. An SPID maps to a > >>> single system-wide IOASID." > >> or, "within an ioasid set, each ioasid can be associated with an alias > >> ID, named SPID." > > I don't have strong opinion, I feel it is good to explain the > > relationship between SPID and IOASID in both directions, how about add? > > " Conversely, each IOASID is associated with an alias ID, named SPID." > yep. I amy suggest: each IOASID may be associated with an alias ID, > local to the IOASID set, named SPID. This is more precise. thanks for suggesting that. > > > >>> > >>>> SPIDs can be used as guest IOASIDs where each guest could do > >>>> +IOASID allocation from its own pool and map them to host physical > >>>> +IOASIDs. SPIDs are particularly useful for supporting live > >>>> migration +where decoupling guest and host physical resources are > >>>> necessary. + > >>>> +For example, two VMs can both allocate guest PASID/SPID #101 but > >>>> map to +different host PASIDs #201 and #202 respectively as shown > >>>> in the +diagram below. > >>>> +:: > >>>> + > >>>> + .------------------. .------------------. > >>>> + | VM 1 | | VM 2 | > >>>> + | | | | > >>>> + |------------------| |------------------| > >>>> + | GPASID/SPID 101 | | GPASID/SPID 101 | > >>>> + '------------------' -------------------' Guest > >>>> + __________|______________________|______________________ > >>>> + | | Host > >>>> + v v > >>>> + .------------------. .------------------. > >>>> + | Host IOASID 201 | | Host IOASID 202 | > >>>> + '------------------' '------------------' > >>>> + | IOASID set 1 | | IOASID set 2 | > >>>> + '------------------' '------------------' > >>>> + > >>>> +Guest PASID is treated as IOASID set private ID (SPID) within an > >>>> +IOASID set, mappings between guest and host IOASIDs are stored in > >>>> the +set for inquiry. > >>>> + > >>>> +IOASID APIs > >>>> +=========== > >>>> +To get the IOASID APIs, users must #include <linux/ioasid.h>. > >>>> These APIs +serve the following functionalities: > >>>> + > >>>> + - IOASID allocation/Free > >>>> + - Group management in the form of ioasid_set > >>>> + - Private data storage and lookup > >>>> + - Reference counting > >>>> + - Event notification in case of state change > >> (a) > > got it > > > >>>> + > >>>> +IOASID Set Level APIs > >>>> +-------------------------- > >>>> +For use cases such as guest SVA it is necessary to manage IOASIDs > >>>> at +a group level. For example, VMs may allocate multiple IOASIDs > >>>> for > >> I would use the introduced ioasid_set terminology instead of "group". > > Right, we already introduced it. > > > >>>> +guest process address sharing (vSVA). It is imperative to enforce > >>>> +VM-IOASID ownership such that malicious guest cannot target DMA > >>> > >>> "a malicious guest" > >>> > >>>> +traffic outside its own IOASIDs, or free an active IOASID belong > >>>> to > >>> > >>> "that belongs to" > >>> > >>>> +another VM. > >>>> +:: > >>>> + > >>>> + struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, > >>>> u32 type) > >> what is this void *token? also the type may be explained here. > > token is explained in the text following API list. I can move it up. > > > >>>> + > >>>> + int ioasid_adjust_set(struct ioasid_set *set, int quota); > >>> > >>> These could be named "ioasid_set_alloc" and "ioasid_set_adjust" to > >>> be consistent with the rest of the API. > >>> > >>>> + > >>>> + void ioasid_set_get(struct ioasid_set *set) > >>>> + > >>>> + void ioasid_set_put(struct ioasid_set *set) > >>>> + > >>>> + void ioasid_set_get_locked(struct ioasid_set *set) > >>>> + > >>>> + void ioasid_set_put_locked(struct ioasid_set *set) > >>>> + > >>>> + int ioasid_set_for_each_ioasid(struct ioasid_set *sdata, > >>> > >>> Might be nicer to keep the same argument names within the API. Here > >>> "set" rather than "sdata". > >>> > >>>> + void (*fn)(ioasid_t id, void > >>>> *data), > >>>> + void *data) > >>> > >>> (alignment) > >>> > >>>> + > >>>> + > >>>> +IOASID set concept is introduced to represent such IOASID groups. > >>>> Each > >>> > >>> Or just "IOASID sets represent such IOASID groups", but might be > >>> redundant. > >>> > >>>> +IOASID set is created with a token which can be one of the > >>>> following +types: > >> I think this explanation should happen before the above function > >> prototypes > > ditto. > > > >>>> + > >>>> + - IOASID_SET_TYPE_NULL (Arbitrary u64 value) > >>>> + - IOASID_SET_TYPE_MM (Set token is a mm_struct) > >>>> + > >>>> +The explicit MM token type is useful when multiple users of an > >>>> IOASID +set under the same process need to communicate about their > >>>> shared IOASIDs. +E.g. An IOASID set created by VFIO for one guest > >>>> can be associated +with the KVM instance for the same guest since > >>>> they share a common mm_struct. + > >>>> +The IOASID set APIs serve the following purposes: > >>>> + > >>>> + - Ownership/permission enforcement > >>>> + - Take collective actions, e.g. free an entire set > >>>> + - Event notifications within a set > >>>> + - Look up a set based on token > >>>> + - Quota enforcement > >>> > >>> This paragraph could be earlier in the section > >> > >> yes this is a kind of repetition of (a), above > > I meant to highlight on what the APIs do such that readers don't > > need to read the code instead. > > > >>> > >>>> + > >>>> +Individual IOASID APIs > >>>> +---------------------- > >>>> +Once an ioasid_set is created, IOASIDs can be allocated from the > >>>> set. +Within the IOASID set namespace, set private ID (SPID) is > >>>> supported. In +the VM use case, SPID can be used for storing guest > >>>> PASID. + > >>>> +:: > >>>> + > >>>> + ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, > >>>> ioasid_t max, > >>>> + void *private); > >>>> + > >>>> + int ioasid_get(struct ioasid_set *set, ioasid_t ioasid); > >>>> + > >>>> + void ioasid_put(struct ioasid_set *set, ioasid_t ioasid); > >>>> + > >>>> + int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid); > >>>> + > >>>> + void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid); > >>>> + > >>>> + void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, > >>>> + bool (*getter)(void *)); > >>>> + > >>>> + ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t > >>>> spid) + > >>>> + int ioasid_attach_data(struct ioasid_set *set, ioasid_t ioasid, > >>>> + void *data); > >>>> + int ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid, > >>>> + ioasid_t ssid); > >>> > >>> s/ssid/spid> > > got it > > > >>>> + > >>>> + > >>>> +Notifications > >>>> +------------- > >>>> +An IOASID may have multiple users, each user may have hardware > >>>> context +associated with an IOASID. When the status of an IOASID > >>>> changes, +e.g. an IOASID is being freed, users need to be notified > >>>> such that the +associated hardware context can be cleared, > >>>> flushed, and drained. + > >>>> +:: > >>>> + > >>>> + int ioasid_register_notifier(struct ioasid_set *set, struct > >>>> + notifier_block *nb) > >>>> + > >>>> + void ioasid_unregister_notifier(struct ioasid_set *set, > >>>> + struct notifier_block *nb) > >>>> + > >>>> + int ioasid_register_notifier_mm(struct mm_struct *mm, struct > >>>> + notifier_block *nb) > >>>> + > >>>> + void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct > >>>> + notifier_block *nb) > >> the mm_struct prototypes may be justified > > This is the mm type token, i.e. > > - IOASID_SET_TYPE_MM (Set token is a mm_struct) > > I am not sure if it is better to keep the explanation in code or in > > this document, certainly don't want to duplicate. > OK. Maybe add a text explaining why it makes sense to register a > notifier at mm_struct granularity. OK. I will add the following: "_mm" flavor of the ioasid_register_notifier() APIs are used when an IOASID user need to isten to the IOASID events belong to a process but without the knowledge of the associated ioasid_set. Thanks, Jacob > > > >>>> + > >>>> + int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd, > >>>> + unsigned int flags) > >> this one is not obvious either. > > Here I just wanted to list the API functions, perhaps readers can check > > out the code comments? > OK never mind. The exercise is difficult anyway. > > > >>>> + > >>>> + > >>>> +Events > >>>> +~~~~~~ > >>>> +Notification events are pertinent to individual IOASIDs, they can > >>>> be +one of the following: > >>>> + > >>>> + - ALLOC > >>>> + - FREE > >>>> + - BIND > >>>> + - UNBIND > >>>> + > >>>> +Ordering > >>>> +~~~~~~~~ > >>>> +Ordering is supported by IOASID notification priorities as the > >>>> +following (in ascending order): > >>>> + > >>>> +:: > >>>> + > >>>> + enum ioasid_notifier_prios { > >>>> + IOASID_PRIO_LAST, > >>>> + IOASID_PRIO_IOMMU, > >>>> + IOASID_PRIO_DEVICE, > >>>> + IOASID_PRIO_CPU, > >>>> + }; > >> > >> Maybe: > >> when registered, notifiers are assigned a priority that affect the > >> call order. Notifiers with CPU priority get called before notifiers > >> with device priority and so on. > > Sounds good. > > > >>>> + > >>>> +The typical use case is when an IOASID is freed due to an > >>>> exception, DMA +source should be quiesced before tearing down > >>>> other hardware contexts +in the system. This will reduce the churn > >>>> in handling faults. DMA work +submission is performed by the CPU > >>>> which is granted higher priority than +devices. > >>>> + > >>>> + > >>>> +Scopes > >>>> +~~~~~~ > >>>> +There are two types of notifiers in IOASID core: system-wide and > >>>> +ioasid_set-wide. > >>>> + > >>>> +System-wide notifier is catering for users that need to handle all > >>>> +IOASIDs in the system. E.g. The IOMMU driver handles all IOASIDs. > >>>> + > >>>> +Per ioasid_set notifier can be used by VM specific components > >>>> such as +KVM. After all, each KVM instance only cares about > >>>> IOASIDs within its +own set. > >>>> + > >>>> + > >>>> +Atomicity > >>>> +~~~~~~~~~ > >>>> +IOASID notifiers are atomic due to spinlocks used inside the > >>>> IOASID +core. For tasks cannot be completed in the notifier > >>>> handler, async work > >>> > >>> "tasks that cannot be" > >>> > >>>> +can be submitted to complete the work later as long as there is no > >>>> +ordering requirement. > >>>> + > >>>> +Reference counting > >>>> +------------------ > >>>> +IOASID lifecycle management is based on reference counting. Users > >>>> of +IOASID intend to align lifecycle with the IOASID need to hold > >>> > >>> "who intend to" > >>> > >>>> +reference of the IOASID. IOASID will not be returned to the pool > >>>> for > >>> > >>> "a reference to the IOASID. The IOASID" > >>> > >>>> +allocation until all references are dropped. Calling ioasid_free() > >>>> +will mark the IOASID as FREE_PENDING if the IOASID has outstanding > >>>> +reference. ioasid_get() is not allowed once an IOASID is in the > >>>> +FREE_PENDING state. > >>>> + > >>>> +Event notifications are used to inform users of IOASID status > >>>> change. +IOASID_FREE event prompts users to drop their references > >>>> after +clearing its context. > >>>> + > >>>> +For example, on VT-d platform when an IOASID is freed, teardown > >>>> +actions are performed on KVM, device driver, and IOMMU driver. > >>>> +KVM shall register notifier block with:: > >>>> + > >>>> + static struct notifier_block pasid_nb_kvm = { > >>>> + .notifier_call = pasid_status_change_kvm, > >>>> + .priority = IOASID_PRIO_CPU, > >>>> + }; > >>>> + > >>>> +VDCM driver shall register notifier block with:: > >>>> + > >>>> + static struct notifier_block pasid_nb_vdcm = { > >>>> + .notifier_call = pasid_status_change_vdcm, > >>>> + .priority = IOASID_PRIO_DEVICE, > >>>> + }; > >> not sure those code snippets are really useful. Maybe simply say who > >> is supposed to use each prio. > > Agreed, not all the bits in the snippets are explained. I will explain > > KVM and VDCM need to use priority to ensure call order. > > > >>>> + > >>>> +In both cases, notifier blocks shall be registered on the IOASID > >>>> set +such that *only* events from the matching VM is received. > >>>> + > >>>> +If KVM attempts to register notifier block before the IOASID set > >>>> is +created for the MM token, the notifier block will be placed on > >>>> a > >> using the MM token > > sounds good > > > >>>> +pending list inside IOASID core. Once the token matching IOASID > >>>> set +is created, IOASID will register the notifier block > >>>> automatically. > >> Is this implementation mandated? Can't you enforce the ioasid_set to > >> be created before the notifier gets registered? > >>>> +IOASID core does not replay events for the existing IOASIDs in the > >>>> +set. For IOASID set of MM type, notification blocks can be > >>>> registered +on empty sets only. This is to avoid lost events. > >>>> + > >>>> +IOMMU driver shall register notifier block on global chain:: > >>>> + > >>>> + static struct notifier_block pasid_nb_vtd = { > >>>> + .notifier_call = pasid_status_change_vtd, > >>>> + .priority = IOASID_PRIO_IOMMU, > >>>> + }; > >>>> + > >>>> +Custom allocator APIs > >>>> +--------------------- > >>>> + > >>>> +:: > >>>> + > >>>> + int ioasid_register_allocator(struct ioasid_allocator_ops > >>>> *allocator); + > >>>> + void ioasid_unregister_allocator(struct ioasid_allocator_ops > >>>> *allocator); + > >>>> +Allocator Choices > >>>> +~~~~~~~~~~~~~~~~~ > >>>> +IOASIDs are allocated for both host and guest SVA/IOVA usage. > >>>> However, +allocators can be different. For example, on VT-d guest > >>>> PASID +allocation must be performed via a virtual command > >>>> interface which is +emulated by VMM. > >>>> + > >>>> +IOASID core has the notion of "custom allocator" such that guest > >>>> can +register virtual command allocator that precedes the default > >>>> one. + > >>>> +Namespaces > >>>> +~~~~~~~~~~ > >>>> +IOASIDs are limited system resources that default to 20 bits in > >>>> +size. Since each device has its own table, theoretically the > >>>> namespace +can be per device also. However, for security reasons > >>>> sharing PASID +tables among devices are not good for isolation. > >>>> Therefore, IOASID +namespace is system-wide. > >>> > >>> I don't follow this development. Having per-device PASID table > >>> would work fine for isolation (assuming no hardware bug > >>> necessitating IOMMU groups). If I remember correctly IOASID space > >>> was chosen to be OS-wide because it simplifies the management code > >>> (single PASID per task), and it is system-wide across VMs only in > >>> the case of VT-d scalable mode. > >>>> + > >>>> +There are also other reasons to have this simpler system-wide > >>>> +namespace. Take VT-d as an example, VT-d supports shared workqueue > >>>> +and ENQCMD[1] where one IOASID could be used to submit work on > >>> > >>> Maybe use the Sphinx glossary syntax rather than "[1]" > >>> https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html#glossary-directive > >>> > >>>> +multiple devices that are shared with other VMs. This requires > >>>> IOASID +to be system-wide. This is also the reason why guests must > >>>> use an +emulated virtual command interface to allocate IOASID from > >>>> the host. + > >>>> + > >>>> +Life cycle > >>>> +========== > >>>> +This section covers IOASID lifecycle management for both > >>>> bare-metal +and guest usages. In bare-metal SVA, MMU notifier is > >>>> directly hooked +up with IOMMU driver, therefore the process > >>>> address space (MM) +lifecycle is aligned with IOASID. > >> therefore the IOASID lifecyle matches the process address space (MM) > >> lifecyle? > > Sounds good. > > > >>>> + > >>>> +However, guest MMU notifier is not available to host IOMMU > >>>> driver, > >> the guest MMU notifier > >>>> +when guest MM terminates unexpectedly, the events have to go > >>>> through > >> the guest MM > >>>> +VFIO and IOMMU UAPI to reach host IOMMU driver. There are also > >>>> more +parties involved in guest SVA, e.g. on Intel VT-d platform, > >>>> IOASIDs +are used by IOMMU driver, KVM, VDCM, and VFIO. > >>>> + > >>>> +Native IOASID Life Cycle (VT-d Example) > >>>> +--------------------------------------- > >>>> + > >>>> +The normal flow of native SVA code with Intel Data Streaming > >>>> +Accelerator(DSA) [2] as example: > >>>> + > >>>> +1. Host user opens accelerator FD, e.g. DSA driver, or uacce; > >>>> +2. DSA driver allocate WQ, do sva_bind_device(); > >>>> +3. IOMMU driver calls ioasid_alloc(), then bind PASID with device, > >>>> + mmu_notifier_get() > >>>> +4. DMA starts by DSA driver userspace > >>>> +5. DSA userspace close FD > >>>> +6. DSA/uacce kernel driver handles FD.close() > >>>> +7. DSA driver stops DMA > >>>> +8. DSA driver calls sva_unbind_device(); > >>>> +9. IOMMU driver does unbind, clears PASID context in IOMMU, flush > >>>> + TLBs. mmu_notifier_put() called. > >>>> +10. mmu_notifier.release() called, IOMMU SVA code calls > >>>> ioasid_free()* +11. The IOASID is returned to the pool, reclaimed. > >>>> + > >>>> +:: > >>>> + > >>> > >>> Use a footnote? > >>> https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#footnotes > >>> > >>>> + * With ENQCMD, PASID used on VT-d is not released in > >>>> mmu_notifier() but > >>>> + mmdrop(). mmdrop comes after FD close. Should not matter. > >>> > >>> "comes after FD close, which doesn't make a difference?" > >>> The following might not be necessary since early process > >>> termination is described later. > >>> > >>>> + If the user process dies unexpectedly, Step #10 may come > >>>> before > >>>> + Step #5, in between, all DMA faults discarded. PRQ responded > >>>> with > >>> > >>> PRQ hasn't been defined in this document. > >>> > >>>> + code INVALID REQUEST. > >>>> + > >>>> +During the normal teardown, the following three steps would > >>>> happen in +order: > >> can't this be illustrated in the above 1-11 sequence, just adding > >> NORMAL TEARDONW before #7? > >>>> + > >>>> +1. Device driver stops DMA request > >>>> +2. IOMMU driver unbinds PASID and mm, flush all TLBs, drain > >>>> in-flight > >>>> + requests. > >>>> +3. IOASID freed > >>>> + > >> Then you can just focus on abnormal termination > > Yes, will refer to the steps starting #7. These can be removed. > > > >>>> +Exception happens when process terminates *before* device driver > >>>> stops +DMA and call IOMMU driver to unbind. The flow of process > >>>> exists are as > >> Can't this be explained with something simpler looking at the steps > >> 1-11? > > It meant to be educational given this level of details. Simpler > > steps are labeled with (1) (2) (3). Perhaps these labels didn't stand > > out right? I will use the steps in the 1-11 sequence. > > > >>> > >>> "exits" > >>> > >>>> +follows: > >>>> + > >>>> +:: > >>>> + > >>>> + do_exit() { > >>>> + exit_mm() { > >>>> + mm_put(); > >>>> + exit_mmap() { > >>>> + intel_invalidate_range() //mmu notifier > >>>> + tlb_finish_mmu() > >>>> + mmu_notifier_release(mm) { > >>>> + intel_iommu_release() { > >>>> + [2] > >>>> intel_iommu_teardown_pasid(); > >>> > >>> Parentheses might be better than square brackets for step numbers > >>> > >>>> + intel_iommu_flush_tlbs(); > >>>> + } > >>>> + // tlb_invalidate_range cb removed > >>>> + } > >>>> + unmap_vmas(); > >>>> + free_pgtables(); // IOMMU cannot walk PGT > >>>> after this > >>>> + }; > >>>> + } > >>>> + exit_files(tsk) { > >>>> + close_files() { > >>>> + dsa_close(); > >>>> + [1] dsa_stop_dma(); > >>>> + intel_svm_unbind_pasid(); //nothing to do > >>>> + } > >>>> + } > >>>> + } > >>>> + > >>>> + mmdrop() /* some random time later, lazy mm user */ { > >>>> + mm_free_pgd(); > >>>> + destroy_context(mm); { > >>>> + [3] ioasid_free(); > >>>> + } > >>>> + } > >>>> + > >>>> +As shown in the list above, step #2 could happen before > >>>> +#1. Unrecoverable(UR) faults could happen between #2 and #1. > >>>> + > >>>> +Also notice that TLB invalidation occurs at mmu_notifier > >>>> +invalidate_range callback as well as the release callback. The > >>>> reason +is that release callback will delete IOMMU driver from the > >>>> notifier +chain which may skip invalidate_range() calls during the > >>>> exit path. + > >>>> +To avoid unnecessary reporting of UR fault, IOMMU driver shall > >>>> disable > >> UR? > > Unrecoverable, mentioned in the previous paragraph. > > > >>>> +fault reporting after free and before unbind. > >>>> + > >>>> +Guest IOASID Life Cycle (VT-d Example) > >>>> +-------------------------------------- > >>>> +Guest IOASID life cycle starts with guest driver open(), this > >>>> could be +uacce or individual accelerator driver such as DSA. At > >>>> FD open, +sva_bind_device() is called which triggers a series of > >>>> actions. + > >>>> +The example below is an illustration of *normal* operations that > >>>> +involves *all* the SW components in VT-d. The flow can be simpler > >>>> if +no ENQCMD is supported. > >>>> + > >>>> +:: > >>>> + > >>>> + VFIO IOMMU KVM VDCM IOASID > >>>> Ref > >>>> + .................................................................. > >>>> + 1 ioasid_register_notifier/_mm() > >>>> + 2 ioasid_alloc() > >>>> 1 > >>>> + 3 bind_gpasid() > >>>> + 4 iommu_bind()->ioasid_get() > >>>> 2 > >>>> + 5 ioasid_notify(BIND) > >>>> + 6 -> ioasid_get() > >>>> 3 > >>>> + 7 -> vmcs_update_atomic() > >>>> + 8 mdev_write(gpasid) > >>>> + 9 hpasid= > >>>> + 10 find_by_spid(gpasid) > >>>> 4 > >>>> + 11 vdev_write(hpasid) > >>>> + 12 -------- GUEST STARTS DMA -------------------------- > >>>> + 13 -------- GUEST STOPS DMA -------------------------- > >>>> + 14 mdev_clear(gpasid) > >>>> + 15 vdev_clear(hpasid) > >>>> + 16 > >>>> ioasid_put() 3 > >>>> + 17 unbind_gpasid() > >>>> + 18 iommu_ubind() > >>>> + 19 ioasid_notify(UNBIND) > >>>> + 20 -> vmcs_update_atomic() > >>>> + 21 -> > >>>> ioasid_put() 2 > >>>> + 22 > >>>> ioasid_free() 1 > >>>> + 23 > >>>> ioasid_put() 0 > >>>> + 24 Reclaimed > >>>> + -------------- New Life Cycle Begin > >>>> ---------------------------- > >>>> + 1 ioasid_alloc() > >>>> -> 1 + > >>>> + Note: IOASID Notification Events: FREE, BIND, UNBIND > >>>> + > >>>> +Exception cases arise when a guest crashes or a malicious guest > >>>> +attempts to cause disruption on the host system. The fault > >>>> handling +rules are: > >>>> + > >>>> +1. IOASID free must *always* succeed. > >>>> +2. An inactive period may be required before the freed IOASID is > >>>> + reclaimed. During this period, consumers of IOASID perform > >>>> cleanup. +3. Malfunction is limited to the guest owned resources > >>>> for all > >>>> + programming errors. > >>>> + > >>>> +The primary source of exception is when the following are out of > >>>> +order: > >>>> + > >>>> +1. Start/Stop of DMA activity > >>>> + (Guest device driver, mdev via VFIO) > >> please explain the meaning of what is inside (): initiator? > >>>> +2. Setup/Teardown of IOMMU PASID context, IOTLB, DevTLB flushes > >>>> + (Host IOMMU driver bind/unbind) > >>>> +3. Setup/Teardown of VMCS PASID translation table entries (KVM) in > >>>> + case of ENQCMD > >>>> +4. Programming/Clearing host PASID in VDCM (Host VDCM driver) > >>>> +5. IOASID alloc/free (Host IOASID) > >>>> + > >>>> +VFIO is the *only* user-kernel interface, which is ultimately > >>>> +responsible for exception handlings. > >>> > >>> "handling" > >>> > >>>> + > >>>> +#1 is processed the same way as the assigned device today based on > >>>> +device file descriptors and events. There is no special handling. > >>>> + > >>>> +#3 is based on bind/unbind events emitted by #2. > >>>> + > >>>> +#4 is naturally aligned with IOASID life cycle in that an illegal > >>>> +guest PASID programming would fail in obtaining reference of the > >>>> +matching host IOASID. > >>>> + > >>>> +#5 is similar to #4. The fault will be reported to the user if > >>>> PASID +used in the ENQCMD is not set up in VMCS PASID translation > >>>> table. + > >>>> +Therefore, the remaining out of order problem is between #2 and > >>>> +#5. I.e. unbind vs. free. More specifically, free before unbind. > >>>> + > >>>> +IOASID notifier and refcounting are used to ensure order. > >>>> Following +a publisher-subscriber pattern where: > >> with the following actors: > >>>> + > >>>> +- Publishers: VFIO & IOMMU > >>>> +- Subscribers: KVM, VDCM, IOMMU > >> this may be introduced before. > >>>> + > >>>> +IOASID notifier is atomic which requires subscribers to do quick > >>>> +handling of the event in the atomic context. Workqueue can be > >>>> used for +any processing that requires thread context. > >> repetition of what was said before. > >> IOASID reference must be > > Right, will remove. > > > >>>> +acquired before receiving the FREE event. The reference must be > >>>> +dropped at the end of the processing in order to return the > >>>> IOASID to +the pool. > >>>> + > >>>> +Let's examine the IOASID life cycle again when free happens > >>>> *before* +unbind. This could be a result of misbehaving guests or > >>>> crash. Assuming +VFIO cannot enforce unbind->free order. Notice > >>>> that the setup part up +until step #12 is identical to the normal > >>>> case, the flow below starts +with step 13. > >>>> + > >>>> +:: > >>>> + > >>>> + VFIO IOMMU KVM VDCM IOASID > >>>> Ref > >>>> + .................................................................. > >>>> + 13 -------- GUEST STARTS DMA -------------------------- > >>>> + 14 -------- *GUEST MISBEHAVES!!!* ---------------- > >>>> + 15 ioasid_free() > >>>> + 16 > >>>> ioasid_notify(FREE) > >>>> + 17 > >>>> mark_ioasid_inactive[1] > >>>> + 18 kvm_nb_handler(FREE) > >>>> + 19 vmcs_update_atomic() > >>>> + 20 ioasid_put_locked() -> > >>>> 3 > >>>> + 21 vdcm_nb_handler(FREE) > >>>> + 22 iomm_nb_handler(FREE) > >>>> + 23 ioasid_free() returns[2] schedule_work() > >>>> 2 > >>>> + 24 schedule_work() vdev_clear_wk(hpasid) > >>>> + 25 teardown_pasid_wk() > >>>> + 26 ioasid_put() -> > >>>> 1 > >>>> + 27 ioasid_put() > >>>> 0 > >>>> + 28 Reclaimed > >>>> + 29 unbind_gpasid() > >>>> + 30 iommu_unbind()->ioasid_find() Fails[3] > >>>> + -------------- New Life Cycle Begin > >>>> ---------------------------- + > >>>> +Note: > >>>> + > >>>> +1. By marking IOASID inactive at step #17, no new references can > >>>> be > >>> > >>> Is "inactive" FREE_PENDING? > >>> > >>>> + held. ioasid_get/find() will return -ENOENT; > >>>> +2. After step #23, all events can go out of order. Shall not > >>>> affect > >>>> + the outcome. > >>>> +3. IOMMU driver fails to find private data for unbinding. If > >>>> unbind is > >>>> + called after the same IOASID is allocated for the same guest > >>>> again, > >>>> + this is a programming error. The damage is limited to the guest > >>>> + itself since unbind performs permission checking based on the > >>>> + IOASID set associated with the guest process. > >>>> + > >>>> +KVM PASID Translation Table Updates > >>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > >>>> +Per VM PASID translation table is maintained by KVM in order to > >>>> +support ENQCMD in the guest. The table contains host-guest PASID > >>>> +translations to be consumed by CPU ucode. The synchronization of > >>>> the +PASID states depends on VFIO/IOMMU driver, where IOCTL and > >>>> atomic +notifiers are used. KVM must register IOASID notifier per > >>>> VM instance +during launch time. The following events are handled: > >>>> + > >>>> +1. BIND/UNBIND > >>>> +2. FREE > >>>> + > >>>> +Rules: > >>>> + > >>>> +1. Multiple devices can bind with the same PASID, this can be > >>>> different PCI > >>>> + devices or mdevs within the same PCI device. However, only the > >>>> + *first* BIND and *last* UNBIND emit notifications. > >>>> +2. IOASID code is responsible for ensuring the correctness of H-G > >>>> + PASID mapping. There is no need for KVM to validate the > >>>> + notification data. > >>>> +3. When UNBIND happens *after* FREE, KVM will see error in > >>>> + ioasid_get() even when the reclaim is not done. IOMMU driver > >>>> will > >>>> + also avoid sending UNBIND if the PASID is already FREE. > >>>> +4. When KVM terminates *before* FREE & UNBIND, references will be > >>>> + dropped for all host PASIDs. > >>>> + > >>>> +VDCM PASID Programming > >>>> +~~~~~~~~~~~~~~~~~~~~~~ > >>>> +VDCM composes virtual devices and exposes them to the guests. When > >>>> +the guest allocates a PASID then program it to the virtual > >>>> device, VDCM > >> programs as well > >>>> +intercepts the programming attempt then program the matching > >>>> host > >>> > >>> "programs" > >>> > >>> Thanks, > >>> Jean > >>> > >>>> +PASID on to the hardware. > >>>> +Conversely, when a device is going away, VDCM must be informed > >>>> such +that PASID context on the hardware can be cleared. There > >>>> could be +multiple mdevs assigned to different guests in the same > >>>> VDCM. Since +the PASID table is shared at PCI device level, lazy > >>>> clearing is not +secure. A malicious guest can attack by using > >>>> newly freed PASIDs that +are allocated by another guest. > >>>> + > >>>> +By holding a reference of the PASID until VDCM cleans up the HW > >>>> context, +it is guaranteed that PASID life cycles do not cross > >>>> within the same +device. > >>>> + > >>>> + > >>>> +Reference > >>>> +==================================================== > >>>> +1. > >>>> https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf > >>>> + +2. > >>>> https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator > >>>> + +3. > >>>> https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification > >>>> -- 2.7.4 > >> > >> Thanks > >> > >> Eric > >>>> > >>> > >> > >> _______________________________________________ > >> iommu mailing list > >> iommu@lists.linux-foundation.org > >> https://lists.linuxfoundation.org/mailman/listinfo/iommu > > [Jacob Pan] > > > Thanks > > Eric > [Jacob Pan] _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu