Introduction ============ VT-d code currently has a number of cases where completion of certain operations is being waited for by way of spinning. The majority of instances use that variable indirectly through IOMMU_WAIT_OP() macro , allowing for loops of up to 1 second (DMAR_OPERATION_TIMEOUT). While in many of the cases this may be acceptable, the invalidation case seems particularly problematic.
Currently hypervisor polls the status address of wait descriptor up to 1 second to get Invalidation flush result. When Invalidation queue includes Device-TLB invalidation, using 1 second is a mistake here in the validation sync. As the 1 second timeout here is related to response times by the IOMMU engine, Instead of Device-TLB invalidation with PCI-e Address Translation Services (ATS) in use. the ATS specification mandates a timeout of 1 _minute_ for cache flush. The ATS case needs to be taken into consideration when doing invalidations. Obviously we can't spin for a minute, so invalidation absolutely needs to be converted to a non-spinning model. Also i should fix the new memory security issue. The page freed from the domain should be on held, until the Device-TLB flush is completed (ATS timeout of 1 _minute_). The page previously associated with the freed portion of GPA should not be reallocated for another purpose until the appropriate invalidations have been performed. Otherwise, the original page owner can still access freed page though DMA. Why RFC ======= Patch 0001--0005, 0013 are IOMMU related. Patch 0006 is about new flag (vCPU / MMU related). Patch 0007 is vCPU related. Patch 0008--0012 are MMU related. 1. Xen MMU is very complicated. Could Xen MMU experts help me verify whether I have covered all of the case? 2. For gnttab_transfer, If the Device-TLB flush is still not completed when to map the transferring page to a remote domain, schedule and wait on a waitqueue until the Device-TLB flush is completed. Is it correct? (I have tested waitqueue in decrease_reservation() [do_memory_op() hypercall]) I wake up domain(with only one vCPU) with debug-key tool, and the domain(with only one vCPU) is still working after waiting 60s in a waitqueue. ) Design Overview =============== This design implements a non-spinning model for Device-TLB invalidation -- using an interrupt based mechanism. Track the Device-TLB invalidation status in an invalidation table per-domain. The invalidation table keeps the count of in-flight Device-TLB invalidation requests, and also provides a global polling parameter per domain for in-flight Device-TLB invalidation requests. Update invalidation table's count of in-flight Device-TLB invalidation requests and assign the address of global polling parameter per domain in the Status Address of each invalidation wait descriptor, when to submit invalidation requests. For example: . |invl | Status Data = 1 (the count of in-flight Device-TLB invalidation requests) |wait | Status Address = virt_to_maddr(&_a_global_polling_parameter_per_domain_) |dsc | . . |invl | |wait | Status Data = 2 (the count of in-flight Device-TLB invalidation requests) |dsc | Status Address = virt_to_maddr(&_a_global_polling_parameter_per_domain_) . . |invl | |wait | Status Data = 3 (the count of in-flight Device-TLB invalidation requests) |dsc | Status Address = virt_to_maddr(&_a_global_polling_parameter_per_domain_) . . More information about VT-d Invalidation Wait Descriptor, please refer to http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html 6.5.2.8 Invalidation Wait Descriptor. Status Address and Data: Status address and data is used by hardware to perform wait descriptor completion status write when the Status Write(SW) field is Set. Hardware Behavior is undefined if the Status address range of 0xFEEX_XXXX etc.). The Status Address and Data fields are ignored by hardware when the Status Write field is Clear. The invalidation completion event interrupt is generated only after the invalidation wait descriptor completes. In invalidation interrupt handler, it will schedule a soft-irq to do the following check: if invalidation table's count of in-flight Device-TLB invalidation requests == polling parameter: This domain has no in-flight Device-TLB invalidation requests. else This domain has in-flight Device-TLB invalidation requests. Track domain Status: The vCPU is NOT allowed to entry guest mode and put into SCHEDOP_yield list if it has in-flight Device-TLB invalidation requests. Memory security issue: In case with PCI-e Address Translation Services(ATS) in use, ATS spec mandates a timeout of 1 minute for cache flush. The page freed from the domain should be on held, until the Device-TLB flush is completed. The page previously associated with the freed portion of GPA should not be reallocated for another purpose until the appropriate invalidations have been performed. Otherwise, the original page owner can still access freed page though DMA. *Held on The page until the Device-TLB flush is completed. - Unlink the page from the original owner. - Remove the page from the page_list of domain. - Decrease the total pages count of domain. - Add the page to qi_hold_page_list. *Put the page in Queued Invalidation(QI) interrupt handler if the Device-TLB is completed. Invalidation Fault: A fault event will be generated if an invalidation failed. We can disable the devices. For Context Invalidation and IOTLB invalidation without Device-TLB invalidation, Queued Invalidation(QI) submits invalidation requests as before(This is a tradeoff and the cost of interrupt is overhead. It will be modified in coming series of patch). More details ============ 1. invalidation table. We define qi_table structure per domain. +struct qi_talbe { + u64 qi_table_poll_slot; + u32 qi_table_status_data; +}; @ struct hvm_iommu { + /* IOMMU Queued Invalidation(QI) */ + struct qi_talbe talbe; } 2. Modification to Device IOTLB invalidation: - Enabled interrupt notification when hardware completes the invalidations: Set FN, IF and SW bits in Invalidation Wait Descriptor. The reason why also set SW bit is that the interrupt for notification is global not per domain. So we still need to poll the status address to know which Device-TLB invalidation request is completed in QI interrupt handler. - A new per-domain flag (*qi_flag) is used to track the status of Device-TLB invalidation request. The *qi_flag will be set before sbumitting the Device-TLB invalidation requests. The vCPU is NOT allowed to entry guest mode and put into SCHEDOP_yield list, if the *qi_flag is Set. - new logic to do synchronize. if no Device-TLB invalidation: Back to current invalidation logic. else Set IF, SW, FN bit in wait descriptor and prepare the Status Data. Set *qi_flag. Put the domain in pending flush list (The vCPU is NOT allowed to entry guest mode and put into SCHEDOP_yield if the *qi_flag is Set.) Return More information about VT-d Invalidation Wait Descriptor, please refer to http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html 6.5.2.8 Invalidation Wait Descriptor. SW: Indicate the invalidation wait descriptor completion by performing a coherent DWORD write of the value in the Status Data field to the address specified in the Status Address. FN: Indicate the descriptors following the invalidation wait descriptor must be processed by hardware only after the invalidation Wait descriptor completes. IF: Indicate the invalidation wait descriptor completion by generating an invalidation completion event per the programing of the Invalidation Completion Event Registers. 3. Modification to domain running lifecycle: - When the *qi_flag is set, the domain is not allowed to enter guest mode and put into SCHEDOP_yield list if there are in-flight Device-TLB invalidation requests. 4. New interrupt handler for invalidation completion: - when hardware completes the Device-TLB invalidation requests, it generates an interrupt to notify hypervisor. - In interrupt handler, schedule a tasklet to handle it. - tasklet to handle below: *Clear IWC field in the Invalidation Completion Status register. If the IWC field in the Invalidation Completion Status register was already Set at the time of setting this field, it is not treated as a new interrupt condition. *Scan the domain list. (the domain is with vt-d passthrough devices. scan 'iommu->domid_bitmap') for each domain: check the values invalidation table (qi_table_poll_slot and qi_table_status_data) of each domain. if equal: Put the on hold pages. Clear the invalidation table. Clear *qi_flag. *If IP field of Invalidation Event Control Register is Set, try to *Clear IWC and *Scan the domain list again, instead of generating another interrupt. *Clear IM field of Invalidation Event Control Register. (( Logic of IWC / IP / IM as below: Interrupt condition (An invalidation wait descriptor with Interrupt Flag(IF) field Set completed.) || v ----------------------(IWC) ---------------------- (IWC is Set) (IWC is not Set) || || V || (Not treated as an new interrupt condition) || V (Set IWC / IP) || V ---------------------(IM)--------------------- (IM is Set) (IM not Set) || || || V || (cause Interrupt message / then hardware clear IP) V (interrupt is held pending, clearing IM to cause interrupt message) * If IWC field is being clear, the IP field is cleared. )) 5. invalidation failed. - A fault event will be generated if invalidation failed. we can disable the devices if receive an invalidation fault event. 6. Memory security issue: The page freed from the domain should be on held, until the Device-TLB flush is completed. The page previously associated with the freed portion of GPA should not be reallocated for another purpose until the appropriate invalidations have been performed. Otherwise, the original page owner can still access freed page though DMA. *Held on The page unitl the Device-TLB flush is completed. - Unlink the page from the original owner. - Remove the page from the page_list of domain. - Decrease the total pages count of domain. - Add the page to qi_hold_page_list. *Put the page in Queued Invalidation(QI) interrupt handler if the Device-TLB is completed. ---- There are 3 reasons to submit device-TLB invalidation requests: *VT-d initialization. *Reassign device ownership. *Memory modification. 6.1 *VT-d initialization When VT-d is initializing, there is no guest domain running. So no memory security issue. iotlb(iotlb/device-tlb) |-iommu_flush_iotlb_global()--iommu_flush_all()--intel_iommu_hwdom_init() |--init_vtd_hw() 6.2 *Reassign device ownership Reassign device ownership is invoked by 2 hypercalls: do_physdev_op() and arch_do_domctl(). As the *qi_flag is Set, the domain is not allowed to enter guest mode. If the appropriate invalidations maybe have not been performed, the *qi_flag is still Set, and these devices are not ready for guest domains to launch DMA with these devices. So if the *qi_flag is introduced, there is no memory security issue. iotlb(iotlb/device-tlb) |-iommu_flush_iotlb_dsi() |--domain_context_mapping_one() ... |--domain_context_unmap_one() ... |-iommu_flush_iotlb_psi() |--domain_context_mapping_one() ... |--domain_context_unmap_one() ... 6.3 *Memory modification. While memory is modified, There are a lot of invoker flow for updating EPT, but not all of them will update IOMMU page tables. All of the following three conditions are met. * P2M is hostp2m. ( p2m_is_hostp2m(p2m) ) * Previous mfn is not equal to new mfn. (prev_mfn != new_mfn) * This domain needs IOMMU. (need_iommu(d)) ## |--iommu_pte_flush()--ept_set_entry() #PoD(populate on demand) is not supported while IOMMU passthrough is enabled. So ignore PoD invoker flow below. |--p2m_pod_zero_check_superpage() ... |--p2m_pod_zero_check() ... |--p2m_pod_demand_populate() ... |--p2m_pod_decrease_reservation() ... |--guest_physmap_mark_populate_on_demand() ... #Xen paging is not supported while IOMMU passthrough is enabled. So ignore Xen paging invoker flow below. |--p2m_mem_paging_evict() ... |--p2m_mem_paging_resume()... |--p2m_mem_paging_prep()... |--p2m_mem_paging_populate()... |--p2m_mem_paging_nominate()... |--p2m_alloc_table()--shadow_enable() --paging_enable()--shadow_domctl() --paging_domctl()--arch_do_domctl() --do_domctl() |--paging_domctl_continuation() #Xen sharing is not supported while IOMMU passthrough is enabled. So ignore Xen paging invoker flow below. |--set_shared_p2m_entry()... #Domain is paused, the domain can't launch DMA. |--relinquish_shared_pages()--domain_relinquish_resources( case RELMEM_shared: ) --domain_kill()--do_domctl() #The below p2m is not hostp2m. It is L2 to L0. So ignore invoker flow below. |--nestedhap_fix_p2m() --nestedhvm_hap_nested_page_fault() --hvm_hap_nested_page_fault() --ept_handle_violation()--vmx_vmexit_handler() #If prev_mfn == new_mfn, it will not update IOMMU page tables. So ignore invoker flow below. |--p2m_mem_access_check()-- hvm_hap_nested_page_fault() --ept_handle_violation()--vmx_vmexit_handler()(L1 --> L0 / but just only check p2m_type_t) |--p2m_set_mem_access() ... |--guest_physmap_mark_populate_on_demand() ... |--p2m_change_type_one() ... # The previous page is not put and allocated for Xen or other guest domains. So there is no memory security issue. Ignore invoker flow below. |--p2m_remove_page()--guest_physmap_remove_page() ... |--clear_mmio_p2m_entry()--unmap_mmio_regions()--do_domctl() |--map_mmio_regions()--do_domctl() # Held on the pages which are removed in guest_remove_page(), and put in QI interrupt handler when it has no in-flight Device-TLB invalidation requests. |--clear_mmio_p2m_entry()--*guest_remove_page()*--decrease_reservation() |--xenmem_add_to_physmap_one() --xenmem_add_to_physmap() /xenmem_add_to_physmap_batch() .. --do_memory_op() |--p2m_add_foreign() -- xenmem_add_to_physmap_one() ..--do_memory_op() |--guest_physmap_add_entry()--create_grant_p2m_mapping() ... --do_grant_table_op() (( Much more explanation: Actually, the previous pages are maybe mapped from Xen heap for guest domains in decrease_reservation() / xenmem_add_to_physmap_one() / p2m_add_foreign(), but they are not mapped to IOMMU table. The below 4 functions will map xen heap page for guest domains: * share page for xen Oprofile. * vLAPIC mapping. * grant table shared page. * domain share_info page. )) # For grant_unmap*. ignore it at this point, as we can held on the page when domain free xenbllooned page. |--iommu_map_page()--__gnttab_unmap_common()--__gnttab_unmap_grant_ref() --gnttab_unmap_grant_ref()--do_grant_table_op() |--__gnttab_unmap_and_replace() -- gnttab_unmap_and_replace() --do_grant_table_op() # For grant_map*, ignore it as there is no pfn<--->mfn in Device-TLB. # For grant_transfer: |--p2m_remove_page()--guest_physmap_remove_page() |--gnttab_transfer() ... --do_grant_table_op() If the Device-TLB flush is still not completed when to map the transferring page to a remote domain, schedule and wait on a waitqueue until the Device-TLB flush is completed. Plan B: ((If the Device-TLB flush is still not completed before adding the transferring page to the target domain, allocate a new page for target domain and held on the old transferring page which will be put in QI interrupt handler when there are no in-flight Device-TLB invalidation requests.)) Quan Xu (13): vt-d: Redefine iommu_set_interrupt() for registering MSI interrupt vt-d: Register MSI for async invalidation completion interrupt. vt-d: Track the Device-TLB invalidation status in an invalidation table. vt-d: Clear invalidation table in invaidation interrupt handler vt-d: Clear the IWC field of Invalidation Event Control Register in vt-d: Introduce a new per-domain flag - qi_flag. vt-d: If the qi_flag is Set, the domain's vCPUs are not allowed to vt-d: Held on the freed page until the Device-TLB flush is completed. vt-d: Put the page in Queued Invalidation(QI) interrupt handler if vt-d: Held on the removed page until the Device-TLB flush is completed. vt-d: If the Device-TLB flush is still not completed when vt-d: For gnttab_transfer, If the Device-TLB flush is still vt-d: Set the IF bit in Invalidation Wait Descriptor When submit Device-TLB xen/arch/x86/hvm/vmx/entry.S | 10 ++ xen/arch/x86/x86_64/asm-offsets.c | 1 + xen/common/domain.c | 15 ++ xen/common/grant_table.c | 16 ++ xen/common/memory.c | 16 +- xen/drivers/passthrough/vtd/iommu.c | 290 +++++++++++++++++++++++++++++++++-- xen/drivers/passthrough/vtd/iommu.h | 18 +++ xen/drivers/passthrough/vtd/qinval.c | 51 +++++- xen/include/xen/hvm/iommu.h | 42 +++++ 9 files changed, 443 insertions(+), 16 deletions(-) -- 1.8.3.2 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel