Introduction
============

   VT-d code currently has a number of cases where completion of certain 
operations
is being waited for by way of spinning. The majority of instances use that 
variable
indirectly through IOMMU_WAIT_OP() macro , allowing for loops of up to 1 second
(DMAR_OPERATION_TIMEOUT). While in many of the cases this may be acceptable, the
invalidation case seems particularly problematic.

Currently hypervisor polls the status address of wait descriptor up to 1 second 
to
get Invalidation flush result. When Invalidation queue includes Device-TLB 
invalidation,
using 1 second is a mistake here in the validation sync. As the 1 second 
timeout here is
related to response times by the IOMMU engine, Instead of Device-TLB 
invalidation with
PCI-e Address Translation Services (ATS) in use. the ATS specification mandates 
a timeout
of 1 _minute_ for cache flush. The ATS case needs to be taken into 
consideration when
doing invalidations. Obviously we can't spin for a minute, so invalidation 
absolutely
needs to be converted to a non-spinning model.

   Also i should fix the new memory security issue.
The page freed from the domain should be on held, until the Device-TLB flush is 
completed (ATS timeout of 1 _minute_).
The page previously associated  with the freed portion of GPA should not be 
reallocated for
another purpose until the appropriate invalidations have been performed. 
Otherwise, the original
page owner can still access freed page though DMA.

Why RFC
=======
    Patch 0001--0005, 0013 are IOMMU related.
    Patch 0006 is about new flag (vCPU / MMU related).
    Patch 0007 is vCPU related.
    Patch 0008--0012 are MMU related.

    1. Xen MMU is very complicated. Could Xen MMU experts help me verify 
whether I
       have covered all of the case?

    2. For gnttab_transfer, If the Device-TLB flush is still not completed when 
to
       map the transferring page to a remote domain, schedule and wait on a 
waitqueue
       until the Device-TLB flush is completed. Is it correct?

       (I have tested waitqueue in decrease_reservation() [do_memory_op()  
hypercall])
        I wake up domain(with only one vCPU) with debug-key tool, and the 
domain(with only one vCPU)
        is still working after waiting 60s in a waitqueue. )


Design Overview
===============

This design implements a non-spinning model for Device-TLB invalidation -- 
using an interrupt
based mechanism. Track the Device-TLB invalidation status in an invalidation 
table per-domain. The
invalidation table keeps the count of in-flight Device-TLB invalidation 
requests, and also
provides a global polling parameter per domain for in-flight Device-TLB 
invalidation requests.
Update invalidation table's count of in-flight Device-TLB invalidation requests 
and assign the
address of global polling parameter per domain in the Status Address of each 
invalidation wait
descriptor, when to submit invalidation requests.

For example:
  .

|invl |  Status Data = 1 (the count of in-flight Device-TLB invalidation 
requests)
|wait |  Status Address = 
virt_to_maddr(&_a_global_polling_parameter_per_domain_)
|dsc  |
  .
  .

|invl |
|wait | Status Data = 2 (the count of in-flight Device-TLB invalidation 
requests)
|dsc  | Status Address = virt_to_maddr(&_a_global_polling_parameter_per_domain_)
  .
  .

|invl |
|wait |  Status Data = 3 (the count of in-flight Device-TLB invalidation 
requests)
|dsc  |  Status Address =  
virt_to_maddr(&_a_global_polling_parameter_per_domain_)
  .
  .

More information about VT-d Invalidation Wait Descriptor, please refer to
  
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html
  6.5.2.8 Invalidation Wait Descriptor.
Status Address and Data: Status address and data is used by hardware to perform 
wait descriptor
                         completion status write when the Status Write(SW) 
field is Set. Hardware Behavior
                         is undefined if the Status address range of 
0xFEEX_XXXX etc.). The Status Address
                         and Data fields are ignored by hardware when the 
Status Write field is Clear.

The invalidation completion event interrupt is generated only after the 
invalidation wait descriptor
completes. In invalidation interrupt handler, it will schedule a soft-irq to do 
the following check:

  if invalidation table's count of in-flight Device-TLB invalidation requests 
== polling parameter:
    This domain has no in-flight Device-TLB invalidation requests.
  else
    This domain has in-flight Device-TLB invalidation requests.

Track domain Status:
   The vCPU is NOT allowed to entry guest mode and put into SCHEDOP_yield list 
if it has in-flight
Device-TLB invalidation requests.

Memory security issue:
    In case with PCI-e Address Translation Services(ATS) in use, ATS spec 
mandates a timeout of 1 minute
for cache flush.
    The page freed from the domain should be on held, until the Device-TLB 
flush is completed. The page
previously associated  with the freed portion of GPA should not be reallocated 
for another purpose until
the appropriate invalidations have been performed. Otherwise, the original page 
owner can still access
freed page though DMA.

   *Held on The page until the Device-TLB flush is completed.
      - Unlink the page from the original owner.
      - Remove the page from the page_list of domain.
      - Decrease the total pages count of domain.
      - Add the page to qi_hold_page_list.

    *Put the page in Queued Invalidation(QI) interrupt handler if the 
Device-TLB is completed.

Invalidation Fault:
A fault event will be generated if an invalidation failed. We can disable the 
devices.

For Context Invalidation and IOTLB invalidation without Device-TLB 
invalidation, Queued Invalidation(QI) submits
invalidation requests as before(This is a tradeoff and the cost of interrupt is 
overhead. It will be modified
in coming series of patch).

More details
============

1. invalidation table. We define qi_table structure per domain.
+struct qi_talbe {
+    u64 qi_table_poll_slot;
+    u32 qi_table_status_data;
+};

@ struct hvm_iommu {
+    /* IOMMU Queued Invalidation(QI) */
+    struct qi_talbe talbe;
}

2. Modification to Device IOTLB invalidation:
    - Enabled interrupt notification when hardware completes the invalidations:
      Set FN, IF and SW bits in Invalidation Wait Descriptor. The reason why 
also set SW bit is that
      the interrupt for notification is global not per domain.
      So we still need to poll the status address to know which Device-TLB 
invalidation request is
      completed in QI interrupt handler.
    - A new per-domain flag (*qi_flag) is used to track the status of 
Device-TLB invalidation request.
      The *qi_flag will be set before sbumitting the Device-TLB invalidation 
requests. The vCPU is NOT
      allowed to entry guest mode and put into SCHEDOP_yield list, if the 
*qi_flag is Set.
    - new logic to do synchronize.
        if no Device-TLB invalidation:
            Back to current invalidation logic.
        else
            Set IF, SW, FN bit in wait descriptor and prepare the Status Data.
            Set *qi_flag.
            Put the domain in pending flush list (The vCPU is NOT allowed to 
entry guest mode and put into SCHEDOP_yield if the *qi_flag is Set.)
        Return

More information about VT-d Invalidation Wait Descriptor, please refer to
  
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html
  6.5.2.8 Invalidation Wait Descriptor.
   SW: Indicate the invalidation wait descriptor completion by performing a 
coherent DWORD write of the value in the Status Data field
       to the address specified in the Status Address.
   FN: Indicate the descriptors following the invalidation wait descriptor must 
be processed by hardware only after the invalidation
       Wait descriptor completes.
   IF: Indicate the invalidation wait descriptor completion by generating an 
invalidation completion event per the programing of the
       Invalidation Completion Event Registers.

3. Modification to domain running lifecycle:
    - When the *qi_flag is set, the domain is not allowed to enter guest mode 
and put into SCHEDOP_yield list
      if there are in-flight Device-TLB invalidation requests.

4. New interrupt handler for invalidation completion:
    - when hardware completes the Device-TLB invalidation requests, it 
generates an interrupt to notify hypervisor.
    - In interrupt handler, schedule a tasklet to handle it.
    - tasklet to handle below:
        *Clear IWC field in the Invalidation Completion Status register. If the 
IWC field in the Invalidation
         Completion Status register was already Set at the time of setting this 
field, it is not treated as a new
         interrupt condition.
        *Scan the domain list. (the domain is with vt-d passthrough devices. 
scan 'iommu->domid_bitmap')
                for each domain:
                check the values invalidation table (qi_table_poll_slot and 
qi_table_status_data) of each domain.
                if equal:
                   Put the on hold pages.
                   Clear the invalidation table.
                   Clear *qi_flag.

        *If IP field of Invalidation Event Control Register is Set, try to 
*Clear IWC and *Scan the domain list again, instead of
         generating another interrupt.
        *Clear IM field of Invalidation Event Control Register.

((
  Logic of IWC / IP / IM as below:

                          Interrupt condition (An invalidation wait descriptor 
with Interrupt Flag(IF) field Set completed.)
                                  ||
                                   v
           ----------------------(IWC) ----------------------
     (IWC is Set)                                (IWC is not Set)
          ||                                            ||
          V                                             ||
(Not treated as an new interrupt condition)             ||
                                                         V
                                                   (Set IWC / IP)
                                                        ||
                                                         V
                                  ---------------------(IM)---------------------
                              (IM is Set)                               (IM not 
Set)
                                  ||                                        ||
                                  ||                                        V
                                  ||                    (cause Interrupt 
message / then hardware clear IP)
                                   V
   (interrupt is held pending, clearing IM to cause interrupt message)

* If IWC field is being clear, the IP field is cleared.
))

5. invalidation failed.
    - A fault event will be generated if invalidation failed. we can disable 
the devices if receive an
      invalidation fault event.

6. Memory security issue:

    The page freed from the domain should be on held, until the Device-TLB 
flush is completed. The page
previously associated  with the freed portion of GPA should not be reallocated 
for another purpose until
the appropriate invalidations have been performed. Otherwise, the original page 
owner can still access
freed page though DMA.

   *Held on The page unitl the Device-TLB flush is completed.
      - Unlink the page from the original owner.
      - Remove the page from the page_list of domain.
      - Decrease the total pages count of domain.
      - Add the page to qi_hold_page_list.

  *Put the page in Queued Invalidation(QI) interrupt handler if the Device-TLB 
is completed.


----
There are 3 reasons to submit device-TLB invalidation requests:
    *VT-d initialization.
    *Reassign device ownership.
    *Memory modification.

6.1 *VT-d initialization
    When VT-d is initializing, there is no guest domain running. So no memory 
security issue.
iotlb(iotlb/device-tlb)
|-iommu_flush_iotlb_global()--iommu_flush_all()--intel_iommu_hwdom_init()
                                              |--init_vtd_hw()
6.2 *Reassign device ownership
    Reassign device ownership is invoked by 2 hypercalls: do_physdev_op() and 
arch_do_domctl().
As the *qi_flag is Set, the domain is not allowed to enter guest mode. If the 
appropriate invalidations maybe have
not been performed, the *qi_flag is still Set, and these devices are not ready 
for guest domains to launch
DMA with these devices. So if the *qi_flag is introduced, there is no memory 
security issue.

iotlb(iotlb/device-tlb)
|-iommu_flush_iotlb_dsi()
                       |--domain_context_mapping_one() ...
                       |--domain_context_unmap_one() ...

|-iommu_flush_iotlb_psi()
                       |--domain_context_mapping_one() ...
                       |--domain_context_unmap_one() ...

6.3 *Memory modification.
While memory is modified, There are a lot of invoker flow for updating EPT, but 
not all of them will update IOMMU page tables. All
of the following three conditions are met.
  * P2M is hostp2m. ( p2m_is_hostp2m(p2m) )
  * Previous mfn is not equal to new mfn. (prev_mfn != new_mfn)
  * This domain needs IOMMU. (need_iommu(d))

##
|--iommu_pte_flush()--ept_set_entry()

#PoD(populate on demand) is not supported while IOMMU passthrough is enabled. 
So ignore PoD invoker flow below.
      |--p2m_pod_zero_check_superpage()  ...
      |--p2m_pod_zero_check()  ...
      |--p2m_pod_demand_populate()  ...
      |--p2m_pod_decrease_reservation()  ...
      |--guest_physmap_mark_populate_on_demand() ...

#Xen paging is not supported while IOMMU passthrough is enabled. So ignore Xen 
paging invoker flow below.
      |--p2m_mem_paging_evict() ...
      |--p2m_mem_paging_resume()...
      |--p2m_mem_paging_prep()...
      |--p2m_mem_paging_populate()...
      |--p2m_mem_paging_nominate()...
      |--p2m_alloc_table()--shadow_enable() --paging_enable()--shadow_domctl() 
--paging_domctl()--arch_do_domctl() --do_domctl()
                                                                                
                                  |--paging_domctl_continuation()

#Xen sharing is not supported while IOMMU passthrough is enabled. So ignore Xen 
paging invoker flow below.
      |--set_shared_p2m_entry()...


#Domain is paused, the domain can't launch DMA.
      |--relinquish_shared_pages()--domain_relinquish_resources( case 
RELMEM_shared: ) --domain_kill()--do_domctl()

#The below p2m is not hostp2m. It is L2 to L0. So ignore invoker flow below.
      |--nestedhap_fix_p2m() --nestedhvm_hap_nested_page_fault() 
--hvm_hap_nested_page_fault() --ept_handle_violation()--vmx_vmexit_handler()

#If prev_mfn == new_mfn, it will not update IOMMU page tables. So ignore 
invoker flow below.
      |--p2m_mem_access_check()-- hvm_hap_nested_page_fault() 
--ept_handle_violation()--vmx_vmexit_handler()(L1 --> L0 / but just only check 
p2m_type_t)
      |--p2m_set_mem_access() ...
      |--guest_physmap_mark_populate_on_demand() ...
      |--p2m_change_type_one() ...
# The previous page is not put and allocated for Xen or other guest domains. So 
there is no memory security issue. Ignore invoker flow below.
   |--p2m_remove_page()--guest_physmap_remove_page() ...

   |--clear_mmio_p2m_entry()--unmap_mmio_regions()--do_domctl()
                           |--map_mmio_regions()--do_domctl()


# Held on the pages which are removed in guest_remove_page(), and put in QI 
interrupt handler when it has no in-flight Device-TLB invalidation requests.

|--clear_mmio_p2m_entry()--*guest_remove_page()*--decrease_reservation()
                                               |--xenmem_add_to_physmap_one() 
--xenmem_add_to_physmap() /xenmem_add_to_physmap_batch()  .. --do_memory_op()
                                               |--p2m_add_foreign() -- 
xenmem_add_to_physmap_one() ..--do_memory_op()
                                                                   
|--guest_physmap_add_entry()--create_grant_p2m_mapping()  ... 
--do_grant_table_op()

((
   Much more explanation:
   Actually, the previous pages are maybe mapped from Xen heap for guest 
domains in decrease_reservation() / xenmem_add_to_physmap_one()
   / p2m_add_foreign(), but they are not mapped to IOMMU table. The below 4 
functions will map xen heap page for guest domains:
          * share page for xen Oprofile.
          * vLAPIC mapping.
          * grant table shared page.
          * domain share_info page.
))

# For grant_unmap*. ignore it at this point, as we can held on the page when 
domain free xenbllooned page.

    |--iommu_map_page()--__gnttab_unmap_common()--__gnttab_unmap_grant_ref() 
--gnttab_unmap_grant_ref()--do_grant_table_op()
                                               |--__gnttab_unmap_and_replace() 
-- gnttab_unmap_and_replace() --do_grant_table_op()

# For grant_map*, ignore it as there is no pfn<--->mfn in Device-TLB.

# For grant_transfer:
  |--p2m_remove_page()--guest_physmap_remove_page()
                                                 |--gnttab_transfer() ...  
--do_grant_table_op()

    If the Device-TLB flush is still not completed when to map the transferring 
page to a remote domain,
    schedule and wait on a waitqueue until the Device-TLB flush is completed.

   Plan B:
   ((If the Device-TLB flush is still not completed before adding the 
transferring page to the target domain,
   allocate a new page for target domain and held on the old transferring page 
which will be put in QI interrupt
   handler when there are no in-flight Device-TLB invalidation requests.))


Quan Xu (13):
  vt-d: Redefine iommu_set_interrupt() for registering MSI interrupt
  vt-d: Register MSI for async invalidation completion interrupt.
  vt-d: Track the Device-TLB invalidation status in an invalidation table.
  vt-d: Clear invalidation table in invaidation interrupt handler
  vt-d: Clear the IWC field of Invalidation Event Control Register in
  vt-d: Introduce a new per-domain flag - qi_flag.
  vt-d: If the qi_flag is Set, the domain's vCPUs are not allowed to
  vt-d: Held on the freed page until the Device-TLB flush is completed.
  vt-d: Put the page in Queued Invalidation(QI) interrupt handler if
  vt-d: Held on the removed page until the Device-TLB flush is completed.
  vt-d: If the Device-TLB flush is still not completed when
  vt-d: For gnttab_transfer, If the Device-TLB flush is still
  vt-d: Set the IF bit in Invalidation Wait Descriptor When submit Device-TLB

 xen/arch/x86/hvm/vmx/entry.S         |  10 ++
 xen/arch/x86/x86_64/asm-offsets.c    |   1 +
 xen/common/domain.c                  |  15 ++
 xen/common/grant_table.c             |  16 ++
 xen/common/memory.c                  |  16 +-
 xen/drivers/passthrough/vtd/iommu.c  | 290 +++++++++++++++++++++++++++++++++--
 xen/drivers/passthrough/vtd/iommu.h  |  18 +++
 xen/drivers/passthrough/vtd/qinval.c |  51 +++++-
 xen/include/xen/hvm/iommu.h          |  42 +++++
 9 files changed, 443 insertions(+), 16 deletions(-)

-- 
1.8.3.2


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Reply via email to