This patchset improves the scalability of the Intel IOMMU code by resolving two spinlock bottlenecks, yielding up to ~5x performance improvement and approaching iommu=off performance.
For example, here's the throughput obtained by 16 memcached instances running on a 16-core Sandy Bridge system, accessed using memslap on another machine that has iommu=off, using the default memslap config (64-byte keys, 1024-byte values, and 10%/90% SET/GET ops): stock iommu=off: 990,803 memcached transactions/sec (=100%, median of 10 runs). stock iommu=on: 221,416 memcached transactions/sec (=22%). [61.70% 0.63% memcached [kernel.kallsyms] [k] _raw_spin_lock_irqsave] patched iommu=on: 963,159 memcached transactions/sec (=97%). [1.29% 1.10% memcached [kernel.kallsyms] [k] _raw_spin_lock_irqsave] The two resolved spinlocks: - Deferred IOTLB invalidations are batched in a global data structure and serialized under a spinlock (add_unmap() & flush_unmaps()); this patchset batches IOTLB invalidations in a per-CPU data structure. - IOVA management (alloc_iova() & __free_iova()) is serialized under the rbtree spinlock; this patchset adds per-CPU caches of allocated IOVAs so that the rbtree doesn't get accessed frequently. (Adding a cache above the existing IOVA allocator is less intrusive than dynamic identity mapping and helps keep IOMMU page table usage low; see Patch 7.) The paper "Utilizing the IOMMU Scalably" (presented at the 2015 USENIX Annual Technical Conference) contains many more details and experiments: https://www.usenix.org/system/files/conference/atc15/atc15-paper-peleg.pdf v3: * Patch 7/7: Respect the caller-passed limit IOVA when satisfying an IOVA allocation from the cache. * Patch 7/7: Flush the IOVA cache if an rbtree IOVA allocation fails, and then retry the allocation. This addresses the possibility that all desired IOVA ranges were in other CPUs' caches. * Patch 4/7: Clean up intel_unmap_sg() to use sg accessors. v2: * Extend IOVA API instead of modifying it, to not break the API's other non-Intel callers. * Invalidate all per-cpu invalidations if one CPU hits its per-cpu limit, so that we don't defer invalidations more than before. * Smaller cap on per-CPU cache size, to consume less of the IOVA space. * Free resources and perform IOTLB invalidations when a CPU is hot-unplugged. Omer Peleg (7): iommu: refactoring of deferred flush entries iommu: per-cpu deferred invalidation queues iommu: correct flush_unmaps pfn usage iommu: only unmap mapped entries iommu: avoid dev iotlb logic in intel-iommu for domains with no dev iotlbs iommu: change intel-iommu to use IOVA frame numbers iommu: introduce per-cpu caching to iova allocation drivers/iommu/intel-iommu.c | 318 +++++++++++++++++++++++---------- drivers/iommu/iova.c | 416 +++++++++++++++++++++++++++++++++++++++++--- include/linux/iova.h | 23 ++- 3 files changed, 637 insertions(+), 120 deletions(-) -- 1.9.1 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu