Hi Will, Robin,

While analyzing an arm64 issue in interrupt handling for NVMe [0], we have noticed a worryingly high CPU utilization in the SMMU driver.

The background is that we may get CPU lockup for high-throughput NVMe testing, and we noticed that disabling the SMMU during testing avoids the issue. However this lockup is a cross-architecture issue and there are attempts to address it, like [1]. To me, disabling the SMMU is just avoiding that specific issue.

Anyway, we should still consider this high CPU loading:

PerfTop: 1694 irqs/sec kernel:97.3% exact: 0.0% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, CPU: 0)
--------------------------------------------------------------------------------------------------------------------------

    50.84%  [kernel]       [k] arm_smmu_cmdq_issue_cmdlist
    19.51%  [kernel]       [k] _raw_spin_unlock_irqrestore
     5.14%  [kernel]       [k] __slab_free
     2.37%  [kernel]       [k] bio_release_pages.part.42
     2.20%  [kernel]       [k] fput_many
     1.92%  [kernel]       [k] aio_complete_rw
     1.85%  [kernel]       [k] __arm_lpae_unmap
     1.71%  [kernel]       [k] arm_smmu_atc_inv_domain.constprop.42
     1.11%  [kernel]       [k] sbitmap_queue_clear
     1.05%  [kernel]       [k] blk_mq_free_request
     0.97%  [kernel]       [k] nvme_irq
     0.71%  [kernel]       [k] blk_account_io_done
     0.66%  [kernel]       [k] kmem_cache_free
     0.66%  [kernel]       [k] blk_mq_complete_request

This is for a CPU servicing the NVMe interrupt and doing the DMA unmap. The DMA unmap is done in threaded interrupt context.

And for the overall system, we have:

PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost: 0/34434 drop: 0/40116 [4000Hz cycles], (all, 96 CPUs)
--------------------------------------------------------------------------------------------------------------------------

    27.43%  [kernel]          [k] arm_smmu_cmdq_issue_cmdlist
    11.71%  [kernel]          [k] _raw_spin_unlock_irqrestore
     6.35%  [kernel]          [k] _raw_spin_unlock_irq
     2.65%  [kernel]          [k] get_user_pages_fast
     2.03%  [kernel]          [k] __slab_free
     1.55%  [kernel]          [k] tick_nohz_idle_exit
     1.47%  [kernel]          [k] arm_lpae_map
     1.39%  [kernel]          [k] __fget
     1.14%  [kernel]          [k] __lock_text_start
     1.09%  [kernel]          [k] _raw_spin_lock
     1.08%  [kernel]          [k] bio_release_pages.part.42
     1.03%  [kernel]          [k] __sbitmap_get_word
     0.97%  [kernel]          [k] arm_smmu_atc_inv_domain.constprop.42
     0.91%  [kernel]          [k] fput_many
     0.88%  [kernel]          [k] __arm_lpae_map

One thing to note is that we still spend an appreciable amount of time in arm_smmu_atc_inv_domain(), which is disappointing when considering it should effectively be a noop.

As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our batch size is 1, so we're not seeing the real benefit of the batching. I can't help but think that we could improve this code to try to combine CMD SYNCs for small batches.

Anyway, let me know your thoughts or any questions. I'll have a look if a get a chance for other possible bottlenecks.

[0] https://lore.kernel.org/lkml/[email protected]/

[1] https://lore.kernel.org/linux-nvme/[email protected]/

Cheers,
John

On 21/08/2019 16:17, Will Deacon wrote:
Hi again,

This is version two of the patches I posted yesterday:

   v1: https://lkml.kernel.org/r/[email protected]

Changes since then include:

   * Fix 'ats_enabled' checking when enabling ATS
   * Remove redundant 'dev_is_pci()' calls
   * Remove bool bitfield
   * Add patch temporarily disabling ATS detection for -stable
   * Issue ATC invalidation even when non-leaf
   * Elide invalidation/SYNC for zero-sized address ranges
   * Shuffle the patches round a bit

Thanks,

Will

Cc: Zhen Lei <[email protected]>
Cc: Jean-Philippe Brucker <[email protected]>
Cc: John Garry <[email protected]>
Cc: Robin Murphy <[email protected]>

--->8

Will Deacon (8):
   iommu/arm-smmu-v3: Document ordering guarantees of command insertion
   iommu/arm-smmu-v3: Disable detection of ATS and PRI
   iommu/arm-smmu-v3: Remove boolean bitfield for 'ats_enabled' flag
   iommu/arm-smmu-v3: Don't issue CMD_SYNC for zero-length invalidations
   iommu/arm-smmu-v3: Rework enabling/disabling of ATS for PCI masters
   iommu/arm-smmu-v3: Fix ATC invalidation ordering wrt main TLBs
   iommu/arm-smmu-v3: Avoid locking on invalidation path when not using
     ATS
   Revert "iommu/arm-smmu-v3: Disable detection of ATS and PRI"

  drivers/iommu/arm-smmu-v3.c | 117 ++++++++++++++++++++++++++++++++------------
  1 file changed, 87 insertions(+), 30 deletions(-)


_______________________________________________
iommu mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to