Hi Rob, On 2/13/20 8:54 PM, Rob Herring wrote: > On Thu, Jan 30, 2020 at 9:06 AM Auger Eric <eric.au...@redhat.com> wrote: >> >> Hi Rob, >> On 1/17/20 10:16 PM, Rob Herring wrote: >>> Arm SMMUv3.2 adds support for TLB range invalidate operations. >>> Support for range invalidate is determined by the RIL bit in the IDR3 >>> register. >>> >>> The range invalidate is in units of the leaf page size and operates on >>> 1-32 chunks of a power of 2 multiple pages. First, we determine from the >>> size what power of 2 multiple we can use. Then we calculate how many >>> chunks (1-31) of the power of 2 size for the range on the iteration. On >>> each iteration, we move up in size by at least 5 bits. >>> >>> Cc: Eric Auger <eric.au...@redhat.com> >>> Cc: Jean-Philippe Brucker <jean-phili...@linaro.org> >>> Cc: Will Deacon <w...@kernel.org> >>> Cc: Robin Murphy <robin.mur...@arm.com> >>> Cc: Joerg Roedel <j...@8bytes.org> >>> Signed-off-by: Rob Herring <r...@kernel.org> >>> --- >>> drivers/iommu/arm-smmu-v3.c | 66 ++++++++++++++++++++++++++++++++++++- >>> 1 file changed, 65 insertions(+), 1 deletion(-) > > >>> @@ -2003,7 +2024,7 @@ static void arm_smmu_tlb_inv_range(unsigned long >>> iova, size_t size, >>> { >>> u64 cmds[CMDQ_BATCH_ENTRIES * CMDQ_ENT_DWORDS]; >>> struct arm_smmu_device *smmu = smmu_domain->smmu; >>> - unsigned long start = iova, end = iova + size; >>> + unsigned long start = iova, end = iova + size, num_pages = 0, tg = 0; >>> int i = 0; >>> struct arm_smmu_cmdq_ent cmd = { >>> .tlbi = { >>> @@ -2022,12 +2043,50 @@ static void arm_smmu_tlb_inv_range(unsigned long >>> iova, size_t size, >>> cmd.tlbi.vmid = smmu_domain->s2_cfg.vmid; >>> } >>> >>> + if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) { >>> + /* Get the leaf page size */ >>> + tg = __ffs(smmu_domain->domain.pgsize_bitmap); >>> + >>> + /* Convert page size of 12,14,16 (log2) to 1,2,3 */ >>> + cmd.tlbi.tg = ((tg - ilog2(SZ_4K)) / 2) + 1; >>> + >>> + /* Determine what level the granule is at */ >>> + cmd.tlbi.ttl = 4 - ((ilog2(granule) - 3) / (tg - 3)); >>> + >>> + num_pages = size / (1UL << tg); >>> + } >>> + >>> while (iova < end) { >>> if (i == CMDQ_BATCH_ENTRIES) { >>> arm_smmu_cmdq_issue_cmdlist(smmu, cmds, i, false); >>> i = 0; >>> } >>> >>> + if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) { >>> + /* >>> + * On each iteration of the loop, the range is 5 bits >>> + * worth of the aligned size remaining. >>> + * The range in pages is: >>> + * >>> + * range = (num_pages & (0x1f << __ffs(num_pages))) >>> + */ >>> + unsigned long scale, num; >>> + >>> + /* Determine the power of 2 multiple number of pages >>> */ >>> + scale = __ffs(num_pages); >>> + cmd.tlbi.scale = scale; >>> + >>> + /* Determine how many chunks of 2^scale size we have >>> */ >>> + num = (num_pages >> scale) & CMDQ_TLBI_RANGE_NUM_MAX; >>> + cmd.tlbi.num = num - 1; >>> + >>> + /* range is num * 2^scale * pgsize */ >>> + granule = num << (scale + tg); >>> + >>> + /* Clear out the lower order bits for the next >>> iteration */ >>> + num_pages -= num << scale; >> Regarding the 2 options given in >> https://lore.kernel.org/linux-arm-kernel/CAL_JsqKABoE+0crGwyZdNogNgEoG=moopf6deqgh6s73c0u...@mail.gmail.com/raw, >> >> I understand you implemented 2) but I still do not understand why you >> preferred that one against 1). >> >> In your case of 1023*4k pages this will invalidate by 31 32*2^0*4K + >> 31*2^0*4K pages >> whereas you could achieve that with 10 invalidations with the 1st algo. >> I did not get the case where it is more efficient. Please can you detail. > > No, it's only 2 commands. We do 31*4K and then 31*2^5*4K. Here's a the > output of a test case: > > iova=10001000, num_pages=0x3e0, granule=1f000, num=31, scale=0, ttl=3 > iova=10020000, num_pages=0x0, granule=3e0000, num=31, scale=5, ttl=3 > > (num_pages being what's left at end of the loop)
> > As I mentioned on v1, worst case is 4 commands for up to 4GB. It's > 20-bits of size (32-12) and each loop processes a minimum of 5 bits. > Each loop becomes a larger aligned size, so scale goes up each pass. > This is what I tried to explain in the top comment. Sorry for the delay, I was out of the office. Yes indeed I misunderstood the code and the algo looks good to me now. just a minor comment, I would use a dedicated "inv_range" loop block local variable instead of "granule". Besides Reviewed-by: Eric Auger <eric.au...@redhat.com> Thanks Eric > > Rob > _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu