On 21/02/2020 4:47 pm, Rob Herring wrote:
On Fri, Feb 21, 2020 at 10:19 AM Robin Murphy <robin.mur...@arm.com> wrote:

On 20/02/2020 5:54 pm, Rob Herring wrote:
On Mon, Feb 17, 2020 at 1:17 PM Robin Murphy <robin.mur...@arm.com> wrote:

On 13/02/2020 9:49 pm, Rob Herring wrote:
On Thu, Jan 30, 2020 at 11:34 AM Robin Murphy <robin.mur...@arm.com> wrote:

On 30/01/2020 3:06 pm, Auger Eric wrote:
Hi Rob,
On 1/17/20 10:16 PM, Rob Herring wrote:
Arm SMMUv3.2 adds support for TLB range invalidate operations.
Support for range invalidate is determined by the RIL bit in the IDR3
register.

The range invalidate is in units of the leaf page size and operates on
1-32 chunks of a power of 2 multiple pages. First, we determine from the
size what power of 2 multiple we can use. Then we calculate how many
chunks (1-31) of the power of 2 size for the range on the iteration. On
each iteration, we move up in size by at least 5 bits.

Cc: Eric Auger <eric.au...@redhat.com>
Cc: Jean-Philippe Brucker <jean-phili...@linaro.org>
Cc: Will Deacon <w...@kernel.org>
Cc: Robin Murphy <robin.mur...@arm.com>
Cc: Joerg Roedel <j...@8bytes.org>
Signed-off-by: Rob Herring <r...@kernel.org>


@@ -2003,7 +2024,7 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, 
size_t size,
     {
        u64 cmds[CMDQ_BATCH_ENTRIES * CMDQ_ENT_DWORDS];
        struct arm_smmu_device *smmu = smmu_domain->smmu;
-    unsigned long start = iova, end = iova + size;
+    unsigned long start = iova, end = iova + size, num_pages = 0, tg = 0;
        int i = 0;
        struct arm_smmu_cmdq_ent cmd = {
                .tlbi = {
@@ -2022,12 +2043,50 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, 
size_t size,
                cmd.tlbi.vmid   = smmu_domain->s2_cfg.vmid;
        }

+    if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
+            /* Get the leaf page size */
+            tg = __ffs(smmu_domain->domain.pgsize_bitmap);
+
+            /* Convert page size of 12,14,16 (log2) to 1,2,3 */
+            cmd.tlbi.tg = ((tg - ilog2(SZ_4K)) / 2) + 1;

Given the comment, I think "(tg - 10) / 2" would suffice ;)

Well, duh...


+
+            /* Determine what level the granule is at */
+            cmd.tlbi.ttl = 4 - ((ilog2(granule) - 3) / (tg - 3));

Is it possible to rephrase that with logs and shifts to avoid the division?

Well, with a loop is the only other way I came up with:

bpl = tg - 3;
ttl = 3;
mask = BIT(bpl) - 1;
while ((granule & (mask << ((4 - ttl) * bpl + 3))) == 0)
       ttl--;

Doesn't seem like much improvement to me given we have h/w divide...

Sure, it doesn't take too many extra instructions to start matching
typical IDIV latency, so there's no point being silly if there really
isn't a clean alternative.

Some quick scribbling suggests "4 - ilog2(granule) / 10" might actually
be close enough, but perhaps that's a bit too cheeky.

How does divide by 10 save a divide?

Well, by that point I was more just thinking about the smallest
expression, since *some* division seems unavoidable. Although GCC does
apparently still think that transforming constant division is a win ;)

Okay. Still, divide by 10 happens to work, but it is very much not
obvious. It also doesn't work for level 1 and 16 or 64KB pages (though

(FWIW the more subtle part of the trick is that 16K doesn't allow level 1 blocks, and even in the vanishingly unlikely 64K level 1 case TTL=0 isn't "wrong")

we'll never see that granule). Subtracting 3 is not that obvious
either, but is at least in the spec for calculating the bits per
level.

I think we've spent enough time micro-optimizing this and have better
things to worry about.

Agreed! Invalidation performance has just been the bogeyman for so long that it's an area to which I've become overly sensitive.

Cheers,
Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to