Hi Rob,

On 2/13/20 8:54 PM, Rob Herring wrote:
> On Thu, Jan 30, 2020 at 9:06 AM Auger Eric <eric.au...@redhat.com> wrote:
>>
>> Hi Rob,
>> On 1/17/20 10:16 PM, Rob Herring wrote:
>>> Arm SMMUv3.2 adds support for TLB range invalidate operations.
>>> Support for range invalidate is determined by the RIL bit in the IDR3
>>> register.
>>>
>>> The range invalidate is in units of the leaf page size and operates on
>>> 1-32 chunks of a power of 2 multiple pages. First, we determine from the
>>> size what power of 2 multiple we can use. Then we calculate how many
>>> chunks (1-31) of the power of 2 size for the range on the iteration. On
>>> each iteration, we move up in size by at least 5 bits.
>>>
>>> Cc: Eric Auger <eric.au...@redhat.com>
>>> Cc: Jean-Philippe Brucker <jean-phili...@linaro.org>
>>> Cc: Will Deacon <w...@kernel.org>
>>> Cc: Robin Murphy <robin.mur...@arm.com>
>>> Cc: Joerg Roedel <j...@8bytes.org>
>>> Signed-off-by: Rob Herring <r...@kernel.org>
>>> ---
>>>  drivers/iommu/arm-smmu-v3.c | 66 ++++++++++++++++++++++++++++++++++++-
>>>  1 file changed, 65 insertions(+), 1 deletion(-)
> 
> 
>>> @@ -2003,7 +2024,7 @@ static void arm_smmu_tlb_inv_range(unsigned long 
>>> iova, size_t size,
>>>  {
>>>       u64 cmds[CMDQ_BATCH_ENTRIES * CMDQ_ENT_DWORDS];
>>>       struct arm_smmu_device *smmu = smmu_domain->smmu;
>>> -     unsigned long start = iova, end = iova + size;
>>> +     unsigned long start = iova, end = iova + size, num_pages = 0, tg = 0;
>>>       int i = 0;
>>>       struct arm_smmu_cmdq_ent cmd = {
>>>               .tlbi = {
>>> @@ -2022,12 +2043,50 @@ static void arm_smmu_tlb_inv_range(unsigned long 
>>> iova, size_t size,
>>>               cmd.tlbi.vmid   = smmu_domain->s2_cfg.vmid;
>>>       }
>>>
>>> +     if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
>>> +             /* Get the leaf page size */
>>> +             tg = __ffs(smmu_domain->domain.pgsize_bitmap);
>>> +
>>> +             /* Convert page size of 12,14,16 (log2) to 1,2,3 */
>>> +             cmd.tlbi.tg = ((tg - ilog2(SZ_4K)) / 2) + 1;
>>> +
>>> +             /* Determine what level the granule is at */
>>> +             cmd.tlbi.ttl = 4 - ((ilog2(granule) - 3) / (tg - 3));
>>> +
>>> +             num_pages = size / (1UL << tg);
>>> +     }
>>> +
>>>       while (iova < end) {
>>>               if (i == CMDQ_BATCH_ENTRIES) {
>>>                       arm_smmu_cmdq_issue_cmdlist(smmu, cmds, i, false);
>>>                       i = 0;
>>>               }
>>>
>>> +             if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
>>> +                     /*
>>> +                      * On each iteration of the loop, the range is 5 bits
>>> +                      * worth of the aligned size remaining.
>>> +                      * The range in pages is:
>>> +                      *
>>> +                      * range = (num_pages & (0x1f << __ffs(num_pages)))
>>> +                      */
>>> +                     unsigned long scale, num;
>>> +
>>> +                     /* Determine the power of 2 multiple number of pages 
>>> */
>>> +                     scale = __ffs(num_pages);
>>> +                     cmd.tlbi.scale = scale;
>>> +
>>> +                     /* Determine how many chunks of 2^scale size we have 
>>> */
>>> +                     num = (num_pages >> scale) & CMDQ_TLBI_RANGE_NUM_MAX;
>>> +                     cmd.tlbi.num = num - 1;
>>> +
>>> +                     /* range is num * 2^scale * pgsize */
>>> +                     granule = num << (scale + tg);
>>> +
>>> +                     /* Clear out the lower order bits for the next 
>>> iteration */
>>> +                     num_pages -= num << scale;
>> Regarding the 2 options given in
>> https://lore.kernel.org/linux-arm-kernel/CAL_JsqKABoE+0crGwyZdNogNgEoG=moopf6deqgh6s73c0u...@mail.gmail.com/raw,
>>
>> I understand you implemented 2) but I still do not understand why you
>> preferred that one against 1).
>>
>> In your case of 1023*4k pages this will invalidate by 31 32*2^0*4K +
>> 31*2^0*4K pages
>> whereas you could achieve that with 10 invalidations with the 1st algo.
>> I did not get the case where it is more efficient. Please can you detail.
> 
> No, it's only 2 commands. We do 31*4K and then 31*2^5*4K. Here's a the
> output of a test case:
> 
> iova=10001000, num_pages=0x3e0, granule=1f000, num=31, scale=0, ttl=3
> iova=10020000, num_pages=0x0, granule=3e0000, num=31, scale=5, ttl=3
> 
> (num_pages being what's left at end of the loop)

> 
> As I mentioned on v1, worst case is 4 commands for up to 4GB. It's
> 20-bits of size (32-12) and each loop processes a minimum of 5 bits.
> Each loop becomes a larger aligned size, so scale goes up each pass.
> This is what I tried to explain in the top comment.

Sorry for the delay, I was out of the office. Yes indeed I misunderstood
the code and the algo looks good to me now.

just a minor comment, I would use a dedicated "inv_range" loop block
local variable instead of "granule".

Besides
Reviewed-by: Eric Auger <eric.au...@redhat.com>

Thanks

Eric




> 
> Rob
> 

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to