On 13/10/17 19:32, Will Deacon wrote:
> Hi Robin,
> 
> This mostly looks good. Just a few comments below.
> 
> On Thu, Aug 31, 2017 at 02:44:27PM +0100, Robin Murphy wrote:
>> As an IRQ, the CMD_SYNC interrupt is not particularly useful, not least
>> because we often need to wait for sync completion within someone else's
>> IRQ handler anyway. However, when the SMMU is both coherent and supports
>> MSIs, we can have a lot more fun by not using it as an interrupt at all.
>> Following the example suggested in the architecture and using a write
>> targeting normal memory, we can let callers wait on a status variable
>> outside the lock instead of having to stall the entire queue or even
>> touch MMIO registers. Since multiple sync commands are guaranteed to
>> complete in order, a simple incrementing sequence count is all we need
>> to unambiguously support any realistic number of overlapping waiters.
>>
>> Signed-off-by: Robin Murphy <robin.mur...@arm.com>
>> ---
>>
>> v2: Remove redundant 'bool msi' command member, other cosmetic tweaks
>>
>>  drivers/iommu/arm-smmu-v3.c | 47 
>> +++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 45 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>> index f066725298cd..311f482b93d5 100644
>> --- a/drivers/iommu/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm-smmu-v3.c
>> @@ -377,7 +377,16 @@
>>  
>>  #define CMDQ_SYNC_0_CS_SHIFT                12
>>  #define CMDQ_SYNC_0_CS_NONE         (0UL << CMDQ_SYNC_0_CS_SHIFT)
>> +#define CMDQ_SYNC_0_CS_IRQ          (1UL << CMDQ_SYNC_0_CS_SHIFT)
>>  #define CMDQ_SYNC_0_CS_SEV          (2UL << CMDQ_SYNC_0_CS_SHIFT)
>> +#define CMDQ_SYNC_0_MSH_SHIFT               22
>> +#define CMDQ_SYNC_0_MSH_ISH         (3UL << CMDQ_SYNC_0_MSH_SHIFT)
>> +#define CMDQ_SYNC_0_MSIATTR_SHIFT   24
>> +#define CMDQ_SYNC_0_MSIATTR_OIWB    (0xfUL << CMDQ_SYNC_0_MSIATTR_SHIFT)
>> +#define CMDQ_SYNC_0_MSIDATA_SHIFT   32
>> +#define CMDQ_SYNC_0_MSIDATA_MASK    0xffffffffUL
>> +#define CMDQ_SYNC_1_MSIADDR_SHIFT   0
>> +#define CMDQ_SYNC_1_MSIADDR_MASK    0xffffffffffffcUL
>>  
>>  /* Event queue */
>>  #define EVTQ_ENT_DWORDS                     4
>> @@ -409,6 +418,7 @@
>>  /* High-level queue structures */
>>  #define ARM_SMMU_POLL_TIMEOUT_US    100
>>  #define ARM_SMMU_CMDQ_DRAIN_TIMEOUT_US      1000000 /* 1s! */
>> +#define ARM_SMMU_SYNC_TIMEOUT_US    1000000 /* 1s! */
> 
> We only ever do this when waiting for the queue to drain, so may as well
> just reuse the drain timeout.

As you've discovered, we remove the "drain" case entirely in the end.

>>  #define MSI_IOVA_BASE                       0x8000000
>>  #define MSI_IOVA_LENGTH                     0x100000
>> @@ -504,6 +514,10 @@ struct arm_smmu_cmdq_ent {
>>              } pri;
>>  
>>              #define CMDQ_OP_CMD_SYNC        0x46
>> +            struct {
>> +                    u32                     msidata;
>> +                    u64                     msiaddr;
>> +            } sync;
>>      };
>>  };
>>  
>> @@ -617,6 +631,9 @@ struct arm_smmu_device {
>>      int                             gerr_irq;
>>      int                             combined_irq;
>>  
>> +    atomic_t                        sync_nr;
>> +    u32                             sync_count;
> 
> It's probably worth sticking these in separate cachelines so we don't
> get spurious wakeups when sync_nr is incremented. (yes, I know it should
> be the ERG, but that can be unreasonably huge!).

Good point - we've got 8K of bitmaps embedded in the structure anyway,
so even maximum ERG separation is easily possible.

>> +
>>      unsigned long                   ias; /* IPA */
>>      unsigned long                   oas; /* PA */
>>      unsigned long                   pgsize_bitmap;
>> @@ -878,7 +895,13 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct 
>> arm_smmu_cmdq_ent *ent)
>>              }
>>              break;
>>      case CMDQ_OP_CMD_SYNC:
>> -            cmd[0] |= CMDQ_SYNC_0_CS_SEV;
>> +            if (ent->sync.msiaddr)
>> +                    cmd[0] |= CMDQ_SYNC_0_CS_IRQ;
>> +            else
>> +                    cmd[0] |= CMDQ_SYNC_0_CS_SEV;
>> +            cmd[0] |= CMDQ_SYNC_0_MSH_ISH | CMDQ_SYNC_0_MSIATTR_OIWB;
>> +            cmd[0] |= (u64)ent->sync.msidata << CMDQ_SYNC_0_MSIDATA_SHIFT;
>> +            cmd[1] |= ent->sync.msiaddr & CMDQ_SYNC_1_MSIADDR_MASK;
>>              break;
>>      default:
>>              return -ENOENT;
>> @@ -964,21 +987,40 @@ static void arm_smmu_cmdq_issue_cmd(struct 
>> arm_smmu_device *smmu,
>>      spin_unlock_irqrestore(&smmu->cmdq.lock, flags);
>>  }
>>  
>> +static int arm_smmu_sync_poll_msi(struct arm_smmu_device *smmu, u32 
>> sync_idx)
>> +{
>> +    ktime_t timeout = ktime_add_us(ktime_get(), ARM_SMMU_SYNC_TIMEOUT_US);
>> +    u32 val = smp_cond_load_acquire(&smmu->sync_count,
>> +                                    (int)(VAL - sync_idx) >= 0 ||
>> +                                    !ktime_before(ktime_get(), timeout));
>> +
>> +    return (int)(val - sync_idx) < 0 ? -ETIMEDOUT : 0;
> 
> There are some theoretical overflow issues here which I don't think will
> ever occur in practice, but deserve at least a comment to explain why.

Even if we did have 2^31 or more CPUs, the size of a queue is bounded at
2^20, so we can never have enough in-flight syncs to get near to an
overflow problem. I can certainly document that if you like.

>> +}
>> +
>>  static void arm_smmu_cmdq_issue_sync(struct arm_smmu_device *smmu)
>>  {
>>      u64 cmd[CMDQ_ENT_DWORDS];
>>      unsigned long flags;
>>      bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
>> +    bool msi = (smmu->features & ARM_SMMU_FEAT_MSI) &&
>> +               (smmu->features & ARM_SMMU_FEAT_COHERENCY);
> 
> I don't think this is sufficient for the case where we fail to setup MSIs
> and fall back on legacy IRQs.

Remember this 'MSI' is going nowhere near the GIC, so the IRQ
configuration is irrelevant (especially after patch #2) - the feature
bits tell us "is the SMMU capable of generating sync-completion writes?"
and "are those writes coherent?", which is precisely what matters here.

>>      struct arm_smmu_cmdq_ent ent = { .opcode = CMDQ_OP_CMD_SYNC };
>>      int ret;
>>  
>> +    if (msi) {
>> +            ent.sync.msidata = atomic_inc_return(&smmu->sync_nr);
> 
> I don't think you need barrier semantics here.

You mean atomic_inc_return_relaxed() would be sufficient? TBH I don't
think I'd given this any thought - I guess the coherency protocols make
it impossible to do an atomic op on stale data, so that seems reasonable.
>> +            ent.sync.msiaddr = virt_to_phys(&smmu->sync_count);
>> +    }
>>      arm_smmu_cmdq_build_cmd(cmd, &ent);
>>  
>>      spin_lock_irqsave(&smmu->cmdq.lock, flags);
>>      arm_smmu_cmdq_insert_cmd(smmu, cmd);
>> -    ret = queue_poll_cons(&smmu->cmdq.q, true, wfe);
>> +    if (!msi)
>> +            ret = queue_poll_cons(&smmu->cmdq.q, true, wfe);
>>      spin_unlock_irqrestore(&smmu->cmdq.lock, flags);
>>  
>> +    if (msi)
>> +            ret = arm_smmu_sync_poll_msi(smmu, ent.sync.msidata);
> 
> This looks like the queue polling should be wrapped up in a function that
> returns with the spinlock released.

I did ponder that, but I can't help finding such asymmetric interfaces
pretty grim, and things do get better again once both cases can wait
outside the lock.

Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to