On Thu, Jan 23, 2025 at 2:26 PM Prakash Gupta <quic_gup...@quicinc.com> wrote: > > On Thu, Jan 23, 2025 at 09:00:17AM -0500, Connor Abbott wrote: > > On Thu, Jan 23, 2025 at 6:10 AM Prakash Gupta <quic_gup...@quicinc.com> > > wrote: > > > > > > On Wed, Jan 22, 2025 at 03:00:58PM -0500, Connor Abbott wrote: > > > > > > > + /* > > > > + * On some implementations FSR.SS asserts a context fault > > > > + * interrupt. We do not want this behavior, because resolving the > > > > + * original context fault typically requires operations that > > > > cannot be > > > > + * performed in IRQ context but leaving the stall unacknowledged > > > > will > > > > + * immediately lead to another spurious interrupt as FSR.SS is > > > > still > > > > + * set. Work around this by disabling interrupts for this context > > > > bank. > > > > + * It's expected that interrupts are re-enabled after resuming the > > > > + * translation. > > > > + * > > > > + * We have to do this before report_iommu_fault() so that we don't > > > > + * leave interrupts disabled in case the downstream user decides > > > > the > > > > + * fault can be resolved inside its fault handler. > > > > + * > > > > + * There is a possible race if there are multiple context banks > > > > sharing > > > > + * the same interrupt and both signal an interrupt in between > > > > writing > > > > + * RESUME and SCTLR. We could disable interrupts here before we > Not sure if multiple context bank with shared interrupt supported for > arm-smmu driver, but even if does than context fault handler they would > operate on their respective context register space, so I don't see the race > at context register update. > > > > > + * re-enable them in the resume handler, leaving interrupts > > > > enabled. > > > > + * Lock the write to serialize it with the resume handler. > > > > + */ > > > > + if (cfi.fsr & ARM_SMMU_CB_FSR_SS) { > > > > + u32 val; > > > > + > > > > + spin_lock(&smmu_domain->cb_lock); > > > > + val = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_SCTLR); > > > > + val &= ~ARM_SMMU_SCTLR_CFIE; > > > > + arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, val); > > > > + spin_unlock(&smmu_domain->cb_lock); > > > > + } > > > > + > > > > + /* > > > > + * The SMMUv2 architecture specification says that if > > > > stall-on-fault is > > > > + * enabled the correct sequence is to write to SMMU_CBn_FSR to > > > > clear > > > > + * the fault and then write to SMMU_CBn_RESUME. Clear the > > > > interrupt > > > > + * first before running the user's fault handler to make sure we > > > > follow > > > > + * this sequence. It should be ok if there is another fault in the > > > > + * meantime because we have already read the fault info. > > > > + */ > qcom_adreno_smmu_get_fault_info() reads the fault info as part of client > fault hanlder. So it would not be ok to clear FSR before reporting the > fault to client.
That's a good point, but as long as stall-on-fault is enabled it doesn't matter because subsequent transactions that fault will be stalled. Patch 3 of this series disables stall-on-fault after the first fault in drm/msm, but we don't care as much about the accuracy of those subsequent faults. > > > > The context would remain stalled till we write to CBn_RESUME. Which is > > > done > > > in qcom_adreno_smmu_resume_translation(). For a stalled context further > > > transactions are not processed and we shouldn't see further faults and > > > or fault inerrupts. Do you observe faults with stalled context? > > > > Yes. I've observed that on MMU-500 writing RESUME before the interrupt > > has been cleared doesn't clear SS. This happened with v2 in the case > > where there was already a devcoredump and drm/msm called > > qcom_adreno_smmu_resume_translation() immediately from its fault > > handler, and we'd get a storm of unhandled interrupts until it was > > disabled. Given that the architecture spec says we're supposed to > > clear the interrupt first this may have been an attempt to "help" > > developers. > > > > Just to clarify, present sequence is: > 1. context fault with stall enable. FSR.SS set. > 2. Report fault to client > 3. resume/terminate stalled transaction > 4. clear FSR > > At what point when you try #2->#3->#4 or #4->#2->#3 sequence, is FSR.SS > cleared and interrupt storm is observed. The way CFIE disable is helping > with current patch indicates write FSR is unstalling context and subsequent > transactions are faulted. Do you stop getting interrupt storm after write > RESUME. If you can mention your SCTLR configuration and FSR state in test > sequence, it would be clearer. > > An aside, If reducing delay between FSR and RESUME write helps then both > can be done as part of qcom_adreno_smmu_resume_translation(). This will > follow spec recommendation and also make sure fault registers are not > cleared before reporting fault to client. > > Thanks, > Prakash