On Tue, Jan 28, 2025 at 7:34 PM Connor Abbott <cwabbo...@gmail.com> wrote: > > On Tue, Jan 28, 2025 at 10:25 PM Rob Clark <robdcl...@gmail.com> wrote: > > > > On Tue, Jan 28, 2025 at 6:31 PM Connor Abbott <cwabbo...@gmail.com> wrote: > > > > > > On Tue, Jan 28, 2025 at 9:21 PM Rob Clark <robdcl...@gmail.com> wrote: > > > > > > > > On Tue, Jan 28, 2025 at 2:15 PM Connor Abbott <cwabbo...@gmail.com> > > > > wrote: > > > > > > > > > > On Tue, Jan 28, 2025 at 5:10 PM Rob Clark <robdcl...@gmail.com> wrote: > > > > > > > > > > > > On Tue, Jan 28, 2025 at 3:08 AM Prakash Gupta > > > > > > <quic_gup...@quicinc.com> wrote: > > > > > > > > > > > > > > On Thu, Jan 23, 2025 at 03:14:16PM -0500, Connor Abbott wrote: > > > > > > > > On Thu, Jan 23, 2025 at 2:26 PM Prakash Gupta > > > > > > > > <quic_gup...@quicinc.com> wrote: > > > > > > > > > > > > > > > > > > On Thu, Jan 23, 2025 at 09:00:17AM -0500, Connor Abbott wrote: > > > > > > > > > > On Thu, Jan 23, 2025 at 6:10 AM Prakash Gupta > > > > > > > > > > <quic_gup...@quicinc.com> wrote: > > > > > > > > > > > > > > > > > > > > > > On Wed, Jan 22, 2025 at 03:00:58PM -0500, Connor Abbott > > > > > > > > > > > wrote: > > > > > > > > > > > The context would remain stalled till we write to > > > > > > > > > > > CBn_RESUME. Which is done > > > > > > > > > > > in qcom_adreno_smmu_resume_translation(). For a stalled > > > > > > > > > > > context further > > > > > > > > > > > transactions are not processed and we shouldn't see > > > > > > > > > > > further faults and > > > > > > > > > > > or fault inerrupts. Do you observe faults with stalled > > > > > > > > > > > context? > > > > > > > > > > > > > > > > > > > > Yes. I've observed that on MMU-500 writing RESUME before > > > > > > > > > > the interrupt > > > > > > > > > > has been cleared doesn't clear SS. This happened with v2 in > > > > > > > > > > the case > > > > > > > > > > where there was already a devcoredump and drm/msm called > > > > > > > > > > qcom_adreno_smmu_resume_translation() immediately from its > > > > > > > > > > fault > > > > > > > > > > handler, and we'd get a storm of unhandled interrupts until > > > > > > > > > > it was > > > > > > > > > > disabled. Given that the architecture spec says we're > > > > > > > > > > supposed to > > > > > > > > > > clear the interrupt first this may have been an attempt to > > > > > > > > > > "help" > > > > > > > > > > developers. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Just to clarify, present sequence is: > > > > > > > > > 1. context fault with stall enable. FSR.SS set. > > > > > > > > > 2. Report fault to client > > > > > > > > > 3. resume/terminate stalled transaction > > > > > > > > > 4. clear FSR > > > > > > > > > > > > > > > > > > At what point when you try #2->#3->#4 or #4->#2->#3 sequence, > > > > > > > > > is FSR.SS > > > > > > > > > cleared and interrupt storm is observed. > > > > > > > > > > > > > > > > With #2->#3->#4 FSR.SS is *not* cleared and there is a > > > > > > > > subsequent > > > > > > > > interrupt storm with only FSR.SS asserted. With #4->#2->#3 > > > > > > > > there is no > > > > > > > > interrupt storm. From this we conclude that MMU-500 does not > > > > > > > > clear > > > > > > > > FSR.SS unless #4 happens before #3. > > > > > > > > > > > > > > > Thanks Connor for clarification. I get your point now. I think > > > > > > > it's > > > > > > > expected interrupt storm with #2->#3->#4 sequence is expected. > > > > > > > With > > > > > > > CONFIG_ARM_SMMU_QCOM_DEBUG enabled, context fault follows the > > > > > > > sequence of > > > > > > > #1->#2->#4->#3, which is spec recommended. > > > > > > > > > > > > > > so, common fault handler can be modified to follow the same > > > > > > > sequence, but I > > > > > > > have concern regarding clearing FSR before reporting fault to > > > > > > > client. > > > > > > > qcom_adreno_smmu_get_fault_info() is an example I gave but other > > > > > > > client > > > > > > > handler may have similar expecation of fault register not changed > > > > > > > before > > > > > > > client fault handler is called. > > > > > > > > > > > > Simple solution would be to move the clearing of FSR to after the > > > > > > fault is reported. It doesn't really matter if it is before or > > > > > > after, > > > > > > as long as it happens before the irq handler returns, AFAIU. And > > > > > > drm/msm will collect the fault info from the irq handler. > > > > > > > > > > As I said in the earlier mail: "From this we conclude that MMU-500 > > > > > does not clear FSR.SS unless #4 happens before #3." #4 is clearing FSR > > > > > and #3 is writing RESUME. So no, unfortunately it does actually matter > > > > > and we get a storm of unhandled IRQs if we don't clear FSR first. Your > > > > > solution is what v2 did and it didn't work. > > > > > > > > So, just clear FSR also in the resume path > > > > > > Then we'd run the risk of incorrectly accepting a second fault if it > > > happens immediately after we resume. Not terrible for our use-case > > > where we only really need the first fault to be accurate and we > > > disable stall-on-fault afterwards, but if you ever leave it on then it > > > would result in another interrupt storm. > > > > > > > Howso? We'd still be ensuring that #4 happens before #3? If needed we > > can add a param to resume_translation() so drm can tell it to only > > clear FSR if it calls resume_translation from the interrupt handler, > > or something like that. But I don't want to lose capturing the FSR, > > and I don't think we need to. > > > > BR, > > -R > > Because we could clear FSR in the resume handler (ok), then resume, > then fault again, then clear FSR again in arm_smmu_context_fault() and > unintentionally acknowledge the fault without a corresponding resume. > Again, it's not possible with drm/msm after this series since we > disable stall-on-fault before resuming but still this means it's very > fragile.
Alternatively we could use the return code from the fault handler (maybe, -EFAULT) to indicate that arm-smmu irq handler should terminate+resume immediately, rather than calling resume_translation() in the case that drm doesn't schedule fault_handler. We can change both sides of the adreno_smmu_priv interface if needed to come up with something sensible. > To be clear, FSR should already be frozen until we resume, if the > SMMUv2 architecture pseudocode is to be believed. So any fault that's > recorded in the devcoredump will still be accurate with this patch, > just subsequent faults might not be accurate. I guess by default I'd be inclined to "trust but verify" that (and also on qcom,smmu-v2). BR, -R