Chris Wilson <ch...@chris-wilson.co.uk> writes:

> If the engine isn't being retired (worker starvation?) then it is
> possible for us to repeatedly observe that between consecutive
> hangchecks the seqno on the ring to be the same and there remain
> unretired requests. Ignore these completely and only regard the engine
> as busy for the purpose of hang detection (not stall detection) if there
> are outstanding breadcrumbs.
>
> In recent history we have looked at using both the request and seqno as
> indication of activity on the engine, but that was reduced to just
> inspecting seqno in commit cffa781e5907 ("drm/i915: Simplify check for
> idleness in hangcheck"). However, in commit dcff85c8443e ("drm/i915:
> Enable i915_gem_wait_for_idle() without holding struct_mutex"), I made
> the decision to use the new common lockless function, under the
> assumption that request retirement was more frequent than hangcheck and
> so we would not have a stuck busy check. The flaw there was in
> forgetting that we accumulate the hang score, and so successive checks
> seeing a stuck request, albeit with the GPU advancing elsewhere and so
> not necessary the same stuck request, would eventually trigger the hang.
>
> Fixes: dcff85c8443e ("drm/i915: Enable i915_gem_wait_for_idle()...")
> Signed-off-by: Chris Wilson <ch...@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuopp...@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_irq.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index ebb83d5a448b..7610eca4f687 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -3079,6 +3079,7 @@ static void i915_hangcheck_elapsed(struct work_struct 
> *work)
>               bool busy = intel_engine_has_waiter(engine);
>               u64 acthd;
>               u32 seqno;
> +             u32 submit;
>  
>               semaphore_clear_deadlocks(dev_priv);
>  
> @@ -3094,9 +3095,10 @@ static void i915_hangcheck_elapsed(struct work_struct 
> *work)
>  
>               acthd = intel_engine_get_active_head(engine);
>               seqno = intel_engine_get_seqno(engine);
> +             submit = READ_ONCE(engine->last_submitted_seqno);
>  
>               if (engine->hangcheck.seqno == seqno) {
> -                     if (!intel_engine_is_active(engine)) {
> +                     if (i915_seqno_passed(seqno, submit)) {

Setting of busy could be moved in the in scope.

Also the check could be seqno == submit and warning if we see
seqno on engine past submit.

But the patch fixes what it says it does,

Reviewed-by: Mika Kuoppala <mika.kuopp...@intel.com>

>                               engine->hangcheck.action = HANGCHECK_IDLE;
>                               if (busy) {
>                                       /* Safeguard against driver failure */
> -- 
> 2.9.3
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Reply via email to