On 18/05/2018 11:09, Chris Wilson wrote:
Inside the live_hangcheck (reset) selftests, we occasionally see
failures like

<7>[  239.094840] i915_gem_set_wedged rcs0
<7>[  239.094843] i915_gem_set_wedged     current seqno 19a98, last 19a9a, 
hangcheck 0 [5158 ms]
<7>[  239.094846] i915_gem_set_wedged     Reset count: 6239 (global 1)
<7>[  239.094848] i915_gem_set_wedged     Requests:
<7>[  239.095052] i915_gem_set_wedged             first  19a99 [e8c:5f] 
prio=1024 @ 5159ms: (null)
<7>[  239.095056] i915_gem_set_wedged             last   19a9a [e81:1a] 
prio=139 @ 5159ms: igt/rcs0[5977]/1
<7>[  239.095059] i915_gem_set_wedged             active 19a99 [e8c:5f] 
prio=1024 @ 5159ms: (null)
<7>[  239.095062] i915_gem_set_wedged             [head 0220, postfix 0280, 
tail 02a8, batch 0xffffffff_ffffffff]
<7>[  239.100050] i915_gem_set_wedged             ring->start:  0x00283000
<7>[  239.100053] i915_gem_set_wedged             ring->head:   0x000001f8
<7>[  239.100055] i915_gem_set_wedged             ring->tail:   0x000002a8
<7>[  239.100057] i915_gem_set_wedged             ring->emit:   0x000002a8
<7>[  239.100059] i915_gem_set_wedged             ring->space:  0x00000f10
<7>[  239.100085] i915_gem_set_wedged     RING_START: 0x00283000
<7>[  239.100088] i915_gem_set_wedged     RING_HEAD:  0x00000260
<7>[  239.100091] i915_gem_set_wedged     RING_TAIL:  0x000002a8
<7>[  239.100094] i915_gem_set_wedged     RING_CTL:   0x00000001
<7>[  239.100097] i915_gem_set_wedged     RING_MODE:  0x00000300 [idle]
<7>[  239.100100] i915_gem_set_wedged     RING_IMR: fffffefe
<7>[  239.100104] i915_gem_set_wedged     ACTHD:  0x00000000_0000609c
<7>[  239.100108] i915_gem_set_wedged     BBADDR: 0x00000000_0000609d
<7>[  239.100111] i915_gem_set_wedged     DMA_FADDR: 0x00000000_00283260
<7>[  239.100114] i915_gem_set_wedged     IPEIR: 0x00000000
<7>[  239.100117] i915_gem_set_wedged     IPEHR: 0x02800000
<7>[  239.100120] i915_gem_set_wedged     Execlist status: 0x00044052 00000002
<7>[  239.100124] i915_gem_set_wedged     Execlist CSB read 5 [5 cached], write 
5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
<7>[  239.100128] i915_gem_set_wedged             ELSP[0] count=1, 
ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
<7>[  239.100132] i915_gem_set_wedged             ELSP[1] count=1, 
ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
<7>[  239.100135] i915_gem_set_wedged             HW active? 0x5
<7>[  239.100250] i915_gem_set_wedged             E 19a99 [e8c:5f] prio=1024 @ 
5164ms: (null)
<7>[  239.100338] i915_gem_set_wedged             E 19a9a [e81:1a] prio=139 @ 
5164ms: igt/rcs0[5977]/1
<7>[  239.100340] i915_gem_set_wedged             Queue priority: 139
<7>[  239.100343] i915_gem_set_wedged             Q 0 [e98:19] prio=132 @ 
5164ms: igt/rcs0[5977]/8
<7>[  239.100346] i915_gem_set_wedged             Q 0 [e84:19] prio=121 @ 
5165ms: igt/rcs0[5977]/2
<7>[  239.100349] i915_gem_set_wedged             Q 0 [e87:19] prio=82 @ 
5165ms: igt/rcs0[5977]/3
<7>[  239.100352] i915_gem_set_wedged             Q 0 [e84:1a] prio=44 @ 
5164ms: igt/rcs0[5977]/2
<7>[  239.100356] i915_gem_set_wedged             Q 0 [e8b:19] prio=20 @ 
5165ms: igt/rcs0[5977]/4
<7>[  239.100362] i915_gem_set_wedged     drv_selftest [5894] waiting for 19a99

where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
The RING_MODE indicates that is idle and has the STOP_RING bit set, so
try clearing it.

v2: Only clear the bit on restarting the ring, as we want to be sure the
STOP_RING bit is kept if reset fails on wedging.
v3: Spot when the ring state doesn't make sense when re-initialising the
engine and dump it to the logs so that we don't have to wait for an
error later and try to guess what happened earlier.
v4: Prepare to print all the unexpected state, not just the first.

Signed-off-by: Chris Wilson <ch...@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursu...@linux.intel.com>
---
  drivers/gpu/drm/i915/intel_lrc.c | 22 ++++++++++++++++++++++
  1 file changed, 22 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 3744f5750624..ba8411ba4abf 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1781,6 +1781,9 @@ static void enable_execlists(struct intel_engine_cs 
*engine)
                I915_WRITE(RING_MODE_GEN7(engine),
                           _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
+ I915_WRITE(RING_MI_MODE(engine->mmio_base),
+                  _MASKED_BIT_DISABLE(STOP_RING));

Worries me a bit to clear it unconditionally since documentation says nothing (that I can find) about this scenario.

+
        I915_WRITE(RING_HWS_PGA(engine->mmio_base),
                   engine->status_page.ggtt_offset);
        POSTING_READ(RING_HWS_PGA(engine->mmio_base));
@@ -1789,6 +1792,19 @@ static void enable_execlists(struct intel_engine_cs 
*engine)
        engine->execlists.csb_head = -1;
  }
+static bool unexpected_starting_state(struct intel_engine_cs *engine)
+{
+       struct drm_i915_private *dev_priv = engine->i915;
+       bool unexpected = false;
+
+       if (I915_READ(RING_MI_MODE(engine->mmio_base)) & STOP_RING) {
+               DRM_DEBUG_DRIVER("STOP_RING still set in RING_MI_MODE\n");

Also worries me to bury this as a debug message unless we can find in documentation that this is known to happen occasionally.

If we had the check and logging immediately after reset, with a more visible log, we could build a larger data set of reported failures and then go back to hw people and ask them about it.

+               unexpected = true;
+       }
+
+       return unexpected;
+}
+
  static int gen8_init_common_ring(struct intel_engine_cs *engine)
  {
        struct intel_engine_execlists * const execlists = &engine->execlists;
@@ -1801,6 +1817,12 @@ static int gen8_init_common_ring(struct intel_engine_cs 
*engine)
        intel_engine_reset_breadcrumbs(engine);
        intel_engine_init_hangcheck(engine);
+ if (GEM_SHOW_DEBUG() && unexpected_starting_state(engine)) {
+               struct drm_printer p = drm_debug_printer(__func__);
+
+               intel_engine_dump(engine, &p, NULL);
+       }
+
        enable_execlists(engine);
/* After a GPU reset, we may have requests to replay */


Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Reply via email to