Quoting Michel Thierry (2018-04-27 21:27:46)
> On 4/27/2018 1:24 PM, Chris Wilson wrote:
> > Previously, we just reset the ring register in the context image such
> > that we could skip over the broken batch and emit the closing
> > breadcrumb. However, on resume the context image and GPU state would be
> > reloaded, which may have been left in an inconsistent state by the
> > reset. The presumption was that at worst it would just cause another
> > reset and skip again until it recovered, however it seems just as likely
> > to cause an unrecoverable hang. Instead of risking loading an incomplete
> > context image, restore it back to the default state.
> > 
> > v2: Fix up off-by-one from including the ppHSWP in with the register
> > state.
> > 
> > Signed-off-by: Chris Wilson <ch...@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuopp...@linux.intel.com>
> > Cc: Michał Winiarski <michal.winiar...@intel.com>
> > Cc: Michel Thierry <michel.thie...@intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursu...@intel.com>
> 
> Reviewed-by: Michel Thierry <michel.thie...@intel.com>
> 
> Does it need a 'Fixes:' tag or has a bugzilla reference?

I suspect it's rare enough that the unrecoverable hang might not be
recognisable in bugzilla. I was just looking at 

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4108/fi-bsw-n3050/dmesg0.log

trying to think of ways how the reset might appear to work but the
recovery fail with 

<7>[  521.765114] missed_breadcrumb vecs0 missed breadcrumb at 
intel_breadcrumbs_hangcheck+0x5a/0x80 [i915]
<7>[  521.765176] missed_breadcrumb     current seqno e4e, last e4f, hangcheck 
e4e [2048 ms], inflight 1
<7>[  521.765191] missed_breadcrumb     Reset count: 0 (global 0)
<7>[  521.765206] missed_breadcrumb     Requests:
<7>[  521.765223] missed_breadcrumb             first  e4f [9b82:e4f] prio=0 @ 
3766ms: gem_sync[3107]/0
<7>[  521.765239] missed_breadcrumb             last   e4f [9b82:e4f] prio=0 @ 
3766ms: gem_sync[3107]/0
<7>[  521.765256] missed_breadcrumb             active e4f [9b82:e4f] prio=0 @ 
3766ms: gem_sync[3107]/0
<7>[  521.765274] missed_breadcrumb             [head 3900, postfix 3930, tail 
3948, batch 0x00000000_00042000]
<7>[  521.765289] missed_breadcrumb             ring->start:  0x008ef000
<7>[  521.765301] missed_breadcrumb             ring->head:   0x000038f8
<7>[  521.765313] missed_breadcrumb             ring->tail:   0x00003948
<7>[  521.765325] missed_breadcrumb             ring->emit:   0x00003950
<7>[  521.765337] missed_breadcrumb             ring->space:  0x00002618
<7>[  521.765372] missed_breadcrumb     RING_START: 0x008ef000
<7>[  521.765389] missed_breadcrumb     RING_HEAD:  0x000038f8
<7>[  521.765404] missed_breadcrumb     RING_TAIL:  0x00003948
<7>[  521.765422] missed_breadcrumb     RING_CTL:   0x00003001
<7>[  521.765438] missed_breadcrumb     RING_MODE:  0x00000000
<7>[  521.765453] missed_breadcrumb     RING_IMR: fffffefe
<7>[  521.765473] missed_breadcrumb     ACTHD:  0x00000000_022039b8
<7>[  521.765492] missed_breadcrumb     BBADDR: 0x00000000_00042004
<7>[  521.765511] missed_breadcrumb     DMA_FADDR: 0x00000000_008f28f8
<7>[  521.765537] missed_breadcrumb     IPEIR: 0x00000000
<7>[  521.765552] missed_breadcrumb     IPEHR: 0x11000011
<7>[  521.765570] missed_breadcrumb     Execlist status: 0x00044032 00000002
<7>[  521.765586] missed_breadcrumb     Execlist CSB read 1 [1 cached], write 2 
[2 from hws], interrupt posted? no, tasklet queued? no (enabled)
<7>[  521.765604] missed_breadcrumb     Execlist CSB[2]: 0x00000001 [0x00000001 
in hwsp], context: 0 [0 in hwsp]
<7>[  521.765619] missed_breadcrumb             ELSP[0] count=1, rq: e4f 
[9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0
<7>[  521.765632] missed_breadcrumb             ELSP[1] idle
<7>[  521.765645] missed_breadcrumb             HW active? 0x1
<7>[  521.765660] missed_breadcrumb             E e4f [9b82:e4f] prio=0 @ 
3767ms: gem_sync[3107]/0
<7>[  521.765670] missed_breadcrumb             Queue priority: -2147483648
<7>[  521.765684] missed_breadcrumb     gem_sync [3112] waiting for e4f
<7>[  521.765697] missed_breadcrumb IRQ? 0x1 (breadcrumbs? yes) (execlists? no)
<7>[  521.765707] missed_breadcrumb HWSP:
<7>[  521.765723] missed_breadcrumb 00000000 00000000 00000000 00000000 
00000000 00000000 00000000 00000000 00000000
<7>[  521.765733] missed_breadcrumb *
<7>[  521.765747] missed_breadcrumb 00000040 00000001 00000000 00000018 
00000002 00000001 00000000 00000018 00000002
<7>[  521.765760] missed_breadcrumb 00000060 00000001 00000000 00000018 
00000002 00000000 00000000 00000000 00000002
<7>[  521.765774] missed_breadcrumb 00000080 00000000 00000000 00000000 
00000000 00000000 00000000 00000000 00000000
<7>[  521.765784] missed_breadcrumb *
<7>[  521.765809] missed_breadcrumb 000000c0 00000e4e 00000000 00000000 
00000000 00000000 00000000 00000000 00000000
<7>[  521.765823] missed_breadcrumb 000000e0 00000000 00000000 00000000 
00000000 00000000 00000000 00000000 00000000
<7>[  521.765833] missed_breadcrumb *
<7>[  521.765845] missed_breadcrumb Idle? no

Of particular note being the IPEHR being MI_LRI, the ring being idle (it
hasn't moved on from the earlier reset) and the fetch address being
unconnected to the rings, so naturally I assume it died loading the
context image on resume.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Reply via email to