On 4/27/2018 1:35 PM, Chris Wilson wrote:
Quoting Michel Thierry (2018-04-27 21:27:46)
On 4/27/2018 1:24 PM, Chris Wilson wrote:
Previously, we just reset the ring register in the context image such
that we could skip over the broken batch and emit the closing
breadcrumb. However, on resume the context image and GPU state would be
reloaded, which may have been left in an inconsistent state by the
reset. The presumption was that at worst it would just cause another
reset and skip again until it recovered, however it seems just as likely
to cause an unrecoverable hang. Instead of risking loading an incomplete
context image, restore it back to the default state.

v2: Fix up off-by-one from including the ppHSWP in with the register
state.

Signed-off-by: Chris Wilson <ch...@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuopp...@linux.intel.com>
Cc: Michał Winiarski <michal.winiar...@intel.com>
Cc: Michel Thierry <michel.thie...@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursu...@intel.com>

Reviewed-by: Michel Thierry <michel.thie...@intel.com>

Does it need a 'Fixes:' tag or has a bugzilla reference?

I suspect it's rare enough that the unrecoverable hang might not be
recognisable in bugzilla. I was just looking at

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4108/fi-bsw-n3050/dmesg0.log

trying to think of ways how the reset might appear to work but the
recovery fail with

<7>[  521.765114] missed_breadcrumb vecs0 missed breadcrumb at 
intel_breadcrumbs_hangcheck+0x5a/0x80 [i915]
<7>[  521.765176] missed_breadcrumb       current seqno e4e, last e4f, 
hangcheck e4e [2048 ms], inflight 1
<7>[  521.765191] missed_breadcrumb       Reset count: 0 (global 0)
<7>[  521.765206] missed_breadcrumb       Requests:
<7>[  521.765223] missed_breadcrumb               first  e4f [9b82:e4f] prio=0 
@ 3766ms: gem_sync[3107]/0
<7>[  521.765239] missed_breadcrumb               last   e4f [9b82:e4f] prio=0 
@ 3766ms: gem_sync[3107]/0
<7>[  521.765256] missed_breadcrumb               active e4f [9b82:e4f] prio=0 
@ 3766ms: gem_sync[3107]/0
<7>[  521.765274] missed_breadcrumb               [head 3900, postfix 3930, 
tail 3948, batch 0x00000000_00042000]
<7>[  521.765289] missed_breadcrumb               ring->start:  0x008ef000
<7>[  521.765301] missed_breadcrumb               ring->head:   0x000038f8
<7>[  521.765313] missed_breadcrumb               ring->tail:   0x00003948
<7>[  521.765325] missed_breadcrumb               ring->emit:   0x00003950
<7>[  521.765337] missed_breadcrumb               ring->space:  0x00002618
<7>[  521.765372] missed_breadcrumb       RING_START: 0x008ef000
<7>[  521.765389] missed_breadcrumb       RING_HEAD:  0x000038f8
<7>[  521.765404] missed_breadcrumb       RING_TAIL:  0x00003948
<7>[  521.765422] missed_breadcrumb       RING_CTL:   0x00003001
<7>[  521.765438] missed_breadcrumb       RING_MODE:  0x00000000
<7>[  521.765453] missed_breadcrumb       RING_IMR: fffffefe
<7>[  521.765473] missed_breadcrumb       ACTHD:  0x00000000_022039b8
<7>[  521.765492] missed_breadcrumb       BBADDR: 0x00000000_00042004
<7>[  521.765511] missed_breadcrumb       DMA_FADDR: 0x00000000_008f28f8
<7>[  521.765537] missed_breadcrumb       IPEIR: 0x00000000
<7>[  521.765552] missed_breadcrumb       IPEHR: 0x11000011
<7>[  521.765570] missed_breadcrumb       Execlist status: 0x00044032 00000002
<7>[  521.765586] missed_breadcrumb       Execlist CSB read 1 [1 cached], write 
2 [2 from hws], interrupt posted? no, tasklet queued? no (enabled)
<7>[  521.765604] missed_breadcrumb       Execlist CSB[2]: 0x00000001 
[0x00000001 in hwsp], context: 0 [0 in hwsp]
<7>[  521.765619] missed_breadcrumb               ELSP[0] count=1, rq: e4f 
[9b82:e4f] prio=0 @ 3767ms: gem_sync[3107]/0
<7>[  521.765632] missed_breadcrumb               ELSP[1] idle
<7>[  521.765645] missed_breadcrumb               HW active? 0x1
<7>[  521.765660] missed_breadcrumb               E e4f [9b82:e4f] prio=0 @ 
3767ms: gem_sync[3107]/0
<7>[  521.765670] missed_breadcrumb               Queue priority: -2147483648
<7>[  521.765684] missed_breadcrumb       gem_sync [3112] waiting for e4f
<7>[  521.765697] missed_breadcrumb IRQ? 0x1 (breadcrumbs? yes) (execlists? no)
<7>[  521.765707] missed_breadcrumb HWSP:
<7>[  521.765723] missed_breadcrumb 00000000 00000000 00000000 00000000 
00000000 00000000 00000000 00000000 00000000
<7>[  521.765733] missed_breadcrumb *
<7>[  521.765747] missed_breadcrumb 00000040 00000001 00000000 00000018 
00000002 00000001 00000000 00000018 00000002
<7>[  521.765760] missed_breadcrumb 00000060 00000001 00000000 00000018 
00000002 00000000 00000000 00000000 00000002
<7>[  521.765774] missed_breadcrumb 00000080 00000000 00000000 00000000 
00000000 00000000 00000000 00000000 00000000
<7>[  521.765784] missed_breadcrumb *
<7>[  521.765809] missed_breadcrumb 000000c0 00000e4e 00000000 00000000 
00000000 00000000 00000000 00000000 00000000
<7>[  521.765823] missed_breadcrumb 000000e0 00000000 00000000 00000000 
00000000 00000000 00000000 00000000 00000000
<7>[  521.765833] missed_breadcrumb *
<7>[  521.765845] missed_breadcrumb Idle? no

Of particular note being the IPEHR being MI_LRI, the ring being idle (it
hasn't moved on from the earlier reset) and the fetch address being
unconnected to the rings, so naturally I assume it died loading the
context image on resume.
Plus it is a bsw...
Agreed, this looks like an issue during the ctx restore.

-Chris

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Reply via email to