On Tue, Jul 29, 2014 at 05:25:48PM +0000, Mcaulay, Alistair wrote: > > drv_suspend, gem_hangcheck_forcewake are working fine now with PPGTT > enabled. Both with and without my patch. > > The results are the same with and without my patch for: > > $ sudo ~/drm_nightly/intel-gpu-tools/tests/drv_hangman > IGT-Version: 1.7-g70e6ed9 (x86_64) (Linux: 3.16.0-rc5+ x86_64) > Subtest error-state-debugfs-entry: SUCCESS > Subtest error-state-sysfs-entry: SUCCESS > Subtest ring-stop-sysfs-entry: SUCCESS > Subtest error-state-basic: SUCCESS > Subtest error-state-capture-render: SUCCESS > Test assertion failure function check_error_state, file drv_hangman.c:303: > Failed assertion: gtt_offset == expected_offset > Subtest error-state-capture-bsd: FAIL > Test assertion failure function check_error_state, file drv_hangman.c:303: > Failed assertion: gtt_offset == expected_offset > Subtest error-state-capture-blt: FAIL > Test assertion failure function check_error_state, file drv_hangman.c:303: > Failed assertion: gtt_offset == expected_offset > Subtest error-state-capture-vebox: FAIL > > $ sudo ~/drm_nightly/intel-gpu-tools/tests/gem_reset_stats > IGT-Version: 1.7-g70e6ed9 (x86_64) (Linux: 3.16.0-rc5+ x86_64) > Subtest params: SUCCESS > Subtest params-ctx-render: SUCCESS > Subtest reset-stats-render: SUCCESS > Subtest reset-stats-ctx-render: SUCCESS > Subtest ban-render: SUCCESS > Subtest ban-ctx-render: SUCCESS > Subtest reset-count-render: SUCCESS > Subtest reset-count-ctx-render: SUCCESS > Subtest unrelated-ctx-render: SUCCESS > Subtest close-pending-render: SUCCESS > Subtest close-pending-ctx-render: SUCCESS > <test now hangs> > > > Both good without PPGTT. > I haven't yet seen a regression with the patch.
gem_ringfill might be able to hit the -EAGAIN issue. But that testcase is missing subtest with signals to interrupt the ioctls. I'll add that. -Daniel > > -----Original Message----- > From: Ben Widawsky [mailto:b...@bwidawsk.net] > Sent: Tuesday, July 29, 2014 1:17 AM > To: Mcaulay, Alistair > Cc: Daniel Vetter; intel-gfx@lists.freedesktop.org > Subject: Re: [Intel-gfx] [PATCH] drm/i915: Rework GPU reset sequence to match > driver load & thaw > > On Mon, Jul 28, 2014 at 05:12:59PM +0000, Mcaulay, Alistair wrote: > > Hi Ben / Daniel, > > I agree that this needs to be properly tested. Are there any particular igt > > tests you would suggest I use? > > I've been running: > > drv_hangman, drv_suspend, gem_hangcheck_forcewake. > > I thought IGT had added some random reset stuff in the way that we have for > signals. However, it looks like I imagined it. I guess you can add gem_hang, > gem_reset_stats, and tests/debugfs_wedged to that list. Daniel can probably > provide any I might have missed. > > The way that igt quiesces everything these days really hurts the ability to > test multi-process. If every tests starts off with no work, and running the > default context, things are pretty trivial. Similarly, running these tests in > isolation, even if it isn't quiescing doesn't help the situation. The way I > wrote the code originally was through debugging hangs on a desktop as I > developed patches, and not with IGT (though drv_hangman could catch many > issues). I'd definitely recommend trying to invoke hangs on a running > desktop. I'd advise doing this by modifying mesa to submit a crap batch, I > can provide you more details on how to do this if you need it. Also try to > disable the quiescing in IGT and run more than these tests in isolation. > > > > > Also do you have a set of PPGTT Patches that should work with these > > tests. Michel sent me a set of patches to enable PPGTT, but these 3 tests > > fail with the patches. > > I will try to reproduce this on my patch series when I have some time and if > nothing else, that should be a good preparation/refresher for the patch > review anyway. The patches I wrote shouldn't have touched much on these paths > - not sure if Michel changed anything there. > > With patch on top of what Michel sent you, everything passes? > > > > > Thanks, > > Alistair. > > > > -----Original Message----- > > From: Daniel Vetter [mailto:daniel.vet...@ffwll.ch] On Behalf Of > > Daniel Vetter > > Sent: Monday, July 28, 2014 10:27 AM > > To: Ben Widawsky > > Cc: Mcaulay, Alistair; intel-gfx@lists.freedesktop.org > > Subject: Re: [Intel-gfx] [PATCH] drm/i915: Rework GPU reset sequence > > to match driver load & thaw > > > > On Fri, Jul 25, 2014 at 06:05:29PM -0700, Ben Widawsky wrote: > > > On Wed, Jul 16, 2014 at 04:05:59PM +0100, alistair.mcau...@intel.com > > > wrote: > > > > From: "McAulay, Alistair" <alistair.mcau...@intel.com> > > > > > > > > This patch is to address Daniels concerns over different code during > > > > reset: > > > > > > > > http://lists.freedesktop.org/archives/intel-gfx/2014-June/047758.h > > > > tm > > > > l > > > > > > > > "The reason for aiming as hard as possible to use the exact same > > > > code for driver load, gpu reset and runtime pm/system resume is > > > > that we've simply seen too many bugs due to slight variations and > > > > unintended omissions." > > > > > > > > Tested using igt drv_hangman. > > > > > > > > Signed-off-by: McAulay, Alistair <alistair.mcau...@intel.com> > > > > > > 2 quick comments before I actually do a real review. > > > 1. Did you actually run this with and without full ppgtt? > > > 2. I don't think this is actually fulfilling what Daniel is > > > requesting, though we can let him comment. > > > > Mostly looks like what I think we need. Comments below. > > > > > 3. Did you reall do #1? > > > > > > Assuming you satisifed #1, can you please post the igt results for > > > the permutations (pre patch w/ and w/o ppgtt; post patch w/ and w/o > > > ppgtt) > > > > > > I really want this data because I spent *a lot* of time with these > > > specific areas in the PPGTT work, and I am somewhat skeptical enough > > > of the code has changed that this will magically work. I also tried > > > the trickiness with the ring handling functions, and never succeeded. > > > Also, with the context stuff, I'm simply not convinced it can > > > magically vanish. If igt looks good, and Daniel agrees that this is > > > what he actually wanted, I will go fishing for corner cases and do the > > > review. > > > > I think the hack in ring_begin might explain why it never worked before. > > But fully agreed, we really need to test this well (and fill gaps if igt > > misses anything around resets - we don't have any systematic gpu reset > > coverage anywhere outside of igt). > > > > > > > > Thanks. > > > > > > > --- > > > > drivers/gpu/drm/i915/i915_gem.c | 2 - > > > > drivers/gpu/drm/i915/i915_gem_context.c | 42 +-------------------- > > > > drivers/gpu/drm/i915/i915_gem_gtt.c | 67 > > > > +++++---------------------------- > > > > drivers/gpu/drm/i915/i915_gem_gtt.h | 3 +- > > > > drivers/gpu/drm/i915/intel_ringbuffer.c | 4 +- > > > > 5 files changed, 14 insertions(+), 104 deletions(-) > > > > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem.c > > > > b/drivers/gpu/drm/i915/i915_gem.c index ef047bc..b38e086 100644 > > > > --- a/drivers/gpu/drm/i915/i915_gem.c > > > > +++ b/drivers/gpu/drm/i915/i915_gem.c > > > > @@ -2590,8 +2590,6 @@ void i915_gem_reset(struct drm_device *dev) > > > > for_each_ring(ring, dev_priv, i) > > > > i915_gem_reset_ring_cleanup(dev_priv, ring); > > > > > > > > - i915_gem_context_reset(dev); > > > > - > > > > i915_gem_restore_fences(dev); > > > > } > > > > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_context.c > > > > b/drivers/gpu/drm/i915/i915_gem_context.c > > > > index de72a28..d96219f 100644 > > > > --- a/drivers/gpu/drm/i915/i915_gem_context.c > > > > +++ b/drivers/gpu/drm/i915/i915_gem_context.c > > > > @@ -372,42 +372,6 @@ err_destroy: > > > > return ERR_PTR(ret); > > > > } > > > > > > > > -void i915_gem_context_reset(struct drm_device *dev) -{ > > > > - struct drm_i915_private *dev_priv = dev->dev_private; > > > > - int i; > > > > - > > > > - /* Prevent the hardware from restoring the last context (which > > > > hung) on > > > > - * the next switch */ > > > > - for (i = 0; i < I915_NUM_RINGS; i++) { > > > > - struct intel_engine_cs *ring = &dev_priv->ring[i]; > > > > - struct intel_context *dctx = ring->default_context; > > > > - struct intel_context *lctx = ring->last_context; > > > > - > > > > - /* Do a fake switch to the default context */ > > > > - if (lctx == dctx) > > > > - continue; > > > > - > > > > - if (!lctx) > > > > - continue; > > > > - > > > > - if (dctx->legacy_hw_ctx.rcs_state && i == RCS) { > > > > - > > > > WARN_ON(i915_gem_obj_ggtt_pin(dctx->legacy_hw_ctx.rcs_state, > > > > - > > > > get_context_alignment(dev), 0)); > > > > - /* Fake a finish/inactive */ > > > > - > > > > dctx->legacy_hw_ctx.rcs_state->base.write_domain = 0; > > > > - dctx->legacy_hw_ctx.rcs_state->active = 0; > > > > - } > > > > - > > > > - if (lctx->legacy_hw_ctx.rcs_state && i == RCS) > > > > - > > > > i915_gem_object_ggtt_unpin(lctx->legacy_hw_ctx.rcs_state); > > > > - > > > > - i915_gem_context_unreference(lctx); > > > > - i915_gem_context_reference(dctx); > > > > - ring->last_context = dctx; > > > > - } > > > > -} > > > > I don't understand why we no longer need this - after reset we probably > > have the default context loaded (if we resue the driver load sequence > > exactly), so I expect that we must reset the software tracking accordingly. > > > > > > - > > > > int i915_gem_context_init(struct drm_device *dev) { > > > > struct drm_i915_private *dev_priv = dev->dev_private; @@ -498,10 > > > > +462,6 @@ int i915_gem_context_enable(struct drm_i915_private > > > > +*dev_priv) > > > > ppgtt->enable(ppgtt); > > > > } > > > > > > > > - /* FIXME: We should make this work, even in reset */ > > > > - if (i915_reset_in_progress(&dev_priv->gpu_error)) > > > > - return 0; > > > > - > > > > BUG_ON(!dev_priv->ring[RCS].default_context); > > > > > > > > for_each_ring(ring, dev_priv, i) { @@ -645,7 +605,7 @@ static > > > > int do_switch(struct intel_engine_cs *ring, > > > > from = ring->last_context; > > > > > > > > if (USES_FULL_PPGTT(ring->dev)) { > > > > - ret = ppgtt->switch_mm(ppgtt, ring, false); > > > > + ret = ppgtt->switch_mm(ppgtt, ring); > > > > if (ret) > > > > goto unpin_out; > > > > } > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c > > > > b/drivers/gpu/drm/i915/i915_gem_gtt.c > > > > index 5188936..450c8a9 100644 > > > > --- a/drivers/gpu/drm/i915/i915_gem_gtt.c > > > > +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c > > > > @@ -216,19 +216,12 @@ static gen6_gtt_pte_t > > > > iris_pte_encode(dma_addr_t addr, > > > > > > > > /* Broadwell Page Directory Pointer Descriptors */ static int > > > > gen8_write_pdp(struct intel_engine_cs *ring, unsigned entry, > > > > - uint64_t val, bool synchronous) > > > > + uint64_t val) > > > > { > > > > - struct drm_i915_private *dev_priv = ring->dev->dev_private; > > > > int ret; > > > > > > > > BUG_ON(entry >= 4); > > > > > > > > - if (synchronous) { > > > > - I915_WRITE(GEN8_RING_PDP_UDW(ring, entry), val >> 32); > > > > - I915_WRITE(GEN8_RING_PDP_LDW(ring, entry), (u32)val); > > > > - return 0; > > > > - } > > > > - > > > > ret = intel_ring_begin(ring, 6); > > > > if (ret) > > > > return ret; > > > > @@ -245,8 +238,7 @@ static int gen8_write_pdp(struct > > > > intel_engine_cs *ring, unsigned entry, } > > > > > > > > static int gen8_mm_switch(struct i915_hw_ppgtt *ppgtt, > > > > - struct intel_engine_cs *ring, > > > > - bool synchronous) > > > > + struct intel_engine_cs *ring) > > > > { > > > > int i, ret; > > > > > > > > @@ -255,7 +247,7 @@ static int gen8_mm_switch(struct i915_hw_ppgtt > > > > *ppgtt, > > > > > > > > for (i = used_pd - 1; i >= 0; i--) { > > > > dma_addr_t addr = ppgtt->pd_dma_addr[i]; > > > > - ret = gen8_write_pdp(ring, i, addr, synchronous); > > > > + ret = gen8_write_pdp(ring, i, addr); > > > > if (ret) > > > > return ret; > > > > } > > > > @@ -724,29 +716,10 @@ static uint32_t get_pd_offset(struct > > > > i915_hw_ppgtt *ppgtt) } > > > > > > > > static int hsw_mm_switch(struct i915_hw_ppgtt *ppgtt, > > > > - struct intel_engine_cs *ring, > > > > - bool synchronous) > > > > + struct intel_engine_cs *ring) > > > > { > > > > - struct drm_device *dev = ppgtt->base.dev; > > > > - struct drm_i915_private *dev_priv = dev->dev_private; > > > > int ret; > > > > > > > > - /* If we're in reset, we can assume the GPU is sufficiently > > > > idle to > > > > - * manually frob these bits. Ideally we could use the ring > > > > functions, > > > > - * except our error handling makes it quite difficult (can't use > > > > - * intel_ring_begin, ring->flush, or intel_ring_advance) > > > > - * > > > > - * FIXME: We should try not to special case reset > > > > - */ > > > > - if (synchronous || > > > > - i915_reset_in_progress(&dev_priv->gpu_error)) { > > > > - WARN_ON(ppgtt != dev_priv->mm.aliasing_ppgtt); > > > > - I915_WRITE(RING_PP_DIR_DCLV(ring), PP_DIR_DCLV_2G); > > > > - I915_WRITE(RING_PP_DIR_BASE(ring), > > > > get_pd_offset(ppgtt)); > > > > - POSTING_READ(RING_PP_DIR_BASE(ring)); > > > > - return 0; > > > > - } > > > > - > > > > /* NB: TLBs must be flushed and invalidated before a switch */ > > > > ret = ring->flush(ring, I915_GEM_GPU_DOMAINS, > > > > I915_GEM_GPU_DOMAINS); > > > > if (ret) > > > > @@ -768,29 +741,10 @@ static int hsw_mm_switch(struct > > > > i915_hw_ppgtt *ppgtt, } > > > > > > > > static int gen7_mm_switch(struct i915_hw_ppgtt *ppgtt, > > > > - struct intel_engine_cs *ring, > > > > - bool synchronous) > > > > + struct intel_engine_cs *ring) > > > > { > > > > - struct drm_device *dev = ppgtt->base.dev; > > > > - struct drm_i915_private *dev_priv = dev->dev_private; > > > > int ret; > > > > > > > > - /* If we're in reset, we can assume the GPU is sufficiently > > > > idle to > > > > - * manually frob these bits. Ideally we could use the ring > > > > functions, > > > > - * except our error handling makes it quite difficult (can't use > > > > - * intel_ring_begin, ring->flush, or intel_ring_advance) > > > > - * > > > > - * FIXME: We should try not to special case reset > > > > - */ > > > > - if (synchronous || > > > > - i915_reset_in_progress(&dev_priv->gpu_error)) { > > > > - WARN_ON(ppgtt != dev_priv->mm.aliasing_ppgtt); > > > > - I915_WRITE(RING_PP_DIR_DCLV(ring), PP_DIR_DCLV_2G); > > > > - I915_WRITE(RING_PP_DIR_BASE(ring), > > > > get_pd_offset(ppgtt)); > > > > - POSTING_READ(RING_PP_DIR_BASE(ring)); > > > > - return 0; > > > > - } > > > > - > > > > /* NB: TLBs must be flushed and invalidated before a switch */ > > > > ret = ring->flush(ring, I915_GEM_GPU_DOMAINS, > > > > I915_GEM_GPU_DOMAINS); > > > > if (ret) > > > > @@ -819,14 +773,11 @@ static int gen7_mm_switch(struct > > > > i915_hw_ppgtt *ppgtt, } > > > > > > > > static int gen6_mm_switch(struct i915_hw_ppgtt *ppgtt, > > > > - struct intel_engine_cs *ring, > > > > - bool synchronous) > > > > + struct intel_engine_cs *ring) > > > > { > > > > struct drm_device *dev = ppgtt->base.dev; > > > > struct drm_i915_private *dev_priv = dev->dev_private; > > > > > > > > - if (!synchronous) > > > > - return 0; > > > > > > > > I915_WRITE(RING_PP_DIR_DCLV(ring), PP_DIR_DCLV_2G); > > > > I915_WRITE(RING_PP_DIR_BASE(ring), get_pd_offset(ppgtt)); @@ > > > > -852,7 +803,7 @@ static int gen8_ppgtt_enable(struct i915_hw_ppgtt > > > > *ppgtt) > > > > if (USES_FULL_PPGTT(dev)) > > > > continue; > > > > > > > > - ret = ppgtt->switch_mm(ppgtt, ring, true); > > > > + ret = ppgtt->switch_mm(ppgtt, ring); > > > > if (ret) > > > > goto err_out; > > > > } > > > > @@ -897,7 +848,7 @@ static int gen7_ppgtt_enable(struct i915_hw_ppgtt > > > > *ppgtt) > > > > if (USES_FULL_PPGTT(dev)) > > > > continue; > > > > > > > > - ret = ppgtt->switch_mm(ppgtt, ring, true); > > > > + ret = ppgtt->switch_mm(ppgtt, ring); > > > > if (ret) > > > > return ret; > > > > } > > > > @@ -926,7 +877,7 @@ static int gen6_ppgtt_enable(struct i915_hw_ppgtt > > > > *ppgtt) > > > > I915_WRITE(GFX_MODE, _MASKED_BIT_ENABLE(GFX_PPGTT_ENABLE)); > > > > > > > > for_each_ring(ring, dev_priv, i) { > > > > - int ret = ppgtt->switch_mm(ppgtt, ring, true); > > > > + int ret = ppgtt->switch_mm(ppgtt, ring); > > > > if (ret) > > > > return ret; > > > > } > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.h > > > > b/drivers/gpu/drm/i915/i915_gem_gtt.h > > > > index 8d6f7c1..bf1e4fc 100644 > > > > --- a/drivers/gpu/drm/i915/i915_gem_gtt.h > > > > +++ b/drivers/gpu/drm/i915/i915_gem_gtt.h > > > > @@ -262,8 +262,7 @@ struct i915_hw_ppgtt { > > > > > > > > int (*enable)(struct i915_hw_ppgtt *ppgtt); > > > > int (*switch_mm)(struct i915_hw_ppgtt *ppgtt, > > > > - struct intel_engine_cs *ring, > > > > - bool synchronous); > > > > + struct intel_engine_cs *ring); > > > > void (*debug_dump)(struct i915_hw_ppgtt *ppgtt, struct seq_file > > > > *m); }; > > > > > > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c > > > > b/drivers/gpu/drm/i915/intel_ringbuffer.c > > > > index 599709e..e33c2e1 100644 > > > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c > > > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c > > > > @@ -1832,7 +1832,9 @@ int intel_ring_begin(struct intel_engine_cs > > > > *ring, > > > > > > > > ret = i915_gem_check_wedge(&dev_priv->gpu_error, > > > > dev_priv->mm.interruptible); > > > > - if (ret) > > > > + > > > > + /* -EAGAIN means a reset is in progress, it is Ok to continue */ > > > > + if (ret && (ret != -EAGAIN)) > > > > return ret; > > > > Oh, I guess that's the tricky bit why the old approach never worked - > > because reset_in_progress is set we failed the context/ppgtt loading > > through the rings and screwed up. > > > > Problem with your approach is that we want to bail out here if a reset is > > in progress, so we can't just eat the EAGAIN. If we do that we potentially > > deadlock or overflow the ring. > > > > I think we need a different hack here, and a few layers down (i.e. at the > > place where we actually generate that offending -EAGAIN). > > > > - Around the re-init sequence in the reset function we set > > dev_priv->mm.reload_in_reset or similar. Since we hold dev->struct_mutex > > no one will see that, as long as we never leak it out of the critical > > section. > > > > - In the ring_begin code that checks for gpu hangs we ignore > > reset_in_progress if this bit is set. > > > > - Both places need fairly big comments to explain what exactly is going > > on. > > > > Thanks, Daniel > > > > > > > > > > ret = __intel_ring_prepare(ring, num_dwords * sizeof(uint32_t)); > > > > -- > > > > 2.0.0 > > > > > > > > _______________________________________________ > > > > Intel-gfx mailing list > > > > Intel-gfx@lists.freedesktop.org > > > > http://lists.freedesktop.org/mailman/listinfo/intel-gfx > > > > > > -- > > > Ben Widawsky, Intel Open Source Technology Center > > > _______________________________________________ > > > Intel-gfx mailing list > > > Intel-gfx@lists.freedesktop.org > > > http://lists.freedesktop.org/mailman/listinfo/intel-gfx > > > > -- > > Daniel Vetter > > Software Engineer, Intel Corporation > > +41 (0) 79 365 57 48 - http://blog.ffwll.ch > > -- > Ben Widawsky, Intel Open Source Technology Center -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx