On Tue, Jul 29, 2014 at 05:25:48PM +0000, Mcaulay, Alistair wrote:
> 
> drv_suspend,  gem_hangcheck_forcewake are working fine now with PPGTT 
> enabled. Both with and without my patch.
> 
> The results are the same with and without my patch for:
> 
> $ sudo ~/drm_nightly/intel-gpu-tools/tests/drv_hangman 
> IGT-Version: 1.7-g70e6ed9 (x86_64) (Linux: 3.16.0-rc5+ x86_64)
> Subtest error-state-debugfs-entry: SUCCESS
> Subtest error-state-sysfs-entry: SUCCESS
> Subtest ring-stop-sysfs-entry: SUCCESS
> Subtest error-state-basic: SUCCESS
> Subtest error-state-capture-render: SUCCESS
> Test assertion failure function check_error_state, file drv_hangman.c:303:
> Failed assertion: gtt_offset == expected_offset
> Subtest error-state-capture-bsd: FAIL
> Test assertion failure function check_error_state, file drv_hangman.c:303:
> Failed assertion: gtt_offset == expected_offset
> Subtest error-state-capture-blt: FAIL
> Test assertion failure function check_error_state, file drv_hangman.c:303:
> Failed assertion: gtt_offset == expected_offset
> Subtest error-state-capture-vebox: FAIL
> 
> $ sudo ~/drm_nightly/intel-gpu-tools/tests/gem_reset_stats
> IGT-Version: 1.7-g70e6ed9 (x86_64) (Linux: 3.16.0-rc5+ x86_64)
> Subtest params: SUCCESS
> Subtest params-ctx-render: SUCCESS
> Subtest reset-stats-render: SUCCESS
> Subtest reset-stats-ctx-render: SUCCESS
> Subtest ban-render: SUCCESS
> Subtest ban-ctx-render: SUCCESS
> Subtest reset-count-render: SUCCESS
> Subtest reset-count-ctx-render: SUCCESS
> Subtest unrelated-ctx-render: SUCCESS
> Subtest close-pending-render: SUCCESS
> Subtest close-pending-ctx-render: SUCCESS
> <test now hangs>
> 
> 
> Both good without PPGTT.
> I haven't yet seen a regression with the patch.

gem_ringfill might be able to hit the -EAGAIN issue. But that testcase is
missing subtest with signals to interrupt the ioctls. I'll add that.
-Daniel

> 
> -----Original Message-----
> From: Ben Widawsky [mailto:b...@bwidawsk.net] 
> Sent: Tuesday, July 29, 2014 1:17 AM
> To: Mcaulay, Alistair
> Cc: Daniel Vetter; intel-gfx@lists.freedesktop.org
> Subject: Re: [Intel-gfx] [PATCH] drm/i915: Rework GPU reset sequence to match 
> driver load & thaw
> 
> On Mon, Jul 28, 2014 at 05:12:59PM +0000, Mcaulay, Alistair wrote:
> > Hi Ben / Daniel,
> > I agree that this needs to be properly tested. Are there any particular igt 
> > tests you would suggest I use?
> > I've been running:
> > drv_hangman, drv_suspend, gem_hangcheck_forcewake.
> 
> I thought IGT had added some random reset stuff in the way that we have for 
> signals. However, it looks like I imagined it. I guess you can add gem_hang, 
> gem_reset_stats, and tests/debugfs_wedged to that list. Daniel can probably 
> provide any I might have missed.
> 
> The way that igt quiesces everything these days really hurts the ability to 
> test multi-process. If every tests starts off with no work, and running the 
> default context, things are pretty trivial. Similarly, running these tests in 
> isolation, even if it isn't quiescing doesn't help the situation. The way I 
> wrote the code originally was through debugging hangs on a desktop as I 
> developed patches, and not with IGT (though drv_hangman could catch many 
> issues). I'd definitely recommend trying to invoke hangs on a running 
> desktop. I'd advise doing this by modifying mesa to submit a crap batch, I 
> can provide you more details on how to do this if you need it. Also try to 
> disable the quiescing in IGT and run more than these tests in isolation.
> 
> > 
> > Also do you have a set of PPGTT Patches that should work with these 
> > tests. Michel sent me a set of patches to enable PPGTT, but these 3 tests 
> > fail with the patches.
> 
> I will try to reproduce this on my patch series when I have some time and if 
> nothing else, that should be a good preparation/refresher for the patch 
> review anyway. The patches I wrote shouldn't have touched much on these paths 
> - not sure if Michel changed anything there.
> 
> With patch on top of what Michel sent you, everything passes?
> 
> > 
> > Thanks,
> > Alistair.
> > 
> > -----Original Message-----
> > From: Daniel Vetter [mailto:daniel.vet...@ffwll.ch] On Behalf Of 
> > Daniel Vetter
> > Sent: Monday, July 28, 2014 10:27 AM
> > To: Ben Widawsky
> > Cc: Mcaulay, Alistair; intel-gfx@lists.freedesktop.org
> > Subject: Re: [Intel-gfx] [PATCH] drm/i915: Rework GPU reset sequence 
> > to match driver load & thaw
> > 
> > On Fri, Jul 25, 2014 at 06:05:29PM -0700, Ben Widawsky wrote:
> > > On Wed, Jul 16, 2014 at 04:05:59PM +0100, alistair.mcau...@intel.com 
> > > wrote:
> > > > From: "McAulay, Alistair" <alistair.mcau...@intel.com>
> > > > 
> > > > This patch is to address Daniels concerns over different code during 
> > > > reset:
> > > > 
> > > > http://lists.freedesktop.org/archives/intel-gfx/2014-June/047758.h
> > > > tm
> > > > l
> > > > 
> > > > "The reason for aiming as hard as possible to use the exact same 
> > > > code for driver load, gpu reset and runtime pm/system resume is 
> > > > that we've simply seen too many bugs due to slight variations and 
> > > > unintended omissions."
> > > > 
> > > > Tested using igt drv_hangman.
> > > > 
> > > > Signed-off-by: McAulay, Alistair <alistair.mcau...@intel.com>
> > > 
> > > 2 quick comments before I actually do a real review.
> > > 1. Did you actually run this with and without full ppgtt?
> > > 2. I don't think this is actually fulfilling what Daniel is 
> > > requesting, though we can let him comment.
> > 
> > Mostly looks like what I think we need. Comments below.
> > 
> > > 3. Did you reall do #1?
> > > 
> > > Assuming you satisifed #1, can you please post the igt results for 
> > > the permutations (pre patch w/ and w/o ppgtt; post patch w/ and w/o 
> > > ppgtt)
> > > 
> > > I really want this data because I spent *a lot* of time with these 
> > > specific areas in the PPGTT work, and I am somewhat skeptical enough 
> > > of the code has changed that this will magically work. I also tried 
> > > the trickiness with the ring handling functions, and never succeeded.
> > > Also, with the context stuff, I'm simply not convinced it can 
> > > magically vanish. If igt looks good, and Daniel agrees that this is 
> > > what he actually wanted, I will go fishing for corner cases and do the 
> > > review.
> > 
> > I think the hack in ring_begin might explain why it never worked before.
> > But fully agreed, we really need to test this well (and fill gaps if igt 
> > misses anything around resets - we don't have any systematic gpu reset 
> > coverage anywhere outside of igt).
> > 
> > > 
> > > Thanks.
> > > 
> > > > ---
> > > >  drivers/gpu/drm/i915/i915_gem.c         |  2 -
> > > >  drivers/gpu/drm/i915/i915_gem_context.c | 42 +--------------------
> > > >  drivers/gpu/drm/i915/i915_gem_gtt.c     | 67 
> > > > +++++----------------------------
> > > >  drivers/gpu/drm/i915/i915_gem_gtt.h     |  3 +-
> > > >  drivers/gpu/drm/i915/intel_ringbuffer.c |  4 +-
> > > >  5 files changed, 14 insertions(+), 104 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_gem.c 
> > > > b/drivers/gpu/drm/i915/i915_gem.c index ef047bc..b38e086 100644
> > > > --- a/drivers/gpu/drm/i915/i915_gem.c
> > > > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > > > @@ -2590,8 +2590,6 @@ void i915_gem_reset(struct drm_device *dev)
> > > >         for_each_ring(ring, dev_priv, i)
> > > >                 i915_gem_reset_ring_cleanup(dev_priv, ring);
> > > >  
> > > > -       i915_gem_context_reset(dev);
> > > > -
> > > >         i915_gem_restore_fences(dev);
> > > >  }
> > > >  
> > > > diff --git a/drivers/gpu/drm/i915/i915_gem_context.c
> > > > b/drivers/gpu/drm/i915/i915_gem_context.c
> > > > index de72a28..d96219f 100644
> > > > --- a/drivers/gpu/drm/i915/i915_gem_context.c
> > > > +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> > > > @@ -372,42 +372,6 @@ err_destroy:
> > > >         return ERR_PTR(ret);
> > > >  }
> > > >  
> > > > -void i915_gem_context_reset(struct drm_device *dev) -{
> > > > -       struct drm_i915_private *dev_priv = dev->dev_private;
> > > > -       int i;
> > > > -
> > > > -       /* Prevent the hardware from restoring the last context (which 
> > > > hung) on
> > > > -        * the next switch */
> > > > -       for (i = 0; i < I915_NUM_RINGS; i++) {
> > > > -               struct intel_engine_cs *ring = &dev_priv->ring[i];
> > > > -               struct intel_context *dctx = ring->default_context;
> > > > -               struct intel_context *lctx = ring->last_context;
> > > > -
> > > > -               /* Do a fake switch to the default context */
> > > > -               if (lctx == dctx)
> > > > -                       continue;
> > > > -
> > > > -               if (!lctx)
> > > > -                       continue;
> > > > -
> > > > -               if (dctx->legacy_hw_ctx.rcs_state && i == RCS) {
> > > > -                       
> > > > WARN_ON(i915_gem_obj_ggtt_pin(dctx->legacy_hw_ctx.rcs_state,
> > > > -                                                     
> > > > get_context_alignment(dev), 0));
> > > > -                       /* Fake a finish/inactive */
> > > > -                       
> > > > dctx->legacy_hw_ctx.rcs_state->base.write_domain = 0;
> > > > -                       dctx->legacy_hw_ctx.rcs_state->active = 0;
> > > > -               }
> > > > -
> > > > -               if (lctx->legacy_hw_ctx.rcs_state && i == RCS)
> > > > -                       
> > > > i915_gem_object_ggtt_unpin(lctx->legacy_hw_ctx.rcs_state);
> > > > -
> > > > -               i915_gem_context_unreference(lctx);
> > > > -               i915_gem_context_reference(dctx);
> > > > -               ring->last_context = dctx;
> > > > -       }
> > > > -}
> > 
> > I don't understand why we no longer need this - after reset we probably 
> > have the default context loaded (if we resue the driver load sequence 
> > exactly), so I expect that we must reset the software tracking accordingly.
> > 
> > > > -
> > > >  int i915_gem_context_init(struct drm_device *dev)  {
> > > >         struct drm_i915_private *dev_priv = dev->dev_private; @@ -498,10
> > > > +462,6 @@ int i915_gem_context_enable(struct drm_i915_private 
> > > > +*dev_priv)
> > > >                 ppgtt->enable(ppgtt);
> > > >         }
> > > >  
> > > > -       /* FIXME: We should make this work, even in reset */
> > > > -       if (i915_reset_in_progress(&dev_priv->gpu_error))
> > > > -               return 0;
> > > > -
> > > >         BUG_ON(!dev_priv->ring[RCS].default_context);
> > > >  
> > > >         for_each_ring(ring, dev_priv, i) { @@ -645,7 +605,7 @@ static 
> > > > int do_switch(struct intel_engine_cs *ring,
> > > >         from = ring->last_context;
> > > >  
> > > >         if (USES_FULL_PPGTT(ring->dev)) {
> > > > -               ret = ppgtt->switch_mm(ppgtt, ring, false);
> > > > +               ret = ppgtt->switch_mm(ppgtt, ring);
> > > >                 if (ret)
> > > >                         goto unpin_out;
> > > >         }
> > > > diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c
> > > > b/drivers/gpu/drm/i915/i915_gem_gtt.c
> > > > index 5188936..450c8a9 100644
> > > > --- a/drivers/gpu/drm/i915/i915_gem_gtt.c
> > > > +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
> > > > @@ -216,19 +216,12 @@ static gen6_gtt_pte_t 
> > > > iris_pte_encode(dma_addr_t addr,
> > > >  
> > > >  /* Broadwell Page Directory Pointer Descriptors */  static int 
> > > > gen8_write_pdp(struct intel_engine_cs *ring, unsigned entry,
> > > > -                          uint64_t val, bool synchronous)
> > > > +                          uint64_t val)
> > > >  {
> > > > -       struct drm_i915_private *dev_priv = ring->dev->dev_private;
> > > >         int ret;
> > > >  
> > > >         BUG_ON(entry >= 4);
> > > >  
> > > > -       if (synchronous) {
> > > > -               I915_WRITE(GEN8_RING_PDP_UDW(ring, entry), val >> 32);
> > > > -               I915_WRITE(GEN8_RING_PDP_LDW(ring, entry), (u32)val);
> > > > -               return 0;
> > > > -       }
> > > > -
> > > >         ret = intel_ring_begin(ring, 6);
> > > >         if (ret)
> > > >                 return ret;
> > > > @@ -245,8 +238,7 @@ static int gen8_write_pdp(struct 
> > > > intel_engine_cs *ring, unsigned entry,  }
> > > >  
> > > >  static int gen8_mm_switch(struct i915_hw_ppgtt *ppgtt,
> > > > -                         struct intel_engine_cs *ring,
> > > > -                         bool synchronous)
> > > > +                         struct intel_engine_cs *ring)
> > > >  {
> > > >         int i, ret;
> > > >  
> > > > @@ -255,7 +247,7 @@ static int gen8_mm_switch(struct i915_hw_ppgtt 
> > > > *ppgtt,
> > > >  
> > > >         for (i = used_pd - 1; i >= 0; i--) {
> > > >                 dma_addr_t addr = ppgtt->pd_dma_addr[i];
> > > > -               ret = gen8_write_pdp(ring, i, addr, synchronous);
> > > > +               ret = gen8_write_pdp(ring, i, addr);
> > > >                 if (ret)
> > > >                         return ret;
> > > >         }
> > > > @@ -724,29 +716,10 @@ static uint32_t get_pd_offset(struct 
> > > > i915_hw_ppgtt *ppgtt)  }
> > > >  
> > > >  static int hsw_mm_switch(struct i915_hw_ppgtt *ppgtt,
> > > > -                        struct intel_engine_cs *ring,
> > > > -                        bool synchronous)
> > > > +                        struct intel_engine_cs *ring)
> > > >  {
> > > > -       struct drm_device *dev = ppgtt->base.dev;
> > > > -       struct drm_i915_private *dev_priv = dev->dev_private;
> > > >         int ret;
> > > >  
> > > > -       /* If we're in reset, we can assume the GPU is sufficiently 
> > > > idle to
> > > > -        * manually frob these bits. Ideally we could use the ring 
> > > > functions,
> > > > -        * except our error handling makes it quite difficult (can't use
> > > > -        * intel_ring_begin, ring->flush, or intel_ring_advance)
> > > > -        *
> > > > -        * FIXME: We should try not to special case reset
> > > > -        */
> > > > -       if (synchronous ||
> > > > -           i915_reset_in_progress(&dev_priv->gpu_error)) {
> > > > -               WARN_ON(ppgtt != dev_priv->mm.aliasing_ppgtt);
> > > > -               I915_WRITE(RING_PP_DIR_DCLV(ring), PP_DIR_DCLV_2G);
> > > > -               I915_WRITE(RING_PP_DIR_BASE(ring), 
> > > > get_pd_offset(ppgtt));
> > > > -               POSTING_READ(RING_PP_DIR_BASE(ring));
> > > > -               return 0;
> > > > -       }
> > > > -
> > > >         /* NB: TLBs must be flushed and invalidated before a switch */
> > > >         ret = ring->flush(ring, I915_GEM_GPU_DOMAINS, 
> > > > I915_GEM_GPU_DOMAINS);
> > > >         if (ret)
> > > > @@ -768,29 +741,10 @@ static int hsw_mm_switch(struct 
> > > > i915_hw_ppgtt *ppgtt,  }
> > > >  
> > > >  static int gen7_mm_switch(struct i915_hw_ppgtt *ppgtt,
> > > > -                         struct intel_engine_cs *ring,
> > > > -                         bool synchronous)
> > > > +                         struct intel_engine_cs *ring)
> > > >  {
> > > > -       struct drm_device *dev = ppgtt->base.dev;
> > > > -       struct drm_i915_private *dev_priv = dev->dev_private;
> > > >         int ret;
> > > >  
> > > > -       /* If we're in reset, we can assume the GPU is sufficiently 
> > > > idle to
> > > > -        * manually frob these bits. Ideally we could use the ring 
> > > > functions,
> > > > -        * except our error handling makes it quite difficult (can't use
> > > > -        * intel_ring_begin, ring->flush, or intel_ring_advance)
> > > > -        *
> > > > -        * FIXME: We should try not to special case reset
> > > > -        */
> > > > -       if (synchronous ||
> > > > -           i915_reset_in_progress(&dev_priv->gpu_error)) {
> > > > -               WARN_ON(ppgtt != dev_priv->mm.aliasing_ppgtt);
> > > > -               I915_WRITE(RING_PP_DIR_DCLV(ring), PP_DIR_DCLV_2G);
> > > > -               I915_WRITE(RING_PP_DIR_BASE(ring), 
> > > > get_pd_offset(ppgtt));
> > > > -               POSTING_READ(RING_PP_DIR_BASE(ring));
> > > > -               return 0;
> > > > -       }
> > > > -
> > > >         /* NB: TLBs must be flushed and invalidated before a switch */
> > > >         ret = ring->flush(ring, I915_GEM_GPU_DOMAINS, 
> > > > I915_GEM_GPU_DOMAINS);
> > > >         if (ret)
> > > > @@ -819,14 +773,11 @@ static int gen7_mm_switch(struct 
> > > > i915_hw_ppgtt *ppgtt,  }
> > > >  
> > > >  static int gen6_mm_switch(struct i915_hw_ppgtt *ppgtt,
> > > > -                         struct intel_engine_cs *ring,
> > > > -                         bool synchronous)
> > > > +                         struct intel_engine_cs *ring)
> > > >  {
> > > >         struct drm_device *dev = ppgtt->base.dev;
> > > >         struct drm_i915_private *dev_priv = dev->dev_private;
> > > >  
> > > > -       if (!synchronous)
> > > > -               return 0;
> > > >  
> > > >         I915_WRITE(RING_PP_DIR_DCLV(ring), PP_DIR_DCLV_2G);
> > > >         I915_WRITE(RING_PP_DIR_BASE(ring), get_pd_offset(ppgtt)); @@
> > > > -852,7 +803,7 @@ static int gen8_ppgtt_enable(struct i915_hw_ppgtt 
> > > > *ppgtt)
> > > >                 if (USES_FULL_PPGTT(dev))
> > > >                         continue;
> > > >  
> > > > -               ret = ppgtt->switch_mm(ppgtt, ring, true);
> > > > +               ret = ppgtt->switch_mm(ppgtt, ring);
> > > >                 if (ret)
> > > >                         goto err_out;
> > > >         }
> > > > @@ -897,7 +848,7 @@ static int gen7_ppgtt_enable(struct i915_hw_ppgtt 
> > > > *ppgtt)
> > > >                 if (USES_FULL_PPGTT(dev))
> > > >                         continue;
> > > >  
> > > > -               ret = ppgtt->switch_mm(ppgtt, ring, true);
> > > > +               ret = ppgtt->switch_mm(ppgtt, ring);
> > > >                 if (ret)
> > > >                         return ret;
> > > >         }
> > > > @@ -926,7 +877,7 @@ static int gen6_ppgtt_enable(struct i915_hw_ppgtt 
> > > > *ppgtt)
> > > >         I915_WRITE(GFX_MODE, _MASKED_BIT_ENABLE(GFX_PPGTT_ENABLE));
> > > >  
> > > >         for_each_ring(ring, dev_priv, i) {
> > > > -               int ret = ppgtt->switch_mm(ppgtt, ring, true);
> > > > +               int ret = ppgtt->switch_mm(ppgtt, ring);
> > > >                 if (ret)
> > > >                         return ret;
> > > >         }
> > > > diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.h
> > > > b/drivers/gpu/drm/i915/i915_gem_gtt.h
> > > > index 8d6f7c1..bf1e4fc 100644
> > > > --- a/drivers/gpu/drm/i915/i915_gem_gtt.h
> > > > +++ b/drivers/gpu/drm/i915/i915_gem_gtt.h
> > > > @@ -262,8 +262,7 @@ struct i915_hw_ppgtt {
> > > >  
> > > >         int (*enable)(struct i915_hw_ppgtt *ppgtt);
> > > >         int (*switch_mm)(struct i915_hw_ppgtt *ppgtt,
> > > > -                        struct intel_engine_cs *ring,
> > > > -                        bool synchronous);
> > > > +                        struct intel_engine_cs *ring);
> > > >         void (*debug_dump)(struct i915_hw_ppgtt *ppgtt, struct seq_file 
> > > > *m);  };
> > > >  
> > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > index 599709e..e33c2e1 100644
> > > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > @@ -1832,7 +1832,9 @@ int intel_ring_begin(struct intel_engine_cs 
> > > > *ring,
> > > >  
> > > >         ret = i915_gem_check_wedge(&dev_priv->gpu_error,
> > > >                                    dev_priv->mm.interruptible);
> > > > -       if (ret)
> > > > +
> > > > +       /* -EAGAIN means a reset is in progress, it is Ok to continue */
> > > > +       if (ret && (ret != -EAGAIN))
> > > >                 return ret;
> > 
> > Oh, I guess that's the tricky bit why the old approach never worked - 
> > because reset_in_progress is set we failed the context/ppgtt loading 
> > through the rings and screwed up.
> > 
> > Problem with your approach is that we want to bail out here if a reset is 
> > in progress, so we can't just eat the EAGAIN. If we do that we potentially 
> > deadlock or overflow the ring.
> > 
> > I think we need a different hack here, and a few layers down (i.e. at the 
> > place where we actually generate that offending -EAGAIN).
> > 
> > - Around the re-init sequence in the reset function we set
> >   dev_priv->mm.reload_in_reset or similar. Since we hold dev->struct_mutex
> >   no one will see that, as long as we never leak it out of the critical
> >   section.
> > 
> > - In the ring_begin code that checks for gpu hangs we ignore
> >   reset_in_progress if this bit is set.
> > 
> > - Both places need fairly big comments to explain what exactly is going
> >   on.
> > 
> > Thanks, Daniel
> > 
> > > >  
> > > >         ret = __intel_ring_prepare(ring, num_dwords * sizeof(uint32_t));
> > > > --
> > > > 2.0.0
> > > > 
> > > > _______________________________________________
> > > > Intel-gfx mailing list
> > > > Intel-gfx@lists.freedesktop.org
> > > > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
> > > 
> > > --
> > > Ben Widawsky, Intel Open Source Technology Center 
> > > _______________________________________________
> > > Intel-gfx mailing list
> > > Intel-gfx@lists.freedesktop.org
> > > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
> > 
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> 
> --
> Ben Widawsky, Intel Open Source Technology Center

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

Reply via email to