On Sun, Sep 08, 2013 at 09:57:13PM +0200, Daniel Vetter wrote:
> My g33 here seems to be shockingly good at hitting them all. This time
> around kms_flip/flip-vs-panning-vs-hang blows up:
> 
> intel_crtc_wait_for_pending_flips correctly checks for gpu hangs and
> if a gpu hang is pending aborts the wait for outstanding flips so that
> the setcrtc call will succeed and release the crtc mutex. And the gpu
> hang handler needs that lock in intel_display_handle_reset to be able
> to complete outstanding flips.
> 
> The problem is that we can race in two ways:
> - Waiters on the dev_priv->pending_flip_queue aren't woken up after
>   we've the reset as pending, but before we actually start the reset
>   work. This means that the waiter doesn't notice the pending reset
>   and hence will keep on hogging the locks.
> 
>   Like with dev->struct_mutex and the ring->irq_queue wait queues we
>   there need to wake up everyone that potentially holds a lock which
>   the reset handler needs.
> 
> - intel_display_handle_reset was called _after_ we've already
>   signalled the completion of the reset work. Which means a waiter
>   could sneak in, grab the lock and never release it (since the
>   pageflips won't ever get released).
> 
>   Similar to resetting the gem state all the reset work must complete
>   before we update the reset counter. Contrary to the gem reset we
>   don't need to have a second explicit wake up call since that will
>   have happened already when completing the pageflips. We also don't
>   have any issues that the completion happens while the reset state is
>   still pending - wait_for_pending_flips is only there to ensure we
>   display the right frame. After a gpu hang&reset events such
>   guarantees are out the window anyway. This is in contrast to the gem
>   code where too-early wake-up would result in unnecessary restarting
>   of ioctls.
> 
> Also, since we've gotten these various deadlocks and ordering
> constraints wrong so often throw copious amounts of comments at the
> code.
> 
> This deadlock regression has been introduced in the commit which added
> the pageflip reset logic to the gpu hang work:
> 
> commit 96a02917a0131e52efefde49c2784c0421d6c439
> Author: Ville Syrjälä <ville.syrj...@linux.intel.com>
> Date:   Mon Feb 18 19:08:49 2013 +0200
> 
>     drm/i915: Finish page flips and update primary planes after a GPU reset
> 
> v2:
> - Add comments to explain how the wake_up serves as memory barriers
>   for the atomic_t reset counter.
> - Improve the comments a bit as suggested by Chris Wilson.
> - Extract the wake_up calls before/after the reset into a little
>   i915_error_wake_up and unconditionally wake up the
>   pending_flip_queue waiters, again as suggested by Chris Wilson.
> 
> v3: Throw copious amounts of comments at i915_error_wake_up as
> suggested by Chris Wilson.
> 
> Cc: sta...@vger.kernel.org
> Cc: Ville Syrjälä <ville.syrj...@linux.intel.com>
> Cc: Chris Wilson <ch...@chris-wilson.co.uk>
> Reviewed-by: Chris Wilson <ch...@chris-wilson.co.uk>
> Signed-off-by: Daniel Vetter <daniel.vet...@ffwll.ch>

Ok, smashed onto -fixes. Thanks a lot for your critical review.
-Daniel

> ---
>  drivers/gpu/drm/i915/i915_irq.c | 68 
> ++++++++++++++++++++++++++++++++---------
>  1 file changed, 54 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 83cce0c..4b91228 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1469,6 +1469,34 @@ static irqreturn_t ironlake_irq_handler(int irq, void 
> *arg)
>       return ret;
>  }
>  
> +static void i915_error_wake_up(struct drm_i915_private *dev_priv,
> +                            bool reset_completed)
> +{
> +     struct intel_ring_buffer *ring;
> +     int i;
> +
> +     /*
> +      * Notify all waiters for GPU completion events that reset state has
> +      * been changed, and that they need to restart their wait after
> +      * checking for potential errors (and bail out to drop locks if there is
> +      * a gpu reset pending so that i915_error_work_func can acquire them).
> +      */
> +
> +     /* Wake up __wait_seqno, potentially holding dev->struct_mutex. */
> +     for_each_ring(ring, dev_priv, i)
> +             wake_up_all(&ring->irq_queue);
> +
> +     /* Wake up intel_crtc_wait_for_pending_flips, holding crtc->mutex. */
> +     wake_up_all(&dev_priv->pending_flip_queue);
> +
> +     /*
> +      * Signal tasks blocked in i915_gem_wait_for_error that the pending
> +      * reset state is cleared.
> +      */
> +     if (reset_completed)
> +             wake_up_all(&dev_priv->gpu_error.reset_queue);
> +}
> +
>  /**
>   * i915_error_work_func - do process context error handling work
>   * @work: work struct
> @@ -1483,11 +1511,10 @@ static void i915_error_work_func(struct work_struct 
> *work)
>       drm_i915_private_t *dev_priv = container_of(error, drm_i915_private_t,
>                                                   gpu_error);
>       struct drm_device *dev = dev_priv->dev;
> -     struct intel_ring_buffer *ring;
>       char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
>       char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
>       char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
> -     int i, ret;
> +     int ret;
>  
>       kobject_uevent_env(&dev->primary->kdev.kobj, KOBJ_CHANGE, error_event);
>  
> @@ -1506,8 +1533,16 @@ static void i915_error_work_func(struct work_struct 
> *work)
>               kobject_uevent_env(&dev->primary->kdev.kobj, KOBJ_CHANGE,
>                                  reset_event);
>  
> +             /*
> +              * All state reset _must_ be completed before we update the
> +              * reset counter, for otherwise waiters might miss the reset
> +              * pending state and not properly drop locks, resulting in
> +              * deadlocks with the reset work.
> +              */
>               ret = i915_reset(dev);
>  
> +             intel_display_handle_reset(dev);
> +
>               if (ret == 0) {
>                       /*
>                        * After all the gem state is reset, increment the reset
> @@ -1528,12 +1563,11 @@ static void i915_error_work_func(struct work_struct 
> *work)
>                       atomic_set(&error->reset_counter, I915_WEDGED);
>               }
>  
> -             for_each_ring(ring, dev_priv, i)
> -                     wake_up_all(&ring->irq_queue);
> -
> -             intel_display_handle_reset(dev);
> -
> -             wake_up_all(&dev_priv->gpu_error.reset_queue);
> +             /*
> +              * Note: The wake_up also serves as a memory barrier so that
> +              * waiters see the update value of the reset counter atomic_t.
> +              */
> +             i915_error_wake_up(dev_priv, true);
>       }
>  }
>  
> @@ -1642,8 +1676,6 @@ static void i915_report_and_clear_eir(struct drm_device 
> *dev)
>  void i915_handle_error(struct drm_device *dev, bool wedged)
>  {
>       struct drm_i915_private *dev_priv = dev->dev_private;
> -     struct intel_ring_buffer *ring;
> -     int i;
>  
>       i915_capture_error_state(dev);
>       i915_report_and_clear_eir(dev);
> @@ -1653,11 +1685,19 @@ void i915_handle_error(struct drm_device *dev, bool 
> wedged)
>                               &dev_priv->gpu_error.reset_counter);
>  
>               /*
> -              * Wakeup waiting processes so that the reset work item
> -              * doesn't deadlock trying to grab various locks.
> +              * Wakeup waiting processes so that the reset work function
> +              * i915_error_work_func doesn't deadlock trying to grab various
> +              * locks. By bumping the reset counter first, the woken
> +              * processes will see a reset in progress and back off,
> +              * releasing their locks and then wait for the reset completion.
> +              * We must do this for _all_ gpu waiters that might hold locks
> +              * that the reset work needs to acquire.
> +              *
> +              * Note: The wake_up serves as the required memory barrier to
> +              * ensure that the waiters see the updated value of the reset
> +              * counter atomic_t.
>                */
> -             for_each_ring(ring, dev_priv, i)
> -                     wake_up_all(&ring->irq_queue);
> +             i915_error_wake_up(dev_priv, false);
>       }
>  
>       /*
> -- 
> 1.8.4.rc3
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

Reply via email to