On Tue, Jun 03, 2025 at 01:43:58PM +0530, Riana Tauro wrote:
> Add a helper function to set recovery method. The recovery
> method has to be set before declaring the device wedged and sending the
> drm wedged uevent. If no method is set, default unbind/re-bind method
> will be set
> 
> Signed-off-by: Riana Tauro <riana.ta...@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device.c       | 30 +++++++++++++++++++++-------
>  drivers/gpu/drm/xe/xe_device.h       |  1 +
>  drivers/gpu/drm/xe/xe_device_types.h |  2 ++
>  3 files changed, 26 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 660b0c5126dc..3fd604ebdc6e 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -1120,16 +1120,28 @@ static void xe_device_wedged_fini(struct drm_device 
> *drm, void *arg)
>       xe_pm_runtime_put(xe);
>  }
>  
> +/**
> + * xe_device_set_wedged_method - Set wedged recovery method
> + * @xe: xe device instance

Missing @method

> + *
> + * Set wedged recovery method to be sent using drm wedged uevent.
> + */
> +void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method)
> +{
> +     xe->wedged.method = method;
> +}
> +
>  /**
>   * xe_device_declare_wedged - Declare device wedged
>   * @xe: xe device instance
>   *
> - * This is a final state that can only be cleared with a module
> - * re-probe (unbind + bind).
> - * In this state every IOCTL will be blocked so the GT cannot be used.
> + * This is a final state that can only be cleared with the method specified
> + * in the drm wedged uevent. The method needs to be set using 
> xe_device_set_wedged_method
> + * before declaring the device as wedged or the default method of reprobe 
> (unbind/re-bind)
> + * will be sent. In this state every IOCTL will be blocked so the GT cannot 
> be used.

The file convention seems like 80 characters for kernel doc, so let's
stick to it.

>   * In general it will be called upon any critical error such as gt reset
> - * failure or guc loading failure. Userspace will be notified of this state
> - * through device wedged uevent.
> + * failure or guc loading failure or firmware failure.
> + * Userspace will be notified of this state through device wedged uevent.
>   * If xe.wedged module parameter is set to 2, this function will be called
>   * on every single execution timeout (a.k.a. GPU hang) right after 
> devcoredump
>   * snapshot capture. In this mode, GT reset won't be attempted so the state 
> of
> @@ -1152,6 +1164,11 @@ void xe_device_declare_wedged(struct xe_device *xe)
>               return;
>       }
>  
> +     /* If no wedge recovery method is set, use default */
> +     if (!xe->wedged.method)
> +             xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_REBIND
> +                                         | DRM_WEDGE_RECOVERY_BUS_RESET);

Although there are no strict rules about this, we usually don't begin a
new line with a symbol.

> +
>       if (!atomic_xchg(&xe->wedged.flag, 1)) {
>               xe->needs_flr_on_fini = true;
>               drm_err(&xe->drm,
> @@ -1161,8 +1178,7 @@ void xe_device_declare_wedged(struct xe_device *xe)
>                       dev_name(xe->drm.dev));
>  
>               /* Notify userspace of wedged device */
> -             drm_dev_wedged_event(&xe->drm,
> -                                  DRM_WEDGE_RECOVERY_REBIND | 
> DRM_WEDGE_RECOVERY_BUS_RESET);
> +             drm_dev_wedged_event(&xe->drm, xe->wedged.method);

I was a bit late to realize it when I originally added this. The event
call should be after xe_gt_declare_wedged() to comply with wedging rules.
We notify userspace *after* we're done with driver cleanup.

Raag

>       }
>  
>       for_each_gt(gt, xe, id)
> diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> index 0bc3bc8e6803..06350740aac5 100644
> --- a/drivers/gpu/drm/xe/xe_device.h
> +++ b/drivers/gpu/drm/xe/xe_device.h
> @@ -191,6 +191,7 @@ static inline bool xe_device_wedged(struct xe_device *xe)
>  }
>  
>  void xe_device_declare_wedged(struct xe_device *xe);
> +void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method);
>  
>  struct xe_file *xe_file_get(struct xe_file *xef);
>  void xe_file_put(struct xe_file *xef);
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h 
> b/drivers/gpu/drm/xe/xe_device_types.h
> index b93c04466637..fb3617956d63 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -559,6 +559,8 @@ struct xe_device {
>               atomic_t flag;
>               /** @wedged.mode: Mode controlled by kernel parameter and 
> debugfs */
>               int mode;
> +             /** @wedged.method: Recovery method to be sent in the drm 
> device wedged uevent */
> +             unsigned long method;
>       } wedged;
>  
>       /** @bo_device: Struct to control async free of BOs */
> -- 
> 2.47.1
> 

Reply via email to