On Tue, Jun 03, 2025 at 01:43:58PM +0530, Riana Tauro wrote: > Add a helper function to set recovery method. The recovery > method has to be set before declaring the device wedged and sending the > drm wedged uevent. If no method is set, default unbind/re-bind method > will be set > > Signed-off-by: Riana Tauro <riana.ta...@intel.com> > --- > drivers/gpu/drm/xe/xe_device.c | 30 +++++++++++++++++++++------- > drivers/gpu/drm/xe/xe_device.h | 1 + > drivers/gpu/drm/xe/xe_device_types.h | 2 ++ > 3 files changed, 26 insertions(+), 7 deletions(-) > > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c > index 660b0c5126dc..3fd604ebdc6e 100644 > --- a/drivers/gpu/drm/xe/xe_device.c > +++ b/drivers/gpu/drm/xe/xe_device.c > @@ -1120,16 +1120,28 @@ static void xe_device_wedged_fini(struct drm_device > *drm, void *arg) > xe_pm_runtime_put(xe); > } > > +/** > + * xe_device_set_wedged_method - Set wedged recovery method > + * @xe: xe device instance
Missing @method > + * > + * Set wedged recovery method to be sent using drm wedged uevent. > + */ > +void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method) > +{ > + xe->wedged.method = method; > +} > + > /** > * xe_device_declare_wedged - Declare device wedged > * @xe: xe device instance > * > - * This is a final state that can only be cleared with a module > - * re-probe (unbind + bind). > - * In this state every IOCTL will be blocked so the GT cannot be used. > + * This is a final state that can only be cleared with the method specified > + * in the drm wedged uevent. The method needs to be set using > xe_device_set_wedged_method > + * before declaring the device as wedged or the default method of reprobe > (unbind/re-bind) > + * will be sent. In this state every IOCTL will be blocked so the GT cannot > be used. The file convention seems like 80 characters for kernel doc, so let's stick to it. > * In general it will be called upon any critical error such as gt reset > - * failure or guc loading failure. Userspace will be notified of this state > - * through device wedged uevent. > + * failure or guc loading failure or firmware failure. > + * Userspace will be notified of this state through device wedged uevent. > * If xe.wedged module parameter is set to 2, this function will be called > * on every single execution timeout (a.k.a. GPU hang) right after > devcoredump > * snapshot capture. In this mode, GT reset won't be attempted so the state > of > @@ -1152,6 +1164,11 @@ void xe_device_declare_wedged(struct xe_device *xe) > return; > } > > + /* If no wedge recovery method is set, use default */ > + if (!xe->wedged.method) > + xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_REBIND > + | DRM_WEDGE_RECOVERY_BUS_RESET); Although there are no strict rules about this, we usually don't begin a new line with a symbol. > + > if (!atomic_xchg(&xe->wedged.flag, 1)) { > xe->needs_flr_on_fini = true; > drm_err(&xe->drm, > @@ -1161,8 +1178,7 @@ void xe_device_declare_wedged(struct xe_device *xe) > dev_name(xe->drm.dev)); > > /* Notify userspace of wedged device */ > - drm_dev_wedged_event(&xe->drm, > - DRM_WEDGE_RECOVERY_REBIND | > DRM_WEDGE_RECOVERY_BUS_RESET); > + drm_dev_wedged_event(&xe->drm, xe->wedged.method); I was a bit late to realize it when I originally added this. The event call should be after xe_gt_declare_wedged() to comply with wedging rules. We notify userspace *after* we're done with driver cleanup. Raag > } > > for_each_gt(gt, xe, id) > diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h > index 0bc3bc8e6803..06350740aac5 100644 > --- a/drivers/gpu/drm/xe/xe_device.h > +++ b/drivers/gpu/drm/xe/xe_device.h > @@ -191,6 +191,7 @@ static inline bool xe_device_wedged(struct xe_device *xe) > } > > void xe_device_declare_wedged(struct xe_device *xe); > +void xe_device_set_wedged_method(struct xe_device *xe, unsigned long method); > > struct xe_file *xe_file_get(struct xe_file *xef); > void xe_file_put(struct xe_file *xef); > diff --git a/drivers/gpu/drm/xe/xe_device_types.h > b/drivers/gpu/drm/xe/xe_device_types.h > index b93c04466637..fb3617956d63 100644 > --- a/drivers/gpu/drm/xe/xe_device_types.h > +++ b/drivers/gpu/drm/xe/xe_device_types.h > @@ -559,6 +559,8 @@ struct xe_device { > atomic_t flag; > /** @wedged.mode: Mode controlled by kernel parameter and > debugfs */ > int mode; > + /** @wedged.method: Recovery method to be sent in the drm > device wedged uevent */ > + unsigned long method; > } wedged; > > /** @bo_device: Struct to control async free of BOs */ > -- > 2.47.1 >