On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote: > On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote: > > On 09.07.25 15:41, Simona Vetter wrote: > > > On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote: > > >> Certain errors can cause the device to be wedged and may > > >> require a vendor specific recovery method to restore normal > > >> operation. > > >> > > >> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors > > >> must provide additional recovery documentation if this method > > >> is used. > > >> > > >> v2: fix documentation (Raag) > > >> > > >> Cc: André Almeida <andrealm...@igalia.com> > > >> Cc: Christian König <christian.koe...@amd.com> > > >> Cc: David Airlie <airl...@gmail.com> > > >> Cc: <dri-devel@lists.freedesktop.org> > > >> Suggested-by: Raag Jadav <raag.ja...@intel.com> > > >> Signed-off-by: Riana Tauro <riana.ta...@intel.com> > > > > > > I'm not really understanding what this is useful for, maybe concrete > > > example in the form of driver code that uses this, and some tool or > > > documentation steps that should be taken for recovery?
The case here is when FW underneath identified something badly corrupted on FW land and decided that only a firmware-flashing could solve the day and raise interrupt to the driver. At that point we want to wedge, but immediately hint the admin the recommended action. > > > > The recovery method for this particular case is to flash in a new firmware. > > > > > The issues I'm seeing here is that eventually we'll get different > > > vendor-specific recovery steps, and maybe even on the same device, and > > > that leads us to an enumeration issue. Since it's just a string and an > > > enum I think it'd be better to just allocate a new one every time there's > > > a new strange recovery method instead of this opaque approach. > > > > That is exactly the opposite of what we discussed so far. > > > > The original idea was to add a firmware-flush recovery method which looked > > a bit wage since it didn't give any information on what to do exactly. > > > > That's why I suggested to add a more generic vendor-specific event with > > refers to the documentation and system log to see what actually needs to be > > done. > > > > Otherwise we would end up with events like firmware-flash, update FW image > > A, update FW image B, FW version mismatch etc.... > > Agree. Any newly allocated method that is specific to a vendor is going to > be opaque anyway, since it can't be generic for all drivers. This just helps > reduce the noise in DRM core. > > And yes, there could be different vendor-specific cases for the same driver > and the driver should be able to provide the means to distinguish between > them. Sim, what's your take on this then? Should we get back to the original idea of firmware-flash? > > Raag > > > >> --- > > >> Documentation/gpu/drm-uapi.rst | 9 +++++---- > > >> drivers/gpu/drm/drm_drv.c | 2 ++ > > >> include/drm/drm_device.h | 4 ++++ > > >> 3 files changed, 11 insertions(+), 4 deletions(-) > > >> > > >> diff --git a/Documentation/gpu/drm-uapi.rst > > >> b/Documentation/gpu/drm-uapi.rst > > >> index 263e5a97c080..c33070bdb347 100644 > > >> --- a/Documentation/gpu/drm-uapi.rst > > >> +++ b/Documentation/gpu/drm-uapi.rst > > >> @@ -421,10 +421,10 @@ Recovery > > >> Current implementation defines three recovery methods, out of which, > > >> drivers > > >> can use any one, multiple or none. Method(s) of choice will be sent in > > >> the > > >> uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of > > >> less to > > >> -more side-effects. If driver is unsure about recovery or method is > > >> unknown > > >> -(like soft/hard system reboot, firmware flashing, physical device > > >> replacement > > >> -or any other procedure which can't be attempted on the fly), > > >> ``WEDGED=unknown`` > > >> -will be sent instead. > > >> +more side-effects. If recovery method is specific to vendor > > >> +``WEDGED=vendor-specific`` will be sent and userspace should refer to > > >> vendor > > >> +specific documentation for further recovery steps. If driver is unsure > > >> about > > >> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead > > >> > > >> Userspace consumers can parse this event and attempt recovery as per the > > >> following expectations. > > >> @@ -435,6 +435,7 @@ following expectations. > > >> none optional telemetry collection > > >> rebind unbind + bind driver > > >> bus-reset unbind + bus reset/re-enumeration + bind > > >> + vendor-specific vendor specific recovery method > > >> unknown consumer policy > > >> =============== ======================================== > > >> > > >> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c > > >> index cdd591b11488..0ac723a46a91 100644 > > >> --- a/drivers/gpu/drm/drm_drv.c > > >> +++ b/drivers/gpu/drm/drm_drv.c > > >> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned > > >> int opt) > > >> return "rebind"; > > >> case DRM_WEDGE_RECOVERY_BUS_RESET: > > >> return "bus-reset"; > > >> + case DRM_WEDGE_RECOVERY_VENDOR: > > >> + return "vendor-specific"; > > >> default: > > >> return NULL; > > >> } > > >> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h > > >> index 08b3b2467c4c..08a087f149ff 100644 > > >> --- a/include/drm/drm_device.h > > >> +++ b/include/drm/drm_device.h > > >> @@ -26,10 +26,14 @@ struct pci_controller; > > >> * Recovery methods for wedged device in order of less to more > > >> side-effects. > > >> * To be used with drm_dev_wedged_event() as recovery @method. Callers > > >> can > > >> * use any one, multiple (or'd) or none depending on their needs. > > >> + * > > >> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst > > >> for more > > >> + * details. > > >> */ > > >> #define DRM_WEDGE_RECOVERY_NONE BIT(0) /* optional telemetry > > >> collection */ > > >> #define DRM_WEDGE_RECOVERY_REBIND BIT(1) /* unbind + bind driver > > >> */ > > >> #define DRM_WEDGE_RECOVERY_BUS_RESET BIT(2) /* unbind + reset bus > > >> device + bind */ > > >> +#define DRM_WEDGE_RECOVERY_VENDOR BIT(3) /* vendor specific > > >> recovery method */ > > >> > > >> /** > > >> * struct drm_wedge_task_info - information about the guilty task of a > > >> wedge dev > > >> -- > > >> 2.47.1 > > >> > > > > >