On 10.07.25 11:01, Simona Vetter wrote: > On Wed, Jul 09, 2025 at 12:52:05PM -0400, Rodrigo Vivi wrote: >> On Wed, Jul 09, 2025 at 05:18:54PM +0300, Raag Jadav wrote: >>> On Wed, Jul 09, 2025 at 04:09:20PM +0200, Christian König wrote: >>>> On 09.07.25 15:41, Simona Vetter wrote: >>>>> On Wed, Jul 09, 2025 at 04:50:13PM +0530, Riana Tauro wrote: >>>>>> Certain errors can cause the device to be wedged and may >>>>>> require a vendor specific recovery method to restore normal >>>>>> operation. >>>>>> >>>>>> Add a recovery method 'WEDGED=vendor-specific' for such errors. Vendors >>>>>> must provide additional recovery documentation if this method >>>>>> is used. >>>>>> >>>>>> v2: fix documentation (Raag) >>>>>> >>>>>> Cc: André Almeida <andrealm...@igalia.com> >>>>>> Cc: Christian König <christian.koe...@amd.com> >>>>>> Cc: David Airlie <airl...@gmail.com> >>>>>> Cc: <dri-devel@lists.freedesktop.org> >>>>>> Suggested-by: Raag Jadav <raag.ja...@intel.com> >>>>>> Signed-off-by: Riana Tauro <riana.ta...@intel.com> >>>>> >>>>> I'm not really understanding what this is useful for, maybe concrete >>>>> example in the form of driver code that uses this, and some tool or >>>>> documentation steps that should be taken for recovery? >> >> The case here is when FW underneath identified something badly corrupted on >> FW land and decided that only a firmware-flashing could solve the day and >> raise interrupt to the driver. At that point we want to wedge, but >> immediately >> hint the admin the recommended action. >> >>>> >>>> The recovery method for this particular case is to flash in a new firmware. >>>> >>>>> The issues I'm seeing here is that eventually we'll get different >>>>> vendor-specific recovery steps, and maybe even on the same device, and >>>>> that leads us to an enumeration issue. Since it's just a string and an >>>>> enum I think it'd be better to just allocate a new one every time there's >>>>> a new strange recovery method instead of this opaque approach. >>>> >>>> That is exactly the opposite of what we discussed so far. > > Sorry, I missed that context. > >>>> The original idea was to add a firmware-flush recovery method which >>>> looked a bit wage since it didn't give any information on what to do >>>> exactly. >>>> >>>> That's why I suggested to add a more generic vendor-specific event >>>> with refers to the documentation and system log to see what actually >>>> needs to be done. >>>> >>>> Otherwise we would end up with events like firmware-flash, update FW >>>> image A, update FW image B, FW version mismatch etc.... > > Yeah, that's kinda what I expect to happen, and we have enough numbers for > this all to not be an issue. > >>> Agree. Any newly allocated method that is specific to a vendor is going to >>> be opaque anyway, since it can't be generic for all drivers. This just helps >>> reduce the noise in DRM core. >>> >>> And yes, there could be different vendor-specific cases for the same driver >>> and the driver should be able to provide the means to distinguish between >>> them. >> >> Sim, what's your take on this then? >> >> Should we get back to the original idea of firmware-flash? > > Maybe intel-firmware-flash or something, meaning prefix with the vendor? > > The reason I think it should be specific is because I'm assuming you want > to script this. And if you have a big fleet with different vendors, then > "vendor-specific" doesn't tell you enough. But if it's something like > $vendor-$magic_step then it does become scriptable, and we do have have a > place to put some documentation on what you should do instead. > > If the point of this interface isn't that it's scriptable, then I'm not > sure why it needs to be an uevent?
You should probably read up on the previous discussion, cause that is exactly what I asked as well :) And no, it should *not* be scripted. That would be a bit brave for a firmware update where you should absolutely not power down the system for example. In my understanding the new value "vendor-specific" basically means it is a known issue with a documented solution, while "unknown" means the driver has no idea how to solve it. Regards, Christian. > I guess if you all want to stick with vendor-specific then I think that's > ok with me too, but the docs should at least explain how to figure out > from the uevent which vendor you're on with a small example. What I'm > worried is that if we have this on multiple drivers userspace will > otherwise make a complete mess and might want to run the wrong recovery > steps. > > I think ideally, no matter what, we'd have a concrete driver patch which > then also comes with the documentation for what exactly you're supposed to > do as something you can script. And not just this stand-alone patch here. > > Cheers, Sima >> >>> >>> Raag >>> >>>>>> --- >>>>>> Documentation/gpu/drm-uapi.rst | 9 +++++---- >>>>>> drivers/gpu/drm/drm_drv.c | 2 ++ >>>>>> include/drm/drm_device.h | 4 ++++ >>>>>> 3 files changed, 11 insertions(+), 4 deletions(-) >>>>>> >>>>>> diff --git a/Documentation/gpu/drm-uapi.rst >>>>>> b/Documentation/gpu/drm-uapi.rst >>>>>> index 263e5a97c080..c33070bdb347 100644 >>>>>> --- a/Documentation/gpu/drm-uapi.rst >>>>>> +++ b/Documentation/gpu/drm-uapi.rst >>>>>> @@ -421,10 +421,10 @@ Recovery >>>>>> Current implementation defines three recovery methods, out of which, >>>>>> drivers >>>>>> can use any one, multiple or none. Method(s) of choice will be sent in >>>>>> the >>>>>> uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of >>>>>> less to >>>>>> -more side-effects. If driver is unsure about recovery or method is >>>>>> unknown >>>>>> -(like soft/hard system reboot, firmware flashing, physical device >>>>>> replacement >>>>>> -or any other procedure which can't be attempted on the fly), >>>>>> ``WEDGED=unknown`` >>>>>> -will be sent instead. >>>>>> +more side-effects. If recovery method is specific to vendor >>>>>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to >>>>>> vendor >>>>>> +specific documentation for further recovery steps. If driver is unsure >>>>>> about >>>>>> +recovery or method is unknown, ``WEDGED=unknown`` will be sent instead >>>>>> >>>>>> Userspace consumers can parse this event and attempt recovery as per the >>>>>> following expectations. >>>>>> @@ -435,6 +435,7 @@ following expectations. >>>>>> none optional telemetry collection >>>>>> rebind unbind + bind driver >>>>>> bus-reset unbind + bus reset/re-enumeration + bind >>>>>> + vendor-specific vendor specific recovery method >>>>>> unknown consumer policy >>>>>> =============== ======================================== >>>>>> >>>>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c >>>>>> index cdd591b11488..0ac723a46a91 100644 >>>>>> --- a/drivers/gpu/drm/drm_drv.c >>>>>> +++ b/drivers/gpu/drm/drm_drv.c >>>>>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned >>>>>> int opt) >>>>>> return "rebind"; >>>>>> case DRM_WEDGE_RECOVERY_BUS_RESET: >>>>>> return "bus-reset"; >>>>>> + case DRM_WEDGE_RECOVERY_VENDOR: >>>>>> + return "vendor-specific"; >>>>>> default: >>>>>> return NULL; >>>>>> } >>>>>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h >>>>>> index 08b3b2467c4c..08a087f149ff 100644 >>>>>> --- a/include/drm/drm_device.h >>>>>> +++ b/include/drm/drm_device.h >>>>>> @@ -26,10 +26,14 @@ struct pci_controller; >>>>>> * Recovery methods for wedged device in order of less to more >>>>>> side-effects. >>>>>> * To be used with drm_dev_wedged_event() as recovery @method. Callers >>>>>> can >>>>>> * use any one, multiple (or'd) or none depending on their needs. >>>>>> + * >>>>>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst >>>>>> for more >>>>>> + * details. >>>>>> */ >>>>>> #define DRM_WEDGE_RECOVERY_NONE BIT(0) /* optional telemetry >>>>>> collection */ >>>>>> #define DRM_WEDGE_RECOVERY_REBIND BIT(1) /* unbind + bind driver >>>>>> */ >>>>>> #define DRM_WEDGE_RECOVERY_BUS_RESET BIT(2) /* unbind + reset bus >>>>>> device + bind */ >>>>>> +#define DRM_WEDGE_RECOVERY_VENDOR BIT(3) /* vendor specific >>>>>> recovery method */ >>>>>> >>>>>> /** >>>>>> * struct drm_wedge_task_info - information about the guilty task of a >>>>>> wedge dev >>>>>> -- >>>>>> 2.47.1 >>>>>> >>>>> >>>> >