Re: [PATCH v7 1/5] drm: Introduce device wedged event

2024-10-19 Thread Raag Jadav
On Fri, Oct 18, 2024 at 12:58:09PM +0200, Christian König wrote:
> Am 17.10.24 um 18:43 schrieb Rodrigo Vivi:
> > On Thu, Oct 17, 2024 at 09:59:10AM +0200, Christian König wrote:
> > > > > Purpose of this implementation is to provide drivers a generic way to
> > > > > recover with the help of userspace intervention. Different drivers may
> > > > > have different ideas of a "wedged device" depending on their hardware
> > > > > implementation, and hence the vendor agnostic nature of the event.
> > > > > It is up to the drivers to decide when they see the need for recovery
> > > > > and how they want to recover from the available methods.
> > > > > 
> > > > > Current implementation defines three recovery methods, out of which,
> > > > > drivers can choose to support any one or multiple of them. Preferred
> > > > > recovery method will be sent in the uevent environment as 
> > > > > WEDGED=.
> > > > > Userspace consumers (sysadmin) can define udev rules to parse this 
> > > > > event
> > > > > and take respective action to recover the device.
> > > > > 
> > > > >   === ==
> > > > >   Recovery method Consumer expectations
> > > > >   === ==
> > > > >   rebind  unbind + rebind driver
> > > > >   bus-reset   unbind + reset bus device + rebind
> > > > >   reboot  reboot system
> > > > >   === ==
> > > Well that sounds like userspace would need to be involved in recovery.
> > > 
> > > That in turn is a complete no-go since we at least need to signal all
> > > dma_fences to unblock the kernel. In other words things like bus reset 
> > > needs
> > > to happen inside the kernel and *not* in userspace.
> > > 
> > > What we can do is to signal to userspace: Hey a bus reset of device X
> > > happened, maybe restart container, daemon, whatever service which was 
> > > using
> > > this device.
> > Well, when we declare device 'wedged' it is because we don't want to take
> > any drastic measures inside the kernel and want to leave it in a protected
> > and unusable state. In a way that users wouldn't lose display for instance,
> > or at least the device is in a debugable state.
> 
> Uff, that needs to be very very well documented or otherwise the whole
> approach is an absolutely clear NAK from my side as DMA-buf maintainer.
> 
> > 
> > Then, the instructions here is to tell what could possibly be attempted
> > from userspace to get the device to an usable state.
> > 
> > The 'wedge' mode (the one emiting this uevent) needs to be responsible
> > for signaling all the fences and everything needed for a clean unbind
> > and whatever next step might be indicated to userspace.
> > 
> > That should already be part of any wedged mode, regardless the uevent
> > to inform the userspace here.
> 
> You need to approach that from a different side. With the current patch set
> you are ignoring documented mandatory driver behavior as far as I can see.
> 
> So first of all describe in the documentation what the wedged mode is and
> what requirements a driver has to fulfill to enter it:
> https://docs.kernel.org/gpu/drm-uapi.html#device-reset
>
> Especially document that all system memory accesses of the device needs to
> be blocked by (for example) disabling DMA accesses in the PCI config space.
> 
> When it is guaranteed that the device can't access any system memory any
> more the device driver should signal all pending fences of this device.
> 
> And only after all of that is done the driver  can send an uevent to inform
> userspace that it can debug the hanged state.

Sure, will do.

> As far as I can see this makes the enum how to recover the device
> superfluous because you will most likely always need a bus reset to get out
> of this again.

That depends on the kind of fault the device has encountered and the bus it is
sitting on. There could be buses that don't support reset.

Raag


Re: [PATCH v7 1/5] drm: Introduce device wedged event

2024-10-19 Thread Raag Jadav
On Fri, Oct 18, 2024 at 02:54:38PM +0200, Christian König wrote:
> Am 18.10.24 um 14:46 schrieb Raag Jadav:
> > > As far as I can see this makes the enum how to recover the device
> > > superfluous because you will most likely always need a bus reset to get 
> > > out
> > > of this again.
> > That depends on the kind of fault the device has encountered and the bus it 
> > is
> > sitting on. There could be buses that don't support reset.
> 
> That is even more an argument to not expose this in the uevent.
> 
> Getting the device working again is strongly device dependent and can't be
> handled in a generic way.

My understanding is that the proposed methods can be handled in a generic way
and are useful for the devices that do support it. This way the userspace can
atleast have a hint about recovery.

For others we can have something like WEDGED=none (as proposed by Michal and
Lucas in other threads) and let admin/user decide how to deal with it.

Raag


Re: [PATCH v7 1/5] drm: Introduce device wedged event

2024-10-17 Thread Raag Jadav
On Mon, Sep 30, 2024 at 01:08:41PM +0530, Raag Jadav wrote:
> Introduce device wedged event, which will notify userspace of wedged
> (hanged/unusable) state of the DRM device through a uevent. This is
> useful especially in cases where the device is no longer operating as
> expected even after a hardware reset and has become unrecoverable from
> driver context.
> 
> Purpose of this implementation is to provide drivers a generic way to
> recover with the help of userspace intervention. Different drivers may
> have different ideas of a "wedged device" depending on their hardware
> implementation, and hence the vendor agnostic nature of the event.
> It is up to the drivers to decide when they see the need for recovery
> and how they want to recover from the available methods.
> 
> Current implementation defines three recovery methods, out of which,
> drivers can choose to support any one or multiple of them. Preferred
> recovery method will be sent in the uevent environment as WEDGED=.
> Userspace consumers (sysadmin) can define udev rules to parse this event
> and take respective action to recover the device.
> 
> === ==
> Recovery method Consumer expectations
> === ==
> rebind  unbind + rebind driver
> bus-reset   unbind + reset bus device + rebind
> reboot  reboot system
> === ==
> 
> v4: s/drm_dev_wedged/drm_dev_wedged_event
> Use drm_info() (Jani)
> Kernel doc adjustment (Aravind)
> v5: Send recovery method with uevent (Lina)
> v6: Access wedge_recovery_opts[] using helper function (Jani)
> Use snprintf() (Jani)
> v7: Convert recovery helpers into regular functions (Andy, Jani)
> Aesthetic adjustments (Andy)
> Handle invalid method cases
> 
> Signed-off-by: Raag Jadav 
> ---

Cc'ing amd, collabora and others as I found semi-related work at

https://lore.kernel.org/dri-devel/20230627132323.115440-1-andrealm...@igalia.com/
https://lore.kernel.org/amd-gfx/20240725150055.1991893-1-alexander.deuc...@amd.com/
https://lore.kernel.org/dri-devel/20241011225906.3789965-3-adrian.laru...@collabora.com/
https://lore.kernel.org/amd-gfx/CAAxE2A5v_RkZ9ex4=7jibskvb22_1faj0aanbcmktett5c3...@mail.gmail.com/


Please share feedback about usefulness and adoption of this.
Improvements are welcome.

Raag

>  drivers/gpu/drm/drm_drv.c | 77 +++
>  include/drm/drm_device.h  | 23 
>  include/drm/drm_drv.h |  3 ++
>  3 files changed, 103 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index ac30b0ec9d93..cfe9600da2ee 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -26,6 +26,8 @@
>   * DEALINGS IN THE SOFTWARE.
>   */
>  
> +#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -33,6 +35,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -70,6 +73,42 @@ static struct dentry *drm_debugfs_root;
>  
>  DEFINE_STATIC_SRCU(drm_unplug_srcu);
>  
> +/*
> + * Available recovery methods for wedged device. To be sent along with device
> + * wedged uevent.
> + */
> +static const char *const drm_wedge_recovery_opts[] = {
> + [DRM_WEDGE_RECOVERY_REBIND] = "rebind",
> + [DRM_WEDGE_RECOVERY_BUS_RESET] = "bus-reset",
> + [DRM_WEDGE_RECOVERY_REBOOT] = "reboot",
> +};
> +
> +static bool drm_wedge_recovery_is_valid(enum drm_wedge_recovery method)
> +{
> + static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == 
> DRM_WEDGE_RECOVERY_MAX);
> +
> + return method >= DRM_WEDGE_RECOVERY_REBIND && method < 
> DRM_WEDGE_RECOVERY_MAX;
> +}
> +
> +/**
> + * drm_wedge_recovery_name - provide wedge recovery name
> + * @method: method to be used for recovery
> + *
> + * This validates wedge recovery @method against the available ones in
> + * drm_wedge_recovery_opts[] and provides respective recovery name in string
> + * format if found valid.
> + *
> + * Returns: pointer to const recovery string on success, NULL otherwise.
> + */
> +const char *drm_wedge_recovery_name(enum drm_wedge_recovery method)
> +{
> + if (drm_wedge_recovery_is_valid(method))
> + return drm_wedge_recovery_opts[method];
> +
> + return NULL;
> +}
> +EXPORT_SYMBOL(drm_wedge_recovery_name);
> +
>  /*
>   * DRM Minors
>   * A DRM device can provide several char-dev interfaces on the DRM-Major. 
> Each
> @@ -497,6 +536,44 @@ void drm_dev_unplug(struct drm_device *dev)
>  }
>  EXPORT_SYMBOL(

[PATCH v8 1/4] drm: Introduce device wedged event

2024-10-25 Thread Raag Jadav
Introduce device wedged event, which will notify userspace of wedged
(hanged/unusable) state of the DRM device through a uevent. This is
useful especially in cases where the device is no longer operating as
expected even after a reset and has become unrecoverable from driver
context. Purpose of this implementation is to provide drivers a generic
way to recover with the help of userspace intervention without taking
any drastic measures in the driver.

A 'wedged' device is basically a dead device that needs attention.
The uevent is the notification that is sent to userspace along with a
hint about what could possibly be attempted to recover the device and
bring it back to usable state. Different drivers may have different
ideas of a 'wedged' device depending on their hardware implementation,
and hence the vendor agnostic nature of the event. It is up to the
drivers to decide when they see the need for recovery and how they
want to recover from the available methods.

Recovery


Current implementation defines two recovery methods, out of which,
drivers can use any one, both or none. Method(s) of choice will be
sent in the uevent environment as ``WEDGED=[,]``
in order of less to more side-effects. If driver is unsure about
recovery or method is unknown (like soft/hard reboot, firmware
flashing, hardware replacement or any other procedure which can't
be attempted on the fly), ``WEDGED=none`` will be sent instead.

It is the responsibility of the driver to perform required cleanups
(like disabling system memory access or signalling dma_fences) and
prepare itself for the recovery before sending the event. Once the
event is sent, driver should block all IOCTLs with an error code.
This will signify the reason for wegeding which can be reported to
the application if needed.

Userspace consumers can parse this event and attempt recovery as per
below expectations.

=== ==
Recovery method Consumer expectations
=== ==
rebind  unbind + rebind driver
bus-reset   unbind + reset bus device + rebind
noneadmin/user policy
=== ==

Example for rebind
~~

Udev rule::

SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]",
RUN+="/path/to/rebind.sh $env{DEVPATH}"

Recovery script::

#!/bin/sh

DEVPATH=$(readlink -f /sys/$1/device)
DEVICE=$(basename $DEVPATH)
DRIVER=$(readlink -f $DEVPATH/driver)

echo -n $DEVICE > $DRIVER/unbind
sleep 1
echo -n $DEVICE > $DRIVER/bind

Although scripts are simple enough for basic recovery, admin/users
can define customized policies around recovery action. For example if
the driver supports multiple recovery methods, consumers can opt for
the suitable one based on policy definition. Consumers can also take
additional steps like gathering telemetry information (devcoredump,
syslog), or have the device available for further debugging and data
collection before performing the recovery. This is useful especially
when the driver is unsure about recovery or method is unknown.

v4: s/drm_dev_wedged/drm_dev_wedged_event
Use drm_info() (Jani)
Kernel doc adjustment (Aravind)
v5: Send recovery method with uevent (Lina)
v6: Access wedge_recovery_opts[] using helper function (Jani)
Use snprintf() (Jani)
v7: Convert recovery helpers into regular functions (Andy, Jani)
Aesthetic adjustments (Andy)
Handle invalid method cases
v8: Allow sending multiple methods with uevent (Lucas, Michal)
static_assert() globally (Andy)

Signed-off-by: Raag Jadav 
---
 drivers/gpu/drm/drm_drv.c | 51 +++
 include/drm/drm_device.h  |  7 ++
 include/drm/drm_drv.h |  1 +
 3 files changed, 59 insertions(+)

diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index ac30b0ec9d93..ded6327fc242 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -26,6 +26,8 @@
  * DEALINGS IN THE SOFTWARE.
  */
 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -33,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -70,6 +73,16 @@ static struct dentry *drm_debugfs_root;
 
 DEFINE_STATIC_SRCU(drm_unplug_srcu);
 
+/*
+ * Available recovery methods for wedged device. To be sent along with device
+ * wedged uevent.
+ */
+static const char *const drm_wedge_recovery_opts[] = {
+   [ffs(DRM_WEDGE_RECOVERY_REBIND) - 1]= "rebind",
+   [ffs(DRM_WEDGE_RECOVERY_BUS_RESET) - 1] = "bus-reset",
+};
+static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == 
ffs(DRM_WEDGE_RECOVERY_BUS_RESET));
+
 /*
  * DRM Minors
  * A DRM device can provide several char-dev interfaces on the DRM-Major. Each
@@ -497,6 +510,44 @@ void drm_dev_unplug(struct drm_device *dev)
 }
 EXPORT_SYMBOL(drm_d

[PATCH v8 3/4] drm/xe: Use device wedged event

2024-10-25 Thread Raag Jadav
This was previously attempted as xe specific reset uevent but dropped
in commit 77a0d4d1cea2 ("drm/xe/uapi: Remove reset uevent for now")
as part of refactoring.

Now that we have device wedged event provided by DRM core, make use
of it and support both driver rebind and bus-reset based recovery.
With this in place userspace will be notified of wedged device, on
the basis of which, userspace may take respective action to recover
the device.

$ udevadm monitor --property --kernel
monitor will print the received events for:
KERNEL - the kernel uevent

KERNEL[265.802982] change   
/devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:03:00.0/drm/card0
 (drm)
ACTION=change
DEVPATH=/devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:03:00.0/drm/card0
SUBSYSTEM=drm
WEDGED=rebind,bus-reset
DEVNAME=/dev/dri/card0
DEVTYPE=drm_minor
SEQNUM=5208
MAJOR=226
MINOR=0

v2: Change authorship to Himal (Aravind)
Add uevent for all device wedged cases (Aravind)
v3: Generic re-implementation in DRM subsystem (Lucas)
v4: Change authorship to Raag (Aravind)

Signed-off-by: Raag Jadav 
---
 drivers/gpu/drm/xe/xe_device.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 2da4affe4dfd..2477cf043397 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -999,11 +999,12 @@ static void xe_device_wedged_fini(struct drm_device *drm, 
void *arg)
  * xe_device_declare_wedged - Declare device wedged
  * @xe: xe device instance
  *
- * This is a final state that can only be cleared with a mudule
+ * This is a final state that can only be cleared with a module
  * re-probe (unbind + bind).
  * In this state every IOCTL will be blocked so the GT cannot be used.
  * In general it will be called upon any critical error such as gt reset
- * failure or guc loading failure.
+ * failure or guc loading failure. Userspace will be notified of this state
+ * by a DRM uevent.
  * If xe.wedged module parameter is set to 2, this function will be called
  * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
  * snapshot capture. In this mode, GT reset won't be attempted so the state of
@@ -1033,6 +1034,10 @@ void xe_device_declare_wedged(struct xe_device *xe)
"IOCTLs and executions are blocked. Only a rebind may 
clear the failure\n"
"Please file a _new_ bug report at 
https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n";,
dev_name(xe->drm.dev));
+
+   /* Notify userspace of wedged device */
+   drm_dev_wedged_event(&xe->drm,
+DRM_WEDGE_RECOVERY_REBIND | 
DRM_WEDGE_RECOVERY_BUS_RESET);
}
 
for_each_gt(gt, xe, id)
-- 
2.34.1



[PATCH v8 4/4] drm/i915: Use device wedged event

2024-10-25 Thread Raag Jadav
Now that we have device wedged event provided by DRM core, make use
of it and support both driver rebind and bus-reset based recovery.
With this in place, userspace will be notified of wedged device on
gt reset failure.

Signed-off-by: Raag Jadav 
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
b/drivers/gpu/drm/i915/gt/intel_reset.c
index 8f1ea95471ef..06bfd2dbb6c8 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -1418,6 +1418,9 @@ static void intel_gt_reset_global(struct intel_gt *gt,
 
if (!test_bit(I915_WEDGED, >->reset.flags))
kobject_uevent_env(kobj, KOBJ_CHANGE, reset_done_event);
+   else
+   drm_dev_wedged_event(>->i915->drm,
+DRM_WEDGE_RECOVERY_REBIND | 
DRM_WEDGE_RECOVERY_BUS_RESET);
 }
 
 /**
-- 
2.34.1



[PATCH v8 0/4] Introduce DRM device wedged event

2024-10-25 Thread Raag Jadav
This series introduces device wedged event in DRM subsystem and uses
it in xe and i915 drivers. Detailed description in commit message.

This was earlier attempted as xe specific uevent in v1 and v2.
https://patchwork.freedesktop.org/series/136909/

Similar work by André Almeida.
https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealm...@igalia.com/

v2: Change authorship to Himal (Aravind)
Add uevent for all device wedged cases (Aravind)

v3: Generic re-implementation in DRM subsystem (Lucas)

v4: s/drm_dev_wedged/drm_dev_wedged_event
Use drm_info() (Jani)
Kernel doc adjustment (Aravind)
Change authorship to Raag (Aravind)

v5: Send recovery method with uevent (Lina)
Expose supported recovery methods via sysfs (Lucas)

v6: Access wedge_recovery_opts[] using helper function (Jani)
Use snprintf() (Jani)

v7: Convert recovery helpers into regular functions (Andy, Jani)
Aesthetic adjustments (Andy)
Handle invalid method cases
Add documentation to drm-uapi.rst (Sima)

v8: Drop sysfs and allow sending multiple methods with uevent (Lucas, Michal)
Improve documentation (Christian, Rodrigo)
static_assert() globally (Andy)

Raag Jadav (4):
  drm: Introduce device wedged event
  drm/doc: Document device wedged event
  drm/xe: Use device wedged event
  drm/i915: Use device wedged event

 Documentation/gpu/drm-uapi.rst| 75 +++
 drivers/gpu/drm/drm_drv.c | 51 ++
 drivers/gpu/drm/i915/gt/intel_reset.c |  3 ++
 drivers/gpu/drm/xe/xe_device.c|  9 +++-
 include/drm/drm_device.h  |  7 +++
 include/drm/drm_drv.h |  1 +
 6 files changed, 144 insertions(+), 2 deletions(-)

-- 
2.34.1



[PATCH v8 2/4] drm/doc: Document device wedged event

2024-10-25 Thread Raag Jadav
Add documentation for device wedged event in a new 'Device wedging'
chapter. The describes basic definitions and consumer expectations
along with an example.

v8: Improve documentation (Christian, Rodrigo)

Signed-off-by: Raag Jadav 
---
 Documentation/gpu/drm-uapi.rst | 75 ++
 1 file changed, 75 insertions(+)

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 370d820be248..11a7446233b5 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -362,6 +362,81 @@ the first place. DRM devices should make use of 
devcoredump to store relevant
 information about the reset, so this information can be added to user bug
 reports.
 
+Device wedging
+==
+
+Drivers can optionally make use of device wedged event (implemented as
+drm_dev_wedged_event() in DRM subsystem) which notifies userspace of wedged
+(hanged/unusable) state of the DRM device through a uevent. This is useful
+especially in cases where the device is no longer operating as expected even
+after a reset and has become unrecoverable from driver context. Purpose of
+this implementation is to provide drivers a generic way to recover with the
+help of userspace intervention without taking any drastic measures in the
+driver.
+
+A 'wedged' device is basically a dead device that needs attention. The
+uevent is the notification that is sent to userspace along with a hint about
+what could possibly be attempted to recover the device and bring it back to
+usable state. Different drivers may have different ideas of a 'wedged' device
+depending on their hardware implementation, and hence the vendor agnostic
+nature of the event. It is up to the drivers to decide when they see the need
+for recovery and how they want to recover from the available methods.
+
+Recovery
+
+
+Current implementation defines two recovery methods, out of which, drivers
+can use any one, both or none. Method(s) of choice will be sent in the uevent
+environment as ``WEDGED=[,]`` in order of less to more side
+effects. If driver is unsure about recovery or method is unknown (like reboot,
+firmware flashing, hardware replacement or any other procedure which can't be
+attempted on the fly), ``WEDGED=none`` will be sent instead.
+
+It is the responsibility of the driver to perform required cleanups (like
+disabling system memory access or signalling dma_fences) and prepare itself
+for the recovery before sending the event. Once the event is sent, driver
+should block all IOCTLs with an error code. This will signify the reason for
+wegeding which can be reported to the application if needed.
+
+Userspace consumers can parse this event and attempt recovery as per below
+expectations.
+
+=== ==
+Recovery method Consumer expectations
+=== ==
+rebind  unbind + rebind driver
+bus-reset   unbind + reset bus device + rebind
+noneadmin/user policy
+=== ==
+
+Example for rebind
+~~
+
+Udev rule::
+
+SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]",
+RUN+="/path/to/rebind.sh $env{DEVPATH}"
+
+Recovery script::
+
+#!/bin/sh
+
+DEVPATH=$(readlink -f /sys/$1/device)
+DEVICE=$(basename $DEVPATH)
+DRIVER=$(readlink -f $DEVPATH/driver)
+
+echo -n $DEVICE > $DRIVER/unbind
+sleep 1
+echo -n $DEVICE > $DRIVER/bind
+
+Although scripts are simple enough for basic recovery, admin/users can define
+customized policies around recovery action. For example if the driver supports
+multiple recovery methods, consumers can opt for the suitable one based on
+policy definition. Consumers can also take additional steps like gathering
+telemetry information (devcoredump, syslog), or have the device available for
+further debugging and data collection before performing the recovery. This is
+useful especially when the driver is unsure about recovery or method is 
unknown.
+
 .. _drm_driver_ioctl:
 
 IOCTL Support on Device Nodes
-- 
2.34.1



Re: [PATCH v8 1/4] drm: Introduce device wedged event

2024-10-26 Thread Raag Jadav
On Fri, Oct 25, 2024 at 05:45:59PM +0300, Andy Shevchenko wrote:
> On Fri, Oct 25, 2024 at 12:08:50PM +0300, Jani Nikula wrote:
> > On Fri, 25 Oct 2024, Raag Jadav  wrote:
> 
> ...
> 
> > > +/*
> > > + * Available recovery methods for wedged device. To be sent along with 
> > > device
> > > + * wedged uevent.
> > > + */
> > > +static const char *const drm_wedge_recovery_opts[] = {
> > > + [ffs(DRM_WEDGE_RECOVERY_REBIND) - 1]= "rebind",
> > > + [ffs(DRM_WEDGE_RECOVERY_BUS_RESET) - 1] = "bus-reset",
> > > +};
> > > +static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == 
> > > ffs(DRM_WEDGE_RECOVERY_BUS_RESET));
> > 
> > This might work in most cases, but you also might end up finding that
> > there's an arch and compiler combo out there that just won't be able to
> > figure out ffs() at compile time, and the array initialization fails.
> 
> We have ilog2() macro for such cases, but it is rather fls() and not ffs(),
> and I have no idea why ffs() even being used here, especially in the index
> part of the array assignments. It's unreadable.

I initially had __builtin_ffs() in mind which is even more ugly.

> > If that happens, you'd have to either convert back to an enum (and call
> > the wedge event function with BIT(DRM_WEDGE_RECOVERY_REBIND) etc.), or

Which would confuse the users since that's not how enums are normally used.

> > make this a array of structs mapping the macro values to strings and
> > loop over it.

Why not just switch() it?

for_each_set_bit(opt, &method, size) {
switch (BIT(opt)) {
case DRM_WEDGE_RECOVERY_REBIND:
recovery = "rebind";
break;
case DRM_WEDGE_RECOVERY_BUS_RESET:
recovery = "bus-reset";
break;
}

...
}

I know we'll have to update it with new additions, but it'd be much simpler,
atleast compared to introducing and maintaining a new struct.

> > Also, the main point of the static assert was to ensure the array is
> > updated when a new recovery option is added, and there's no out of
> > bounds access. That no longer holds, and the static assert is pretty
> > much useless. You still have to manually find and update this.

With above in place this won't be needed.

Raag


Re: [PATCH v8 2/4] drm/doc: Document device wedged event

2024-11-01 Thread Raag Jadav
On Tue, Oct 29, 2024 at 10:51:34AM +0100, Christian König wrote:
> Am 25.10.24 um 10:48 schrieb Raag Jadav:
> > Add documentation for device wedged event in a new 'Device wedging'
> > chapter. The describes basic definitions and consumer expectations
> > along with an example.
> > 
> > v8: Improve documentation (Christian, Rodrigo)
> > 
> > Signed-off-by: Raag Jadav 
> > ---
> >   Documentation/gpu/drm-uapi.rst | 75 ++
> >   1 file changed, 75 insertions(+)
> > 
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index 370d820be248..11a7446233b5 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -362,6 +362,81 @@ the first place. DRM devices should make use of 
> > devcoredump to store relevant
> >   information about the reset, so this information can be added to user bug
> >   reports.
> > +Device wedging
> > +==
> > +
> > +Drivers can optionally make use of device wedged event (implemented as
> > +drm_dev_wedged_event() in DRM subsystem) which notifies userspace of wedged
> > +(hanged/unusable) state of the DRM device through a uevent. This is useful
> > +especially in cases where the device is no longer operating as expected 
> > even
> > +after a reset and has become unrecoverable from driver context. Purpose of
> > +this implementation is to provide drivers a generic way to recover with the
> > +help of userspace intervention without taking any drastic measures in the
> > +driver.
> > +
> > +A 'wedged' device is basically a dead device that needs attention. The
> > +uevent is the notification that is sent to userspace along with a hint 
> > about
> > +what could possibly be attempted to recover the device and bring it back to
> > +usable state. Different drivers may have different ideas of a 'wedged' 
> > device
> > +depending on their hardware implementation, and hence the vendor agnostic
> > +nature of the event. It is up to the drivers to decide when they see the 
> > need
> > +for recovery and how they want to recover from the available methods.
> > +
> > +Recovery
> > +
> > +
> > +Current implementation defines two recovery methods, out of which, drivers
> > +can use any one, both or none. Method(s) of choice will be sent in the 
> > uevent
> > +environment as ``WEDGED=[,]`` in order of less to more 
> > side
> > +effects. If driver is unsure about recovery or method is unknown (like 
> > reboot,
> > +firmware flashing, hardware replacement or any other procedure which can't 
> > be
> > +attempted on the fly), ``WEDGED=none`` will be sent instead.
> > +
> > +It is the responsibility of the driver to perform required cleanups (like
> > +disabling system memory access or signalling dma_fences) and prepare itself
> > +for the recovery before sending the event.
> 
> That needs to be more like a warning and should have a bit more text. Maybe
> even a separate section.
> 
> Something like this maybe:
> 
> Prerequisites
> --
> 
> Drivers needs to make sure that the device won't harm the system as a
> whole by keeping it in a wedged state. Necessary actions must include
> disabling DMA to system memory as well as communication channels
> with other devices.
> Further drivers must ensure that all dma_fences are signaled and other
> state the core kernel might depend on are cleaned up.
> All ongoing waiting for hardware state changes must be aborted and
> new accesses to the hardware sufficiently blocked
> 
> 
> I'm not a native speaker of English, so feel free to re-phrase that. But the

Neither am I, but will try my best.

> general approach should be like don't do this before you have made sure a, b
> and c.

Sure, thanks.

> >   Once the event is sent, driver
> > +should block all IOCTLs with an error code.
> 
> Better define which error code that should be. I think -ENODEV similar to a
> hotplug case would be appropriate.

Why not leave it to the drivers to decide depending on the type of failure?

> >   This will signify the reason for
> > +wegeding which can be reported to the application if needed.
> > +
> > +Userspace consumers can parse this event and attempt recovery as per below
> > +expectations.
> > +
> > +=== ==
> > +Recovery method Consumer expectations
> > +=== ==
> > +rebind 

[PATCH v10 3/4] drm/xe: Use device wedged event

2024-11-29 Thread Raag Jadav
This was previously attempted as xe specific reset uevent but dropped
in commit 77a0d4d1cea2 ("drm/xe/uapi: Remove reset uevent for now")
as part of refactoring.

Now that we have device wedged event provided by DRM core, make use
of it and support both driver rebind and bus-reset based recovery.
With this in place userspace will be notified of wedged device, on
the basis of which, userspace may take respective action to recover
the device.

$ udevadm monitor --property --kernel
monitor will print the received events for:
KERNEL - the kernel uevent

KERNEL[265.802982] change   
/devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:03:00.0/drm/card0
 (drm)
ACTION=change
DEVPATH=/devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:03:00.0/drm/card0
SUBSYSTEM=drm
WEDGED=rebind,bus-reset
DEVNAME=/dev/dri/card0
DEVTYPE=drm_minor
SEQNUM=5208
MAJOR=226
MINOR=0

v2: Change authorship to Himal (Aravind)
Add uevent for all device wedged cases (Aravind)
v3: Generic re-implementation in DRM subsystem (Lucas)
v4: Change authorship to Raag (Aravind)

Signed-off-by: Raag Jadav 
Reviewed-by: Aravind Iddamsetty 
---
 drivers/gpu/drm/xe/xe_device.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 930bb2750e2e..f04737854887 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -991,11 +991,12 @@ static void xe_device_wedged_fini(struct drm_device *drm, 
void *arg)
  * xe_device_declare_wedged - Declare device wedged
  * @xe: xe device instance
  *
- * This is a final state that can only be cleared with a mudule
+ * This is a final state that can only be cleared with a module
  * re-probe (unbind + bind).
  * In this state every IOCTL will be blocked so the GT cannot be used.
  * In general it will be called upon any critical error such as gt reset
- * failure or guc loading failure.
+ * failure or guc loading failure. Userspace will be notified of this state
+ * by a DRM uevent.
  * If xe.wedged module parameter is set to 2, this function will be called
  * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
  * snapshot capture. In this mode, GT reset won't be attempted so the state of
@@ -1025,6 +1026,10 @@ void xe_device_declare_wedged(struct xe_device *xe)
"IOCTLs and executions are blocked. Only a rebind may 
clear the failure\n"
"Please file a _new_ bug report at 
https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n";,
dev_name(xe->drm.dev));
+
+   /* Notify userspace of wedged device */
+   drm_dev_wedged_event(&xe->drm,
+DRM_WEDGE_RECOVERY_REBIND | 
DRM_WEDGE_RECOVERY_BUS_RESET);
}
 
for_each_gt(gt, xe, id)
-- 
2.34.1



[PATCH v10 2/4] drm/doc: Document device wedged event

2024-11-29 Thread Raag Jadav
Add documentation for device wedged event in a new 'Device wedging'
chapter. The describes basic definitions, prerequisites and consumer
expectations along with an example.

 v8: Improve documentation (Christian, Rodrigo)
 v9: Add prerequisites section (Christian)
v10: Clarify mmap cleanup and consumer prerequisites (Christian, Aravind)

Signed-off-by: Raag Jadav 
Reviewed-by: Christian König 
---
 Documentation/gpu/drm-uapi.rst | 112 -
 1 file changed, 109 insertions(+), 3 deletions(-)

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index b75cc9a70d1f..da2927cde53d 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -371,9 +371,115 @@ Reporting causes of resets
 
 Apart from propagating the reset through the stack so apps can recover, it's
 really useful for driver developers to learn more about what caused the reset 
in
-the first place. DRM devices should make use of devcoredump to store relevant
-information about the reset, so this information can be added to user bug
-reports.
+the first place. For this, drivers can make use of devcoredump to store 
relevant
+information about the reset and send device wedged event without recovery 
method
+(as explained in next chapter) to notify userspace, so this information can be
+collected and added to user bug reports.
+
+Device wedging
+==
+
+Drivers can optionally make use of device wedged event (implemented as
+drm_dev_wedged_event() in DRM subsystem), which notifies userspace of 'wedged'
+(hanged/unusable) state of the DRM device through a uevent. This is useful
+especially in cases where the device is no longer operating as expected and
+has become unrecoverable from driver context. Purpose of this implementation
+is to provide drivers a generic way to recover with the help of userspace
+intervention without taking any drastic measures in the driver.
+
+A 'wedged' device is basically a dead device that needs attention. The
+uevent is the notification that is sent to userspace along with a hint about
+what could possibly be attempted to recover the device and bring it back to
+usable state. Different drivers may have different ideas of a 'wedged' device
+depending on their hardware implementation, and hence the vendor agnostic
+nature of the event. It is up to the drivers to decide when they see the need
+for device recovery and how they want to recover from the available methods.
+
+Driver prerequisites
+
+
+The driver, before opting for recovery, needs to make sure that the 'wedged'
+device doesn't harm the system as a whole by taking care of the prerequisites.
+Necessary actions must include disabling DMA to system memory as well as any
+communication channels with other devices. Further, the driver must ensure
+that all dma_fences are signalled and any device state that the core kernel
+might depend on is cleaned up. Any existing mmap should be invalidated and
+page faults should be redirected to a dummy page. Once the event is sent, the
+device must be kept in 'wedged' state until the recovery is performed. New
+accesses to the device (IOCTLs) should be blocked, preferably with an error
+code that resembles the type of failure the device has encountered. This will
+signify the reason for wedging, which can be reported to the application if
+needed.
+
+Recovery
+
+
+Current implementation defines three recovery methods, out of which, drivers
+can use any one, multiple or none. Method(s) of choice will be sent in the
+uevent environment as ``WEDGED=[,]`` in order of less to
+more side-effects. If driver is unsure about recovery or method is unknown
+(like soft/hard reboot, firmware flashing, hardware replacement or any other
+procedure which can't be attempted on the fly), ``WEDGED=unknown`` will be
+sent instead.
+
+Userspace consumers can parse this event and attempt recovery as per the
+following expectations.
+
+=== 
+Recovery method Consumer expectations
+=== 
+noneoptional telemetry collection
+rebind  unbind + bind driver
+bus-reset   unbind + reset bus device + bind
+unknown admin/user policy
+=== 
+
+The only exception to this is ``WEDGED=none``, which signifies that the
+device was temporarily 'wedged' at some point but was able to recover using
+device specific methods like reset. No explicit device recovery is expected
+from the consumer in this case, but it can still take additional steps like
+gathering telemetry information (devcoredump, syslog). This is useful
+because the first hang is usually the most critical one which can result in
+consequential hangs or complete wedging.
+
+Consumer prerequisites
+--

[PATCH v10 4/4] drm/i915: Use device wedged event

2024-11-29 Thread Raag Jadav
Now that we have device wedged event provided by DRM core, make use
of it and support both driver rebind and bus-reset based recovery.
With this in place, userspace will be notified of wedged device on
gt reset failure.

Signed-off-by: Raag Jadav 
Reviewed-by: Aravind Iddamsetty 
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
b/drivers/gpu/drm/i915/gt/intel_reset.c
index f42f21632306..18cf50a1e84d 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -1418,6 +1418,9 @@ static void intel_gt_reset_global(struct intel_gt *gt,
 
if (!test_bit(I915_WEDGED, >->reset.flags))
kobject_uevent_env(kobj, KOBJ_CHANGE, reset_done_event);
+   else
+   drm_dev_wedged_event(>->i915->drm,
+DRM_WEDGE_RECOVERY_REBIND | 
DRM_WEDGE_RECOVERY_BUS_RESET);
 }
 
 /**
-- 
2.34.1



[PATCH v10 0/4] Introduce DRM device wedged event

2024-11-29 Thread Raag Jadav
This series introduces device wedged event in DRM subsystem and uses
it in xe and i915 drivers. Detailed description in commit message.

This was earlier attempted as xe specific uevent in v1 and v2.
https://patchwork.freedesktop.org/series/136909/

Similar work by André Almeida.
https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealm...@igalia.com/

 v2: Change authorship to Himal (Aravind)
 Add uevent for all device wedged cases (Aravind)

 v3: Generic re-implementation in DRM subsystem (Lucas)

 v4: s/drm_dev_wedged/drm_dev_wedged_event
 Use drm_info() (Jani)
 Kernel doc adjustment (Aravind)
 Change authorship to Raag (Aravind)

 v5: Send recovery method with uevent (Lina)
 Expose supported recovery methods via sysfs (Lucas)

 v6: Access wedge_recovery_opts[] using helper function (Jani)
 Use snprintf() (Jani)

 v7: Convert recovery helpers into regular functions (Andy, Jani)
 Aesthetic adjustments (Andy)
 Handle invalid method cases
 Add documentation to drm-uapi.rst (Sima)

 v8: Drop sysfs and allow sending multiple methods with uevent (Lucas, Michal)
 Improve documentation (Christian, Rodrigo)
 static_assert() globally (Andy)

 v9: Document prerequisites section (Christian)
 Provide 'none' method for reset cases (Christian)
 Provide recovery opts using switch cases

v10: Clarify mmap cleanup and consumer prerequisites (Christian, Aravind)

Raag Jadav (4):
  drm: Introduce device wedged event
  drm/doc: Document device wedged event
  drm/xe: Use device wedged event
  drm/i915: Use device wedged event

 Documentation/gpu/drm-uapi.rst| 112 +-
 drivers/gpu/drm/drm_drv.c |  66 +++
 drivers/gpu/drm/i915/gt/intel_reset.c |   3 +
 drivers/gpu/drm/xe/xe_device.c|   9 ++-
 include/drm/drm_device.h  |   8 ++
 include/drm/drm_drv.h |   1 +
 6 files changed, 194 insertions(+), 5 deletions(-)

-- 
2.34.1



[PATCH v10 1/4] drm: Introduce device wedged event

2024-11-29 Thread Raag Jadav
id method cases
v8: Allow sending multiple methods with uevent (Lucas, Michal)
static_assert() globally (Andy)
v9: Provide 'none' method for reset cases (Christian)
Provide recovery opts using switch cases

Signed-off-by: Raag Jadav 
---
 drivers/gpu/drm/drm_drv.c | 66 +++
 include/drm/drm_device.h  |  8 +
 include/drm/drm_drv.h |  1 +
 3 files changed, 75 insertions(+)

diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index c2c172eb25df..e6583ebc6b05 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -26,6 +26,7 @@
  * DEALINGS IN THE SOFTWARE.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -33,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -497,6 +499,70 @@ void drm_dev_unplug(struct drm_device *dev)
 }
 EXPORT_SYMBOL(drm_dev_unplug);
 
+/*
+ * Available recovery methods for wedged device. To be sent along with device
+ * wedged uevent.
+ */
+static const char *drm_get_wedge_recovery(unsigned int opt)
+{
+   switch (BIT(opt)) {
+   case DRM_WEDGE_RECOVERY_NONE:
+   return "none";
+   case DRM_WEDGE_RECOVERY_REBIND:
+   return "rebind";
+   case DRM_WEDGE_RECOVERY_BUS_RESET:
+   return "bus-reset";
+   default:
+   return NULL;
+   }
+}
+
+/**
+ * drm_dev_wedged_event - generate a device wedged uevent
+ * @dev: DRM device
+ * @method: method(s) to be used for recovery
+ *
+ * This generates a device wedged uevent for the DRM device specified by @dev.
+ * Recovery @method\(s) of choice will be sent in the uevent environment as
+ * ``WEDGED=[,]`` in order of less to more side-effects. If
+ * caller is unsure about recovery or @method is unknown (0), 
``WEDGED=unknown``
+ * will be sent instead.
+ *
+ * Refer to 'Device Wedging' chapter in Documentation/gpu/drm-uapi.rst for more
+ * details.
+ *
+ * Returns: 0 on success, negative error code otherwise.
+ */
+int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
+{
+   const char *recovery = NULL;
+   unsigned int len, opt;
+   /* Event string length up to 28+ characters with available methods */
+   char event_string[32];
+   char *envp[] = { event_string, NULL };
+
+   len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
+
+   for_each_set_bit(opt, &method, BITS_PER_TYPE(method)) {
+   recovery = drm_get_wedge_recovery(opt);
+   if (drm_WARN(dev, !recovery, "device wedged, invalid recovery 
method %u\n", opt))
+   break;
+
+   len += scnprintf(event_string + len, sizeof(event_string), 
"%s,", recovery);
+   }
+
+   if (recovery)
+   /* Get rid of trailing comma */
+   event_string[len - 1] = '\0';
+   else
+   /* Caller is unsure about recovery, do the best we can at this 
point. */
+   snprintf(event_string, sizeof(event_string), "%s", 
"WEDGED=unknown");
+
+   drm_info(dev, "device wedged, needs recovery\n");
+   return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
+}
+EXPORT_SYMBOL(drm_dev_wedged_event);
+
 /*
  * DRM internal mount
  * We want to be able to allocate our own "struct address_space" to control
diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
index c91f87b5242d..6ea54a578cda 100644
--- a/include/drm/drm_device.h
+++ b/include/drm/drm_device.h
@@ -21,6 +21,14 @@ struct inode;
 struct pci_dev;
 struct pci_controller;
 
+/*
+ * Recovery methods for wedged device in order of less to more side-effects.
+ * To be used with drm_dev_wedged_event() as recovery @method. Callers can
+ * use any one, multiple (or'd) or none depending on their needs.
+ */
+#define DRM_WEDGE_RECOVERY_NONEBIT(0)  /* optional telemetry 
collection */
+#define DRM_WEDGE_RECOVERY_REBIND  BIT(1)  /* unbind + bind driver */
+#define DRM_WEDGE_RECOVERY_BUS_RESET   BIT(2)  /* unbind + reset bus device + 
bind */
 
 /**
  * enum switch_power_state - power state of drm device
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index 1bbbcb8e2d23..f41a82839e28 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -479,6 +479,7 @@ void drm_put_dev(struct drm_device *dev);
 bool drm_dev_enter(struct drm_device *dev, int *idx);
 void drm_dev_exit(int idx);
 void drm_dev_unplug(struct drm_device *dev);
+int drm_dev_wedged_event(struct drm_device *dev, unsigned long method);
 
 /**
  * drm_dev_is_unplugged - is a DRM device unplugged
-- 
2.34.1



Re: [PATCH v9 1/4] drm: Introduce device wedged event

2024-11-22 Thread Raag Jadav
On Mon, Nov 18, 2024 at 08:26:37PM +0530, Aravind Iddamsetty wrote:
> On 15/11/24 10:37, Raag Jadav wrote:
> > Introduce device wedged event, which notifies userspace of 'wedged'
> > (hanged/unusable) state of the DRM device through a uevent. This is
> > useful especially in cases where the device is no longer operating as
> > expected and has become unrecoverable from driver context. Purpose of
> > this implementation is to provide drivers a generic way to recover with
> > the help of userspace intervention without taking any drastic measures
> > in the driver.
> >
> > A 'wedged' device is basically a dead device that needs attention. The
> > uevent is the notification that is sent to userspace along with a hint
> > about what could possibly be attempted to recover the device and bring
> > it back to usable state. Different drivers may have different ideas of
> > a 'wedged' device depending on their hardware implementation, and hence
> > the vendor agnostic nature of the event. It is up to the drivers to
> > decide when they see the need for recovery and how they want to recover
> > from the available methods.
> >
> > Prerequisites
> > -
> >
> > The driver, before opting for recovery, needs to make sure that the
> > 'wedged' device doesn't harm the system as a whole by taking care of the
> > prerequisites. Necessary actions must include disabling DMA to system
> > memory as well as any communication channels with other devices. Further,
> > the driver must ensure that all dma_fences are signalled and any device
> > state that the core kernel might depend on are cleaned up. Once the event
> > is sent, the device must be kept in 'wedged' state until the recovery is
> > performed. New accesses to the device (IOCTLs) should be blocked,
> > preferably with an error code that resembles the type of failure the
> > device has encountered. This will signify the reason for wegeding which
> > can be reported to the application if needed.
> 
> should we even drop the mmaps we created?

Whatever is required for a clean recovery, yes.

Although how would this play out? Do we risk loosing display?
Or any other possible side-effects?

Raag


Re: [PATCH v9 3/4] drm/xe: Use device wedged event

2024-11-20 Thread Raag Jadav
On Tue, Nov 19, 2024 at 10:25:10AM +0530, Ghimiray, Himal Prasad wrote:
> On 15-11-2024 10:37, Raag Jadav wrote:
> > This was previously attempted as xe specific reset uevent but dropped
> > in commit 77a0d4d1cea2 ("drm/xe/uapi: Remove reset uevent for now")
> > as part of refactoring.
> > 
> > Now that we have device wedged event provided by DRM core, make use
> > of it and support both driver rebind and bus-reset based recovery.
> > With this in place userspace will be notified of wedged device, on
> > the basis of which, userspace may take respective action to recover
> > the device.
> > 
> > $ udevadm monitor --property --kernel
> > monitor will print the received events for:
> > KERNEL - the kernel uevent
> > 
> > KERNEL[265.802982] change   
> > /devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:03:00.0/drm/card0
> >  (drm)
> > ACTION=change
> > DEVPATH=/devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:03:00.0/drm/card0
> > SUBSYSTEM=drm
> > WEDGED=rebind,bus-reset
> > DEVNAME=/dev/dri/card0
> > DEVTYPE=drm_minor
> > SEQNUM=5208
> > MAJOR=226
> > MINOR=0
> 
> 
> The patch in itself looks good. Do we have IGT tests in place to validate
> this uevent ?

I unit tested it with documented example which seems to work. We can have an
igt in place once we have acceptance from the community.

Raag


[PATCH v9 2/4] drm/doc: Document device wedged event

2024-11-15 Thread Raag Jadav
Add documentation for device wedged event in a new 'Device wedging'
chapter. The describes basic definitions and consumer expectations
along with an example.

v8: Improve documentation (Christian, Rodrigo)
v9: Add prerequisites section (Christian)

Signed-off-by: Raag Jadav 
---
 Documentation/gpu/drm-uapi.rst | 102 -
 1 file changed, 99 insertions(+), 3 deletions(-)

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index b75cc9a70d1f..33d9c253d4d6 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -371,9 +371,105 @@ Reporting causes of resets
 
 Apart from propagating the reset through the stack so apps can recover, it's
 really useful for driver developers to learn more about what caused the reset 
in
-the first place. DRM devices should make use of devcoredump to store relevant
-information about the reset, so this information can be added to user bug
-reports.
+the first place. For this, drivers can make use of devcoredump to store 
relevant
+information about the reset and send device wedged event without recovery 
method
+(as explained in next chapter) to notify userspace, so this information can be
+collected and added to user bug reports.
+
+Device wedging
+==
+
+Drivers can optionally make use of device wedged event (implemented as
+drm_dev_wedged_event() in DRM subsystem), which notifies userspace of 'wedged'
+(hanged/unusable) state of the DRM device through a uevent. This is useful
+especially in cases where the device is no longer operating as expected and
+has become unrecoverable from driver context. Purpose of this implementation
+is to provide drivers a generic way to recover with the help of userspace
+intervention without taking any drastic measures in the driver.
+
+A 'wedged' device is basically a dead device that needs attention. The
+uevent is the notification that is sent to userspace along with a hint about
+what could possibly be attempted to recover the device and bring it back to
+usable state. Different drivers may have different ideas of a 'wedged' device
+depending on their hardware implementation, and hence the vendor agnostic
+nature of the event. It is up to the drivers to decide when they see the need
+for recovery and how they want to recover from the available methods.
+
+Prerequisites
+-
+
+The driver, before opting for recovery, needs to make sure that the 'wedged'
+device doesn't harm the system as a whole by taking care of the prerequisites.
+Necessary actions must include disabling DMA to system memory as well as any
+communication channels with other devices. Further, the driver must ensure
+that all dma_fences are signalled and any device state that the core kernel
+might depend on are cleaned up. Once the event is sent, the device must be
+kept in 'wedged' state until the recovery is performed. New accesses to the
+device (IOCTLs) should be blocked, preferably with an error code that
+resembles the type of failure the device has encountered. This will signify
+the reason for wegeding which can be reported to the application if needed.
+
+Recovery
+
+
+Current implementation defines three recovery methods, out of which, drivers
+can use any one, multiple or none. Method(s) of choice will be sent in the
+uevent environment as ``WEDGED=[,]`` in order of less to
+more side-effects. If driver is unsure about recovery or method is unknown
+(like soft/hard reboot, firmware flashing, hardware replacement or any other
+procedure which can't be attempted on the fly), ``WEDGED=unknown`` will be
+sent instead.
+
+Userspace consumers can parse this event and attempt recovery as per the
+following expectations.
+
+=== 
+Recovery method Consumer expectations
+=== 
+noneoptional telemetry collection
+rebind  unbind + bind driver
+bus-reset   unbind + reset bus device + bind
+unknown admin/user policy
+=== 
+
+The only exception to this is ``WEDGED=none``, which signifies that the
+device was temporarily 'wedged' at some point but was able to recover using
+device specific methods like reset. No explicit action is expected from
+userspace consumers in this case, but they can still take additional steps
+like gathering telemetry information (devcoredump, syslog). This is useful
+because the first hang is usually the most critical one which can result in
+consequential hangs or complete wedging.
+
+Example
+---
+
+Udev rule::
+
+SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]",
+RUN+="/path/to/rebind.sh $env{DEVPATH}"
+
+Recovery script::
+
+#!/bin/sh
+
+DEVPATH=$(readlink -f /sys/$1/device)
+  

[PATCH v9 1/4] drm: Introduce device wedged event

2024-11-15 Thread Raag Jadav
Introduce device wedged event, which notifies userspace of 'wedged'
(hanged/unusable) state of the DRM device through a uevent. This is
useful especially in cases where the device is no longer operating as
expected and has become unrecoverable from driver context. Purpose of
this implementation is to provide drivers a generic way to recover with
the help of userspace intervention without taking any drastic measures
in the driver.

A 'wedged' device is basically a dead device that needs attention. The
uevent is the notification that is sent to userspace along with a hint
about what could possibly be attempted to recover the device and bring
it back to usable state. Different drivers may have different ideas of
a 'wedged' device depending on their hardware implementation, and hence
the vendor agnostic nature of the event. It is up to the drivers to
decide when they see the need for recovery and how they want to recover
from the available methods.

Prerequisites
-

The driver, before opting for recovery, needs to make sure that the
'wedged' device doesn't harm the system as a whole by taking care of the
prerequisites. Necessary actions must include disabling DMA to system
memory as well as any communication channels with other devices. Further,
the driver must ensure that all dma_fences are signalled and any device
state that the core kernel might depend on are cleaned up. Once the event
is sent, the device must be kept in 'wedged' state until the recovery is
performed. New accesses to the device (IOCTLs) should be blocked,
preferably with an error code that resembles the type of failure the
device has encountered. This will signify the reason for wegeding which
can be reported to the application if needed.

Recovery


Current implementation defines three recovery methods, out of which,
drivers can use any one, multiple or none. Method(s) of choice will be
sent in the uevent environment as ``WEDGED=[,]`` in
order of less to more side-effects. If driver is unsure about recovery
or method is unknown (like soft/hard reboot, firmware flashing, hardware
replacement or any other procedure which can't be attempted on the fly),
``WEDGED=unknown`` will be sent instead.

Userspace consumers can parse this event and attempt recovery as per the
following expectations.

=== 
Recovery method Consumer expectations
=== 
noneoptional telemetry collection
rebind  unbind + bind driver
bus-reset   unbind + reset bus device + bind
unknown admin/user policy
=== 

The only exception to this is ``WEDGED=none``, which signifies that the
device was temporarily 'wedged' at some point but was able to recover
using device specific methods like reset. No explicit action is expected
from userspace consumers in this case, but they can still take additional
steps like gathering telemetry information (devcoredump, syslog). This is
useful because the first hang is usually the most critical one which can
result in consequential hangs or complete wedging.

Example
---

Udev rule::

SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]",
RUN+="/path/to/rebind.sh $env{DEVPATH}"

Recovery script::

#!/bin/sh

DEVPATH=$(readlink -f /sys/$1/device)
DEVICE=$(basename $DEVPATH)
DRIVER=$(readlink -f $DEVPATH/driver)

echo -n $DEVICE > $DRIVER/unbind
sleep 1
echo -n $DEVICE > $DRIVER/bind

Customization
-

Although basic recovery is possible with a simple script, admin/users can
define custom policies around recovery action. For example, if the driver
supports multiple recovery methods, consumers can opt for the suitable one
based on policy definition. Consumers can also choose to have the device
available for debugging or additional data collection before performing
the recovery. This is useful especially when the driver is unsure about
recovery or method is unknown.

v4: s/drm_dev_wedged/drm_dev_wedged_event
Use drm_info() (Jani)
Kernel doc adjustment (Aravind)
v5: Send recovery method with uevent (Lina)
v6: Access wedge_recovery_opts[] using helper function (Jani)
Use snprintf() (Jani)
v7: Convert recovery helpers into regular functions (Andy, Jani)
Aesthetic adjustments (Andy)
Handle invalid method cases
v8: Allow sending multiple methods with uevent (Lucas, Michal)
static_assert() globally (Andy)
v9: Provide 'none' method for reset cases (Christian)
Provide recovery opts using switch cases

Signed-off-by: Raag Jadav 
---
 drivers/gpu/drm/drm_drv.c | 63 +++
 include/drm/drm_device.h  |  8 +
 include/drm/drm_drv.h |  1 +
 3 files changed, 72 insertions(+)

diff --git a/drive

[PATCH v9 0/4] Introduce DRM device wedged event

2024-11-15 Thread Raag Jadav
This series introduces device wedged event in DRM subsystem and uses
it in xe and i915 drivers. Detailed description in commit message.

This was earlier attempted as xe specific uevent in v1 and v2.
https://patchwork.freedesktop.org/series/136909/

Similar work by André Almeida.
https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealm...@igalia.com/

v2: Change authorship to Himal (Aravind)
Add uevent for all device wedged cases (Aravind)

v3: Generic re-implementation in DRM subsystem (Lucas)

v4: s/drm_dev_wedged/drm_dev_wedged_event
Use drm_info() (Jani)
Kernel doc adjustment (Aravind)
Change authorship to Raag (Aravind)

v5: Send recovery method with uevent (Lina)
Expose supported recovery methods via sysfs (Lucas)

v6: Access wedge_recovery_opts[] using helper function (Jani)
Use snprintf() (Jani)

v7: Convert recovery helpers into regular functions (Andy, Jani)
Aesthetic adjustments (Andy)
Handle invalid method cases
Add documentation to drm-uapi.rst (Sima)

v8: Drop sysfs and allow sending multiple methods with uevent (Lucas, Michal)
Improve documentation (Christian, Rodrigo)
static_assert() globally (Andy)

v9: Document prerequisites section (Christian)
Provide 'none' method for reset cases (Christian)
Provide recovery opts using switch cases

Raag Jadav (4):
  drm: Introduce device wedged event
  drm/doc: Document device wedged event
  drm/xe: Use device wedged event
  drm/i915: Use device wedged event

 Documentation/gpu/drm-uapi.rst| 102 +-
 drivers/gpu/drm/drm_drv.c |  63 
 drivers/gpu/drm/i915/gt/intel_reset.c |   3 +
 drivers/gpu/drm/xe/xe_device.c|   9 ++-
 include/drm/drm_device.h  |   8 ++
 include/drm/drm_drv.h |   1 +
 6 files changed, 181 insertions(+), 5 deletions(-)

-- 
2.34.1



[PATCH v9 3/4] drm/xe: Use device wedged event

2024-11-15 Thread Raag Jadav
This was previously attempted as xe specific reset uevent but dropped
in commit 77a0d4d1cea2 ("drm/xe/uapi: Remove reset uevent for now")
as part of refactoring.

Now that we have device wedged event provided by DRM core, make use
of it and support both driver rebind and bus-reset based recovery.
With this in place userspace will be notified of wedged device, on
the basis of which, userspace may take respective action to recover
the device.

$ udevadm monitor --property --kernel
monitor will print the received events for:
KERNEL - the kernel uevent

KERNEL[265.802982] change   
/devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:03:00.0/drm/card0
 (drm)
ACTION=change
DEVPATH=/devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:03:00.0/drm/card0
SUBSYSTEM=drm
WEDGED=rebind,bus-reset
DEVNAME=/dev/dri/card0
DEVTYPE=drm_minor
SEQNUM=5208
MAJOR=226
MINOR=0

v2: Change authorship to Himal (Aravind)
Add uevent for all device wedged cases (Aravind)
v3: Generic re-implementation in DRM subsystem (Lucas)
v4: Change authorship to Raag (Aravind)

Signed-off-by: Raag Jadav 
---
 drivers/gpu/drm/xe/xe_device.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 0e2dd691bdae..5878b331e35c 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -989,11 +989,12 @@ static void xe_device_wedged_fini(struct drm_device *drm, 
void *arg)
  * xe_device_declare_wedged - Declare device wedged
  * @xe: xe device instance
  *
- * This is a final state that can only be cleared with a mudule
+ * This is a final state that can only be cleared with a module
  * re-probe (unbind + bind).
  * In this state every IOCTL will be blocked so the GT cannot be used.
  * In general it will be called upon any critical error such as gt reset
- * failure or guc loading failure.
+ * failure or guc loading failure. Userspace will be notified of this state
+ * by a DRM uevent.
  * If xe.wedged module parameter is set to 2, this function will be called
  * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
  * snapshot capture. In this mode, GT reset won't be attempted so the state of
@@ -1023,6 +1024,10 @@ void xe_device_declare_wedged(struct xe_device *xe)
"IOCTLs and executions are blocked. Only a rebind may 
clear the failure\n"
"Please file a _new_ bug report at 
https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n";,
dev_name(xe->drm.dev));
+
+   /* Notify userspace of wedged device */
+   drm_dev_wedged_event(&xe->drm,
+DRM_WEDGE_RECOVERY_REBIND | 
DRM_WEDGE_RECOVERY_BUS_RESET);
}
 
for_each_gt(gt, xe, id)
-- 
2.34.1



[PATCH v9 4/4] drm/i915: Use device wedged event

2024-11-15 Thread Raag Jadav
Now that we have device wedged event provided by DRM core, make use
of it and support both driver rebind and bus-reset based recovery.
With this in place, userspace will be notified of wedged device on
gt reset failure.

Signed-off-by: Raag Jadav 
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
b/drivers/gpu/drm/i915/gt/intel_reset.c
index f42f21632306..18cf50a1e84d 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -1418,6 +1418,9 @@ static void intel_gt_reset_global(struct intel_gt *gt,
 
if (!test_bit(I915_WEDGED, >->reset.flags))
kobject_uevent_env(kobj, KOBJ_CHANGE, reset_done_event);
+   else
+   drm_dev_wedged_event(>->i915->drm,
+DRM_WEDGE_RECOVERY_REBIND | 
DRM_WEDGE_RECOVERY_BUS_RESET);
 }
 
 /**
-- 
2.34.1



Re: [PATCH v10 1/4] drm: Introduce device wedged event

2024-12-02 Thread Raag Jadav
On Fri, Nov 29, 2024 at 10:40:14AM -0300, André Almeida wrote:
> Hi Raag,
> 
> Em 28/11/2024 12:37, Raag Jadav escreveu:
> > Introduce device wedged event, which notifies userspace of 'wedged'
> > (hanged/unusable) state of the DRM device through a uevent. This is
> > useful especially in cases where the device is no longer operating as
> > expected and has become unrecoverable from driver context. Purpose of
> > this implementation is to provide drivers a generic way to recover with
> > the help of userspace intervention without taking any drastic measures
> > in the driver.
> > 
> > A 'wedged' device is basically a dead device that needs attention. The
> > uevent is the notification that is sent to userspace along with a hint
> > about what could possibly be attempted to recover the device and bring
> > it back to usable state. Different drivers may have different ideas of
> > a 'wedged' device depending on their hardware implementation, and hence
> > the vendor agnostic nature of the event. It is up to the drivers to
> > decide when they see the need for device recovery and how they want to
> > recover from the available methods.
> > 
> 
> Thank you for your work. Do you think you can add the optional PID
> parameter, as the PID of the app that caused the reset? For SteamOS use case
> it has been proved to be useful to kill the fault app as well. If the reset
> was caused by a kthread, no PID can be provided hence it's an optional
> parameter.

Hmm, I'm not sure if it really fits here since it doesn't seem like
a generic usecase.

I'd still be open for it if found useful by the drivers but perhaps
as an extended feature in a separate series.

Raag


Re: [PATCH v9 1/4] drm: Introduce device wedged event

2024-11-25 Thread Raag Jadav
On Fri, Nov 22, 2024 at 11:09:32AM +0100, Christian König wrote:
> Am 22.11.24 um 08:07 schrieb Raag Jadav:
> > On Mon, Nov 18, 2024 at 08:26:37PM +0530, Aravind Iddamsetty wrote:
> > > On 15/11/24 10:37, Raag Jadav wrote:
> > > > Introduce device wedged event, which notifies userspace of 'wedged'
> > > > (hanged/unusable) state of the DRM device through a uevent. This is
> > > > useful especially in cases where the device is no longer operating as
> > > > expected and has become unrecoverable from driver context. Purpose of
> > > > this implementation is to provide drivers a generic way to recover with
> > > > the help of userspace intervention without taking any drastic measures
> > > > in the driver.
> > > > 
> > > > A 'wedged' device is basically a dead device that needs attention. The
> > > > uevent is the notification that is sent to userspace along with a hint
> > > > about what could possibly be attempted to recover the device and bring
> > > > it back to usable state. Different drivers may have different ideas of
> > > > a 'wedged' device depending on their hardware implementation, and hence
> > > > the vendor agnostic nature of the event. It is up to the drivers to
> > > > decide when they see the need for recovery and how they want to recover
> > > > from the available methods.
> > > > 
> > > > Prerequisites
> > > > -
> > > > 
> > > > The driver, before opting for recovery, needs to make sure that the
> > > > 'wedged' device doesn't harm the system as a whole by taking care of the
> > > > prerequisites. Necessary actions must include disabling DMA to system
> > > > memory as well as any communication channels with other devices. 
> > > > Further,
> > > > the driver must ensure that all dma_fences are signalled and any device
> > > > state that the core kernel might depend on are cleaned up. Once the 
> > > > event
> > > > is sent, the device must be kept in 'wedged' state until the recovery is
> > > > performed. New accesses to the device (IOCTLs) should be blocked,
> > > > preferably with an error code that resembles the type of failure the
> > > > device has encountered. This will signify the reason for wegeding which
> > > > can be reported to the application if needed.
> > > should we even drop the mmaps we created?
> > Whatever is required for a clean recovery, yes.
> > 
> > Although how would this play out? Do we risk loosing display?
> > Or any other possible side-effects?
> 
> Before sending a wedge event all DMA transfers of the device have to be
> blocked.
> 
> So yes, all display, mmap() and file descriptor connections you had with the
> device would need to be re-created.

Does it mean we'd have to rely on userspace to unmap()?

Raag


Re: [PATCH v9 1/4] drm: Introduce device wedged event

2024-11-26 Thread Raag Jadav
On Mon, Nov 25, 2024 at 10:32:42AM +0100, Christian König wrote:
> Am 22.11.24 um 17:02 schrieb Raag Jadav:
> > On Fri, Nov 22, 2024 at 11:09:32AM +0100, Christian König wrote:
> > > Am 22.11.24 um 08:07 schrieb Raag Jadav:
> > > > On Mon, Nov 18, 2024 at 08:26:37PM +0530, Aravind Iddamsetty wrote:
> > > > > On 15/11/24 10:37, Raag Jadav wrote:
> > > > > > Introduce device wedged event, which notifies userspace of 'wedged'
> > > > > > (hanged/unusable) state of the DRM device through a uevent. This is
> > > > > > useful especially in cases where the device is no longer operating 
> > > > > > as
> > > > > > expected and has become unrecoverable from driver context. Purpose 
> > > > > > of
> > > > > > this implementation is to provide drivers a generic way to recover 
> > > > > > with
> > > > > > the help of userspace intervention without taking any drastic 
> > > > > > measures
> > > > > > in the driver.
> > > > > > 
> > > > > > A 'wedged' device is basically a dead device that needs attention. 
> > > > > > The
> > > > > > uevent is the notification that is sent to userspace along with a 
> > > > > > hint
> > > > > > about what could possibly be attempted to recover the device and 
> > > > > > bring
> > > > > > it back to usable state. Different drivers may have different ideas 
> > > > > > of
> > > > > > a 'wedged' device depending on their hardware implementation, and 
> > > > > > hence
> > > > > > the vendor agnostic nature of the event. It is up to the drivers to
> > > > > > decide when they see the need for recovery and how they want to 
> > > > > > recover
> > > > > > from the available methods.
> > > > > > 
> > > > > > Prerequisites
> > > > > > -
> > > > > > 
> > > > > > The driver, before opting for recovery, needs to make sure that the
> > > > > > 'wedged' device doesn't harm the system as a whole by taking care 
> > > > > > of the
> > > > > > prerequisites. Necessary actions must include disabling DMA to 
> > > > > > system
> > > > > > memory as well as any communication channels with other devices. 
> > > > > > Further,
> > > > > > the driver must ensure that all dma_fences are signalled and any 
> > > > > > device
> > > > > > state that the core kernel might depend on are cleaned up. Once the 
> > > > > > event
> > > > > > is sent, the device must be kept in 'wedged' state until the 
> > > > > > recovery is
> > > > > > performed. New accesses to the device (IOCTLs) should be blocked,
> > > > > > preferably with an error code that resembles the type of failure the
> > > > > > device has encountered. This will signify the reason for wegeding 
> > > > > > which
> > > > > > can be reported to the application if needed.
> > > > > should we even drop the mmaps we created?
> > > > Whatever is required for a clean recovery, yes.
> > > > 
> > > > Although how would this play out? Do we risk loosing display?
> > > > Or any other possible side-effects?
> > > Before sending a wedge event all DMA transfers of the device have to be
> > > blocked.
> > > 
> > > So yes, all display, mmap() and file descriptor connections you had with 
> > > the
> > > device would need to be re-created.
> > Does it mean we'd have to rely on userspace to unmap()?
> 
> Yes and no :)
> 
> The handling should be similar to how hotplug is handled. E.g. the device
> becomes inaccessible by normal applications all mappings become invalid.

Isn't that just unbind (which is already part of recovery)?

> But we don't send a SIGBUS or similar on access, instead all mappings
> redirected to a dummy page which basically shallows all writes and gives
> undefined data on reads.
> 
> On IOCTLs the applications should get an error code and eventually restart
> or at least unmap all their mappings.

Thanks for the detailed explanation.

Rethinking about this, the criteria set for prerequisites is to not do
anything that could possibly harm the system. So I think the important
question is,

with fences signalled and ioctls already blocked, is live mmap on a wedged
device capable of producing harmful behaviour or unintended side-effects
(atleast until the application has the opportunity to unmap() as part of
recovery)?

Raag


Re: [PATCH v10 1/4] drm: Introduce device wedged event

2024-12-03 Thread Raag Jadav
On Mon, Dec 02, 2024 at 10:07:59AM +0200, Raag Jadav wrote:
> On Fri, Nov 29, 2024 at 10:40:14AM -0300, André Almeida wrote:
> > Hi Raag,
> > 
> > Em 28/11/2024 12:37, Raag Jadav escreveu:
> > > Introduce device wedged event, which notifies userspace of 'wedged'
> > > (hanged/unusable) state of the DRM device through a uevent. This is
> > > useful especially in cases where the device is no longer operating as
> > > expected and has become unrecoverable from driver context. Purpose of
> > > this implementation is to provide drivers a generic way to recover with
> > > the help of userspace intervention without taking any drastic measures
> > > in the driver.
> > > 
> > > A 'wedged' device is basically a dead device that needs attention. The
> > > uevent is the notification that is sent to userspace along with a hint
> > > about what could possibly be attempted to recover the device and bring
> > > it back to usable state. Different drivers may have different ideas of
> > > a 'wedged' device depending on their hardware implementation, and hence
> > > the vendor agnostic nature of the event. It is up to the drivers to
> > > decide when they see the need for device recovery and how they want to
> > > recover from the available methods.
> > > 
> > 
> > Thank you for your work. Do you think you can add the optional PID
> > parameter, as the PID of the app that caused the reset? For SteamOS use case
> > it has been proved to be useful to kill the fault app as well. If the reset
> > was caused by a kthread, no PID can be provided hence it's an optional
> > parameter.
> 
> Hmm, I'm not sure if it really fits here since it doesn't seem like
> a generic usecase.
> 
> I'd still be open for it if found useful by the drivers but perhaps
> as an extended feature in a separate series.

What do you think Chris, are we good to go with v10?

Raag


Re: [PATCH v10 1/4] drm: Introduce device wedged event

2024-12-04 Thread Raag Jadav
+ misc maintainers

On Tue, Dec 03, 2024 at 11:18:00AM +0100, Christian König wrote:
> Am 03.12.24 um 06:00 schrieb Raag Jadav:
> > On Mon, Dec 02, 2024 at 10:07:59AM +0200, Raag Jadav wrote:
> > > On Fri, Nov 29, 2024 at 10:40:14AM -0300, André Almeida wrote:
> > > > Hi Raag,
> > > > 
> > > > Em 28/11/2024 12:37, Raag Jadav escreveu:
> > > > > Introduce device wedged event, which notifies userspace of 'wedged'
> > > > > (hanged/unusable) state of the DRM device through a uevent. This is
> > > > > useful especially in cases where the device is no longer operating as
> > > > > expected and has become unrecoverable from driver context. Purpose of
> > > > > this implementation is to provide drivers a generic way to recover 
> > > > > with
> > > > > the help of userspace intervention without taking any drastic measures
> > > > > in the driver.
> > > > > 
> > > > > A 'wedged' device is basically a dead device that needs attention. The
> > > > > uevent is the notification that is sent to userspace along with a hint
> > > > > about what could possibly be attempted to recover the device and bring
> > > > > it back to usable state. Different drivers may have different ideas of
> > > > > a 'wedged' device depending on their hardware implementation, and 
> > > > > hence
> > > > > the vendor agnostic nature of the event. It is up to the drivers to
> > > > > decide when they see the need for device recovery and how they want to
> > > > > recover from the available methods.
> > > > > 
> > > > Thank you for your work. Do you think you can add the optional PID
> > > > parameter, as the PID of the app that caused the reset? For SteamOS use 
> > > > case
> > > > it has been proved to be useful to kill the fault app as well. If the 
> > > > reset
> > > > was caused by a kthread, no PID can be provided hence it's an optional
> > > > parameter.
> > > Hmm, I'm not sure if it really fits here since it doesn't seem like
> > > a generic usecase.
> > > 
> > > I'd still be open for it if found useful by the drivers but perhaps
> > > as an extended feature in a separate series.
> > What do you think Chris, are we good to go with v10?
> 
> I agree with Andre that the PID and maybe the new DRM client name would be
> really nice to have here.
> 
> We do have that in the device core dump we create, but if an application is
> supervised by daemon for example then that would be really useful.
> 
> On the other hand I think we should merge the documentation and code as is
> and then add the PID/name later on. That is essentially a separate
> discussion.

So how do we proceed, perhaps through misc tree?

Raag


[PATCH v12 3/5] drm/xe: Use device wedged event

2025-02-04 Thread Raag Jadav
This was previously attempted as xe specific reset uevent but dropped
in commit 77a0d4d1cea2 ("drm/xe/uapi: Remove reset uevent for now")
as part of refactoring.

Now that we have device wedged event provided by DRM core, make use
of it and support both driver rebind and bus-reset based recovery.
With this in place userspace will be notified of wedged device, on
the basis of which, userspace may take respective action to recover
the device.

$ udevadm monitor --property --kernel
monitor will print the received events for:
KERNEL - the kernel uevent

KERNEL[265.802982] change   
/devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:03:00.0/drm/card0
 (drm)
ACTION=change
DEVPATH=/devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:03:00.0/drm/card0
SUBSYSTEM=drm
WEDGED=rebind,bus-reset
DEVNAME=/dev/dri/card0
DEVTYPE=drm_minor
SEQNUM=5208
MAJOR=226
MINOR=0

v2: Change authorship to Himal (Aravind)
Add uevent for all device wedged cases (Aravind)
v3: Generic implementation in DRM subsystem (Lucas)
v4: Change authorship to Raag (Aravind)

Signed-off-by: Raag Jadav 
Reviewed-by: Aravind Iddamsetty 
---
 drivers/gpu/drm/xe/xe_device.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index f30f8f668dee..1cacefcb5afe 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -1120,7 +1120,8 @@ static void xe_device_wedged_fini(struct drm_device *drm, 
void *arg)
  * re-probe (unbind + bind).
  * In this state every IOCTL will be blocked so the GT cannot be used.
  * In general it will be called upon any critical error such as gt reset
- * failure or guc loading failure.
+ * failure or guc loading failure. Userspace will be notified of this state
+ * through device wedged uevent.
  * If xe.wedged module parameter is set to 2, this function will be called
  * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
  * snapshot capture. In this mode, GT reset won't be attempted so the state of
@@ -1150,6 +1151,10 @@ void xe_device_declare_wedged(struct xe_device *xe)
"IOCTLs and executions are blocked. Only a rebind may 
clear the failure\n"
"Please file a _new_ bug report at 
https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n";,
dev_name(xe->drm.dev));
+
+   /* Notify userspace of wedged device */
+   drm_dev_wedged_event(&xe->drm,
+DRM_WEDGE_RECOVERY_REBIND | 
DRM_WEDGE_RECOVERY_BUS_RESET);
}
 
for_each_gt(gt, xe, id)
-- 
2.34.1



[PATCH v12 1/5] drm: Introduce device wedged event

2025-02-04 Thread Raag Jadav
ngs. This is useful
especially when the driver is unsure about recovery or method is unknown.

 v4: s/drm_dev_wedged/drm_dev_wedged_event
 Use drm_info() (Jani)
 Kernel doc adjustment (Aravind)
 v5: Send recovery method with uevent (Lina)
 v6: Access wedge_recovery_opts[] using helper function (Jani)
 Use snprintf() (Jani)
 v7: Convert recovery helpers into regular functions (Andy, Jani)
 Aesthetic adjustments (Andy)
 Handle invalid recovery method
 v8: Allow sending multiple methods with uevent (Lucas, Michal)
 static_assert() globally (Andy)
 v9: Provide 'none' method for device reset (Christian)
 Provide recovery opts using switch cases
v11: Log device reset (André)

Signed-off-by: Raag Jadav 
Reviewed-by: André Almeida 
---
 drivers/gpu/drm/drm_drv.c | 68 +++
 include/drm/drm_device.h  |  8 +
 include/drm/drm_drv.h |  1 +
 3 files changed, 77 insertions(+)

diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 3cf440eee8a2..17fc5dc708f4 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -26,6 +26,7 @@
  * DEALINGS IN THE SOFTWARE.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -34,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -498,6 +500,72 @@ void drm_dev_unplug(struct drm_device *dev)
 }
 EXPORT_SYMBOL(drm_dev_unplug);
 
+/*
+ * Available recovery methods for wedged device. To be sent along with device
+ * wedged uevent.
+ */
+static const char *drm_get_wedge_recovery(unsigned int opt)
+{
+   switch (BIT(opt)) {
+   case DRM_WEDGE_RECOVERY_NONE:
+   return "none";
+   case DRM_WEDGE_RECOVERY_REBIND:
+   return "rebind";
+   case DRM_WEDGE_RECOVERY_BUS_RESET:
+   return "bus-reset";
+   default:
+   return NULL;
+   }
+}
+
+/**
+ * drm_dev_wedged_event - generate a device wedged uevent
+ * @dev: DRM device
+ * @method: method(s) to be used for recovery
+ *
+ * This generates a device wedged uevent for the DRM device specified by @dev.
+ * Recovery @method\(s) of choice will be sent in the uevent environment as
+ * ``WEDGED=[,..,]`` in order of less to more side-effects.
+ * If caller is unsure about recovery or @method is unknown (0),
+ * ``WEDGED=unknown`` will be sent instead.
+ *
+ * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
+ * details.
+ *
+ * Returns: 0 on success, negative error code otherwise.
+ */
+int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
+{
+   const char *recovery = NULL;
+   unsigned int len, opt;
+   /* Event string length up to 28+ characters with available methods */
+   char event_string[32];
+   char *envp[] = { event_string, NULL };
+
+   len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
+
+   for_each_set_bit(opt, &method, BITS_PER_TYPE(method)) {
+   recovery = drm_get_wedge_recovery(opt);
+   if (drm_WARN_ONCE(dev, !recovery, "invalid recovery method 
%u\n", opt))
+   break;
+
+   len += scnprintf(event_string + len, sizeof(event_string), 
"%s,", recovery);
+   }
+
+   if (recovery)
+   /* Get rid of trailing comma */
+   event_string[len - 1] = '\0';
+   else
+   /* Caller is unsure about recovery, do the best we can at this 
point. */
+   snprintf(event_string, sizeof(event_string), "%s", 
"WEDGED=unknown");
+
+   drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
+"but recovered through reset" : "needs recovery");
+
+   return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
+}
+EXPORT_SYMBOL(drm_dev_wedged_event);
+
 /*
  * DRM internal mount
  * We want to be able to allocate our own "struct address_space" to control
diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
index c91f87b5242d..6ea54a578cda 100644
--- a/include/drm/drm_device.h
+++ b/include/drm/drm_device.h
@@ -21,6 +21,14 @@ struct inode;
 struct pci_dev;
 struct pci_controller;
 
+/*
+ * Recovery methods for wedged device in order of less to more side-effects.
+ * To be used with drm_dev_wedged_event() as recovery @method. Callers can
+ * use any one, multiple (or'd) or none depending on their needs.
+ */
+#define DRM_WEDGE_RECOVERY_NONEBIT(0)  /* optional telemetry 
collection */
+#define DRM_WEDGE_RECOVERY_REBIND  BIT(1)  /* unbind + bind driver */
+#define DRM_WEDGE_RECOVERY_BUS_RESET   BIT(2)  /* unbind + reset bus device + 
bind */
 
 /**
  * enum switch_power_state - power state of drm device
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index 9952b846c170..a43d707b5f36 100644
--- a/include

[PATCH v12 2/5] drm/doc: Document device wedged event

2025-02-04 Thread Raag Jadav
Add documentation for device wedged event in a new "Device wedging"
chapter. This describes basic definitions, prerequisites and consumer
expectations along with an example.

 v8: Improve introduction (Christian, Rodrigo)
 v9: Add prerequisites section (Christian)
v10: Clarify mmap cleanup and consumer prerequisites (Christian, Aravind)
v11: Reference wedged event in device reset chapter (André)
v12: Refine consumer expectations and terminologies (Xaver, Pekka)

Signed-off-by: Raag Jadav 
Reviewed-by: Christian König 
Reviewed-by: André Almeida 
---
 Documentation/gpu/drm-uapi.rst | 116 -
 1 file changed, 113 insertions(+), 3 deletions(-)

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index b75cc9a70d1f..69f72e71a96e 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -371,9 +371,119 @@ Reporting causes of resets
 
 Apart from propagating the reset through the stack so apps can recover, it's
 really useful for driver developers to learn more about what caused the reset 
in
-the first place. DRM devices should make use of devcoredump to store relevant
-information about the reset, so this information can be added to user bug
-reports.
+the first place. For this, drivers can make use of devcoredump to store 
relevant
+information about the reset and send device wedged event with ``none`` recovery
+method (as explained in "Device Wedging" chapter) to notify userspace, so this
+information can be collected and added to user bug reports.
+
+Device Wedging
+==
+
+Drivers can optionally make use of device wedged event (implemented as
+drm_dev_wedged_event() in DRM subsystem), which notifies userspace of 'wedged'
+(hanged/unusable) state of the DRM device through a uevent. This is useful
+especially in cases where the device is no longer operating as expected and has
+become unrecoverable from driver context. Purpose of this implementation is to
+provide drivers a generic way to recover the device with the help of userspace
+intervention, without taking any drastic measures (like resetting or
+re-enumerating the full bus, on which the underlying physical device is 
sitting)
+in the driver.
+
+A 'wedged' device is basically a device that is declared dead by the driver
+after exhausting all possible attempts to recover it from driver context. The
+uevent is the notification that is sent to userspace along with a hint about
+what could possibly be attempted to recover the device from userspace and bring
+it back to usable state. Different drivers may have different ideas of a
+'wedged' device depending on hardware implementation of the underlying physical
+device, and hence the vendor agnostic nature of the event. It is up to the
+drivers to decide when they see the need for device recovery and how they want
+to recover from the available methods.
+
+Driver prerequisites
+
+
+The driver, before opting for recovery, needs to make sure that the 'wedged'
+device doesn't harm the system as a whole by taking care of the prerequisites.
+Necessary actions must include disabling DMA to system memory as well as any
+communication channels with other devices. Further, the driver must ensure
+that all dma_fences are signalled and any device state that the core kernel
+might depend on is cleaned up. All existing mmaps should be invalidated and
+page faults should be redirected to a dummy page. Once the event is sent, the
+device must be kept in 'wedged' state until the recovery is performed. New
+accesses to the device (IOCTLs) should be rejected, preferably with an error
+code that resembles the type of failure the device has encountered. This will
+signify the reason for wedging, which can be reported to the application if
+needed.
+
+Recovery
+
+
+Current implementation defines three recovery methods, out of which, drivers
+can use any one, multiple or none. Method(s) of choice will be sent in the
+uevent environment as ``WEDGED=[,..,]`` in order of less to
+more side-effects. If driver is unsure about recovery or method is unknown
+(like soft/hard system reboot, firmware flashing, physical device replacement
+or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
+will be sent instead.
+
+Userspace consumers can parse this event and attempt recovery as per the
+following expectations.
+
+=== 
+Recovery method Consumer expectations
+=== 
+noneoptional telemetry collection
+rebind  unbind + bind driver
+bus-reset   unbind + bus reset/re-enumeration + bind
+unknown consumer policy
+=== 
+
+The only exception to this is ``WEDGED=none``, which signifies that the device

[PATCH v12 5/5] drm/amdgpu: Use device wedged event

2025-02-04 Thread Raag Jadav
From: André Almeida 

Use DRM's device wedged event to notify userspace that a reset had
happened. For now, only use `none` method meant for telemetry
capture.

In the future we might want to report a recovery method if the reset didn't
succeed.

Acked-by: Shashank Sharma 
Signed-off-by: André Almeida 
Reviewed-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d100bb7a137c..9e7219bff0c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -6116,6 +6116,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
 
atomic_set(&adev->reset_domain->reset_res, r);
+
+   if (!r)
+   drm_dev_wedged_event(adev_to_drm(adev), 
DRM_WEDGE_RECOVERY_NONE);
+
return r;
 }
 
-- 
2.34.1



[PATCH v12 0/5] Introduce DRM device wedged event

2025-02-04 Thread Raag Jadav
This series introduces device wedged event in DRM subsystem and uses it
in xe, i915 and amdgpu drivers. Detailed description in commit message.

This was earlier attempted as xe specific uevent in v1 and v2 on [1].
Similar work by André Almeida on [2].
Wedged event support for amdgpu by André Almeida on [3].
Consumer implementation by Xaver Hugl on [4].

 [1] https://patchwork.freedesktop.org/series/136909/
 [2] 
https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealm...@igalia.com/
 [3] 
https://lore.kernel.org/dri-devel/20241216162104.58241-1-andrealm...@igalia.com/
 [4] https://invent.kde.org/plasma/kwin/-/merge_requests/7027

 v2: Change authorship to Himal (Aravind)
 Add uevent for all device wedged cases (Aravind)

 v3: Generic implementation in DRM subsystem (Lucas)

 v4: s/drm_dev_wedged/drm_dev_wedged_event
 Use drm_info() (Jani)
 Kernel doc adjustment (Aravind)
 Change authorship to Raag (Aravind)

 v5: Send recovery method with uevent (Lina)
 Expose supported recovery methods via sysfs (Lucas)

 v6: Access wedge_recovery_opts[] using helper function (Jani)
 Use snprintf() (Jani)

 v7: Convert recovery helpers into regular functions (Andy, Jani)
 Aesthetic adjustments (Andy)
 Handle invalid recovery method
 Add documentation to drm-uapi.rst (Sima)

 v8: Drop sysfs and allow sending multiple methods with uevent (Lucas, Michal)
 Improve documentation (Christian, Rodrigo)
 static_assert() globally (Andy)

 v9: Document prerequisites section (Christian)
 Provide 'none' method for device reset (Christian)
 Provide recovery opts using switch cases

v10: Clarify mmap cleanup and consumer prerequisites (Christian, Aravind)

v11: Log device reset (André)
 Reference wedged event in device reset chapter (André)
 Wedged event support for amdgpu (André)

v12: Refine consumer expectations and terminologies (Xaver, Pekka)

André Almeida (1):
  drm/amdgpu: Use device wedged event

Raag Jadav (4):
  drm: Introduce device wedged event
  drm/doc: Document device wedged event
  drm/xe: Use device wedged event
  drm/i915: Use device wedged event

 Documentation/gpu/drm-uapi.rst | 116 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   4 +
 drivers/gpu/drm/drm_drv.c  |  68 
 drivers/gpu/drm/i915/gt/intel_reset.c  |   3 +
 drivers/gpu/drm/xe/xe_device.c |   7 +-
 include/drm/drm_device.h   |   8 ++
 include/drm/drm_drv.h  |   1 +
 7 files changed, 203 insertions(+), 4 deletions(-)

-- 
2.34.1



[PATCH v12 4/5] drm/i915: Use device wedged event

2025-02-04 Thread Raag Jadav
Now that we have device wedged event provided by DRM core, make use
of it and support both driver rebind and bus-reset based recovery.
With this in place, userspace will be notified of wedged device on
gt reset failure.

Signed-off-by: Raag Jadav 
Reviewed-by: Aravind Iddamsetty 
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
b/drivers/gpu/drm/i915/gt/intel_reset.c
index aae5a081cb53..d6dc12fd87c1 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -1422,6 +1422,9 @@ static void intel_gt_reset_global(struct intel_gt *gt,
 
if (!test_bit(I915_WEDGED, >->reset.flags))
kobject_uevent_env(kobj, KOBJ_CHANGE, reset_done_event);
+   else
+   drm_dev_wedged_event(>->i915->drm,
+DRM_WEDGE_RECOVERY_REBIND | 
DRM_WEDGE_RECOVERY_BUS_RESET);
 }
 
 /**
-- 
2.34.1



Re: [PATCH v10 1/4] drm: Introduce device wedged event

2024-12-12 Thread Raag Jadav
On Wed, Dec 11, 2024 at 06:14:12PM +0100, Maxime Ripard wrote:
> On Wed, Dec 04, 2024 at 01:17:17PM +0200, Raag Jadav wrote:
> > + misc maintainers
> > 
> > On Tue, Dec 03, 2024 at 11:18:00AM +0100, Christian König wrote:
> > > Am 03.12.24 um 06:00 schrieb Raag Jadav:
> > > > On Mon, Dec 02, 2024 at 10:07:59AM +0200, Raag Jadav wrote:
> > > > > On Fri, Nov 29, 2024 at 10:40:14AM -0300, André Almeida wrote:
> > > > > > Hi Raag,
> > > > > > 
> > > > > > Em 28/11/2024 12:37, Raag Jadav escreveu:
> > > > > > > Introduce device wedged event, which notifies userspace of 
> > > > > > > 'wedged'
> > > > > > > (hanged/unusable) state of the DRM device through a uevent. This 
> > > > > > > is
> > > > > > > useful especially in cases where the device is no longer 
> > > > > > > operating as
> > > > > > > expected and has become unrecoverable from driver context. 
> > > > > > > Purpose of
> > > > > > > this implementation is to provide drivers a generic way to 
> > > > > > > recover with
> > > > > > > the help of userspace intervention without taking any drastic 
> > > > > > > measures
> > > > > > > in the driver.
> > > > > > > 
> > > > > > > A 'wedged' device is basically a dead device that needs 
> > > > > > > attention. The
> > > > > > > uevent is the notification that is sent to userspace along with a 
> > > > > > > hint
> > > > > > > about what could possibly be attempted to recover the device and 
> > > > > > > bring
> > > > > > > it back to usable state. Different drivers may have different 
> > > > > > > ideas of
> > > > > > > a 'wedged' device depending on their hardware implementation, and 
> > > > > > > hence
> > > > > > > the vendor agnostic nature of the event. It is up to the drivers 
> > > > > > > to
> > > > > > > decide when they see the need for device recovery and how they 
> > > > > > > want to
> > > > > > > recover from the available methods.
> > > > > > > 
> > > > > > Thank you for your work. Do you think you can add the optional PID
> > > > > > parameter, as the PID of the app that caused the reset? For SteamOS 
> > > > > > use case
> > > > > > it has been proved to be useful to kill the fault app as well. If 
> > > > > > the reset
> > > > > > was caused by a kthread, no PID can be provided hence it's an 
> > > > > > optional
> > > > > > parameter.
> > > > > Hmm, I'm not sure if it really fits here since it doesn't seem like
> > > > > a generic usecase.
> > > > > 
> > > > > I'd still be open for it if found useful by the drivers but perhaps
> > > > > as an extended feature in a separate series.
> > > > What do you think Chris, are we good to go with v10?
> > > 
> > > I agree with Andre that the PID and maybe the new DRM client name would be
> > > really nice to have here.
> > > 
> > > We do have that in the device core dump we create, but if an application 
> > > is
> > > supervised by daemon for example then that would be really useful.
> > > 
> > > On the other hand I think we should merge the documentation and code as is
> > > and then add the PID/name later on. That is essentially a separate
> > > discussion.
> > 
> > So how do we proceed, perhaps through misc tree?
> 
> Provided it follows the usual rules (ie, Reviewed-by, open source
> userspace tools using it if it's a new uAPI, etc.) then yeah, we can
> merge it through drm-misc.

My understanding is that the core patches are to be reviewed by the
maintainers? The rest of it (patch 2 to 4) seems already reviewed.

We have a documented example (patch 2) with udev rule and a reference
script which can be setup to get this working. Does that qualify as
a consumer?

Raag


Re: [PATCH 1/1] drm/amdgpu: Use device wedged event

2024-12-16 Thread Raag Jadav
On Mon, Dec 16, 2024 at 10:15:00AM -0300, André Almeida wrote:
> Em 16/12/2024 10:10, Christian König escreveu:
> > Am 16.12.24 um 14:04 schrieb André Almeida:
> > > Em 16/12/2024 07:38, Lazar, Lijo escreveu:
> > > > 
> > > > 
> > > > On 12/16/2024 3:48 PM, Christian König wrote:
> > > > > Am 13.12.24 um 16:56 schrieb André Almeida:
> > > > > > Em 13/12/2024 11:36, Raag Jadav escreveu:
> > > > > > > On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
> > > > > > > > Hi Christian,
> > > > > > > > 
> > > > > > > > Em 13/12/2024 04:34, Christian König escreveu:
> > > > > > > > > Am 12.12.24 um 20:09 schrieb André Almeida:
> > > > > > > > > > Use DRM's device wedged event to notify userspace that a 
> > > > > > > > > > reset had
> > > > > > > > > > happened. For now, only use `none` method meant for 
> > > > > > > > > > telemetry
> > > > > > > > > > capture.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: André Almeida 
> > > > > > > > > > ---
> > > > > > > > > >     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
> > > > > > > > > >     1 file changed, 3 insertions(+)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > > b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > > index 96316111300a..19e1a5493778 100644
> > > > > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > > @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
> > > > > > > > > > amdgpu_device *adev,
> > > > > > > > > >     dev_info(adev->dev, "GPU
> > > > > > > > > > reset end with ret = %d\n", r);
> > > > > > > > > > atomic_set(&adev->reset_domain->reset_res, r);
> > > > > > > > > > +
> > > > > > > > > > +    drm_dev_wedged_event(adev_to_drm(adev),
> > > > > > > > > > DRM_WEDGE_RECOVERY_NONE);
> > > > > > > > > 
> > > > > > > > > That looks really good in general. I would just make the
> > > > > > > > > DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Why depend or `r`? A reset was triggered anyway, regardless of 
> > > > > > > > the
> > > > > > > > success
> > > > > > > > of it, shouldn't we tell userspace?
> > > > > > > 
> > > > > > > A failed reset would perhaps result in wedging,
> > > > > > > atleast that's how i915
> > > > > > > is handling it.
> > > > > > > 
> > > > > > 
> > > > > > Right, and I think this raises the question of what wedge recovery
> > > > > > method should I add for amdgpu... Christian?
> > > > > > 
> > > > > 
> > > > > In theory a rebind should be enough to get the device going again, our
> > > > > BOCO does a bus reset on driver load anyway.
> > > > > 
> > > > 
> > > > The behavior varies between SOCs. In certain ones, if driver reset
> > > > fails, that means it's really in a bad state and it would need system
> > > > reboot.
> > > > 
> > > 
> > > Is this documented somewhere? Then I could even add a
> > > DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario.

This was present in drafts v5 through v7 but later dropped with the
understanding that it is unwise to let a drm device make system level
decisions and rather have something like WEDGED=unknown to let admin/user
decide how to deal with it.

https://patchwork.freedesktop.org/series/138069/

> > Not publicly as far as I know. But indeed a driver reset has basically
> > the same chan

Re: [PATCH 1/1] drm/amdgpu: Use device wedged event

2024-12-16 Thread Raag Jadav
On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote:
> Hi Christian,
> 
> Em 13/12/2024 04:34, Christian König escreveu:
> > Am 12.12.24 um 20:09 schrieb André Almeida:
> > > Use DRM's device wedged event to notify userspace that a reset had
> > > happened. For now, only use `none` method meant for telemetry
> > > capture.
> > > 
> > > Signed-off-by: André Almeida 
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
> > >   1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c
> > > index 96316111300a..19e1a5493778 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct
> > > amdgpu_device *adev,
> > >   dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
> > >   atomic_set(&adev->reset_domain->reset_res, r);
> > > +
> > > +    drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
> > 
> > That looks really good in general. I would just make the
> > DRM_WEDGE_RECOVERY_NONE depend on the value of "r".
> > 
> 
> Why depend or `r`? A reset was triggered anyway, regardless of the success
> of it, shouldn't we tell userspace?

A failed reset would perhaps result in wedging, atleast that's how i915
is handling it.

Raag


Re: [PATCH v10 1/4] drm: Introduce device wedged event

2024-12-16 Thread Raag Jadav
On Thu, Dec 12, 2024 at 03:31:01PM -0300, André Almeida wrote:
> Hi Raag,
> 
> Thank you for your patch.
> 
> Em 28/11/2024 12:37, Raag Jadav escreveu:
> 
> [...]
> 
> > +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> > +{
> > +   const char *recovery = NULL;
> > +   unsigned int len, opt;
> > +   /* Event string length up to 28+ characters with available methods */
> > +   char event_string[32];
> > +   char *envp[] = { event_string, NULL };
> > +
> > +   len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
> > +
> > +   for_each_set_bit(opt, &method, BITS_PER_TYPE(method)) {
> > +   recovery = drm_get_wedge_recovery(opt);
> > +   if (drm_WARN(dev, !recovery, "device wedged, invalid recovery 
> > method %u\n", opt))
> > +   break;
> > +
> > +   len += scnprintf(event_string + len, sizeof(event_string), 
> > "%s,", recovery);
> > +   }
> > +
> > +   if (recovery)
> > +   /* Get rid of trailing comma */
> > +   event_string[len - 1] = '\0';
> > +   else
> > +   /* Caller is unsure about recovery, do the best we can at this 
> > point. */
> > +   snprintf(event_string, sizeof(event_string), "%s", 
> > "WEDGED=unknown");
> > +
> > +   drm_info(dev, "device wedged, needs recovery\n");
> 
> As documented in the commit message "No explicit device recovery is expected
> from the consumer in this case", I think this should be like this:
> 
> if (method != DRM_WEDGE_RECOVERY_NONE)
> drm_info(dev, "device wedged, needs recovery\n");
> 
> and maybe a note like this:
> 
> else
> drm_info(dev, "device reseted, but managed to recover\n");

Or perhaps

drm_info(dev, "device wedged, but recovered through reset\n");

> Either way, this patch is:
> 
> Reviewed-by: André Almeida 

Thanks for the review.

Raag


Re: [PATCH v10 2/4] drm/doc: Document device wedged event

2024-12-17 Thread Raag Jadav
On Thu, Dec 12, 2024 at 03:50:29PM -0300, André Almeida wrote:
> Em 28/11/2024 12:37, Raag Jadav escreveu:
> > Add documentation for device wedged event in a new 'Device wedging'
> > chapter. The describes basic definitions, prerequisites and consumer
> > expectations along with an example.
> > 
> >   v8: Improve documentation (Christian, Rodrigo)
> >   v9: Add prerequisites section (Christian)
> > v10: Clarify mmap cleanup and consumer prerequisites (Christian, Aravind)
> > 
> > Signed-off-by: Raag Jadav 
> > Reviewed-by: Christian König 
> > ---
> >   Documentation/gpu/drm-uapi.rst | 112 -
> >   1 file changed, 109 insertions(+), 3 deletions(-)
> > 
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index b75cc9a70d1f..da2927cde53d 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -371,9 +371,115 @@ Reporting causes of resets
> 
> I think it's a good idea to add a note about "device wedged event" in the
> section "Device reset > Kernel Mode Driver" since the idea is to explain
> what should kernel developer add to their kernel drivers to be used when a
> device resets.

Wouldn't that be just a repetition of below? Also, I'm not sure if it's a hard
requirement.

> >   Apart from propagating the reset through the stack so apps can recover, 
> > it's
> >   really useful for driver developers to learn more about what caused the 
> > reset in
> > -the first place. DRM devices should make use of devcoredump to store 
> > relevant
> > -information about the reset, so this information can be added to user bug
> > -reports.
> > +the first place. For this, drivers can make use of devcoredump to store 
> > relevant
> > +information about the reset and send device wedged event without recovery 
> > method
> 
> and send a device wedged event with recovery method as "none" (as explained
> in the chapter "Device wedging")

Sure.

> > +(as explained in next chapter) to notify userspace, so this information 
> > can be
> > +collected and added to user bug reports.
> > +
> 
> With those changes applied:
> 
> Reviewed-by: André Almeida 

Thanks.

Raag


Re: [PATCH v10 0/4] Introduce DRM device wedged event

2025-01-23 Thread Raag Jadav
On Tue, Jan 21, 2025 at 01:59:47AM +0100, Xaver Hugl wrote:
> Hi,
> 
> I experimented with using this in KWin, and
> https://invent.kde.org/plasma/kwin/-/merge_requests/7027/diffs?commit_id=6da40f1b9e2bc94615a436de4778880cee16f940
> makes it fall back to a software renderer when a rebind is required to
> recover the GPU.
> Making it also survive the rebind properly is more challenging
> (current version of the MR doesn't do it for you and just crashes if
> you do it with a udev rule or manually), but it's doable - and not a
> problem of the API.
> 
> I'd really like to have the PID of the client that triggered the GPU
> reset, so that we can kill it if multiple resets are triggered in a
> row (or switch to software rendering if it's KWin itself) and show a
> user-friendly notification about why their app(s) crashed, but that
> can be added later.

Excellent! While we have our consumer implementation in progress, it's
always good to have wider adoption.

Thank you for your contribution.

Raag


Re: [PATCH v10 2/4] drm/doc: Document device wedged event

2025-01-23 Thread Raag Jadav
On Tue, Jan 21, 2025 at 02:14:56AM +0100, Xaver Hugl wrote:
> > +It is the responsibility of the consumer to make sure that the device or
> > +its resources are not in use by any process before attempting recovery.
> I'm not convinced this is actually doable in practice, outside of
> killing all apps that aren't the one trying to recover the GPU.
> Is this just about not crashing those processes if they don't handle
> GPU hotunplugs well, about leaks, or something else?

Correct, all of it. And since the compositor is in charge of device resources,
this way it atleast has the opportunity to recover the device and recreate
context without all the userspace violence.

I'm not entirely aware of its feasibility though, perhaps something for the
consumers to experiment.

> > +With IOCTLs blocked and device already 'wedged', all device memory should
> > +be unmapped and file descriptors should be closed to prevent leaks.
> Afaiu from a userspace POV, a rebind is just like a GPU hotunplug +
> hotplug with matching "remove" and "add" udev events. As long as the
> application cleans up resources related to the device when it receives
> the event, there should be no leaks with a normal hotunplug... Is this
> different enough that we can't have the same expectations?

The thing about "remove" event is that it is generated *after* we opt for an
unbind, and at that point it might be already too late if userspace doesn't
get enough time to clean things up while the device is removed with a live
client resulting in unknown consequences.

The idea here is to clean things up *before* we opt for an unbind leaving
no room for side effects.

Raag


[PATCH v11 1/5] drm: Introduce device wedged event

2025-01-24 Thread Raag Jadav
 Handle invalid method cases
 v8: Allow sending multiple methods with uevent (Lucas, Michal)
 static_assert() globally (Andy)
 v9: Provide 'none' method for device reset (Christian)
 Provide recovery opts using switch cases
v11: Log device reset (André)

Signed-off-by: Raag Jadav 
Reviewed-by: André Almeida 
---
 drivers/gpu/drm/drm_drv.c | 68 +++
 include/drm/drm_device.h  |  8 +
 include/drm/drm_drv.h |  1 +
 3 files changed, 77 insertions(+)

diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 3cf440eee8a2..9eb1f236a03f 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -26,6 +26,7 @@
  * DEALINGS IN THE SOFTWARE.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -34,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -498,6 +500,72 @@ void drm_dev_unplug(struct drm_device *dev)
 }
 EXPORT_SYMBOL(drm_dev_unplug);
 
+/*
+ * Available recovery methods for wedged device. To be sent along with device
+ * wedged uevent.
+ */
+static const char *drm_get_wedge_recovery(unsigned int opt)
+{
+   switch (BIT(opt)) {
+   case DRM_WEDGE_RECOVERY_NONE:
+   return "none";
+   case DRM_WEDGE_RECOVERY_REBIND:
+   return "rebind";
+   case DRM_WEDGE_RECOVERY_BUS_RESET:
+   return "bus-reset";
+   default:
+   return NULL;
+   }
+}
+
+/**
+ * drm_dev_wedged_event - generate a device wedged uevent
+ * @dev: DRM device
+ * @method: method(s) to be used for recovery
+ *
+ * This generates a device wedged uevent for the DRM device specified by @dev.
+ * Recovery @method\(s) of choice will be sent in the uevent environment as
+ * ``WEDGED=[,]`` in order of less to more side-effects. If
+ * caller is unsure about recovery or @method is unknown (0), 
``WEDGED=unknown``
+ * will be sent instead.
+ *
+ * Refer to 'Device Wedging' chapter in Documentation/gpu/drm-uapi.rst for more
+ * details.
+ *
+ * Returns: 0 on success, negative error code otherwise.
+ */
+int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
+{
+   const char *recovery = NULL;
+   unsigned int len, opt;
+   /* Event string length up to 28+ characters with available methods */
+   char event_string[32];
+   char *envp[] = { event_string, NULL };
+
+   len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
+
+   for_each_set_bit(opt, &method, BITS_PER_TYPE(method)) {
+   recovery = drm_get_wedge_recovery(opt);
+   if (drm_WARN_ONCE(dev, !recovery, "invalid recovery method 
%u\n", opt))
+   break;
+
+   len += scnprintf(event_string + len, sizeof(event_string), 
"%s,", recovery);
+   }
+
+   if (recovery)
+   /* Get rid of trailing comma */
+   event_string[len - 1] = '\0';
+   else
+   /* Caller is unsure about recovery, do the best we can at this 
point. */
+   snprintf(event_string, sizeof(event_string), "%s", 
"WEDGED=unknown");
+
+   drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
+"but recovered through reset" : "needs recovery");
+
+   return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
+}
+EXPORT_SYMBOL(drm_dev_wedged_event);
+
 /*
  * DRM internal mount
  * We want to be able to allocate our own "struct address_space" to control
diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
index c91f87b5242d..6ea54a578cda 100644
--- a/include/drm/drm_device.h
+++ b/include/drm/drm_device.h
@@ -21,6 +21,14 @@ struct inode;
 struct pci_dev;
 struct pci_controller;
 
+/*
+ * Recovery methods for wedged device in order of less to more side-effects.
+ * To be used with drm_dev_wedged_event() as recovery @method. Callers can
+ * use any one, multiple (or'd) or none depending on their needs.
+ */
+#define DRM_WEDGE_RECOVERY_NONEBIT(0)  /* optional telemetry 
collection */
+#define DRM_WEDGE_RECOVERY_REBIND  BIT(1)  /* unbind + bind driver */
+#define DRM_WEDGE_RECOVERY_BUS_RESET   BIT(2)  /* unbind + reset bus device + 
bind */
 
 /**
  * enum switch_power_state - power state of drm device
diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
index 9952b846c170..a43d707b5f36 100644
--- a/include/drm/drm_drv.h
+++ b/include/drm/drm_drv.h
@@ -482,6 +482,7 @@ void drm_put_dev(struct drm_device *dev);
 bool drm_dev_enter(struct drm_device *dev, int *idx);
 void drm_dev_exit(int idx);
 void drm_dev_unplug(struct drm_device *dev);
+int drm_dev_wedged_event(struct drm_device *dev, unsigned long method);
 
 /**
  * drm_dev_is_unplugged - is a DRM device unplugged
-- 
2.34.1



[PATCH v11 2/5] drm/doc: Document device wedged event

2025-01-24 Thread Raag Jadav
Add documentation for device wedged event in a new 'Device wedging'
chapter. The describes basic definitions, prerequisites and consumer
expectations along with an example.

 v8: Improve documentation (Christian, Rodrigo)
 v9: Add prerequisites section (Christian)
v10: Clarify mmap cleanup and consumer prerequisites (Christian, Aravind)
v11: Reference wedged event in device reset section (André)

Signed-off-by: Raag Jadav 
Reviewed-by: Christian König 
Reviewed-by: André Almeida 
---
 Documentation/gpu/drm-uapi.rst | 112 -
 1 file changed, 109 insertions(+), 3 deletions(-)

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index b75cc9a70d1f..b1f9e213d71c 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -371,9 +371,115 @@ Reporting causes of resets
 
 Apart from propagating the reset through the stack so apps can recover, it's
 really useful for driver developers to learn more about what caused the reset 
in
-the first place. DRM devices should make use of devcoredump to store relevant
-information about the reset, so this information can be added to user bug
-reports.
+the first place. For this, drivers can make use of devcoredump to store 
relevant
+information about the reset and send device wedged event with ``none`` recovery
+method (as explained in "Device Wedging" chapter) to notify userspace, so this
+information can be collected and added to user bug reports.
+
+Device wedging
+==
+
+Drivers can optionally make use of device wedged event (implemented as
+drm_dev_wedged_event() in DRM subsystem), which notifies userspace of 'wedged'
+(hanged/unusable) state of the DRM device through a uevent. This is useful
+especially in cases where the device is no longer operating as expected and
+has become unrecoverable from driver context. Purpose of this implementation
+is to provide drivers a generic way to recover with the help of userspace
+intervention without taking any drastic measures in the driver.
+
+A 'wedged' device is basically a dead device that needs attention. The
+uevent is the notification that is sent to userspace along with a hint about
+what could possibly be attempted to recover the device and bring it back to
+usable state. Different drivers may have different ideas of a 'wedged' device
+depending on their hardware implementation, and hence the vendor agnostic
+nature of the event. It is up to the drivers to decide when they see the need
+for device recovery and how they want to recover from the available methods.
+
+Driver prerequisites
+
+
+The driver, before opting for recovery, needs to make sure that the 'wedged'
+device doesn't harm the system as a whole by taking care of the prerequisites.
+Necessary actions must include disabling DMA to system memory as well as any
+communication channels with other devices. Further, the driver must ensure
+that all dma_fences are signalled and any device state that the core kernel
+might depend on is cleaned up. Any existing mmap should be invalidated and
+page faults should be redirected to a dummy page. Once the event is sent, the
+device must be kept in 'wedged' state until the recovery is performed. New
+accesses to the device (IOCTLs) should be blocked, preferably with an error
+code that resembles the type of failure the device has encountered. This will
+signify the reason for wedging, which can be reported to the application if
+needed.
+
+Recovery
+
+
+Current implementation defines three recovery methods, out of which, drivers
+can use any one, multiple or none. Method(s) of choice will be sent in the
+uevent environment as ``WEDGED=[,]`` in order of less to
+more side-effects. If driver is unsure about recovery or method is unknown
+(like soft/hard reboot, firmware flashing, hardware replacement or any other
+procedure which can't be attempted on the fly), ``WEDGED=unknown`` will be
+sent instead.
+
+Userspace consumers can parse this event and attempt recovery as per the
+following expectations.
+
+=== 
+Recovery method Consumer expectations
+=== 
+noneoptional telemetry collection
+rebind  unbind + bind driver
+bus-reset   unbind + reset bus device + bind
+unknown admin/user policy
+=== 
+
+The only exception to this is ``WEDGED=none``, which signifies that the
+device was temporarily 'wedged' at some point but was able to recover using
+device specific methods like reset. No explicit device recovery is expected
+from the consumer in this case, but it can still take additional steps like
+gathering telemetry information (devcoredump, syslog). This is useful
+because the first hang is usually the most criti

[PATCH v11 5/5] drm/amdgpu: Use device wedged event

2025-01-24 Thread Raag Jadav
From: André Almeida 

Use DRM's device wedged event to notify userspace that a reset had
happened. For now, only use `none` method meant for telemetry
capture.

In the future we might want to report a recovery method if the reset didn't
succeed.

Acked-by: Shashank Sharma 
Signed-off-by: André Almeida 
Reviewed-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 36053b3d48b3..83bc7f9faff1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -6116,6 +6116,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
 
atomic_set(&adev->reset_domain->reset_res, r);
+
+   if (!r)
+   drm_dev_wedged_event(adev_to_drm(adev), 
DRM_WEDGE_RECOVERY_NONE);
+
return r;
 }
 
-- 
2.34.1



[PATCH v11 3/5] drm/xe: Use device wedged event

2025-01-24 Thread Raag Jadav
This was previously attempted as xe specific reset uevent but dropped
in commit 77a0d4d1cea2 ("drm/xe/uapi: Remove reset uevent for now")
as part of refactoring.

Now that we have device wedged event provided by DRM core, make use
of it and support both driver rebind and bus-reset based recovery.
With this in place userspace will be notified of wedged device, on
the basis of which, userspace may take respective action to recover
the device.

$ udevadm monitor --property --kernel
monitor will print the received events for:
KERNEL - the kernel uevent

KERNEL[265.802982] change   
/devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:03:00.0/drm/card0
 (drm)
ACTION=change
DEVPATH=/devices/pci:00/:00:01.0/:01:00.0/:02:01.0/:03:00.0/drm/card0
SUBSYSTEM=drm
WEDGED=rebind,bus-reset
DEVNAME=/dev/dri/card0
DEVTYPE=drm_minor
SEQNUM=5208
MAJOR=226
MINOR=0

v2: Change authorship to Himal (Aravind)
Add uevent for all device wedged cases (Aravind)
v3: Generic implementation in DRM subsystem (Lucas)
v4: Change authorship to Raag (Aravind)

Signed-off-by: Raag Jadav 
Reviewed-by: Aravind Iddamsetty 
---
 drivers/gpu/drm/xe/xe_device.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index bd6191e1ed3e..98981c0051c1 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -1114,7 +1114,8 @@ static void xe_device_wedged_fini(struct drm_device *drm, 
void *arg)
  * re-probe (unbind + bind).
  * In this state every IOCTL will be blocked so the GT cannot be used.
  * In general it will be called upon any critical error such as gt reset
- * failure or guc loading failure.
+ * failure or guc loading failure. Userspace will be notified of this state
+ * through device wedged uevent.
  * If xe.wedged module parameter is set to 2, this function will be called
  * on every single execution timeout (a.k.a. GPU hang) right after devcoredump
  * snapshot capture. In this mode, GT reset won't be attempted so the state of
@@ -1144,6 +1145,10 @@ void xe_device_declare_wedged(struct xe_device *xe)
"IOCTLs and executions are blocked. Only a rebind may 
clear the failure\n"
"Please file a _new_ bug report at 
https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n";,
dev_name(xe->drm.dev));
+
+   /* Notify userspace of wedged device */
+   drm_dev_wedged_event(&xe->drm,
+DRM_WEDGE_RECOVERY_REBIND | 
DRM_WEDGE_RECOVERY_BUS_RESET);
}
 
for_each_gt(gt, xe, id)
-- 
2.34.1



[PATCH v11 4/5] drm/i915: Use device wedged event

2025-01-24 Thread Raag Jadav
Now that we have device wedged event provided by DRM core, make use
of it and support both driver rebind and bus-reset based recovery.
With this in place, userspace will be notified of wedged device on
gt reset failure.

Signed-off-by: Raag Jadav 
Reviewed-by: Aravind Iddamsetty 
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
b/drivers/gpu/drm/i915/gt/intel_reset.c
index 5a625518d1a9..b33007cd1504 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -1422,6 +1422,9 @@ static void intel_gt_reset_global(struct intel_gt *gt,
 
if (!test_bit(I915_WEDGED, >->reset.flags))
kobject_uevent_env(kobj, KOBJ_CHANGE, reset_done_event);
+   else
+   drm_dev_wedged_event(>->i915->drm,
+DRM_WEDGE_RECOVERY_REBIND | 
DRM_WEDGE_RECOVERY_BUS_RESET);
 }
 
 /**
-- 
2.34.1



[PATCH v11 0/5] Introduce DRM device wedged event

2025-01-24 Thread Raag Jadav
This series introduces device wedged event in DRM subsystem and uses it
in xe, i915 and amdgpu drivers. Detailed description in commit message.

This was earlier attempted as xe specific uevent in v1 and v2 on [1].
Similar work by André Almeida on [2].
Wedged event support for amdgpu by André Almeida on [3].
Consumer implementation by Xaver Hugl on [4].

 [1] https://patchwork.freedesktop.org/series/136909/
 [2] 
https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealm...@igalia.com/
 [3] 
https://lore.kernel.org/dri-devel/20241216162104.58241-1-andrealm...@igalia.com/
 [4] https://invent.kde.org/plasma/kwin/-/merge_requests/7027

 v2: Change authorship to Himal (Aravind)
 Add uevent for all device wedged cases (Aravind)

 v3: Generic implementation in DRM subsystem (Lucas)

 v4: s/drm_dev_wedged/drm_dev_wedged_event
 Use drm_info() (Jani)
 Kernel doc adjustment (Aravind)
 Change authorship to Raag (Aravind)

 v5: Send recovery method with uevent (Lina)
 Expose supported recovery methods via sysfs (Lucas)

 v6: Access wedge_recovery_opts[] using helper function (Jani)
 Use snprintf() (Jani)

 v7: Convert recovery helpers into regular functions (Andy, Jani)
 Aesthetic adjustments (Andy)
 Handle invalid method cases
 Add documentation to drm-uapi.rst (Sima)

 v8: Drop sysfs and allow sending multiple methods with uevent (Lucas, Michal)
 Improve documentation (Christian, Rodrigo)
 static_assert() globally (Andy)

 v9: Document prerequisites section (Christian)
 Provide 'none' method for device reset (Christian)
 Provide recovery opts using switch cases

v10: Clarify mmap cleanup and consumer prerequisites (Christian, Aravind)

v11: Log device reset (André)
 Reference wedged event in device reset section (André)
 Wedged event support for amdgpu (André)

André Almeida (1):
  drm/amdgpu: Use device wedged event

Raag Jadav (4):
  drm: Introduce device wedged event
  drm/doc: Document device wedged event
  drm/xe: Use device wedged event
  drm/i915: Use device wedged event

 Documentation/gpu/drm-uapi.rst | 112 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   4 +
 drivers/gpu/drm/drm_drv.c  |  68 +
 drivers/gpu/drm/i915/gt/intel_reset.c  |   3 +
 drivers/gpu/drm/xe/xe_device.c |   7 +-
 include/drm/drm_device.h   |   8 ++
 include/drm/drm_drv.h  |   1 +
 7 files changed, 199 insertions(+), 4 deletions(-)

-- 
2.34.1



Re: [PATCH v10 2/4] drm/doc: Document device wedged event

2025-01-28 Thread Raag Jadav
On Mon, Jan 27, 2025 at 12:23:28PM +0200, Pekka Paalanen wrote:
> On Wed, 22 Jan 2025 07:22:25 +0200
> Raag Jadav  wrote:
> 
> > On Tue, Jan 21, 2025 at 02:14:56AM +0100, Xaver Hugl wrote:
> > > > +It is the responsibility of the consumer to make sure that the device 
> > > > or
> > > > +its resources are not in use by any process before attempting 
> > > > recovery.  
> > > I'm not convinced this is actually doable in practice, outside of
> > > killing all apps that aren't the one trying to recover the GPU.
> > > Is this just about not crashing those processes if they don't handle
> > > GPU hotunplugs well, about leaks, or something else?  
> > 
> > Correct, all of it. And since the compositor is in charge of device 
> > resources,
> > this way it atleast has the opportunity to recover the device and recreate
> > context without all the userspace violence.
> 
> Hi Raag,
> 
> sorry, I haven't followed this series, so I wonder, why should
> userspace be part of recovering the device? Why doesn't the kernel
> automatically load a new driver instance with a new DRM device node?

There are things like bus level reset (PCI SBR) and re-enumeration that are
not possible from driver context (or atleast I'm not aware of it), so a new
instance is just as useful/less as the old one.

> Of course userspace needs to deal with stuff suddenly erroring out, and
> destroy existing related resources, then wait for a working device
> to appear and rebuild all state. The kernel driver already needs to
> make the existing open stuff inert and harmless, why does it need an
> acknowledgement from userspace to unbind and re-bind?

Rebind is kind of a stepping stone to the above.

> > I'm not entirely aware of its feasibility though, perhaps something for the
> > consumers to experiment.
> 
> If consumers mean userspace, then no, not reliably. But the kernel can
> do it.

Can you please elaborate or refer to an example?

> I see in the commit message written:
> 
>   "For example, if the driver supports multiple recovery methods,
>   consumers can opt for the suitable one based on policy
>   definition."
> 
> How could consumers know what to do? How can they guess what would be
> enough to recover the device? Isn't that the kernel driver's job to
> know?

Yes, 'WEDGED=' value are the known methods that are expected to work. The
policy is how the consumer can decide which one to opt for depending on the
scenario. For example, the less drastic method could work in most cases, but
you'd probably want to opt for a more drastic method for repeat offences or
perhaps if something more serious is discovered from "optional telemetry
collection".

> (More important for userspace would be know if dmabuf fds remain
> pointing to valid memory retaining its contents or if the contents are
> lost. Userspace cannot tell which device a dmabuf originates from,
> AFAIK, so this would need to be added in the generic dmabuf UAPI.)

Not sure if I understand, perhaps Christian can shed some light here.

>   "Consumers can also choose to have the device available for
>   debugging or additional data collection before performing the
>   recovery."
> 
> Couldn't the wedged driver instance remain detached from the hardware
> while a new driver instance initializes? Then debug data remains until
> the wedged device is fully closed from userspace, or maybe devcore dump
> retains it.
> 
> I presume that WEDGED=none case should retain the debug data somehow as
> well.

Indeed, but it's optional so depends on the driver.

> > > > +With IOCTLs blocked and device already 'wedged', all device memory 
> > > > should
> 
> btw. when I see "blocked" I think of the function call not returning
> yet. But in this patch "blocked" seems to be synonymous for "returns
> an error immediately". Would it be possible to avoid the word "blocked"
> for this?

It is meant as "blocking the access", but fair enough. We can have a quick
fix later on if it's not too big of a concern.

> > > > +be unmapped and file descriptors should be closed to prevent leaks.  
> > > Afaiu from a userspace POV, a rebind is just like a GPU hotunplug +
> > > hotplug with matching "remove" and "add" udev events. As long as the
> > > application cleans up resources related to the device when it receives
> > > the event, there should be no leaks with a normal hotunplug... Is this
> > > different enough that we can't have the same expectations?  
> &g

Re: [PATCH v10 2/4] drm/doc: Document device wedged event

2025-01-29 Thread Raag Jadav
On Tue, Jan 28, 2025 at 01:38:09PM +0200, Pekka Paalanen wrote:
> On Tue, 28 Jan 2025 11:37:53 +0200
> Raag Jadav  wrote:
> 
> > On Mon, Jan 27, 2025 at 12:23:28PM +0200, Pekka Paalanen wrote:
> > > On Wed, 22 Jan 2025 07:22:25 +0200
> > > Raag Jadav  wrote:
> > >   
> > > > On Tue, Jan 21, 2025 at 02:14:56AM +0100, Xaver Hugl wrote:  
> > > > > > +It is the responsibility of the consumer to make sure that the 
> > > > > > device or
> > > > > > +its resources are not in use by any process before attempting 
> > > > > > recovery.
> > > > > I'm not convinced this is actually doable in practice, outside of
> > > > > killing all apps that aren't the one trying to recover the GPU.
> > > > > Is this just about not crashing those processes if they don't handle
> > > > > GPU hotunplugs well, about leaks, or something else?
> > > > 
> > > > Correct, all of it. And since the compositor is in charge of device 
> > > > resources,
> > > > this way it atleast has the opportunity to recover the device and 
> > > > recreate
> > > > context without all the userspace violence.  
> > > 
> > > Hi Raag,
> > > 
> > > sorry, I haven't followed this series, so I wonder, why should
> > > userspace be part of recovering the device? Why doesn't the kernel
> > > automatically load a new driver instance with a new DRM device node?  
> > 
> > There are things like bus level reset (PCI SBR) and re-enumeration that are
> > not possible from driver context (or atleast I'm not aware of it), so a new
> > instance is just as useful/less as the old one.
> 
> Ok, "not possible from driver context" is a key revelation. I wonder if
> starting an overview section with that in the documentation would help
> getting the right mindset.

Not "not possible" in a literal sense, but rather allowing something that
drastic and convoluted that is probably beyond the scope of the driver.

> Did I miss that somewhere?

The first two paragraphs are meant as an introduction. Let me know if
something's not translating so well.

> I thought bus level reset meant resetting the PCI device by some bus
> API. Clearly mistaken, I suppose you mean resetting the whole bus
> including everything on it?

I'm no PCI expert but yes, it is atleast my understanding.

...

> > > (More important for userspace would be know if dmabuf fds remain
> > > pointing to valid memory retaining its contents or if the contents are
> > > lost. Userspace cannot tell which device a dmabuf originates from,
> > > AFAIK, so this would need to be added in the generic dmabuf UAPI.)  
> > 
> > Not sure if I understand, perhaps Christian can shed some light here.
> 
> A system might have multiple GPUs, and one GPU going down may leave all
> the rest working as usual. A Wayland compositor would want to tell the
> difference between still-good and possibly or definitely lost dmabufs
> it received from its clients.

Usually 'DEVNAME=' and 'DEVPATH=' values refer to the device that generates
the event.

> But this is off-topic in this thread I believe, nothing to the series
> at hand.

...

> > > > > > +be unmapped and file descriptors should be closed to prevent 
> > > > > > leaks.
> > > > > Afaiu from a userspace POV, a rebind is just like a GPU hotunplug +
> > > > > hotplug with matching "remove" and "add" udev events. As long as the
> > > > > application cleans up resources related to the device when it receives
> > > > > the event, there should be no leaks with a normal hotunplug... Is this
> > > > > different enough that we can't have the same expectations?
> > > > 
> > > > The thing about "remove" event is that it is generated *after* we opt 
> > > > for an
> > > > unbind, and at that point it might be already too late if userspace 
> > > > doesn't
> > > > get enough time to clean things up while the device is removed with a 
> > > > live
> > > > client resulting in unknown consequences.
> > > > 
> > > > The idea here is to clean things up *before* we opt for an unbind 
> > > > leaving
> > > > no room for side effects.  
> > > 
> > > Something here feels fragile. There should not be a deadline for
> > > userspace to finish cleaning up. What was described for K

Re: [PATCH v12 0/5] Introduce DRM device wedged event

2025-02-12 Thread Raag Jadav
On Tue, Feb 04, 2025 at 12:35:23PM +0530, Raag Jadav wrote:
> This series introduces device wedged event in DRM subsystem and uses it
> in xe, i915 and amdgpu drivers. Detailed description in commit message.
> 
> This was earlier attempted as xe specific uevent in v1 and v2 on [1].
> Similar work by André Almeida on [2].
> Wedged event support for amdgpu by André Almeida on [3].
> Consumer implementation by Xaver Hugl on [4].
> 
>  [1] https://patchwork.freedesktop.org/series/136909/
>  [2] 
> https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealm...@igalia.com/
>  [3] 
> https://lore.kernel.org/dri-devel/20241216162104.58241-1-andrealm...@igalia.com/
>  [4] https://invent.kde.org/plasma/kwin/-/merge_requests/7027

Bump. Anything I can do to move this forward?

Raag


Re: [PATCH 2/2] drm/amdgpu: Make use of drm_wedge_app_info

2025-03-03 Thread Raag Jadav
On Fri, Feb 28, 2025 at 09:13:53AM -0300, André Almeida wrote:
> To notify userspace about which app (if any) made the device get in a
> wedge state, make use of drm_wedge_app_info parameter, filling it with
> the app PID and name.
> 
> Signed-off-by: André Almeida 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|  6 +-
>  2 files changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 00b9b87dafd8..e06adf6f34fd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -6123,8 +6123,23 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>  
>   atomic_set(&adev->reset_domain->reset_res, r);
>  
> - if (!r)
> - drm_dev_wedged_event(adev_to_drm(adev), 
> DRM_WEDGE_RECOVERY_NONE, NULL);
> + if (!r) {
> + struct drm_wedge_app_info aux, *info = NULL;
> +
> + if (job) {
> + struct amdgpu_task_info *ti;
> +
> + ti = amdgpu_vm_get_task_info_pasid(adev, job->pasid);
> + if (ti) {
> + aux.pid = ti->pid;
> + aux.comm = ti->process_name;
> + info = &aux;
> + amdgpu_vm_put_task_info(ti);
> + }
> + }

Is this guaranteed to be guilty app and not some scheduled worker?

Raag


Re: [PATCH 1/2] drm: Create an app info option for wedge events

2025-03-03 Thread Raag Jadav
On Fri, Feb 28, 2025 at 06:54:12PM -0300, André Almeida wrote:
> Hi Raag,
> 
> On 2/28/25 11:20, Raag Jadav wrote:
> > Cc: Lucas
> > 
> > On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote:
> > > When a device get wedged, it might be caused by a guilty application.
> > > For userspace, knowing which app was the cause can be useful for some
> > > situations, like for implementing a policy, logs or for giving a chance
> > > for the compositor to let the user know what app caused the problem.
> > > This is an optional argument, when `PID=-1` there's no information about
> > > the app caused the problem, or if any app was involved during the hang.
> > > 
> > > Sometimes just the PID isn't enough giving that the app might be already
> > > dead by the time userspace will try to check what was this PID's name,
> > > so to make the life easier also notify what's the app's name in the user
> > > event.
> > > 
> > > Signed-off-by: André Almeida 
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|  2 +-
> > >   drivers/gpu/drm/drm_drv.c  | 16 +---
> > >   drivers/gpu/drm/i915/gt/intel_reset.c  |  3 ++-
> > >   drivers/gpu/drm/xe/xe_device.c |  3 ++-
> > >   include/drm/drm_device.h   |  8 
> > >   include/drm/drm_drv.h  |  3 ++-
> > >   7 files changed, 29 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index 24ba52d76045..00b9b87dafd8 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -6124,7 +6124,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> > > *adev,
> > >   atomic_set(&adev->reset_domain->reset_res, r);
> > >   if (!r)
> > > - drm_dev_wedged_event(adev_to_drm(adev), 
> > > DRM_WEDGE_RECOVERY_NONE);
> > > + drm_dev_wedged_event(adev_to_drm(adev), 
> > > DRM_WEDGE_RECOVERY_NONE, NULL);
> > >   return r;
> > >   }
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > index ef1b77f1e88f..3ed9cbcab1ad 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > @@ -150,7 +150,7 @@ static enum drm_gpu_sched_stat 
> > > amdgpu_job_timedout(struct drm_sched_job *s_job)
> > >   amdgpu_fence_driver_force_completion(ring);
> > >   if (amdgpu_ring_sched_ready(ring))
> > >   drm_sched_start(&ring->sched, 0);
> > > - drm_dev_wedged_event(adev_to_drm(adev), 
> > > DRM_WEDGE_RECOVERY_NONE);
> > > + drm_dev_wedged_event(adev_to_drm(adev), 
> > > DRM_WEDGE_RECOVERY_NONE, NULL);
> > >   dev_err(adev->dev, "Ring %s reset succeeded\n", 
> > > ring->sched.name);
> > >   goto exit;
> > >   }
> > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > > index 17fc5dc708f4..48faafd82a99 100644
> > > --- a/drivers/gpu/drm/drm_drv.c
> > > +++ b/drivers/gpu/drm/drm_drv.c
> > > @@ -522,6 +522,7 @@ static const char *drm_get_wedge_recovery(unsigned 
> > > int opt)
> > >* drm_dev_wedged_event - generate a device wedged uevent
> > >* @dev: DRM device
> > >* @method: method(s) to be used for recovery
> > > + * @info: optional information about the guilty app
> > >*
> > >* This generates a device wedged uevent for the DRM device specified 
> > > by @dev.
> > >* Recovery @method\(s) of choice will be sent in the uevent 
> > > environment as
> > > @@ -534,13 +535,14 @@ static const char *drm_get_wedge_recovery(unsigned 
> > > int opt)
> > >*
> > >* Returns: 0 on success, negative error code otherwise.
> > >*/
> > > -int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> > > +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method,
> > > +  struct drm_wedge_app_info *info)
&g

Re: [PATCH 2/2] drm/amdgpu: Make use of drm_wedge_app_info

2025-03-03 Thread Raag Jadav
On Fri, Feb 28, 2025 at 06:49:43PM -0300, André Almeida wrote:
> Hi Raag,
> 
> On 2/28/25 11:58, Raag Jadav wrote:
> > On Fri, Feb 28, 2025 at 09:13:53AM -0300, André Almeida wrote:
> > > To notify userspace about which app (if any) made the device get in a
> > > wedge state, make use of drm_wedge_app_info parameter, filling it with
> > > the app PID and name.
> > > 
> > > Signed-off-by: André Almeida 
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +--
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|  6 +-
> > >   2 files changed, 22 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index 00b9b87dafd8..e06adf6f34fd 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -6123,8 +6123,23 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> > > *adev,
> > >   atomic_set(&adev->reset_domain->reset_res, r);
> > > - if (!r)
> > > - drm_dev_wedged_event(adev_to_drm(adev), 
> > > DRM_WEDGE_RECOVERY_NONE, NULL);
> > > + if (!r) {
> > > + struct drm_wedge_app_info aux, *info = NULL;
> > > +
> > > + if (job) {
> > > + struct amdgpu_task_info *ti;
> > > +
> > > + ti = amdgpu_vm_get_task_info_pasid(adev, job->pasid);
> > > + if (ti) {
> > > + aux.pid = ti->pid;
> > > + aux.comm = ti->process_name;
> > > + info = &aux;
> > > + amdgpu_vm_put_task_info(ti);
> > > + }
> > > + }
> > Is this guaranteed to be guilty app and not some scheduled worker?
> 
> This is how amdgpu decides which app is the guilty one earlier in the code
> as in the print:
> 
>     ti = amdgpu_vm_get_task_info_pasid(ring->adev, job->pasid);
> 
>     "Process information: process %s pid %d thread %s pid %d\n"
> 
> So I think it's consistent with what the driver thinks it's the guilty
> process.

Sure, but with something like app_info we're kind of hinting to userspace
that an application was _indeed_ involved with reset. Is that also guaranteed?

Is it possible that an application needlessly suffers from a false positive
scenario (reset due to other factors)?

Raag


Re: [PATCH 1/2] drm: Create an app info option for wedge events

2025-03-03 Thread Raag Jadav
Cc: Lucas

On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote:
> When a device get wedged, it might be caused by a guilty application.
> For userspace, knowing which app was the cause can be useful for some
> situations, like for implementing a policy, logs or for giving a chance
> for the compositor to let the user know what app caused the problem.
> This is an optional argument, when `PID=-1` there's no information about
> the app caused the problem, or if any app was involved during the hang.
> 
> Sometimes just the PID isn't enough giving that the app might be already
> dead by the time userspace will try to check what was this PID's name,
> so to make the life easier also notify what's the app's name in the user
> event.
> 
> Signed-off-by: André Almeida 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|  2 +-
>  drivers/gpu/drm/drm_drv.c  | 16 +---
>  drivers/gpu/drm/i915/gt/intel_reset.c  |  3 ++-
>  drivers/gpu/drm/xe/xe_device.c |  3 ++-
>  include/drm/drm_device.h   |  8 
>  include/drm/drm_drv.h  |  3 ++-
>  7 files changed, 29 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 24ba52d76045..00b9b87dafd8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -6124,7 +6124,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>   atomic_set(&adev->reset_domain->reset_res, r);
>  
>   if (!r)
> - drm_dev_wedged_event(adev_to_drm(adev), 
> DRM_WEDGE_RECOVERY_NONE);
> + drm_dev_wedged_event(adev_to_drm(adev), 
> DRM_WEDGE_RECOVERY_NONE, NULL);
>  
>   return r;
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index ef1b77f1e88f..3ed9cbcab1ad 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -150,7 +150,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct 
> drm_sched_job *s_job)
>   amdgpu_fence_driver_force_completion(ring);
>   if (amdgpu_ring_sched_ready(ring))
>   drm_sched_start(&ring->sched, 0);
> - drm_dev_wedged_event(adev_to_drm(adev), 
> DRM_WEDGE_RECOVERY_NONE);
> + drm_dev_wedged_event(adev_to_drm(adev), 
> DRM_WEDGE_RECOVERY_NONE, NULL);
>   dev_err(adev->dev, "Ring %s reset succeeded\n", 
> ring->sched.name);
>   goto exit;
>   }
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 17fc5dc708f4..48faafd82a99 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -522,6 +522,7 @@ static const char *drm_get_wedge_recovery(unsigned int 
> opt)
>   * drm_dev_wedged_event - generate a device wedged uevent
>   * @dev: DRM device
>   * @method: method(s) to be used for recovery
> + * @info: optional information about the guilty app
>   *
>   * This generates a device wedged uevent for the DRM device specified by 
> @dev.
>   * Recovery @method\(s) of choice will be sent in the uevent environment as
> @@ -534,13 +535,14 @@ static const char *drm_get_wedge_recovery(unsigned int 
> opt)
>   *
>   * Returns: 0 on success, negative error code otherwise.
>   */
> -int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method,
> +  struct drm_wedge_app_info *info)
>  {
>   const char *recovery = NULL;
>   unsigned int len, opt;
>   /* Event string length up to 28+ characters with available methods */
> - char event_string[32];
> - char *envp[] = { event_string, NULL };
> + char event_string[32], pid_string[15], comm_string[TASK_COMM_LEN];
> + char *envp[] = { event_string, pid_string, comm_string, NULL };
>  
>   len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
>  
> @@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device *dev, 
> unsigned long method)
>   drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
>"but recovered through reset" : "needs recovery");
>  
> + if (info) {
> + snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
> + snprintf(comm_string, sizeof(comm_string), "APP=%s", 
> info->comm);
> + } else {
> + snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1");
> + snprintf(comm_string, sizeof(comm_string), "%s", "APP=none");
> + }

This is not much use for wedge cases that needs recovery, since at that point
the userspace will need to clean house anyway.

Which leaves us with only 'none' case and perhaps the need for standardizati

Re: [PATCH 1/2] drm: Create an app info option for wedge events

2025-03-13 Thread Raag Jadav
On Wed, Mar 12, 2025 at 06:59:33PM -0300, André Almeida wrote:
> Em 12/03/2025 07:06, Raag Jadav escreveu:
> > On Tue, Mar 11, 2025 at 07:09:45PM +0200, Raag Jadav wrote:
> > > On Mon, Mar 10, 2025 at 06:27:53PM -0300, André Almeida wrote:
> > > > Em 01/03/2025 02:53, Raag Jadav escreveu:
> > > > > On Fri, Feb 28, 2025 at 06:54:12PM -0300, André Almeida wrote:
> > > > > > Hi Raag,
> > > > > > 
> > > > > > On 2/28/25 11:20, Raag Jadav wrote:
> > > > > > > Cc: Lucas
> > > > > > > 
> > > > > > > On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote:
> > > > > > > > When a device get wedged, it might be caused by a guilty 
> > > > > > > > application.
> > > > > > > > For userspace, knowing which app was the cause can be useful 
> > > > > > > > for some
> > > > > > > > situations, like for implementing a policy, logs or for giving 
> > > > > > > > a chance
> > > > > > > > for the compositor to let the user know what app caused the 
> > > > > > > > problem.
> > > > > > > > This is an optional argument, when `PID=-1` there's no 
> > > > > > > > information about
> > > > > > > > the app caused the problem, or if any app was involved during 
> > > > > > > > the hang.
> > > > > > > > 
> > > > > > > > Sometimes just the PID isn't enough giving that the app might 
> > > > > > > > be already
> > > > > > > > dead by the time userspace will try to check what was this 
> > > > > > > > PID's name,
> > > > > > > > so to make the life easier also notify what's the app's name in 
> > > > > > > > the user
> > > > > > > > event.
> > > > > > > > 
> > > > > > > > Signed-off-by: André Almeida 
> > > > 
> > > > [...]
> > > > 
> > > > > > > > len = scnprintf(event_string, sizeof(event_string), 
> > > > > > > > "%s", "WEDGED=");
> > > > > > > > @@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device 
> > > > > > > > *dev, unsigned long method)
> > > > > > > > drm_info(dev, "device wedged, %s\n", method == 
> > > > > > > > DRM_WEDGE_RECOVERY_NONE ?
> > > > > > > >  "but recovered through reset" : "needs 
> > > > > > > > recovery");
> > > > > > > > +   if (info) {
> > > > > > > > +   snprintf(pid_string, sizeof(pid_string), 
> > > > > > > > "PID=%u", info->pid);
> > > > > > > > +   snprintf(comm_string, sizeof(comm_string), 
> > > > > > > > "APP=%s", info->comm);
> > > > > > > > +   } else {
> > > > > > > > +   snprintf(pid_string, sizeof(pid_string), "%s", 
> > > > > > > > "PID=-1");
> > > > > > > > +   snprintf(comm_string, sizeof(comm_string), 
> > > > > > > > "%s", "APP=none");
> > > > > > > > +   }
> > > > > > > This is not much use for wedge cases that needs recovery, since 
> > > > > > > at that point
> > > > > > > the userspace will need to clean house anyway.
> > > > > > > 
> > > > > > > Which leaves us with only 'none' case and perhaps the need for 
> > > > > > > standardization
> > > > > > > of "optional telemetry collection".
> > > > > > > 
> > > > > > > Thoughts?
> > > > > > 
> > > > > > I had the feeling that 'none' was already meant to be used for 
> > > > > > that. Do you
> > > > > > think we should move to another naming? Given that we didn't reach 
> > > > > > the merge
> > > > > > window yet we could potentially change that name without much 
> > > > > > damage.
> 

Re: [PATCH 1/2] drm: Create an app info option for wedge events

2025-03-12 Thread Raag Jadav
On Mon, Mar 10, 2025 at 06:27:53PM -0300, André Almeida wrote:
> Em 01/03/2025 02:53, Raag Jadav escreveu:
> > On Fri, Feb 28, 2025 at 06:54:12PM -0300, André Almeida wrote:
> > > Hi Raag,
> > > 
> > > On 2/28/25 11:20, Raag Jadav wrote:
> > > > Cc: Lucas
> > > > 
> > > > On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote:
> > > > > When a device get wedged, it might be caused by a guilty application.
> > > > > For userspace, knowing which app was the cause can be useful for some
> > > > > situations, like for implementing a policy, logs or for giving a 
> > > > > chance
> > > > > for the compositor to let the user know what app caused the problem.
> > > > > This is an optional argument, when `PID=-1` there's no information 
> > > > > about
> > > > > the app caused the problem, or if any app was involved during the 
> > > > > hang.
> > > > > 
> > > > > Sometimes just the PID isn't enough giving that the app might be 
> > > > > already
> > > > > dead by the time userspace will try to check what was this PID's name,
> > > > > so to make the life easier also notify what's the app's name in the 
> > > > > user
> > > > > event.
> > > > > 
> > > > > Signed-off-by: André Almeida 
> 
> [...]
> 
> > > > >   len = scnprintf(event_string, sizeof(event_string), "%s", 
> > > > > "WEDGED=");
> > > > > @@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device *dev, 
> > > > > unsigned long method)
> > > > >   drm_info(dev, "device wedged, %s\n", method == 
> > > > > DRM_WEDGE_RECOVERY_NONE ?
> > > > >"but recovered through reset" : "needs recovery");
> > > > > + if (info) {
> > > > > + snprintf(pid_string, sizeof(pid_string), "PID=%u", 
> > > > > info->pid);
> > > > > + snprintf(comm_string, sizeof(comm_string), "APP=%s", 
> > > > > info->comm);
> > > > > + } else {
> > > > > + snprintf(pid_string, sizeof(pid_string), "%s", 
> > > > > "PID=-1");
> > > > > + snprintf(comm_string, sizeof(comm_string), "%s", 
> > > > > "APP=none");
> > > > > + }
> > > > This is not much use for wedge cases that needs recovery, since at that 
> > > > point
> > > > the userspace will need to clean house anyway.
> > > > 
> > > > Which leaves us with only 'none' case and perhaps the need for 
> > > > standardization
> > > > of "optional telemetry collection".
> > > > 
> > > > Thoughts?
> > > 
> > > I had the feeling that 'none' was already meant to be used for that. Do 
> > > you
> > > think we should move to another naming? Given that we didn't reach the 
> > > merge
> > > window yet we could potentially change that name without much damage.
> > 
> > No, I meant thoughts on possible telemetry data that the drivers might
> > think is useful for userspace (along with PID) and can be presented in
> > a vendor agnostic manner (just like wedged event).
> 
> I'm not if I agree that this will only be used for telemetry and for the
> `none` use case. As stated by Xaver, there's use case to know which app
> caused the device to get wedged (like switching to software rendering) and
> to display something for the user after the recovery is done (e.g. "The game
>  stopped working and Plasma has reset").

Sure, but since this information is already available in coredump, I was
hoping to have something like a standardized DRM level coredump with both
vendor specific and agnostic sections, which the drivers can (and hopefully
transition to) use in conjunction with wedged event to provide wider
telemetry and is useful for all wedge cases.

Would that serve the usecase here?

Raag


Re: [PATCH 2/2] drm/amdgpu: Make use of drm_wedge_app_info

2025-03-12 Thread Raag Jadav
On Wed, Mar 12, 2025 at 09:25:08AM +0100, Christian König wrote:
>Am 11.03.25 um 18:13 schrieb Raag Jadav:
>> On Mon, Mar 10, 2025 at 06:03:27PM -0400, Alex Deucher wrote:
>>> On Mon, Mar 10, 2025 at 5:54 PM André Almeida  
>>> wrote:
>>>> Em 01/03/2025 03:04, Raag Jadav escreveu:
>>>>> On Fri, Feb 28, 2025 at 06:49:43PM -0300, André Almeida wrote:
>>>>>> Hi Raag,
>>>>>>
>>>>>> On 2/28/25 11:58, Raag Jadav wrote:
>>>>>>> On Fri, Feb 28, 2025 at 09:13:53AM -0300, André Almeida wrote:
>>>>>>>> To notify userspace about which app (if any) made the device get in a
>>>>>>>> wedge state, make use of drm_wedge_app_info parameter, filling it with
>>>>>>>> the app PID and name.
>>>>>>>>
>>>>>>>> Signed-off-by: André Almeida 
>>>>>>>> ---
>>>>>>>>drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +--
>>>>>>>>drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|  6 +-
>>>>>>>>2 files changed, 22 insertions(+), 3 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index 00b9b87dafd8..e06adf6f34fd 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> @@ -6123,8 +6123,23 @@ int amdgpu_device_gpu_recover(struct 
>>>>>>>> amdgpu_device *adev,
>>>>>>>>atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>>> -  if (!r)
>>>>>>>> -  drm_dev_wedged_event(adev_to_drm(adev), 
>>>>>>>> DRM_WEDGE_RECOVERY_NONE, NULL);
>>>>>>>> +  if (!r) {
>>>>>>>> +  struct drm_wedge_app_info aux, *info = NULL;
>>>>>>>> +
>>>>>>>> +  if (job) {
>>>>>>>> +  struct amdgpu_task_info *ti;
>>>>>>>> +
>>>>>>>> +  ti = amdgpu_vm_get_task_info_pasid(adev, 
>>>>>>>> job->pasid);
>>>>>>>> +  if (ti) {
>>>>>>>> +  aux.pid = ti->pid;
>>>>>>>> +  aux.comm = ti->process_name;
>>>>>>>> +  info = &aux;
>>>>>>>> +  amdgpu_vm_put_task_info(ti);
>>>>>>>> +  }
>>>>>>>> +  }
>>>>>>> Is this guaranteed to be guilty app and not some scheduled worker?
>>>>>> This is how amdgpu decides which app is the guilty one earlier in the 
>>>>>> code
>>>>>> as in the print:
>>>>>>
>>>>>>  ti = amdgpu_vm_get_task_info_pasid(ring->adev, job->pasid);
>>>>>>
>>>>>>  "Process information: process %s pid %d thread %s pid %d\n"
>>>>>>
>>>>>> So I think it's consistent with what the driver thinks it's the guilty
>>>>>> process.
>>>>> Sure, but with something like app_info we're kind of hinting to userspace
>>>>> that an application was _indeed_ involved with reset. Is that also 
>>>>> guaranteed?
>>>>>
>>>>> Is it possible that an application needlessly suffers from a false 
>>>>> positive
>>>>> scenario (reset due to other factors)?
>>>>>
>>>> I asked Alex Deucher in IRC about that and yes, there's a chance that
>>>> this is a false positive. However, for the majority of cases this is the
>>>> right app that caused the hang. This is what amdgpu is doing for GL
>>>> robustness as well and devcoredump, so it's very consistent with how
>>>> amdgpu deals with this scenario even if the mechanism is still not perfect.
>>> It's usually the guilty one, but it's not guaranteed.  For example,
>>> say you have a ROCm user queue and a gfx job submitted to a kernel
>>> queue.  The actual guilty job may be the ROCm user queue, but the
>>> dri

Re: [PATCH 1/2] drm: Create an app info option for wedge events

2025-03-12 Thread Raag Jadav
On Tue, Mar 11, 2025 at 07:09:45PM +0200, Raag Jadav wrote:
> On Mon, Mar 10, 2025 at 06:27:53PM -0300, André Almeida wrote:
> > Em 01/03/2025 02:53, Raag Jadav escreveu:
> > > On Fri, Feb 28, 2025 at 06:54:12PM -0300, André Almeida wrote:
> > > > Hi Raag,
> > > > 
> > > > On 2/28/25 11:20, Raag Jadav wrote:
> > > > > Cc: Lucas
> > > > > 
> > > > > On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote:
> > > > > > When a device get wedged, it might be caused by a guilty 
> > > > > > application.
> > > > > > For userspace, knowing which app was the cause can be useful for 
> > > > > > some
> > > > > > situations, like for implementing a policy, logs or for giving a 
> > > > > > chance
> > > > > > for the compositor to let the user know what app caused the problem.
> > > > > > This is an optional argument, when `PID=-1` there's no information 
> > > > > > about
> > > > > > the app caused the problem, or if any app was involved during the 
> > > > > > hang.
> > > > > > 
> > > > > > Sometimes just the PID isn't enough giving that the app might be 
> > > > > > already
> > > > > > dead by the time userspace will try to check what was this PID's 
> > > > > > name,
> > > > > > so to make the life easier also notify what's the app's name in the 
> > > > > > user
> > > > > > event.
> > > > > > 
> > > > > > Signed-off-by: André Almeida 
> > 
> > [...]
> > 
> > > > > > len = scnprintf(event_string, sizeof(event_string), "%s", 
> > > > > > "WEDGED=");
> > > > > > @@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device 
> > > > > > *dev, unsigned long method)
> > > > > > drm_info(dev, "device wedged, %s\n", method == 
> > > > > > DRM_WEDGE_RECOVERY_NONE ?
> > > > > >  "but recovered through reset" : "needs recovery");
> > > > > > +   if (info) {
> > > > > > +   snprintf(pid_string, sizeof(pid_string), "PID=%u", 
> > > > > > info->pid);
> > > > > > +   snprintf(comm_string, sizeof(comm_string), "APP=%s", 
> > > > > > info->comm);
> > > > > > +   } else {
> > > > > > +   snprintf(pid_string, sizeof(pid_string), "%s", 
> > > > > > "PID=-1");
> > > > > > +   snprintf(comm_string, sizeof(comm_string), "%s", 
> > > > > > "APP=none");
> > > > > > +   }
> > > > > This is not much use for wedge cases that needs recovery, since at 
> > > > > that point
> > > > > the userspace will need to clean house anyway.
> > > > > 
> > > > > Which leaves us with only 'none' case and perhaps the need for 
> > > > > standardization
> > > > > of "optional telemetry collection".
> > > > > 
> > > > > Thoughts?
> > > > 
> > > > I had the feeling that 'none' was already meant to be used for that. Do 
> > > > you
> > > > think we should move to another naming? Given that we didn't reach the 
> > > > merge
> > > > window yet we could potentially change that name without much damage.
> > > 
> > > No, I meant thoughts on possible telemetry data that the drivers might
> > > think is useful for userspace (along with PID) and can be presented in
> > > a vendor agnostic manner (just like wedged event).
> > 
> > I'm not if I agree that this will only be used for telemetry and for the
> > `none` use case. As stated by Xaver, there's use case to know which app
> > caused the device to get wedged (like switching to software rendering) and
> > to display something for the user after the recovery is done (e.g. "The game
> >  stopped working and Plasma has reset").
> 
> Sure, but since this information is already available in coredump, I was
> hoping to have something like a standardized DRM level coredump with both
> vendor specific and agnostic sections, which the drivers can (and hopefully
> transition to) use in conjunction with wedged event to provide wider
> telemetry and is useful for all wedge cases.

This is more useful because,

1. It gives drivers an opportunity to present the telemetry that they are
   interested in without needing to add a new event string (like PID or APP)
   for their case.

2. When we consider wedging as a usecase, there's a lot more that goes
   into it than an application that might be behaving strangely. So a wider
   telemetry is what I would hope to look at in such a scenario.

Raag


Re: [PATCH 2/2] drm/amdgpu: Make use of drm_wedge_app_info

2025-03-12 Thread Raag Jadav
On Mon, Mar 10, 2025 at 06:03:27PM -0400, Alex Deucher wrote:
> On Mon, Mar 10, 2025 at 5:54 PM André Almeida  wrote:
> >
> > Em 01/03/2025 03:04, Raag Jadav escreveu:
> > > On Fri, Feb 28, 2025 at 06:49:43PM -0300, André Almeida wrote:
> > >> Hi Raag,
> > >>
> > >> On 2/28/25 11:58, Raag Jadav wrote:
> > >>> On Fri, Feb 28, 2025 at 09:13:53AM -0300, André Almeida wrote:
> > >>>> To notify userspace about which app (if any) made the device get in a
> > >>>> wedge state, make use of drm_wedge_app_info parameter, filling it with
> > >>>> the app PID and name.
> > >>>>
> > >>>> Signed-off-by: André Almeida 
> > >>>> ---
> > >>>>drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +--
> > >>>>drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|  6 +-
> > >>>>2 files changed, 22 insertions(+), 3 deletions(-)
> > >>>>
> > >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> > >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >>>> index 00b9b87dafd8..e06adf6f34fd 100644
> > >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > >>>> @@ -6123,8 +6123,23 @@ int amdgpu_device_gpu_recover(struct 
> > >>>> amdgpu_device *adev,
> > >>>>atomic_set(&adev->reset_domain->reset_res, r);
> > >>>> -  if (!r)
> > >>>> -  drm_dev_wedged_event(adev_to_drm(adev), 
> > >>>> DRM_WEDGE_RECOVERY_NONE, NULL);
> > >>>> +  if (!r) {
> > >>>> +  struct drm_wedge_app_info aux, *info = NULL;
> > >>>> +
> > >>>> +  if (job) {
> > >>>> +  struct amdgpu_task_info *ti;
> > >>>> +
> > >>>> +  ti = amdgpu_vm_get_task_info_pasid(adev, 
> > >>>> job->pasid);
> > >>>> +  if (ti) {
> > >>>> +  aux.pid = ti->pid;
> > >>>> +  aux.comm = ti->process_name;
> > >>>> +  info = &aux;
> > >>>> +  amdgpu_vm_put_task_info(ti);
> > >>>> +  }
> > >>>> +  }
> > >>> Is this guaranteed to be guilty app and not some scheduled worker?
> > >>
> > >> This is how amdgpu decides which app is the guilty one earlier in the 
> > >> code
> > >> as in the print:
> > >>
> > >>  ti = amdgpu_vm_get_task_info_pasid(ring->adev, job->pasid);
> > >>
> > >>  "Process information: process %s pid %d thread %s pid %d\n"
> > >>
> > >> So I think it's consistent with what the driver thinks it's the guilty
> > >> process.
> > >
> > > Sure, but with something like app_info we're kind of hinting to userspace
> > > that an application was _indeed_ involved with reset. Is that also 
> > > guaranteed?
> > >
> > > Is it possible that an application needlessly suffers from a false 
> > > positive
> > > scenario (reset due to other factors)?
> > >
> >
> > I asked Alex Deucher in IRC about that and yes, there's a chance that
> > this is a false positive. However, for the majority of cases this is the
> > right app that caused the hang. This is what amdgpu is doing for GL
> > robustness as well and devcoredump, so it's very consistent with how
> > amdgpu deals with this scenario even if the mechanism is still not perfect.
> 
> It's usually the guilty one, but it's not guaranteed.  For example,
> say you have a ROCm user queue and a gfx job submitted to a kernel
> queue.  The actual guilty job may be the ROCm user queue, but the
> driver may not detect that the ROCm queue was hung until some other
> event (e.g., memory pressure).  However, the timer for the gfx job may
> timeout before that happens on the ROCm queue so in that case the gfx
> job would be incorrectly considered guilty.

So it boils down to what are the chances of that happening and whether
it's significant enough to open the door for API abuse.

Considering this is amd specific accuracy, it's still an open question
how other drivers are/will be managing it.

Raag


Re: [PATCH v2 2/3] drm/doc: Add a section about "App information" for the wedge API

2025-05-13 Thread Raag Jadav
On Sun, May 11, 2025 at 07:47:44PM -0300, André Almeida wrote:
> Add a section about "App information" for the wedge API.
> 
> Signed-off-by: André Almeida 
> ---
>  Documentation/gpu/drm-uapi.rst | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 69f72e71a96e..826abe265a24 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -446,6 +446,21 @@ telemetry information (devcoredump, syslog). This is 
> useful because the first
>  hang is usually the most critical one which can result in consequential 
> hangs or
>  complete wedging.
>  
> +App information
> +---
> +
> +The information about which application (if any) caused the device to get in 
> the

I'm wondering if we should change the wording to "application involved in device
wedging", or can we guarantee it to be the cause?

My limited understanding is that we'd still need the full dump to find the 
cause,
if it's possible to also note here.

Raag

> +wedge state is useful for userspace if they want to notify the user about 
> what
> +happened (e.g. the compositor display a message to the user "The 
> +caused a graphical error and the system recovered") or to implement policies
> +(e.g. the daemon may "ban" an app that keeps resetting the device). If the 
> app
> +information is not available, the uevent will display as ``PID=-1`` and
> +``APP=none``. Otherwise, ``PID`` and ``APP`` will advertise about the guilty
> +app.
> +
> +The reliability of this information is driver and hardware specific, and 
> should
> +be taken with a caution regarding it's precision.
> +
>  Consumer prerequisites
>  --
>  
> -- 
> 2.49.0
> 


Re: [PATCH v4 1/3] drm: Create an app info option for wedge events

2025-05-21 Thread Raag Jadav
On Mon, May 19, 2025 at 07:03:30PM -0300, André Almeida wrote:
> When a device get wedged, it might be caused by a guilty application.
> For userspace, knowing which app was the cause can be useful for some
> situations, like for implementing a policy, logs or for giving a chance
> for the compositor to let the user know what app caused the problem.
> This is an optional argument, when the app info is not available, the
> PID and TASK string won't appear in the event string.
> 
> Sometimes just the PID isn't enough giving that the app might be already
> dead by the time userspace will try to check what was this PID's name,
> so to make the life easier also notify what's the app's name in the user
> event.
> 
> Acked-by: Rodrigo Vivi  (for i915 and xe)
> Reviewed-by: Krzysztof Karas 
> Signed-off-by: André Almeida 
> ---
> v4: s/APP/TASK

With this in place I think it's better to use the same terminology
everywhere else for consistency (code, doc, commit message and subject).

> v3: Make comm_string and pid_string empty when there's no app info
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|  2 +-
>  drivers/gpu/drm/drm_drv.c  | 16 
>  drivers/gpu/drm/i915/gt/intel_reset.c  |  3 ++-
>  drivers/gpu/drm/xe/xe_device.c |  3 ++-
>  include/drm/drm_device.h   | 11 +++
>  include/drm/drm_drv.h  |  3 ++-
>  7 files changed, 31 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 4d1b54f58495..d27091d5929c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -6363,7 +6363,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
>   atomic_set(&adev->reset_domain->reset_res, r);
>  
>   if (!r)
> - drm_dev_wedged_event(adev_to_drm(adev), 
> DRM_WEDGE_RECOVERY_NONE);
> + drm_dev_wedged_event(adev_to_drm(adev), 
> DRM_WEDGE_RECOVERY_NONE, NULL);
>  
>   return r;
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index acb21fc8b3ce..a47b2eb301e5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -166,7 +166,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct 
> drm_sched_job *s_job)
>   if (amdgpu_ring_sched_ready(ring))
>   drm_sched_start(&ring->sched, 0);
>   dev_err(adev->dev, "Ring %s reset succeeded\n", 
> ring->sched.name);
> - drm_dev_wedged_event(adev_to_drm(adev), 
> DRM_WEDGE_RECOVERY_NONE);
> + drm_dev_wedged_event(adev_to_drm(adev), 
> DRM_WEDGE_RECOVERY_NONE, NULL);
>   goto exit;
>   }
>   dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 3dc7acd56b1d..c428d05a8e7c 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -542,6 +542,7 @@ static const char *drm_get_wedge_recovery(unsigned int 
> opt)
>   * drm_dev_wedged_event - generate a device wedged uevent
>   * @dev: DRM device
>   * @method: method(s) to be used for recovery
> + * @info: optional information about the guilty app
>   *
>   * This generates a device wedged uevent for the DRM device specified by 
> @dev.
>   * Recovery @method\(s) of choice will be sent in the uevent environment as
> @@ -554,13 +555,13 @@ static const char *drm_get_wedge_recovery(unsigned int 
> opt)
>   *
>   * Returns: 0 on success, negative error code otherwise.
>   */
> -int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method,
> +  struct drm_wedge_app_info *info)
>  {
>   const char *recovery = NULL;
>   unsigned int len, opt;
> - /* Event string length up to 28+ characters with available methods */
> - char event_string[32];
> - char *envp[] = { event_string, NULL };
> + char event_string[WEDGE_STR_LEN], pid_string[PID_LEN] = "", 
> comm_string[TASK_COMM_LEN] = "";
> + char *envp[] = { event_string, NULL, NULL, NULL };
>  
>   len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
>  
> @@ -582,6 +583,13 @@ int drm_dev_wedged_event(struct drm_device *dev, 
> unsigned long method)
>   drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
>"but recovered through reset" : "needs recovery");
>  
> + if (info) {
> + snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
> + snprintf(comm_string, sizeof(comm_string), "TASK=%s", 
> info->comm);

Are we allowing callers to pass empty members?

> + envp[1

Re: [PATCH v3 1/3] drm: Create an app info option for wedge events

2025-05-19 Thread Raag Jadav
On Mon, May 12, 2025 at 05:34:35PM -0300, André Almeida wrote:
> When a device get wedged, it might be caused by a guilty application.
> For userspace, knowing which app was the cause can be useful for some
> situations, like for implementing a policy, logs or for giving a chance
> for the compositor to let the user know what app caused the problem.
> This is an optional argument, when the app info is not available, the
> PID and APP string won't appear in the event string.
> 
> Sometimes just the PID isn't enough giving that the app might be already
> dead by the time userspace will try to check what was this PID's name,
> so to make the life easier also notify what's the app's name in the user
> event.

...

> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 3dc7acd56b1d..e30efa4ac330 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -542,6 +542,7 @@ static const char *drm_get_wedge_recovery(unsigned int 
> opt)
>   * drm_dev_wedged_event - generate a device wedged uevent
>   * @dev: DRM device
>   * @method: method(s) to be used for recovery
> + * @info: optional information about the guilty app
>   *
>   * This generates a device wedged uevent for the DRM device specified by 
> @dev.
>   * Recovery @method\(s) of choice will be sent in the uevent environment as
> @@ -554,13 +555,14 @@ static const char *drm_get_wedge_recovery(unsigned int 
> opt)
>   *
>   * Returns: 0 on success, negative error code otherwise.
>   */
> -int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method,
> +  struct drm_wedge_app_info *info)
>  {
>   const char *recovery = NULL;
>   unsigned int len, opt;
>   /* Event string length up to 28+ characters with available methods */
> - char event_string[32];
> - char *envp[] = { event_string, NULL };
> + char event_string[32], pid_string[15] = "", comm_string[TASK_COMM_LEN] 
> = "";

Let's make these into proper defines for consistency,

#define WEDGE_STR_LEN
#define PID_LEN

and drop redundant comment.

> + char *envp[] = { event_string, NULL, NULL, NULL };
>  
>   len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
>  
> @@ -582,6 +584,13 @@ int drm_dev_wedged_event(struct drm_device *dev, 
> unsigned long method)
>   drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
>"but recovered through reset" : "needs recovery");
>  
> + if (info) {
> + snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
> + snprintf(comm_string, sizeof(comm_string), "APP=%s", 
> info->comm);

Since this is a core helper, we'd want to make sure these fields are valid
and not being abused.

Also s/APP/TASK IMHO, but upto you.

Raag


Re: [PATCH v3 2/3] drm/doc: Add a section about "App information" for the wedge API

2025-05-16 Thread Raag Jadav
On Mon, May 12, 2025 at 05:34:36PM -0300, André Almeida wrote:
> Add a section about "App information" for the wedge API.
> 
> Signed-off-by: André Almeida 
> ---
> v3:
>  - Change "app that caused ..." to "app involved ..."
>  - Clarify that devcoredump have more information about what happened
>  - Update that PID and APP will be empty if there's no app info
> ---
>  Documentation/gpu/drm-uapi.rst | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 69f72e71a96e..3300a928d8ef 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -446,6 +446,23 @@ telemetry information (devcoredump, syslog). This is 
> useful because the first
>  hang is usually the most critical one which can result in consequential 
> hangs or
>  complete wedging.
>  
> +App information
> +---
> +
> +The information about which application (if any) was involved in the device
> +wedging is useful for userspace if they want to notify the user about what
> +happened (e.g. the compositor display a message to the user "The 
> +caused a graphical error and the system recovered") or to implement policies
> +(e.g. the daemon may "ban" an app that keeps resetting the device). If the 
> app
> +information is available, the uevent will display as ``PID=`` and
> +``APP=``. Otherwise, ``PID`` and ``APP`` will not appear in the 
> event

Personally I'd use Linux specific naming for consistency.

s/APP/TASK

But in any case,

Reviewed-by: Raag Jadav 

> +string.
> +
> +The reliability of this information is driver and hardware specific, and 
> should
> +be taken with a caution regarding it's precision. To have a big picture of 
> what
> +happened,

Nit: what *really* happened

> the devcoredump file provides should have much more detailed
> +information about the device state and about the event.
> +
>  Consumer prerequisites
>  --
>  
> -- 
> 2.49.0
> 


Re: [PATCH v4] PCI: Prevent power state transition of erroneous device

2025-05-22 Thread Raag Jadav
On Tue, May 20, 2025 at 01:56:28PM -0500, Mario Limonciello wrote:
> On 5/20/2025 1:42 PM, Raag Jadav wrote:
> > On Tue, May 20, 2025 at 12:39:12PM -0500, Mario Limonciello wrote:
> > > On 5/20/2025 12:22 PM, Denis Benato wrote:
> > > > On 5/20/25 17:49, Mario Limonciello wrote:
> > > > > On 5/20/2025 10:47 AM, Raag Jadav wrote:
> > > > > > On Tue, May 20, 2025 at 10:23:57AM -0500, Mario Limonciello wrote:
> > > > > > > On 5/20/2025 4:48 AM, Raag Jadav wrote:
> > > > > > > > On Mon, May 19, 2025 at 11:42:31PM +0200, Denis Benato wrote:
> > > > > > > > > On 5/19/25 12:41, Raag Jadav wrote:
> > > > > > > > > > On Mon, May 19, 2025 at 03:58:08PM +0530, Raag Jadav wrote:
> > > > > > > > > > > If error status is set on an AER capable device, most 
> > > > > > > > > > > likely either the
> > > > > > > > > > > device recovery is in progress or has already failed. 
> > > > > > > > > > > Neither of the
> > > > > > > > > > > cases are well suited for power state transition of the 
> > > > > > > > > > > device, since
> > > > > > > > > > > this can lead to unpredictable consequences like resume 
> > > > > > > > > > > failure, or in
> > > > > > > > > > > worst case the device is lost because of it. Leave the 
> > > > > > > > > > > device in its
> > > > > > > > > > > existing power state to avoid such issues.
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Raag Jadav 
> > > > > > > > > > > ---
> > > > > > > > > > > 
> > > > > > > > > > > v2: Synchronize AER handling with PCI PM (Rafael)
> > > > > > > > > > > v3: Move pci_aer_in_progress() to 
> > > > > > > > > > > pci_set_low_power_state() (Rafael)
> > > > > > > > > > >     Elaborate "why" (Bjorn)
> > > > > > > > > > > v4: Rely on error status instead of device status
> > > > > > > > > > >     Condense comment (Lukas)
> > > > > > > > > > Since pci_aer_in_progress() is changed I've not included 
> > > > > > > > > > Rafael's tag with
> > > > > > > > > > my understanding of this needing a revisit. If this was a 
> > > > > > > > > > mistake, please
> > > > > > > > > > let me know.
> > > > > > > > > > 
> > > > > > > > > > Denis, Mario, does this fix your issue?
> > > > > > > > > > 
> > > > > > > > > Hello,
> > > > > > > > > 
> > > > > > > > > Unfortunately no, I have prepared a dmesg but had to remove 
> > > > > > > > > the bootup process because it was too long of a few kb: 
> > > > > > > > > https://pastebin.com/1uBEA1FL
> > > > > > > > 
> > > > > > > > Thanks for the test. It seems there's no hotplug event this 
> > > > > > > > time around
> > > > > > > > and endpoint device is still intact without any PCI related 
> > > > > > > > failure.
> > > > > > > > 
> > > > > > > > Also,
> > > > > > > > 
> > > > > > > > amdgpu :09:00.0: PCI PM: Suspend power state: D3hot
> > > > > > > > 
> > > > > > > > Which means whatever you're facing is either not related to 
> > > > > > > > this patch,
> > > > > > > > or at best exposed some nasty side-effect that's not handled 
> > > > > > > > correctly
> > > > > > > > by the driver.
> > > > > > > > 
> > > > > > > > I'd say amdgpu folks would be of better help for your case.
> > > > > > > > 
> > > > > > > > Raag
> > > > > > > 
> > > > > > > So according to the logs Denis share

Re: [PATCH v5 1/3] drm: Create a task info option for wedge events

2025-05-22 Thread Raag Jadav
On Tue, May 20, 2025 at 01:32:41PM -0300, André Almeida wrote:
> When a device get wedged, it might be caused by a guilty application.
> For userspace, knowing which task was the cause can be useful for some
> situations, like for implementing a policy, logs or for giving a chance
> for the compositor to let the user know what task caused the problem.
> This is an optional argument, when the task info is not available, the
> PID and TASK string won't appear in the event string.
> 
> Sometimes just the PID isn't enough giving that the task might be already
> dead by the time userspace will try to check what was this PID's name,
> so to make the life easier also notify what's the task's name in the user
> event.

...

> -int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method,
> +  struct drm_wedge_task_info *info)
>  {
>   const char *recovery = NULL;
>   unsigned int len, opt;
> - /* Event string length up to 28+ characters with available methods */
> - char event_string[32];
> - char *envp[] = { event_string, NULL };
> + char event_string[WEDGE_STR_LEN], pid_string[PID_LEN] = "", 
> comm_string[TASK_COMM_LEN] = "";
> + char *envp[] = { event_string, NULL, NULL, NULL };
>  
>   len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
>  
> @@ -582,6 +586,13 @@ int drm_dev_wedged_event(struct drm_device *dev, 
> unsigned long method)
>   drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
>"but recovered through reset" : "needs recovery");
>  
> + if (info && ((info->comm && info->comm[0] != '\0'))) {

Thanks for adding this. Should we check if pid > 0?

Also, I was wondering what if the driver only has info on one of the
given members? Should we allow it to be flagged independently?

> + snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
> + snprintf(comm_string, sizeof(comm_string), "TASK=%s", 
> info->comm);
> + envp[1] = pid_string;
> + envp[2] = comm_string;
> + }
> +
>   return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
>  }
>  EXPORT_SYMBOL(drm_dev_wedged_event);

...

> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> index e2f894f1b90a..c13fe85210f2 100644
> --- a/include/drm/drm_device.h
> +++ b/include/drm/drm_device.h
> @@ -30,6 +30,14 @@ struct pci_controller;
>  #define DRM_WEDGE_RECOVERY_REBINDBIT(1)  /* unbind + bind driver */
>  #define DRM_WEDGE_RECOVERY_BUS_RESET BIT(2)  /* unbind + reset bus device + 
> bind */
>  
> +/**
> + * struct drm_wedge_task_info - information about the guilty app of a wedge 
> dev

s/app/task, missed an instance ;)

> + */
> +struct drm_wedge_task_info {
> + pid_t pid;
> + char *comm;
> +};

Raag


Re: [PATCH v6 1/3] drm: Create a task info option for wedge events

2025-05-23 Thread Raag Jadav
On Wed, May 21, 2025 at 12:33:21PM -0300, André Almeida wrote:
> When a device get wedged, it might be caused by a guilty application.
> For userspace, knowing which task was the cause can be useful for some

s/cause/involved

> situations, like for implementing a policy, logs or for giving a chance
> for the compositor to let the user know what task caused the problem.

Ditto

> This is an optional argument, when the task info is not available, the
> PID and TASK string won't appear in the event string.
> 
> Sometimes just the PID isn't enough giving that the task might be already
> dead by the time userspace will try to check what was this PID's name,
> so to make the life easier also notify what's the task's name in the user
> event.

...

> -int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method,
> +  struct drm_wedge_task_info *info)
>  {
>   const char *recovery = NULL;
>   unsigned int len, opt;
> - /* Event string length up to 28+ characters with available methods */
> - char event_string[32];
> - char *envp[] = { event_string, NULL };
> + char event_string[WEDGE_STR_LEN], pid_string[PID_LEN] = "", 
> comm_string[TASK_COMM_LEN] = "";

Most likely there's no need to initialize these.

With above changes,

Reviewed-by: Raag Jadav 


Re: [PATCH v4] PCI: Prevent power state transition of erroneous device

2025-05-31 Thread Raag Jadav
On Fri, May 23, 2025 at 05:23:10PM +0200, Rafael J. Wysocki wrote:
> On Wed, May 21, 2025 at 1:27 PM Rafael J. Wysocki  wrote:
> > On Wed, May 21, 2025 at 10:54 AM Raag Jadav  wrote:
> > > On Tue, May 20, 2025 at 01:56:28PM -0500, Mario Limonciello wrote:
> > > > On 5/20/2025 1:42 PM, Raag Jadav wrote:
> > > > > On Tue, May 20, 2025 at 12:39:12PM -0500, Mario Limonciello wrote:
> > > > > > On 5/20/2025 12:22 PM, Denis Benato wrote:
> > > > > > > On 5/20/25 17:49, Mario Limonciello wrote:
> > > > > > > > On 5/20/2025 10:47 AM, Raag Jadav wrote:
> > > > > > > > > On Tue, May 20, 2025 at 10:23:57AM -0500, Mario Limonciello 
> > > > > > > > > wrote:
> > > > > > > > > > On 5/20/2025 4:48 AM, Raag Jadav wrote:
> > > > > > > > > > > On Mon, May 19, 2025 at 11:42:31PM +0200, Denis Benato 
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On 5/19/25 12:41, Raag Jadav wrote:
> > > > > > > > > > > > > On Mon, May 19, 2025 at 03:58:08PM +0530, Raag Jadav 
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > If error status is set on an AER capable device, 
> > > > > > > > > > > > > > most likely either the
> > > > > > > > > > > > > > device recovery is in progress or has already 
> > > > > > > > > > > > > > failed. Neither of the
> > > > > > > > > > > > > > cases are well suited for power state transition of 
> > > > > > > > > > > > > > the device, since
> > > > > > > > > > > > > > this can lead to unpredictable consequences like 
> > > > > > > > > > > > > > resume failure, or in
> > > > > > > > > > > > > > worst case the device is lost because of it. Leave 
> > > > > > > > > > > > > > the device in its
> > > > > > > > > > > > > > existing power state to avoid such issues.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Signed-off-by: Raag Jadav 
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > v2: Synchronize AER handling with PCI PM (Rafael)
> > > > > > > > > > > > > > v3: Move pci_aer_in_progress() to 
> > > > > > > > > > > > > > pci_set_low_power_state() (Rafael)
> > > > > > > > > > > > > > Elaborate "why" (Bjorn)
> > > > > > > > > > > > > > v4: Rely on error status instead of device status
> > > > > > > > > > > > > > Condense comment (Lukas)
> > > > > > > > > > > > > Since pci_aer_in_progress() is changed I've not 
> > > > > > > > > > > > > included Rafael's tag with
> > > > > > > > > > > > > my understanding of this needing a revisit. If this 
> > > > > > > > > > > > > was a mistake, please
> > > > > > > > > > > > > let me know.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Denis, Mario, does this fix your issue?
> > > > > > > > > > > > >
> > > > > > > > > > > > Hello,
> > > > > > > > > > > >
> > > > > > > > > > > > Unfortunately no, I have prepared a dmesg but had to 
> > > > > > > > > > > > remove the bootup process because it was too long of a 
> > > > > > > > > > > > few kb: https://pastebin.com/1uBEA1FL
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the test. It seems there's no hotplug event 
> > > > > > > > > > > this time around
> > > > > > > > > > 

Re: [PATCH v4] PCI: Prevent power state transition of erroneous device

2025-06-20 Thread Raag Jadav
On Tue, Jun 10, 2025 at 03:53:10PM +0200, Rafael J. Wysocki wrote:
> On Tue, Jun 10, 2025 at 3:44 PM Raag Jadav  wrote:
> > On Thu, Jun 05, 2025 at 02:26:05PM +0200, Rafael J. Wysocki wrote:
> > > On Thu, Jun 5, 2025 at 1:44 PM Raag Jadav  wrote:
> > > > On Wed, Jun 04, 2025 at 08:19:34PM +0200, Rafael J. Wysocki wrote:
> > > > > On Wed, Jun 4, 2025 at 5:43 PM Raag Jadav  
> > > > > wrote:
> > > > > > On Fri, May 30, 2025 at 07:49:26PM +0200, Rafael J. Wysocki wrote:
> > > > > > > On Fri, May 30, 2025 at 7:23 PM Raag Jadav  
> > > > > > > wrote:
> > > > > > > > On Fri, May 23, 2025 at 05:23:10PM +0200, Rafael J. Wysocki 
> > > > > > > > wrote:
> > > > > > > > > On Wed, May 21, 2025 at 1:27 PM Rafael J. Wysocki 
> > > > > > > > >  wrote:
> > > > > > > > > > On Wed, May 21, 2025 at 10:54 AM Raag Jadav 
> > > > > > > > > >  wrote:
> > > > > > > > > > > On Tue, May 20, 2025 at 01:56:28PM -0500, Mario 
> > > > > > > > > > > Limonciello wrote:
> > > > > > > > > > > > On 5/20/2025 1:42 PM, Raag Jadav wrote:
> > > > > > > > > > > > > On Tue, May 20, 2025 at 12:39:12PM -0500, Mario 
> > > > > > > > > > > > > Limonciello wrote:
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > > > > > > > From the driver perspective it does have expectations 
> > > > > > > > > > > > that the parts outside
> > > > > > > > > > > > the driver did the right thing.  If the driver was 
> > > > > > > > > > > > expecting the root port
> > > > > > > > > > > > to be powered down at suspend and it wasn't there are 
> > > > > > > > > > > > hardware components
> > > > > > > > > > > > that didn't power cycle and that's what we're seeing 
> > > > > > > > > > > > here.
> > > > > > > > > > >
> > > > > > > > > > > Which means the expectation set by the driver is the 
> > > > > > > > > > > opposite of the
> > > > > > > > > > > purpose of this patch, and it's going to fail if any kind 
> > > > > > > > > > > of error is
> > > > > > > > > > > detected under root port during suspend.
> > > > > > > > > >
> > > > > > > > > > And IMV this driver's expectation is questionable at least.
> > > > > > > > > >
> > > > > > > > > > There is no promise whatsoever that the device will always 
> > > > > > > > > > be put into
> > > > > > > > > > D3cold during system suspend.
> > > > > > > > >
> > > > > > > > > For instance, user space may disable D3cold for any PCI 
> > > > > > > > > device via the
> > > > > > > > > d3cold_allowed attribute in sysfs.
> > > > > > > > >
> > > > > > > > > If the driver cannot handle this, it needs to be fixed.
> > > > > > > >
> > > > > > > > Thanks for confirming. So should we consider this patch to be 
> > > > > > > > valid
> > > > > > > > and worth moving forward?
> > > > > > >
> > > > > > > It doesn't do anything that would be invalid in principle IMV.
> > > > > > >
> > > > > > > You need to consider one more thing, though: It may be necessary 
> > > > > > > to
> > > > > > > power-cycle the device in order to kick it out of the erroneous 
> > > > > > > state
> > > > > > > and the patch effectively blocks this if I'm not mistaken.
> > > > > > >
> > > > > > > But admittedly I'm not sure if this really matters.
> > > > > >
> > > > > > Wouldn't something like bus reset (SBR) be more predictable?
> > > > >
> > > > > Maybe.
> > > > >
> > > > > The device state is most likely inconsistent in that case, so it 
> > > > > depends.
> > > >
> > > > My limited understanding is that if SBR doesn't help, at that point all
> > > > bets are off including PMCSR configuration and probably a cold boot is
> > > > needed.
> > >
> > > I'm not talking about PMCSR, I'm talking about power removal (D3cold).
> > > This should be equivalent to a cold boot for the particular device
> > > except that cold boot would also reset the hierarchy above it.
> >
> > Sure. But for D3cold we rely on everything else under root port to also
> > be suspended, which we can't predict while in endpoint suspend path. And
> > with that we're back to "no guarantees" problem, which was the case either
> > way. The patch might just prevent any further damage than what's already
> > done.
> 
> Fair enough.

So anything I can do to move this forward?
Sorry I didn't include your tag since I changed the core logic.

Raag


Re: [PATCH v4] PCI: Prevent power state transition of erroneous device

2025-06-13 Thread Raag Jadav
On Thu, Jun 05, 2025 at 02:26:05PM +0200, Rafael J. Wysocki wrote:
> On Thu, Jun 5, 2025 at 1:44 PM Raag Jadav  wrote:
> > On Wed, Jun 04, 2025 at 08:19:34PM +0200, Rafael J. Wysocki wrote:
> > > On Wed, Jun 4, 2025 at 5:43 PM Raag Jadav  wrote:
> > > > On Fri, May 30, 2025 at 07:49:26PM +0200, Rafael J. Wysocki wrote:
> > > > > On Fri, May 30, 2025 at 7:23 PM Raag Jadav  
> > > > > wrote:
> > > > > > On Fri, May 23, 2025 at 05:23:10PM +0200, Rafael J. Wysocki wrote:
> > > > > > > On Wed, May 21, 2025 at 1:27 PM Rafael J. Wysocki 
> > > > > > >  wrote:
> > > > > > > > On Wed, May 21, 2025 at 10:54 AM Raag Jadav 
> > > > > > > >  wrote:
> > > > > > > > > On Tue, May 20, 2025 at 01:56:28PM -0500, Mario Limonciello 
> > > > > > > > > wrote:
> > > > > > > > > > On 5/20/2025 1:42 PM, Raag Jadav wrote:
> > > > > > > > > > > On Tue, May 20, 2025 at 12:39:12PM -0500, Mario 
> > > > > > > > > > > Limonciello wrote:
> > > >
> > > > ...
> > > >
> > > > > > > > > > From the driver perspective it does have expectations that 
> > > > > > > > > > the parts outside
> > > > > > > > > > the driver did the right thing.  If the driver was 
> > > > > > > > > > expecting the root port
> > > > > > > > > > to be powered down at suspend and it wasn't there are 
> > > > > > > > > > hardware components
> > > > > > > > > > that didn't power cycle and that's what we're seeing here.
> > > > > > > > >
> > > > > > > > > Which means the expectation set by the driver is the opposite 
> > > > > > > > > of the
> > > > > > > > > purpose of this patch, and it's going to fail if any kind of 
> > > > > > > > > error is
> > > > > > > > > detected under root port during suspend.
> > > > > > > >
> > > > > > > > And IMV this driver's expectation is questionable at least.
> > > > > > > >
> > > > > > > > There is no promise whatsoever that the device will always be 
> > > > > > > > put into
> > > > > > > > D3cold during system suspend.
> > > > > > >
> > > > > > > For instance, user space may disable D3cold for any PCI device 
> > > > > > > via the
> > > > > > > d3cold_allowed attribute in sysfs.
> > > > > > >
> > > > > > > If the driver cannot handle this, it needs to be fixed.
> > > > > >
> > > > > > Thanks for confirming. So should we consider this patch to be valid
> > > > > > and worth moving forward?
> > > > >
> > > > > It doesn't do anything that would be invalid in principle IMV.
> > > > >
> > > > > You need to consider one more thing, though: It may be necessary to
> > > > > power-cycle the device in order to kick it out of the erroneous state
> > > > > and the patch effectively blocks this if I'm not mistaken.
> > > > >
> > > > > But admittedly I'm not sure if this really matters.
> > > >
> > > > Wouldn't something like bus reset (SBR) be more predictable?
> > >
> > > Maybe.
> > >
> > > The device state is most likely inconsistent in that case, so it depends.
> >
> > My limited understanding is that if SBR doesn't help, at that point all
> > bets are off including PMCSR configuration and probably a cold boot is
> > needed.
> 
> I'm not talking about PMCSR, I'm talking about power removal (D3cold).
> This should be equivalent to a cold boot for the particular device
> except that cold boot would also reset the hierarchy above it.

Sure. But for D3cold we rely on everything else under root port to also
be suspended, which we can't predict while in endpoint suspend path. And
with that we're back to "no guarantees" problem, which was the case either
way. The patch might just prevent any further damage than what's already
done.

Raag


Re: [PATCH v7 2/5] drm: Create a task info option for wedge events

2025-06-16 Thread Raag Jadav
On Fri, Jun 13, 2025 at 03:43:45PM -0300, André Almeida wrote:
> When a device get wedged, it might be caused by a guilty application.
> For userspace, knowing which task was involved can be useful for some
> situations, like for implementing a policy, logs or for giving a chance
> for the compositor to let the user know what task was involved in the
> problem.  This is an optional argument, when the task info is not
> available, the PID and TASK string won't appear in the event string.
> 
> Sometimes just the PID isn't enough giving that the task might be already
> dead by the time userspace will try to check what was this PID's name,
> so to make the life easier also notify what's the task's name in the user
> event.
> 
> Acked-by: Rodrigo Vivi  (for i915 and xe)
> Reviewed-by: Krzysztof Karas 
> Reviewed-by: Raag Jadav 

Although I'm okay with this version, a few aesthetic nits below.

> Signed-off-by: André Almeida 

...

> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 56dd61f8e05a..eba99a081ec1 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -538,10 +538,15 @@ static const char *drm_get_wedge_recovery(unsigned int 
> opt)
>   }
>  }
>  
> +#define WEDGE_STR_LEN 32
> +#define PID_STR_LEN 15
> +#define COMM_STR_LEN (TASK_COMM_LEN + 5)

Align the values using tabs for readability, and since you're using
TASK_COMM_LEN here please include sched.h instead of relying on
intermediate header which may not guarantee it for other archs and
randconfigs.

> +
>  /**
>   * drm_dev_wedged_event - generate a device wedged uevent
>   * @dev: DRM device
>   * @method: method(s) to be used for recovery
> + * @info: optional information about the guilty task
>   *
>   * This generates a device wedged uevent for the DRM device specified by 
> @dev.
>   * Recovery @method\(s) of choice will be sent in the uevent environment as
> @@ -554,13 +559,13 @@ static const char *drm_get_wedge_recovery(unsigned int 
> opt)
>   *
>   * Returns: 0 on success, negative error code otherwise.
>   */
> -int drm_dev_wedged_event(struct drm_device *dev, unsigned long method)
> +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method,
> +  struct drm_wedge_task_info *info)
>  {
>   const char *recovery = NULL;
>   unsigned int len, opt;
> - /* Event string length up to 28+ characters with available methods */
> - char event_string[32];
> - char *envp[] = { event_string, NULL };
> + char event_string[WEDGE_STR_LEN], pid_string[PID_STR_LEN], 
> comm_string[COMM_STR_LEN];
> + char *envp[] = { event_string, NULL, NULL, NULL };

Let's make this reverse xmas order and be consistent with other helpers
in this file.

>   len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
>  
> @@ -582,6 +587,13 @@ int drm_dev_wedged_event(struct drm_device *dev, 
> unsigned long method)
>   drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
>"but recovered through reset" : "needs recovery");
>  
> + if (info && (info->comm[0] != '\0') && (info->pid >= 0)) {
> + snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
> + snprintf(comm_string, sizeof(comm_string), "TASK=%s", 
> info->comm);
> + envp[1] = pid_string;
> + envp[2] = comm_string;
> + }
> +
>   return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
>  }
>  EXPORT_SYMBOL(drm_dev_wedged_event);

...

> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> index e2f894f1b90a..729e1c6da138 100644
> --- a/include/drm/drm_device.h
> +++ b/include/drm/drm_device.h
> @@ -30,6 +30,14 @@ struct pci_controller;
>  #define DRM_WEDGE_RECOVERY_REBINDBIT(1)  /* unbind + bind driver */
>  #define DRM_WEDGE_RECOVERY_BUS_RESET BIT(2)  /* unbind + reset bus device + 
> bind */
>  
> +/**
> + * struct drm_wedge_task_info - information about the guilty task of a wedge 
> dev
> + */
> +struct drm_wedge_task_info {
> + pid_t pid;
> + char comm[TASK_COMM_LEN];

Ditto for sched.h.

Raag

> +};
> +
>  /**
>   * enum switch_power_state - power state of drm device
>   */
> diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h
> index 63b51942d606..3f76a32d6b84 100644
> --- a/include/drm/drm_drv.h
> +++ b/include/drm/drm_drv.h
> @@ -487,7 +487,8 @@ void drm_put_dev(struct drm_device *dev);
>  bool drm_dev_enter(struct drm_device *dev, int *idx);
>  void drm_dev_exit(int idx);
>  void drm_dev_unplug(struct drm_device *dev);
> -int drm_dev_wedged_event(struct drm_device *dev, unsigned long method);
> +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method,
> +  struct drm_wedge_task_info *info);
>  
>  /**
>   * drm_dev_is_unplugged - is a DRM device unplugged
> -- 
> 2.49.0
> 


Re: [PATCH v4] PCI: Prevent power state transition of erroneous device

2025-06-05 Thread Raag Jadav
On Fri, May 30, 2025 at 07:49:26PM +0200, Rafael J. Wysocki wrote:
> On Fri, May 30, 2025 at 7:23 PM Raag Jadav  wrote:
> > On Fri, May 23, 2025 at 05:23:10PM +0200, Rafael J. Wysocki wrote:
> > > On Wed, May 21, 2025 at 1:27 PM Rafael J. Wysocki  
> > > wrote:
> > > > On Wed, May 21, 2025 at 10:54 AM Raag Jadav  
> > > > wrote:
> > > > > On Tue, May 20, 2025 at 01:56:28PM -0500, Mario Limonciello wrote:
> > > > > > On 5/20/2025 1:42 PM, Raag Jadav wrote:
> > > > > > > On Tue, May 20, 2025 at 12:39:12PM -0500, Mario Limonciello wrote:

...

> > > > > > From the driver perspective it does have expectations that the 
> > > > > > parts outside
> > > > > > the driver did the right thing.  If the driver was expecting the 
> > > > > > root port
> > > > > > to be powered down at suspend and it wasn't there are hardware 
> > > > > > components
> > > > > > that didn't power cycle and that's what we're seeing here.
> > > > >
> > > > > Which means the expectation set by the driver is the opposite of the
> > > > > purpose of this patch, and it's going to fail if any kind of error is
> > > > > detected under root port during suspend.
> > > >
> > > > And IMV this driver's expectation is questionable at least.
> > > >
> > > > There is no promise whatsoever that the device will always be put into
> > > > D3cold during system suspend.
> > >
> > > For instance, user space may disable D3cold for any PCI device via the
> > > d3cold_allowed attribute in sysfs.
> > >
> > > If the driver cannot handle this, it needs to be fixed.
> >
> > Thanks for confirming. So should we consider this patch to be valid
> > and worth moving forward?
> 
> It doesn't do anything that would be invalid in principle IMV.
> 
> You need to consider one more thing, though: It may be necessary to
> power-cycle the device in order to kick it out of the erroneous state
> and the patch effectively blocks this if I'm not mistaken.
> 
> But admittedly I'm not sure if this really matters.

Wouldn't something like bus reset (SBR) be more predictable?

Raag


Re: [PATCH v4] PCI: Prevent power state transition of erroneous device

2025-06-05 Thread Raag Jadav
On Wed, Jun 04, 2025 at 08:19:34PM +0200, Rafael J. Wysocki wrote:
> On Wed, Jun 4, 2025 at 5:43 PM Raag Jadav  wrote:
> > On Fri, May 30, 2025 at 07:49:26PM +0200, Rafael J. Wysocki wrote:
> > > On Fri, May 30, 2025 at 7:23 PM Raag Jadav  wrote:
> > > > On Fri, May 23, 2025 at 05:23:10PM +0200, Rafael J. Wysocki wrote:
> > > > > On Wed, May 21, 2025 at 1:27 PM Rafael J. Wysocki  
> > > > > wrote:
> > > > > > On Wed, May 21, 2025 at 10:54 AM Raag Jadav  
> > > > > > wrote:
> > > > > > > On Tue, May 20, 2025 at 01:56:28PM -0500, Mario Limonciello wrote:
> > > > > > > > On 5/20/2025 1:42 PM, Raag Jadav wrote:
> > > > > > > > > On Tue, May 20, 2025 at 12:39:12PM -0500, Mario Limonciello 
> > > > > > > > > wrote:
> >
> > ...
> >
> > > > > > > > From the driver perspective it does have expectations that the 
> > > > > > > > parts outside
> > > > > > > > the driver did the right thing.  If the driver was expecting 
> > > > > > > > the root port
> > > > > > > > to be powered down at suspend and it wasn't there are hardware 
> > > > > > > > components
> > > > > > > > that didn't power cycle and that's what we're seeing here.
> > > > > > >
> > > > > > > Which means the expectation set by the driver is the opposite of 
> > > > > > > the
> > > > > > > purpose of this patch, and it's going to fail if any kind of 
> > > > > > > error is
> > > > > > > detected under root port during suspend.
> > > > > >
> > > > > > And IMV this driver's expectation is questionable at least.
> > > > > >
> > > > > > There is no promise whatsoever that the device will always be put 
> > > > > > into
> > > > > > D3cold during system suspend.
> > > > >
> > > > > For instance, user space may disable D3cold for any PCI device via the
> > > > > d3cold_allowed attribute in sysfs.
> > > > >
> > > > > If the driver cannot handle this, it needs to be fixed.
> > > >
> > > > Thanks for confirming. So should we consider this patch to be valid
> > > > and worth moving forward?
> > >
> > > It doesn't do anything that would be invalid in principle IMV.
> > >
> > > You need to consider one more thing, though: It may be necessary to
> > > power-cycle the device in order to kick it out of the erroneous state
> > > and the patch effectively blocks this if I'm not mistaken.
> > >
> > > But admittedly I'm not sure if this really matters.
> >
> > Wouldn't something like bus reset (SBR) be more predictable?
> 
> Maybe.
> 
> The device state is most likely inconsistent in that case, so it depends.

My limited understanding is that if SBR doesn't help, at that point all
bets are off including PMCSR configuration and probably a cold boot is
needed.

Please correct me if I've misunderstood.

Raag