sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_RUNNING

Maíra Canal Sat, 03 May 2025 14:01:09 -0700

When the DRM scheduler times out, it's possible that the GPU isn't hung;
instead, a job may still be running, and there may be no valid reason to
reset the hardware. This can occur in two situations:


  1. The GPU exposes some mechanism that ensures the GPU is still making
     progress. By checking this mechanism, we can safely skip the reset,
     rearm the timeout, and allow the job to continue running until
     completion. This is the case for v3d and Etnaviv.

  2. TDR has fired before the IRQ that signals the fence. Consequently,
     the job actually finishes, but it triggers a timeout before signaling
     the completion fence.

These two scenarios are problematic because we remove the job from the
`sched->pending_list` before calling `sched->ops->timedout_job()`. This
means that when the job finally signals completion (e.g. in the IRQ
handler), the scheduler won't call `sched->ops->free_job()`. As a result,
the job and its resources won't be freed, leading to a memory leak.

For v3d specifically, we have observed that these memory leaks can be
significant in certain scenarios, as reported by users in [1][2]. To
address this situation, I submitted a patch similar to commit 704d3d60fec4
("drm/etnaviv: don't block scheduler when GPU is still active") for v3d [3].
This patch has already landed in drm-misc-fixes and successfully resolved
the users' issues.

However, as I mentioned in [3], exposing the scheduler's internals within
the drivers isn't ideal and I believe this specific situation can be
addressed within the DRM scheduler framework.

This series aims to resolve this issue by adding a new DRM sched status
that allows a driver to skip the reset. This new status will indicate that
the job should be reinserted into the pending list, and the driver will
still signal its completion.

The series can be divided into three parts:

  * Patch 1: Implementation of the new status in the DRM scheduler.
  * Patches 2-4: Some fixes to the DRM scheduler KUnit tests and the
    addition of a test for the new status.
  * Patches 5-8: Usage the new status in four different drivers.

[1] https://gitlab.freedesktop.org/mesa/mesa/-/issues/12227
[2] https://github.com/raspberrypi/linux/issues/6817
[3] 
https://lore.kernel.org/dri-devel/20250430210643.57924-1-mca...@igalia.com/T/

Best Regards,
- Maíra

---
Maíra Canal (8):
      drm/sched: Allow drivers to skip the reset and keep on running
      drm/sched: Always free the job after the timeout
      drm/sched: Reduce scheduler's timeout for timeout tests
      drm/sched: Add new test for DRM_GPU_SCHED_STAT_RUNNING
      drm/v3d: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset
      drm/etnaviv: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset
      drm/xe: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset
      drm/panfrost: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset

 drivers/gpu/drm/etnaviv/etnaviv_sched.c          | 12 ++---
 drivers/gpu/drm/panfrost/panfrost_job.c          |  8 ++--
 drivers/gpu/drm/scheduler/sched_main.c           | 14 ++++++
 drivers/gpu/drm/scheduler/tests/mock_scheduler.c | 13 ++++++
 drivers/gpu/drm/scheduler/tests/tests_basic.c    | 57 ++++++++++++++++++++++--
 drivers/gpu/drm/v3d/v3d_sched.c                  |  4 +-
 drivers/gpu/drm/xe/xe_guc_submit.c               |  8 +---
 include/drm/gpu_scheduler.h                      |  2 +
 8 files changed, 94 insertions(+), 24 deletions(-)
---
base-commit: 760e296124ef3b6e14cd1d940f2a01c5ed7c0dac
change-id: 20250502-sched-skip-reset-bf7c163233da

[PATCH 0/8] drm/sched: Allow drivers to skip the reset with DRM_GPU_SCHED_STAT_RUNNING

Reply via email to