When the DRM scheduler times out, it's possible that the GPU isn't hung; instead, a job may still be running, and there may be no valid reason to reset the hardware. This can occur in two situations:
1. The GPU exposes some mechanism that ensures the GPU is still making progress. By checking this mechanism, we can safely skip the reset, rearm the timeout, and allow the job to continue running until completion. This is the case for v3d and Etnaviv. 2. TDR has fired before the IRQ that signals the fence. Consequently, the job actually finishes, but it triggers a timeout before signaling the completion fence. These two scenarios are problematic because we remove the job from the `sched->pending_list` before calling `sched->ops->timedout_job()`. This means that when the job finally signals completion (e.g. in the IRQ handler), the scheduler won't call `sched->ops->free_job()`. As a result, the job and its resources won't be freed, leading to a memory leak. For v3d specifically, we have observed that these memory leaks can be significant in certain scenarios, as reported by users in [1][2]. To address this situation, I submitted a patch similar to commit 704d3d60fec4 ("drm/etnaviv: don't block scheduler when GPU is still active") for v3d [3]. This patch has already landed in drm-misc-fixes and successfully resolved the users' issues. However, as I mentioned in [3], exposing the scheduler's internals within the drivers isn't ideal and I believe this specific situation can be addressed within the DRM scheduler framework. This series aims to resolve this issue by adding a new DRM sched status that allows a driver to skip the reset. This new status will indicate that the job should be reinserted into the pending list, and the driver will still signal its completion. The series can be divided into three parts: * Patch 1: Implementation of the new status in the DRM scheduler. * Patches 2-4: Some fixes to the DRM scheduler KUnit tests and the addition of a test for the new status. * Patches 5-8: Usage the new status in four different drivers. [1] https://gitlab.freedesktop.org/mesa/mesa/-/issues/12227 [2] https://github.com/raspberrypi/linux/issues/6817 [3] https://lore.kernel.org/dri-devel/20250430210643.57924-1-mca...@igalia.com/T/ Best Regards, - Maíra --- Maíra Canal (8): drm/sched: Allow drivers to skip the reset and keep on running drm/sched: Always free the job after the timeout drm/sched: Reduce scheduler's timeout for timeout tests drm/sched: Add new test for DRM_GPU_SCHED_STAT_RUNNING drm/v3d: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset drm/etnaviv: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset drm/xe: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset drm/panfrost: Use DRM_GPU_SCHED_STAT_RUNNING to skip the reset drivers/gpu/drm/etnaviv/etnaviv_sched.c | 12 ++--- drivers/gpu/drm/panfrost/panfrost_job.c | 8 ++-- drivers/gpu/drm/scheduler/sched_main.c | 14 ++++++ drivers/gpu/drm/scheduler/tests/mock_scheduler.c | 13 ++++++ drivers/gpu/drm/scheduler/tests/tests_basic.c | 57 ++++++++++++++++++++++-- drivers/gpu/drm/v3d/v3d_sched.c | 4 +- drivers/gpu/drm/xe/xe_guc_submit.c | 8 +--- include/drm/gpu_scheduler.h | 2 + 8 files changed, 94 insertions(+), 24 deletions(-) --- base-commit: 760e296124ef3b6e14cd1d940f2a01c5ed7c0dac change-id: 20250502-sched-skip-reset-bf7c163233da