amdgpu: Update amdgpu_job_timedout to check if the ring is guilty

jesse.zhang Wed, 12 Feb 2025 21:48:46 -0800

From: "jesse.zh...@amd.com" <jesse.zh...@amd.com>

This patch updates the `amdgpu_job_timedout` function to check if
the ring is actually guilty of causing the timeout. If not, it
skips error handling and fence completion.


Suggested-by: Alex Deucher <alexander.deuc...@amd.com>
Signed-off-by: Jesse Zhang <jesse.zh...@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 100f04475943..f94c876db72b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -101,6 +101,16 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct 
drm_sched_job *s_job)
                /* Effectively the job is aborted as the device is gone */
                return DRM_GPU_SCHED_STAT_ENODEV;
        }
+       /* Check if the ring is actually guilty of causing the timeout.
+       * If not, skip error handling and fence completion.
+       */
+       if (amdgpu_gpu_recovery && ring->funcs->is_guilty) {
+               if (!ring->funcs->is_guilty(ring)) {
+                       dev_err(adev->dev, "ring %s timeout, but not guilty\n",
+                               s_job->sched->name);
+                       goto exit;
+               }
+       }
 
        /*
         * Do the coredump immediately after a job timeout to get a very
-- 
2.25.1

[PATCH V7 5/9] drm/amdgpu: Update amdgpu_job_timedout to check if the ring is guilty

Reply via email to