[PATCH] drm/amdgpu: Use -ENODATA for GPU job timeout queue recovery

jesse.zh...@amd.com Tue, 14 Jan 2025 22:52:37 -0800

When a GPU job times out, the driver attempts to recover by restarting
the scheduler. Previously, the scheduler was restarted with an error
code of 0, which does not distinguish between a full GPU reset and a
queue reset. This patch changes the error code to -ENODATA for queue
resets, while -ECANCELED or -ETIME is used for full GPU resets.


This change improves error handling by:
1. Clearly differentiating between queue resets and full GPU resets.
2. Providing more specific error codes for better debugging and recovery.
3. Aligning with kernel best practices for error reporting.

The related commit "b2ef808786d93df3658" (drm/sched: add optional errno
to drm_sched_start())
introduced support for passing an error code to
drm_sched_start(), enabling this improvement.

Signed-off-by: Vitaly Prosyak <vitaly.pros...@amd.com>
Signed-off-by: Jesse Zhang <jesse.zh...@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 100f04475943..b18b316872a0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -148,7 +148,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct 
drm_sched_job *s_job)
                        atomic_inc(&ring->adev->gpu_reset_counter);
                        amdgpu_fence_driver_force_completion(ring);
                        if (amdgpu_ring_sched_ready(ring))
-                               drm_sched_start(&ring->sched, 0);
+                               drm_sched_start(&ring->sched, -ENODATA);
                        goto exit;
                }
                dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
-- 
2.25.1

[PATCH] drm/amdgpu: Use -ENODATA for GPU job timeout queue recovery

Reply via email to