Adding Hawking for commenting on RAS.

Am 05.02.25 um 16:33 schrieb Tvrtko Ursulin:
The helper copies code from the existing amdgpu_job_stop_all_jobs_on_sched
with the purpose of reducing the amount of driver code which directly
touch scheduler internals.

If or when amdgpu manages to change the approach for handling the
permanently wedged state this helper can be removed.

When RAS indicates a problem and reset is disabled we shouldn't mess with the scheduler internals, but rather mark the device as unplugged and clear the PCIe DMA bits.

In other words enter the wedged state which is now well documented.

This way all submissions will run into ENODEV errors and be cleaned up immediately on submission by the scheduler. Applications will then just wait for their existing submissions and get an error if they try to send new ones.

Stopping the scheduler and then messing with the internals is basically just a really ugly hack. and never made sense in the first place as far as I can see.

See below for more comments.


Signed-off-by: Tvrtko Ursulin <tvrtko.ursu...@igalia.com>
Cc: Christian König <christian.koe...@amd.com>
Cc: Danilo Krummrich <d...@kernel.org>
Cc: Matthew Brost <matthew.br...@intel.com>
Cc: Philipp Stanner <pha...@kernel.org>
---
  drivers/gpu/drm/scheduler/sched_main.c | 44 ++++++++++++++++++++++++++
  include/drm/gpu_scheduler.h            |  1 +
  2 files changed, 45 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index a48be16ab84f..0363655db22d 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -703,6 +703,50 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, int 
errno)
  }
  EXPORT_SYMBOL(drm_sched_start);
+/**
+ * drm_sched_cancel_all_jobs - Cancel all queued and scheduled jobs
+ *
+ * @sched: scheduler instance
+ * @errno: error value to set on signaled fences
+ *
+ * Signal all queued and scheduled jobs and set them to error state.
+ *
+ * Scheduler must be stopped before calling this.
+ */
+void drm_sched_cancel_all_jobs(struct drm_gpu_scheduler *sched, int errno)
+{
+       struct drm_sched_entity *entity;
+       struct drm_sched_fence *s_fence;
+       struct drm_sched_job *job;
+       enum drm_sched_priority p;
+
+       drm_WARN_ON_ONCE(sched, !sched->pause_submit);
+
+       /* Signal all jobs not yet scheduled */
+       for (p = DRM_SCHED_PRIORITY_KERNEL; p < sched->num_rqs; p++) {
+               struct drm_sched_rq *rq = sched->sched_rq[p];
+
+               spin_lock(&rq->lock);
+               list_for_each_entry(entity, &rq->entities, list) {
+                       while ((job = 
to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
+                               s_fence = job->s_fence;
+                               dma_fence_signal(&s_fence->scheduled);
+                               dma_fence_set_error(&s_fence->finished, errno);
+                               dma_fence_signal(&s_fence->finished);
+                       }
+               }
+               spin_unlock(&rq->lock);
+       }
+
+       /* Signal all jobs already scheduled to HW */
+       list_for_each_entry(job, &sched->pending_list, list) {
+               s_fence = job->s_fence;
+               dma_fence_set_error(&s_fence->finished, errno);
+               dma_fence_signal(&s_fence->finished);
+       }

This is in the wrong order, e.g. already scheduled jobs need to signal first and then not yet scheduled ones. Otherwise you violate the dma_fence ordering rules.

Additional to that this is racy like hell, e.g. even when we had an RAS error it is perfectly possible that submissions finish normally.

Regards,
Christian.

+}
+EXPORT_SYMBOL(drm_sched_cancel_all_jobs);
+
  /**
   * drm_sched_resubmit_jobs - Deprecated, don't use in new code!
   *
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index a0ff08123f07..298513f8c327 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -579,6 +579,7 @@ void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
  void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job 
*bad);
  void drm_sched_start(struct drm_gpu_scheduler *sched, int errno);
+void drm_sched_cancel_all_jobs(struct drm_gpu_scheduler *sched, int errno);
  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
  void drm_sched_increase_karma(struct drm_sched_job *bad);
  void drm_sched_reset_karma(struct drm_sched_job *bad);

Reply via email to