On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote: > Tearing down the scheduler with jobs still on the pending list can > lead to use after free issues. Add a warning if drivers try to > destroy a scheduler which still has work pushed to the HW.
Did you have time yet to look into my proposed waitque-solution? > > When there are still entities with jobs the situation is even worse > since the dma_fences for those jobs can never signal we can just > choose between potentially locking up core memory management and > random memory corruption. When drivers really mess it up that well > let them run into a BUG_ON(). > > Signed-off-by: Christian König <christian.koe...@amd.com> > --- > drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++- > 1 file changed, 18 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > b/drivers/gpu/drm/scheduler/sched_main.c > index f093616fe53c..8a46fab5cdc8 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler > *sched) I agree with Sima that it should first be documented in the function's docstring what the user is expected to have done before calling the function. P. > > drm_sched_wqueue_stop(sched); > > + /* > + * Tearing down the scheduler wile there are still > unprocessed jobs can > + * lead to use after free issues in the scheduler fence. > + */ > + WARN_ON(!list_empty(&sched->pending_list)); > + > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) > { > struct drm_sched_rq *rq = sched->sched_rq[i]; > > spin_lock(&rq->lock); > - list_for_each_entry(s_entity, &rq->entities, list) > + list_for_each_entry(s_entity, &rq->entities, list) { > + /* > + * The justification for this BUG_ON() is > that tearing > + * down the scheduler while jobs are pending > leaves > + * dma_fences unsignaled. Since we have > dependencies > + * from the core memory management to > eventually signal > + * dma_fences this can trivially lead to a > system wide > + * stop because of a locked up memory > management. > + */ > + BUG_ON(spsc_queue_count(&s_entity- > >job_queue)); > + > /* > * Prevents reinsertion and marks job_queue > as idle, > * it will removed from rq in > drm_sched_entity_fini > * eventually > */ > s_entity->stopped = true; > + } > spin_unlock(&rq->lock); > kfree(sched->sched_rq[i]); > }