On Wed, 2025-08-13 at 14:58 +0200, Danilo Krummrich wrote: > On Wed Aug 13, 2025 at 10:56 AM CEST, Philipp Stanner wrote: > > In drm_sched_fini() all entities are marked as stopped - without taking > > the appropriate lock, because that would deadlock. That means that > > drm_sched_fini() and drm_sched_entity_push_job() can race against each > > other. > > > > This should most likely be fixed by establishing the rule that all > > entities associated with a scheduler must be torn down first. Then, > > however, the locking should be removed from drm_sched_fini() alltogether > > with an appropriate comment. > > > > Reported-by: James Flowers <bold.zone2...@fastmail.com> > > Link: > > https://lore.kernel.org/dri-devel/20250720235748.2798-1-bold.zone2...@fastmail.com/ > > Signed-off-by: Philipp Stanner <pha...@kernel.org>
Applied to drm-misc-next > > --- > > Changes in v2: > > - Fix typo. > > --- > > drivers/gpu/drm/scheduler/sched_main.c | 16 ++++++++++++++++ > > 1 file changed, 16 insertions(+) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > b/drivers/gpu/drm/scheduler/sched_main.c > > index 5a550fd76bf0..46119aacb809 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -1424,6 +1424,22 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched) > > * Prevents reinsertion and marks job_queue as idle, > > * it will be removed from the rq in > > drm_sched_entity_fini() > > * eventually > > + * > > + * FIXME: > > + * This lacks the proper spin_lock(&s_entity->lock) and > > + * is, therefore, a race condition. Most notably, it > > + * can race with drm_sched_entity_push_job(). The lock > > + * cannot be taken here, however, because this would > > + * lead to lock inversion -> deadlock. > > + * > > + * The best solution probably is to enforce the life > > + * time rule of all entities having to be torn down > > + * before their scheduler. Then, however, locking could > > + * be dropped alltogether from this function. > > "Enforce the rule" is correct, since factually it's there, as a dependency in > the code. > > Do we know which drivers violate this lifetime rule? I've got no idea :( > > @Christian: What about amdgpu (for which the below was added to begin with)? +1 P. > > > + * For now, this remains a potential race in all > > + * drivers that keep entities alive for longer than > > + * the scheduler. > > */ > > s_entity->stopped = true; > > spin_unlock(&rq->lock); > > -- > > 2.49.0 >