On 14.07.25 15:27, Philipp Stanner wrote: > On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote: >> >> >> On 14.07.25 14:46, Philipp Stanner wrote: >>> regarding the patch subject: the prefix we use for the scheduler >>> is: >>> drm/sched: >>> >>> >>> On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote: >>> >>>> When Application A submits jobs (a1, a2, a3) and application B >>>> submits >>> >>> s/Application/application >>> >>>> job b1 with a dependency on a2's scheduler fence, killing >>>> application A >>>> before run_job(a1) causes drm_sched_entity_kill_jobs_work() to >>>> force >>>> signal all jobs sequentially. However, due to missing >>>> work_run_job or >>>> work_free_job >>>> >>> >>> You probably mean "due to the work items work_run_job and >>> work_free_job >>> not being started […]". >>> >>> However, the call you add, drm_sched_wakeup(), immediately only >>> triggers work_run_job. It's not clear to me why you mention >>> work_free_job, because drm_sched_entity_kill_jobs_work() directly >>> invokes the free_job() callback. >> >> Yeah the description is rather confusing, took me more than one try >> to understand what's going on here as well. Let me try to explain it >> in my words: >> >> When an application A is killed the pending submissions from it are >> not executed, but rather torn down by >> drm_sched_entity_kill_jobs_work(). >> >> If now a submission from application B depends on something >> application A wanted to do on the same scheduler instance the >> function drm_sched_entity_add_dependency_cb() would have optimized >> this dependency and assumed that the scheduler work would already run >> and try to pick a job. >> >> But that isn't the case when the submissions from application A are >> all killed. So make sure to start the scheduler work from >> drm_sched_entity_kill_jobs_work() to handle that case. > > Alright, so the bug then depends on B setting a dependency on A _after_ > A was killed, doesn't it? Because the optimization in > _add_dependency_cb() can only have an effect after A's jobs have been > torn down.
No, application A and application B submit right behind each other. Application A is then killed later on. The optimization in _add_dependency_cb() just waits for As submission to be handled by the scheduler, but that never happens because it was killed. > If there is such a timing order problem, the commit message should > mention it. > > The commit message definitely also needs to mention this optimization > in drm_sched_entity_add_dependency_cb() since it's essential for > understanding the bug. +1 Christian. > > > Danke > P. > > >> >> Regards, >> Christian. >> >>> >>>> in entity_kill_job_work(), the scheduler enters sleep >>>> state, causing application B hang. >>> >>> So the issue is that if a1 never ran, the scheduler will not >>> continue >>> with the jobs of application B, correct? >>> >>>> >>>> Add drm_sched_wakeup() when entity_kill_job_work() to preventing >>> >>> s/to preventing/to prevent >>> >>>> scheduler sleep and subsequent application hangs. >>>> >>>> v2: >>>> - Move drm_sched_wakeup() to after drm_sched_fence_scheduled() >>> >>> Changelogs are cool and useful, but move them below the Signed-off >>> area >>> with another --- please, like so: >>> >>> Signed-off by: … >>> --- >>> v2: >>> … >>> --- >>> drivers/gpu/drm/scheduler/sched_entity.c | 1 + >>> >>> >>>> >>>> Signed-off-by: Lin.Cao <linca...@amd.com> >>> >>> This fixes a bug. Even a racy bug, it seems. So we need a Fixes tag >>> and >>> put the stable kernel in CC. >>> >>> Please figure out with git blame, git log and git tag --contains >>> which >>> commit your patch fixes and which is the oldest kernel that >>> contains >>> the bug. Then add a Fixes: tag and Cc: the stable kernel. You'll >>> find >>> lots of examples for that in git log. >>> >>> >>> Thx >>> P. >>> >>>> --- >>>> drivers/gpu/drm/scheduler/sched_entity.c | 1 + >>>> 1 file changed, 1 insertion(+) >>>> >>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c >>>> b/drivers/gpu/drm/scheduler/sched_entity.c >>>> index e671aa241720..66f2a43c58fd 100644 >>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c >>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c >>>> @@ -177,6 +177,7 @@ static void >>>> drm_sched_entity_kill_jobs_work(struct work_struct *wrk) >>>> struct drm_sched_job *job = container_of(wrk, >>>> typeof(*job), work); >>>> >>>> drm_sched_fence_scheduled(job->s_fence, NULL); >>>> + drm_sched_wakeup(job->sched); >>>> drm_sched_fence_finished(job->s_fence, -ESRCH); >>>> WARN_ON(job->s_fence->parent); >>>> job->sched->ops->free_job(job); >>> >> >