Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs

Christian König Tue, 15 Jul 2025 05:32:57 -0700

On 15.07.25 14:20, Philipp Stanner wrote:
> On Tue, 2025-07-15 at 12:52 +0200, Christian König wrote:
>> On 15.07.25 12:27, Philipp Stanner wrote:
>>> On Tue, 2025-07-15 at 09:51 +0000, cao, lin wrote:
>>>>
>>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>>
>>>>
>>>>
>>>> Hi Philipp,
>>>>
>>>>
>>>> Thanks for your review, let me try to clarify why I added
>>>> drm_sched_wakeup() to drm_sched_entity_kill_jobs_work():
>>>>
>>>>
>>>> When application A submits jobs (a1, a2, a3) and application B
>>>> submits job b1 with a dependency on a1's scheduled fence, the
>>>> normal execution flow is (function drm_sched_run_job_work()):
>>>> 1. a1 gets popped from the entity by the scheduler
>>>> 2. run_job(a1) executes
>>>> 3. a1's scheduled fence gets signaled 
>>>> 4. drm_sched_run_job_work() calls drm_sched_run_job_queue() at
>>>> the end
>>>> 5. The scheduler wakes up and re-selects entities to pop jobs
>>>> 6. Since b1's dependency is cleared, the scheduler can select b1
>>>> and continue the same flow
>>>>
>>>>
>>>> However, if application A is killed before a1 gets popped by the
>>>> scheduler, then a1, a2, a3 are killed sequentially by
>>>> drm_sched_entity_kill_jobs_cb(). During the kill process, their
>>>> scheduled fences are still signaled, but the kill process itself
>>>> lacks work_run_job. This means b1's dependency gets cleared, but
>>>> there's no work_run_job to drive the scheduler to continue
>>>> running, or we can saystep 4 in the normal flow is missing, and
>>>> the scheduler enters sleep state, which causes application B to
>>>> hang.
>>>> So if we add drm_sched_wakeup() in
>>>> drm_sched_entity_kill_jobs_work() after
>>>> drm_sched_fence_scheduled(), the scheduler can be woken up, and
>>>> application B can continue running after a1's scheduled fence is
>>>> force signaled.
>>>
>>> Thanks for the detailed explanation. Makes sense.
>>> Maybe you can detail in v3 that this "optimization" consists of the
>>> work item not being scheduled.
>>
>> Yeah, removing this "optimization" would also be a valid solution to
>> the problem.
> 
> If you at AMD are willing to work on that that'd be definitely fine.
> Removing code is usually better than adding code.


I fear I won't have time for that before my retirement :) 

> Do you think being afraid of a performance regression is realistic
> here, though?

No, absolutely not. It was just a micro optimization done long ago.

On the other hand with the scheduler code base I'm not sure of anything any 
more.

Regards,
Christian.

> 
> 
> P.
> 
>>
>> Christian.
>>
>>> I think that was the piece of the puzzle
>>> I was missing.
>>>
>>> I / DRM tools will also include a link to this thread, so I think
>>> that
>>> will then be sufficient.
>>>
>>> Thx
>>> P.
>>>
>>>>
>>>> Thanks,
>>>> Lin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> From: Philipp Stanner <pha...@mailbox.org>
>>>> Sent: Tuesday, July 15, 2025 17:12
>>>> To: cao, lin <lin....@amd.com>; Koenig, Christian
>>>> <christian.koe...@amd.com>;
>>>> pha...@kernel.org <pha...@kernel.org>;
>>>> dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>
>>>> Cc: Yin, ZhenGuo (Chris) <zhenguo....@amd.com>; Deng, Emily
>>>> <emily.d...@amd.com>; d...@kernel.org <d...@kernel.org>;
>>>> matthew.br...@intel.com <matthew.br...@intel.com>
>>>> Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when
>>>> killing app with dependent jobs
>>>>
>>>>
>>>> On Tue, 2025-07-15 at 03:38 +0000, cao, lin wrote:
>>>>>
>>>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>>>
>>>>>
>>>>>
>>>>> Hi Christian, Philipp,
>>>>>
>>>>>
>>>>> I modified the commit msg, and I hope it makes more sense:
>>>>>
>>>>>
>>>>> drm/sched: wake up scheduler when killing jobs to prevent hang
>>>>
>>>> nit:
>>>> s/wake/Wake
>>>>
>>>>>
>>>>>
>>>>> When application A submits jobs (a1, a2, a3) and application B
>>>>> submits
>>>>> job b1 with a dependency on a2's scheduler fence, killing
>>>>> application A
>>>>> before run_job(a1) causes drm_sched_entity_kill_jobs_work() to
>>>>> force
>>>>> signal all jobs sequentially. However, the optimization in
>>>>> drm_sched_entity_add_dependency_cb() waits for the fence to be
>>>>> scheduled
>>>>> by adding a callback (drm_sched_entity_clear_dep) instead of
>>>>> immediately
>>>>> waking up the scheduler. When A is killed before its jobs can
>>>>> run, the
>>>>> callback gets triggered but drm_sched_entity_clear_dep() only
>>>>> clears the
>>>>> dependency without waking up the scheduler, causing the
>>>>> scheduler to enter
>>>>> sleep state and application B to hang.
>>>>
>>>> That now reads as if drm_sched_entity_clear_dep() is supposed to
>>>> wake
>>>> up the scheduler. But you add the wakeup to a different function
>>>> (the
>>>> work item).
>>>>
>>>> Also the code actually looks like that:
>>>>
>>>>
>>>>                 xa_erase(&job->dependencies, index);
>>>>                 if (f && !dma_fence_add_callback(f, &job-
>>>>> finish_cb,
>>>>                                                 
>>>> drm_sched_entity_kill_jobs_cb))
>>>>                         return;
>>>>
>>>>                 dma_fence_put(f);
>>>>         }
>>>>
>>>>         INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work);
>>>>         schedule_work(&job->work);
>>>> }
>>>>
>>>> So if adding that callback succeeds we.. return immediately but
>>>> we.. do
>>>> that for the first dependency, not for all of them?
>>>>
>>>> Oh boy. That code base. We(tm) should look into pimping that up.
>>>> Also
>>>> lacks documentation everywhere.
>>>>
>>>>
>>>> Anyways. If this solves a bug for you the patch looks fine (i.e.,
>>>> not
>>>> potentially harmful) by me and if the tests succeed we can merge
>>>> it.
>>>> However, I'd feel better if you could clarify more why that
>>>> function is
>>>> the right place to solve the bug.
>>>>
>>>>
>>>> Thanks,
>>>> P.
>>>>
>>>>
>>>>>
>>>>>
>>>>> Add drm_sched_wakeup() in entity_kill_job_work() to prevent
>>>>> scheduler
>>>>> sleep and subsequent application hangs.
>>>>>
>>>>>
>>>>> v2:
>>>>> - Move drm_sched_wakeup() to after drm_sched_fence_scheduled()
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Lin
>>>>>
>>>>>
>>>>> From: Koenig, Christian <christian.koe...@amd.com>
>>>>> Sent: Monday, July 14, 2025 21:39
>>>>> To: pha...@kernel.org <pha...@kernel.org>; cao, lin
>>>>> <lin....@amd.com>;
>>>>> dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org
>>>>>>
>>>>> Cc: Yin, ZhenGuo (Chris) <zhenguo....@amd.com>; Deng, Emily
>>>>> <emily.d...@amd.com>; d...@kernel.org <d...@kernel.org>;
>>>>> matthew.br...@intel.com <matthew.br...@intel.com>
>>>>> Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when
>>>>> killing app with dependent jobs
>>>>>
>>>>>  
>>>>>
>>>>>
>>>>> On 14.07.25 15:27, Philipp Stanner wrote:
>>>>>> On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 14.07.25 14:46, Philipp Stanner wrote:
>>>>>>>> regarding the patch subject: the prefix we use for the
>>>>>>>> scheduler
>>>>>>>> is:
>>>>>>>> drm/sched:
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote:
>>>>>>>>
>>>>>>>>> When Application A submits jobs (a1, a2, a3) and
>>>>>>>>> application B
>>>>>>>>> submits
>>>>>>>>
>>>>>>>> s/Application/application
>>>>>>>>
>>>>>>>>> job b1 with a dependency on a2's scheduler fence,
>>>>>>>>> killing
>>>>>>>>> application A
>>>>>>>>> before run_job(a1) causes
>>>>>>>>> drm_sched_entity_kill_jobs_work() to
>>>>>>>>> force
>>>>>>>>> signal all jobs sequentially. However, due to missing
>>>>>>>>> work_run_job or
>>>>>>>>> work_free_job
>>>>>>>>>
>>>>>>>>
>>>>>>>> You probably mean "due to the work items work_run_job and
>>>>>>>> work_free_job
>>>>>>>> not being started […]".
>>>>>>>>
>>>>>>>> However, the call you add, drm_sched_wakeup(),
>>>>>>>> immediately only
>>>>>>>> triggers work_run_job. It's not clear to me why you
>>>>>>>> mention
>>>>>>>> work_free_job, because drm_sched_entity_kill_jobs_work()
>>>>>>>> directly
>>>>>>>> invokes the free_job() callback.
>>>>>>>
>>>>>>> Yeah the description is rather confusing, took me more than
>>>>>>> one try
>>>>>>> to understand what's going on here as well. Let me try to
>>>>>>> explain it
>>>>>>> in my words:
>>>>>>>
>>>>>>> When an application A is killed the pending submissions
>>>>>>> from it are
>>>>>>> not executed, but rather torn down by
>>>>>>> drm_sched_entity_kill_jobs_work().
>>>>>>>
>>>>>>> If now a submission from application B depends on something
>>>>>>> application A wanted to do on the same scheduler instance
>>>>>>> the
>>>>>>> function drm_sched_entity_add_dependency_cb() would have
>>>>>>> optimized
>>>>>>> this dependency and assumed that the scheduler work would
>>>>>>> already run
>>>>>>> and try to pick a job.
>>>>>>>
>>>>>>> But that isn't the case when the submissions from
>>>>>>> application A are
>>>>>>> all killed. So make sure to start the scheduler work from
>>>>>>> drm_sched_entity_kill_jobs_work() to handle that case.
>>>>>>
>>>>>> Alright, so the bug then depends on B setting a dependency on
>>>>>> A _after_
>>>>>> A was killed, doesn't it? Because the optimization in
>>>>>> _add_dependency_cb() can only have an effect after A's jobs
>>>>>> have been
>>>>>> torn down.
>>>>>
>>>>> No, application A and application B submit right behind each
>>>>> other. Application A is then killed later on.
>>>>>
>>>>> The optimization in _add_dependency_cb() just waits for As
>>>>> submission to be handled by the scheduler, but that never
>>>>> happens because it was killed.
>>>>>
>>>>>> If there is such a timing order problem, the commit message
>>>>>> should
>>>>>> mention it.
>>>>>>
>>>>>> The commit message definitely also needs to mention this
>>>>>> optimization
>>>>>> in drm_sched_entity_add_dependency_cb() since it's essential
>>>>>> for
>>>>>> understanding the bug.
>>>>>
>>>>> +1
>>>>>
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>
>>>>>> Danke
>>>>>> P.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>>>  in entity_kill_job_work(), the scheduler enters sleep
>>>>>>>>> state, causing application B hang.
>>>>>>>>
>>>>>>>> So the issue is that if a1 never ran, the scheduler will
>>>>>>>> not
>>>>>>>> continue
>>>>>>>> with the jobs of application B, correct?
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Add drm_sched_wakeup() when entity_kill_job_work() to
>>>>>>>>> preventing
>>>>>>>>
>>>>>>>> s/to preventing/to prevent
>>>>>>>>
>>>>>>>>> scheduler sleep and subsequent application hangs.
>>>>>>>>>
>>>>>>>>> v2:
>>>>>>>>> - Move drm_sched_wakeup() to after
>>>>>>>>> drm_sched_fence_scheduled()
>>>>>>>>
>>>>>>>> Changelogs are cool and useful, but move them below the
>>>>>>>> Signed-off
>>>>>>>> area
>>>>>>>> with another --- please, like so:
>>>>>>>>
>>>>>>>> Signed-off by: …
>>>>>>>> ---
>>>>>>>> v2:
>>>>>>>> …
>>>>>>>> ---
>>>>>>>> drivers/gpu/drm/scheduler/sched_entity.c | 1 +
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Signed-off-by: Lin.Cao <linca...@amd.com>
>>>>>>>>
>>>>>>>> This fixes a bug. Even a racy bug, it seems. So we need a
>>>>>>>> Fixes tag
>>>>>>>> and
>>>>>>>> put the stable kernel in CC.
>>>>>>>>
>>>>>>>> Please figure out with git blame, git log and git tag --
>>>>>>>> contains
>>>>>>>> which
>>>>>>>> commit your patch fixes and which is the oldest kernel
>>>>>>>> that
>>>>>>>> contains
>>>>>>>> the bug. Then add a Fixes: tag and Cc: the stable kernel.
>>>>>>>> You'll
>>>>>>>> find
>>>>>>>> lots of examples for that in git log.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thx
>>>>>>>> P.
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>  drivers/gpu/drm/scheduler/sched_entity.c | 1 +
>>>>>>>>>  1 file changed, 1 insertion(+)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>>> index e671aa241720..66f2a43c58fd 100644
>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>>>>>> @@ -177,6 +177,7 @@ static void
>>>>>>>>> drm_sched_entity_kill_jobs_work(struct work_struct
>>>>>>>>> *wrk)
>>>>>>>>>    struct drm_sched_job *job = container_of(wrk,
>>>>>>>>> typeof(*job), work);
>>>>>>>>>  
>>>>>>>>>    drm_sched_fence_scheduled(job->s_fence, NULL);
>>>>>>>>> +  drm_sched_wakeup(job->sched);
>>>>>>>>>    drm_sched_fence_finished(job->s_fence, -ESRCH);
>>>>>>>>>    WARN_ON(job->s_fence->parent);
>>>>>>>>>    job->sched->ops->free_job(job);
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs

Reply via email to