On 15.07.25 14:20, Philipp Stanner wrote: > On Tue, 2025-07-15 at 12:52 +0200, Christian König wrote: >> On 15.07.25 12:27, Philipp Stanner wrote: >>> On Tue, 2025-07-15 at 09:51 +0000, cao, lin wrote: >>>> >>>> [AMD Official Use Only - AMD Internal Distribution Only] >>>> >>>> >>>> >>>> Hi Philipp, >>>> >>>> >>>> Thanks for your review, let me try to clarify why I added >>>> drm_sched_wakeup() to drm_sched_entity_kill_jobs_work(): >>>> >>>> >>>> When application A submits jobs (a1, a2, a3) and application B >>>> submits job b1 with a dependency on a1's scheduled fence, the >>>> normal execution flow is (function drm_sched_run_job_work()): >>>> 1. a1 gets popped from the entity by the scheduler >>>> 2. run_job(a1) executes >>>> 3. a1's scheduled fence gets signaled >>>> 4. drm_sched_run_job_work() calls drm_sched_run_job_queue() at >>>> the end >>>> 5. The scheduler wakes up and re-selects entities to pop jobs >>>> 6. Since b1's dependency is cleared, the scheduler can select b1 >>>> and continue the same flow >>>> >>>> >>>> However, if application A is killed before a1 gets popped by the >>>> scheduler, then a1, a2, a3 are killed sequentially by >>>> drm_sched_entity_kill_jobs_cb(). During the kill process, their >>>> scheduled fences are still signaled, but the kill process itself >>>> lacks work_run_job. This means b1's dependency gets cleared, but >>>> there's no work_run_job to drive the scheduler to continue >>>> running, or we can saystep 4 in the normal flow is missing, and >>>> the scheduler enters sleep state, which causes application B to >>>> hang. >>>> So if we add drm_sched_wakeup() in >>>> drm_sched_entity_kill_jobs_work() after >>>> drm_sched_fence_scheduled(), the scheduler can be woken up, and >>>> application B can continue running after a1's scheduled fence is >>>> force signaled. >>> >>> Thanks for the detailed explanation. Makes sense. >>> Maybe you can detail in v3 that this "optimization" consists of the >>> work item not being scheduled. >> >> Yeah, removing this "optimization" would also be a valid solution to >> the problem. > > If you at AMD are willing to work on that that'd be definitely fine. > Removing code is usually better than adding code.
I fear I won't have time for that before my retirement :) > Do you think being afraid of a performance regression is realistic > here, though? No, absolutely not. It was just a micro optimization done long ago. On the other hand with the scheduler code base I'm not sure of anything any more. Regards, Christian. > > > P. > >> >> Christian. >> >>> I think that was the piece of the puzzle >>> I was missing. >>> >>> I / DRM tools will also include a link to this thread, so I think >>> that >>> will then be sufficient. >>> >>> Thx >>> P. >>> >>>> >>>> Thanks, >>>> Lin >>>> >>>> >>>> >>>> >>>> >>>> From: Philipp Stanner <pha...@mailbox.org> >>>> Sent: Tuesday, July 15, 2025 17:12 >>>> To: cao, lin <lin....@amd.com>; Koenig, Christian >>>> <christian.koe...@amd.com>; >>>> pha...@kernel.org <pha...@kernel.org>; >>>> dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org> >>>> Cc: Yin, ZhenGuo (Chris) <zhenguo....@amd.com>; Deng, Emily >>>> <emily.d...@amd.com>; d...@kernel.org <d...@kernel.org>; >>>> matthew.br...@intel.com <matthew.br...@intel.com> >>>> Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when >>>> killing app with dependent jobs >>>> >>>> >>>> On Tue, 2025-07-15 at 03:38 +0000, cao, lin wrote: >>>>> >>>>> [AMD Official Use Only - AMD Internal Distribution Only] >>>>> >>>>> >>>>> >>>>> Hi Christian, Philipp, >>>>> >>>>> >>>>> I modified the commit msg, and I hope it makes more sense: >>>>> >>>>> >>>>> drm/sched: wake up scheduler when killing jobs to prevent hang >>>> >>>> nit: >>>> s/wake/Wake >>>> >>>>> >>>>> >>>>> When application A submits jobs (a1, a2, a3) and application B >>>>> submits >>>>> job b1 with a dependency on a2's scheduler fence, killing >>>>> application A >>>>> before run_job(a1) causes drm_sched_entity_kill_jobs_work() to >>>>> force >>>>> signal all jobs sequentially. However, the optimization in >>>>> drm_sched_entity_add_dependency_cb() waits for the fence to be >>>>> scheduled >>>>> by adding a callback (drm_sched_entity_clear_dep) instead of >>>>> immediately >>>>> waking up the scheduler. When A is killed before its jobs can >>>>> run, the >>>>> callback gets triggered but drm_sched_entity_clear_dep() only >>>>> clears the >>>>> dependency without waking up the scheduler, causing the >>>>> scheduler to enter >>>>> sleep state and application B to hang. >>>> >>>> That now reads as if drm_sched_entity_clear_dep() is supposed to >>>> wake >>>> up the scheduler. But you add the wakeup to a different function >>>> (the >>>> work item). >>>> >>>> Also the code actually looks like that: >>>> >>>> >>>> xa_erase(&job->dependencies, index); >>>> if (f && !dma_fence_add_callback(f, &job- >>>>> finish_cb, >>>> >>>> drm_sched_entity_kill_jobs_cb)) >>>> return; >>>> >>>> dma_fence_put(f); >>>> } >>>> >>>> INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work); >>>> schedule_work(&job->work); >>>> } >>>> >>>> So if adding that callback succeeds we.. return immediately but >>>> we.. do >>>> that for the first dependency, not for all of them? >>>> >>>> Oh boy. That code base. We(tm) should look into pimping that up. >>>> Also >>>> lacks documentation everywhere. >>>> >>>> >>>> Anyways. If this solves a bug for you the patch looks fine (i.e., >>>> not >>>> potentially harmful) by me and if the tests succeed we can merge >>>> it. >>>> However, I'd feel better if you could clarify more why that >>>> function is >>>> the right place to solve the bug. >>>> >>>> >>>> Thanks, >>>> P. >>>> >>>> >>>>> >>>>> >>>>> Add drm_sched_wakeup() in entity_kill_job_work() to prevent >>>>> scheduler >>>>> sleep and subsequent application hangs. >>>>> >>>>> >>>>> v2: >>>>> - Move drm_sched_wakeup() to after drm_sched_fence_scheduled() >>>>> >>>>> >>>>> Thanks, >>>>> Lin >>>>> >>>>> >>>>> From: Koenig, Christian <christian.koe...@amd.com> >>>>> Sent: Monday, July 14, 2025 21:39 >>>>> To: pha...@kernel.org <pha...@kernel.org>; cao, lin >>>>> <lin....@amd.com>; >>>>> dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org >>>>>> >>>>> Cc: Yin, ZhenGuo (Chris) <zhenguo....@amd.com>; Deng, Emily >>>>> <emily.d...@amd.com>; d...@kernel.org <d...@kernel.org>; >>>>> matthew.br...@intel.com <matthew.br...@intel.com> >>>>> Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when >>>>> killing app with dependent jobs >>>>> >>>>> >>>>> >>>>> >>>>> On 14.07.25 15:27, Philipp Stanner wrote: >>>>>> On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote: >>>>>>> >>>>>>> >>>>>>> On 14.07.25 14:46, Philipp Stanner wrote: >>>>>>>> regarding the patch subject: the prefix we use for the >>>>>>>> scheduler >>>>>>>> is: >>>>>>>> drm/sched: >>>>>>>> >>>>>>>> >>>>>>>> On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote: >>>>>>>> >>>>>>>>> When Application A submits jobs (a1, a2, a3) and >>>>>>>>> application B >>>>>>>>> submits >>>>>>>> >>>>>>>> s/Application/application >>>>>>>> >>>>>>>>> job b1 with a dependency on a2's scheduler fence, >>>>>>>>> killing >>>>>>>>> application A >>>>>>>>> before run_job(a1) causes >>>>>>>>> drm_sched_entity_kill_jobs_work() to >>>>>>>>> force >>>>>>>>> signal all jobs sequentially. However, due to missing >>>>>>>>> work_run_job or >>>>>>>>> work_free_job >>>>>>>>> >>>>>>>> >>>>>>>> You probably mean "due to the work items work_run_job and >>>>>>>> work_free_job >>>>>>>> not being started […]". >>>>>>>> >>>>>>>> However, the call you add, drm_sched_wakeup(), >>>>>>>> immediately only >>>>>>>> triggers work_run_job. It's not clear to me why you >>>>>>>> mention >>>>>>>> work_free_job, because drm_sched_entity_kill_jobs_work() >>>>>>>> directly >>>>>>>> invokes the free_job() callback. >>>>>>> >>>>>>> Yeah the description is rather confusing, took me more than >>>>>>> one try >>>>>>> to understand what's going on here as well. Let me try to >>>>>>> explain it >>>>>>> in my words: >>>>>>> >>>>>>> When an application A is killed the pending submissions >>>>>>> from it are >>>>>>> not executed, but rather torn down by >>>>>>> drm_sched_entity_kill_jobs_work(). >>>>>>> >>>>>>> If now a submission from application B depends on something >>>>>>> application A wanted to do on the same scheduler instance >>>>>>> the >>>>>>> function drm_sched_entity_add_dependency_cb() would have >>>>>>> optimized >>>>>>> this dependency and assumed that the scheduler work would >>>>>>> already run >>>>>>> and try to pick a job. >>>>>>> >>>>>>> But that isn't the case when the submissions from >>>>>>> application A are >>>>>>> all killed. So make sure to start the scheduler work from >>>>>>> drm_sched_entity_kill_jobs_work() to handle that case. >>>>>> >>>>>> Alright, so the bug then depends on B setting a dependency on >>>>>> A _after_ >>>>>> A was killed, doesn't it? Because the optimization in >>>>>> _add_dependency_cb() can only have an effect after A's jobs >>>>>> have been >>>>>> torn down. >>>>> >>>>> No, application A and application B submit right behind each >>>>> other. Application A is then killed later on. >>>>> >>>>> The optimization in _add_dependency_cb() just waits for As >>>>> submission to be handled by the scheduler, but that never >>>>> happens because it was killed. >>>>> >>>>>> If there is such a timing order problem, the commit message >>>>>> should >>>>>> mention it. >>>>>> >>>>>> The commit message definitely also needs to mention this >>>>>> optimization >>>>>> in drm_sched_entity_add_dependency_cb() since it's essential >>>>>> for >>>>>> understanding the bug. >>>>> >>>>> +1 >>>>> >>>>> Christian. >>>>> >>>>>> >>>>>> >>>>>> Danke >>>>>> P. >>>>>> >>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Christian. >>>>>>> >>>>>>>> >>>>>>>>> in entity_kill_job_work(), the scheduler enters sleep >>>>>>>>> state, causing application B hang. >>>>>>>> >>>>>>>> So the issue is that if a1 never ran, the scheduler will >>>>>>>> not >>>>>>>> continue >>>>>>>> with the jobs of application B, correct? >>>>>>>> >>>>>>>>> >>>>>>>>> Add drm_sched_wakeup() when entity_kill_job_work() to >>>>>>>>> preventing >>>>>>>> >>>>>>>> s/to preventing/to prevent >>>>>>>> >>>>>>>>> scheduler sleep and subsequent application hangs. >>>>>>>>> >>>>>>>>> v2: >>>>>>>>> - Move drm_sched_wakeup() to after >>>>>>>>> drm_sched_fence_scheduled() >>>>>>>> >>>>>>>> Changelogs are cool and useful, but move them below the >>>>>>>> Signed-off >>>>>>>> area >>>>>>>> with another --- please, like so: >>>>>>>> >>>>>>>> Signed-off by: … >>>>>>>> --- >>>>>>>> v2: >>>>>>>> … >>>>>>>> --- >>>>>>>> drivers/gpu/drm/scheduler/sched_entity.c | 1 + >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> Signed-off-by: Lin.Cao <linca...@amd.com> >>>>>>>> >>>>>>>> This fixes a bug. Even a racy bug, it seems. So we need a >>>>>>>> Fixes tag >>>>>>>> and >>>>>>>> put the stable kernel in CC. >>>>>>>> >>>>>>>> Please figure out with git blame, git log and git tag -- >>>>>>>> contains >>>>>>>> which >>>>>>>> commit your patch fixes and which is the oldest kernel >>>>>>>> that >>>>>>>> contains >>>>>>>> the bug. Then add a Fixes: tag and Cc: the stable kernel. >>>>>>>> You'll >>>>>>>> find >>>>>>>> lots of examples for that in git log. >>>>>>>> >>>>>>>> >>>>>>>> Thx >>>>>>>> P. >>>>>>>> >>>>>>>>> --- >>>>>>>>> drivers/gpu/drm/scheduler/sched_entity.c | 1 + >>>>>>>>> 1 file changed, 1 insertion(+) >>>>>>>>> >>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c >>>>>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c >>>>>>>>> index e671aa241720..66f2a43c58fd 100644 >>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c >>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c >>>>>>>>> @@ -177,6 +177,7 @@ static void >>>>>>>>> drm_sched_entity_kill_jobs_work(struct work_struct >>>>>>>>> *wrk) >>>>>>>>> struct drm_sched_job *job = container_of(wrk, >>>>>>>>> typeof(*job), work); >>>>>>>>> >>>>>>>>> drm_sched_fence_scheduled(job->s_fence, NULL); >>>>>>>>> + drm_sched_wakeup(job->sched); >>>>>>>>> drm_sched_fence_finished(job->s_fence, -ESRCH); >>>>>>>>> WARN_ON(job->s_fence->parent); >>>>>>>>> job->sched->ops->free_job(job); >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >