[AMD Official Use Only - AMD Internal Distribution Only] Hi Philipp, Christian,
Thanks for the kind words! :) I've reconsidered the approach and decided to go with the "removing code" solution instead. I'll drop the v3 patch and submit a new v4 that removes the problematic optimization in drm_sched_entity_add_dependency_cb(). Thanks, Lin ________________________________ From: Philipp Stanner <pha...@mailbox.org> Sent: Tuesday, July 15, 2025 21:07 To: Koenig, Christian <christian.koe...@amd.com>; pha...@kernel.org <pha...@kernel.org>; cao, lin <lin....@amd.com>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org> Cc: Yin, ZhenGuo (Chris) <zhenguo....@amd.com>; Deng, Emily <emily.d...@amd.com>; d...@kernel.org <d...@kernel.org>; matthew.br...@intel.com <matthew.br...@intel.com> Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when killing app with dependent jobs On Tue, 2025-07-15 at 14:32 +0200, Christian König wrote: > On 15.07.25 14:20, Philipp Stanner wrote: > > On Tue, 2025-07-15 at 12:52 +0200, Christian König wrote: > > > On 15.07.25 12:27, Philipp Stanner wrote: > > > > On Tue, 2025-07-15 at 09:51 +0000, cao, lin wrote: > > > > > > > > > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > > > > > > > > > > > > > > > > Hi Philipp, > > > > > > > > > > > > > > > Thanks for your review, let me try to clarify why I added > > > > > drm_sched_wakeup() to drm_sched_entity_kill_jobs_work(): > > > > > > > > > > > > > > > When application A submits jobs (a1, a2, a3) and application > > > > > B > > > > > submits job b1 with a dependency on a1's scheduled fence, the > > > > > normal execution flow is (function drm_sched_run_job_work()): > > > > > 1. a1 gets popped from the entity by the scheduler > > > > > 2. run_job(a1) executes > > > > > 3. a1's scheduled fence gets signaled > > > > > 4. drm_sched_run_job_work() calls drm_sched_run_job_queue() > > > > > at > > > > > the end > > > > > 5. The scheduler wakes up and re-selects entities to pop jobs > > > > > 6. Since b1's dependency is cleared, the scheduler can select > > > > > b1 > > > > > and continue the same flow > > > > > > > > > > > > > > > However, if application A is killed before a1 gets popped by > > > > > the > > > > > scheduler, then a1, a2, a3 are killed sequentially by > > > > > drm_sched_entity_kill_jobs_cb(). During the kill process, > > > > > their > > > > > scheduled fences are still signaled, but the kill process > > > > > itself > > > > > lacks work_run_job. This means b1's dependency gets cleared, > > > > > but > > > > > there's no work_run_job to drive the scheduler to continue > > > > > running, or we can saystep 4 in the normal flow is missing, > > > > > and > > > > > the scheduler enters sleep state, which causes application B > > > > > to > > > > > hang. > > > > > So if we add drm_sched_wakeup() in > > > > > drm_sched_entity_kill_jobs_work() after > > > > > drm_sched_fence_scheduled(), the scheduler can be woken up, > > > > > and > > > > > application B can continue running after a1's scheduled fence > > > > > is > > > > > force signaled. > > > > > > > > Thanks for the detailed explanation. Makes sense. > > > > Maybe you can detail in v3 that this "optimization" consists of > > > > the > > > > work item not being scheduled. > > > > > > Yeah, removing this "optimization" would also be a valid solution > > > to > > > the problem. > > > > If you at AMD are willing to work on that that'd be definitely > > fine. > > Removing code is usually better than adding code. > > I fear I won't have time for that before my retirement :) You've got fresh, powerful folks like Lin! :) But either solution is fine. If you just want it fixed, we can merge the existing approach. > > > Do you think being afraid of a performance regression is realistic > > here, though? > > No, absolutely not. It was just a micro optimization done long ago. > > On the other hand with the scheduler code base I'm not sure of > anything any more. "In my subsystem you have to run as fast as you can just to remain at the same place", said the Red Queen to Alice. :) P. > > Regards, > Christian. > > > > > > > P. > > > > > > > > Christian. > > > > > > > I think that was the piece of the puzzle > > > > I was missing. > > > > > > > > I / DRM tools will also include a link to this thread, so I > > > > think > > > > that > > > > will then be sufficient. > > > > > > > > Thx > > > > P. > > > > > > > > > > > > > > Thanks, > > > > > Lin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From: Philipp Stanner <pha...@mailbox.org> > > > > > Sent: Tuesday, July 15, 2025 17:12 > > > > > To: cao, lin <lin....@amd.com>; Koenig, Christian > > > > > <christian.koe...@amd.com>; > > > > > pha...@kernel.org <pha...@kernel.org>; > > > > > dri-devel@lists.freedesktop.org < > > > > > dri-devel@lists.freedesktop.org> > > > > > Cc: Yin, ZhenGuo (Chris) <zhenguo....@amd.com>; Deng, Emily > > > > > <emily.d...@amd.com>; d...@kernel.org <d...@kernel.org>; > > > > > matthew.br...@intel.com <matthew.br...@intel.com> > > > > > Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when > > > > > killing app with dependent jobs > > > > > > > > > > > > > > > On Tue, 2025-07-15 at 03:38 +0000, cao, lin wrote: > > > > > > > > > > > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > > > > > > > > > > > > > > > > > > > > Hi Christian, Philipp, > > > > > > > > > > > > > > > > > > I modified the commit msg, and I hope it makes more sense: > > > > > > > > > > > > > > > > > > drm/sched: wake up scheduler when killing jobs to prevent > > > > > > hang > > > > > > > > > > nit: > > > > > s/wake/Wake > > > > > > > > > > > > > > > > > > > > > > > When application A submits jobs (a1, a2, a3) and > > > > > > application B > > > > > > submits > > > > > > job b1 with a dependency on a2's scheduler fence, killing > > > > > > application A > > > > > > before run_job(a1) causes drm_sched_entity_kill_jobs_work() > > > > > > to > > > > > > force > > > > > > signal all jobs sequentially. However, the optimization in > > > > > > drm_sched_entity_add_dependency_cb() waits for the fence to > > > > > > be > > > > > > scheduled > > > > > > by adding a callback (drm_sched_entity_clear_dep) instead > > > > > > of > > > > > > immediately > > > > > > waking up the scheduler. When A is killed before its jobs > > > > > > can > > > > > > run, the > > > > > > callback gets triggered but drm_sched_entity_clear_dep() > > > > > > only > > > > > > clears the > > > > > > dependency without waking up the scheduler, causing the > > > > > > scheduler to enter > > > > > > sleep state and application B to hang. > > > > > > > > > > That now reads as if drm_sched_entity_clear_dep() is supposed > > > > > to > > > > > wake > > > > > up the scheduler. But you add the wakeup to a different > > > > > function > > > > > (the > > > > > work item). > > > > > > > > > > Also the code actually looks like that: > > > > > > > > > > > > > > > xa_erase(&job->dependencies, index); > > > > > if (f && !dma_fence_add_callback(f, &job- > > > > > > finish_cb, > > > > > > > > > > drm_sched_entity_kill_jobs_cb)) > > > > > return; > > > > > > > > > > dma_fence_put(f); > > > > > } > > > > > > > > > > INIT_WORK(&job->work, > > > > > drm_sched_entity_kill_jobs_work); > > > > > schedule_work(&job->work); > > > > > } > > > > > > > > > > So if adding that callback succeeds we.. return immediately > > > > > but > > > > > we.. do > > > > > that for the first dependency, not for all of them? > > > > > > > > > > Oh boy. That code base. We(tm) should look into pimping that > > > > > up. > > > > > Also > > > > > lacks documentation everywhere. > > > > > > > > > > > > > > > Anyways. If this solves a bug for you the patch looks fine > > > > > (i.e., > > > > > not > > > > > potentially harmful) by me and if the tests succeed we can > > > > > merge > > > > > it. > > > > > However, I'd feel better if you could clarify more why that > > > > > function is > > > > > the right place to solve the bug. > > > > > > > > > > > > > > > Thanks, > > > > > P. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Add drm_sched_wakeup() in entity_kill_job_work() to prevent > > > > > > scheduler > > > > > > sleep and subsequent application hangs. > > > > > > > > > > > > > > > > > > v2: > > > > > > - Move drm_sched_wakeup() to after > > > > > > drm_sched_fence_scheduled() > > > > > > > > > > > > > > > > > > Thanks, > > > > > > Lin > > > > > > > > > > > > > > > > > > From: Koenig, Christian <christian.koe...@amd.com> > > > > > > Sent: Monday, July 14, 2025 21:39 > > > > > > To: pha...@kernel.org <pha...@kernel.org>; cao, lin > > > > > > <lin....@amd.com>; > > > > > > dri-devel@lists.freedesktop.org < > > > > > > dri-devel@lists.freedesktop.org > > > > > > > > > > > > > Cc: Yin, ZhenGuo (Chris) <zhenguo....@amd.com>; Deng, Emily > > > > > > <emily.d...@amd.com>; d...@kernel.org <d...@kernel.org>; > > > > > > matthew.br...@intel.com <matthew.br...@intel.com> > > > > > > Subject: Re: [PATCH v2] drm/scheduler: Fix sched hang when > > > > > > killing app with dependent jobs > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 14.07.25 15:27, Philipp Stanner wrote: > > > > > > > On Mon, 2025-07-14 at 15:08 +0200, Christian König wrote: > > > > > > > > > > > > > > > > > > > > > > > > On 14.07.25 14:46, Philipp Stanner wrote: > > > > > > > > > regarding the patch subject: the prefix we use for > > > > > > > > > the > > > > > > > > > scheduler > > > > > > > > > is: > > > > > > > > > drm/sched: > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, 2025-07-14 at 14:23 +0800, Lin.Cao wrote: > > > > > > > > > > > > > > > > > > > When Application A submits jobs (a1, a2, a3) and > > > > > > > > > > application B > > > > > > > > > > submits > > > > > > > > > > > > > > > > > > s/Application/application > > > > > > > > > > > > > > > > > > > job b1 with a dependency on a2's scheduler fence, > > > > > > > > > > killing > > > > > > > > > > application A > > > > > > > > > > before run_job(a1) causes > > > > > > > > > > drm_sched_entity_kill_jobs_work() to > > > > > > > > > > force > > > > > > > > > > signal all jobs sequentially. However, due to > > > > > > > > > > missing > > > > > > > > > > work_run_job or > > > > > > > > > > work_free_job > > > > > > > > > > > > > > > > > > > > > > > > > > > > You probably mean "due to the work items work_run_job > > > > > > > > > and > > > > > > > > > work_free_job > > > > > > > > > not being started […]". > > > > > > > > > > > > > > > > > > However, the call you add, drm_sched_wakeup(), > > > > > > > > > immediately only > > > > > > > > > triggers work_run_job. It's not clear to me why you > > > > > > > > > mention > > > > > > > > > work_free_job, because > > > > > > > > > drm_sched_entity_kill_jobs_work() > > > > > > > > > directly > > > > > > > > > invokes the free_job() callback. > > > > > > > > > > > > > > > > Yeah the description is rather confusing, took me more > > > > > > > > than > > > > > > > > one try > > > > > > > > to understand what's going on here as well. Let me try > > > > > > > > to > > > > > > > > explain it > > > > > > > > in my words: > > > > > > > > > > > > > > > > When an application A is killed the pending submissions > > > > > > > > from it are > > > > > > > > not executed, but rather torn down by > > > > > > > > drm_sched_entity_kill_jobs_work(). > > > > > > > > > > > > > > > > If now a submission from application B depends on > > > > > > > > something > > > > > > > > application A wanted to do on the same scheduler > > > > > > > > instance > > > > > > > > the > > > > > > > > function drm_sched_entity_add_dependency_cb() would > > > > > > > > have > > > > > > > > optimized > > > > > > > > this dependency and assumed that the scheduler work > > > > > > > > would > > > > > > > > already run > > > > > > > > and try to pick a job. > > > > > > > > > > > > > > > > But that isn't the case when the submissions from > > > > > > > > application A are > > > > > > > > all killed. So make sure to start the scheduler work > > > > > > > > from > > > > > > > > drm_sched_entity_kill_jobs_work() to handle that case. > > > > > > > > > > > > > > Alright, so the bug then depends on B setting a > > > > > > > dependency on > > > > > > > A _after_ > > > > > > > A was killed, doesn't it? Because the optimization in > > > > > > > _add_dependency_cb() can only have an effect after A's > > > > > > > jobs > > > > > > > have been > > > > > > > torn down. > > > > > > > > > > > > No, application A and application B submit right behind > > > > > > each > > > > > > other. Application A is then killed later on. > > > > > > > > > > > > The optimization in _add_dependency_cb() just waits for As > > > > > > submission to be handled by the scheduler, but that never > > > > > > happens because it was killed. > > > > > > > > > > > > > If there is such a timing order problem, the commit > > > > > > > message > > > > > > > should > > > > > > > mention it. > > > > > > > > > > > > > > The commit message definitely also needs to mention this > > > > > > > optimization > > > > > > > in drm_sched_entity_add_dependency_cb() since it's > > > > > > > essential > > > > > > > for > > > > > > > understanding the bug. > > > > > > > > > > > > +1 > > > > > > > > > > > > Christian. > > > > > > > > > > > > > > > > > > > > > > > > > > > Danke > > > > > > > P. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > Christian. > > > > > > > > > > > > > > > > > > > > > > > > > > > in entity_kill_job_work(), the scheduler enters > > > > > > > > > > sleep > > > > > > > > > > state, causing application B hang. > > > > > > > > > > > > > > > > > > So the issue is that if a1 never ran, the scheduler > > > > > > > > > will > > > > > > > > > not > > > > > > > > > continue > > > > > > > > > with the jobs of application B, correct? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Add drm_sched_wakeup() when entity_kill_job_work() > > > > > > > > > > to > > > > > > > > > > preventing > > > > > > > > > > > > > > > > > > s/to preventing/to prevent > > > > > > > > > > > > > > > > > > > scheduler sleep and subsequent application hangs. > > > > > > > > > > > > > > > > > > > > v2: > > > > > > > > > > - Move drm_sched_wakeup() to after > > > > > > > > > > drm_sched_fence_scheduled() > > > > > > > > > > > > > > > > > > Changelogs are cool and useful, but move them below > > > > > > > > > the > > > > > > > > > Signed-off > > > > > > > > > area > > > > > > > > > with another --- please, like so: > > > > > > > > > > > > > > > > > > Signed-off by: … > > > > > > > > > --- > > > > > > > > > v2: > > > > > > > > > … > > > > > > > > > --- > > > > > > > > > drivers/gpu/drm/scheduler/sched_entity.c | 1 + > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Lin.Cao <linca...@amd.com> > > > > > > > > > > > > > > > > > > This fixes a bug. Even a racy bug, it seems. So we > > > > > > > > > need a > > > > > > > > > Fixes tag > > > > > > > > > and > > > > > > > > > put the stable kernel in CC. > > > > > > > > > > > > > > > > > > Please figure out with git blame, git log and git tag > > > > > > > > > -- > > > > > > > > > contains > > > > > > > > > which > > > > > > > > > commit your patch fixes and which is the oldest > > > > > > > > > kernel > > > > > > > > > that > > > > > > > > > contains > > > > > > > > > the bug. Then add a Fixes: tag and Cc: the stable > > > > > > > > > kernel. > > > > > > > > > You'll > > > > > > > > > find > > > > > > > > > lots of examples for that in git log. > > > > > > > > > > > > > > > > > > > > > > > > > > > Thx > > > > > > > > > P. > > > > > > > > > > > > > > > > > > > --- > > > > > > > > > > drivers/gpu/drm/scheduler/sched_entity.c | 1 + > > > > > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > > > > > > > > > diff --git > > > > > > > > > > a/drivers/gpu/drm/scheduler/sched_entity.c > > > > > > > > > > b/drivers/gpu/drm/scheduler/sched_entity.c > > > > > > > > > > index e671aa241720..66f2a43c58fd 100644 > > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c > > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > > > > > > > > > > @@ -177,6 +177,7 @@ static void > > > > > > > > > > drm_sched_entity_kill_jobs_work(struct work_struct > > > > > > > > > > *wrk) > > > > > > > > > > struct drm_sched_job *job = container_of(wrk, > > > > > > > > > > typeof(*job), work); > > > > > > > > > > > > > > > > > > > > drm_sched_fence_scheduled(job->s_fence, NULL); > > > > > > > > > > + drm_sched_wakeup(job->sched); > > > > > > > > > > drm_sched_fence_finished(job->s_fence, -ESRCH); > > > > > > > > > > WARN_ON(job->s_fence->parent); > > > > > > > > > > job->sched->ops->free_job(job); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >