On Fri, 2025-05-16 at 02:21 +0000, cao, lin wrote: > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > Hi Philipp, > > > I updated the commit message to better clarify the issue. Would you > mind reviewing if this revised description adequately explains the > bug and the rationale behind the fix? > > > When an entity from application B is killed, drm_sched_entity_kill() > > removes all jobs belonging to that entity through > drm_sched_entity_kill_jobs_work(). If application A's job depends on > a > scheduled fence from application B's job, and that fence is not > properly > signaled during the killing process, application A's dependency > cannot be > cleared. > > This leads to application A hanging indefinitely while waiting for a > > dependency that will never be resolved. Fix this issue by ensuring > that > scheduled fences are properly signaled when an entity is killed, > allowing > dependent applications to continue execution.
That sounds perfect, yes, Thx. Reviewed-by: Philipp Stanner <pha...@kernel.org> P. > > Thanks, > Lin > > > > > > From: Koenig, Christian <christian.koe...@amd.com> > Sent: Thursday, May 15, 2025 20:39 > To: pha...@kernel.org <pha...@kernel.org>; cao, lin > <lin....@amd.com>; dri-devel@lists.freedesktop.org > <dri-devel@lists.freedesktop.org>; aamd-...@lists.freedesktop.org > <aamd-...@lists.freedesktop.org> > Cc: Chang, HaiJun <haijun.ch...@amd.com>; Yin, ZhenGuo (Chris) > <zhenguo....@amd.com>; Danilo Krummrich <d...@kernel.org> > Subject: Re: [PATCH] drm/scheduler: signal scheduled fence when kill > job > > > On 5/15/25 11:05, Philipp Stanner wrote: > > On Thu, 2025-05-15 at 10:48 +0200, Christian König wrote: > > > Explicitly adding the scheduler maintainers. > > > > > > On 5/15/25 04:07, Lin.Cao wrote: > > > > Previously we only signaled finished fence which may cause some > > > > submission's dependency cannot be cleared the cause benchmark > > > > hang. > > > > Signal both scheduled fence and finished fence could fix this > > > > issue. > > > > Code seems legit to me; but be so kind and also pimp up the commit > > message a bit, Christian. It's not very clear what the bug is and > > why > > setting the parent to NULL solves it. Or is the issue simply that > > the > > fence might be dropped unsignaled, being a bug by definition? Needs > > to > > be written down. > > The later, we simply forgot to signal the scheduled fence when an > application was killed. > > > Grammar is also a bit too broken. > > > > And running the unit tests before pushing is probably also a good > > idea. > > And maybe even writing a new unit test for that. > > Regards, > Christian. > > > > > > > > > > > Signed-off-by: Lin.Cao <linca...@amd.com> > > > > Acked-by: Philipp Stanner <pha...@kernel.org> > > > > > > > > Reviewed-by: Christian König <christian.koe...@amd.com> > > > > > > Danilo & Philipp can we quickly get an rb for that? I'm > > > volunteering > > > to push it to drm-misc-fixes and add the necessary stable tags > > > since > > > this is a fix for a rather ugly bug. > > > > > > Regards, > > > Christian. > > > > > > > > > > --- > > > > drivers/gpu/drm/scheduler/sched_entity.c | 1 + > > > > 1 file changed, 1 insertion(+) > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c > > > > b/drivers/gpu/drm/scheduler/sched_entity.c > > > > index bd39db7bb240..e671aa241720 100644 > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > > > > @@ -176,6 +176,7 @@ static void > > > > drm_sched_entity_kill_jobs_work(struct work_struct *wrk) > > > > { > > > > struct drm_sched_job *job = container_of(wrk, > > > > typeof(*job), work); > > > > > > > > + drm_sched_fence_scheduled(job->s_fence, NULL); > > > > drm_sched_fence_finished(job->s_fence, -ESRCH); > > > > WARN_ON(job->s_fence->parent); > > > > job->sched->ops->free_job(job); > > > > > >