On Fri, 2025-05-16 at 02:21 +0000, cao, lin wrote:
> 
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> 
> 
> Hi Philipp,
> 
> 
> I updated the commit message to better clarify the issue. Would you
> mind reviewing if this revised description adequately explains the
> bug and the rationale behind the fix?
> 
> 
> When an entity from application B is killed, drm_sched_entity_kill()
>  
> removes all jobs belonging to that entity through  
> drm_sched_entity_kill_jobs_work(). If application A's job depends on
> a  
> scheduled fence from application B's job, and that fence is not
> properly  
> signaled during the killing process, application A's dependency
> cannot be  
> cleared.  
>  
> This leads to application A hanging indefinitely while waiting for a
>  
> dependency that will never be resolved. Fix this issue by ensuring
> that  
> scheduled fences are properly signaled when an entity is killed,
> allowing  
> dependent applications to continue execution.

That sounds perfect, yes, Thx.

Reviewed-by: Philipp Stanner <pha...@kernel.org>


P.

> 
> Thanks,
> Lin
> 
> 
> 
> 
> 
> From: Koenig, Christian <christian.koe...@amd.com>
> Sent: Thursday, May 15, 2025 20:39
> To: pha...@kernel.org <pha...@kernel.org>; cao, lin
> <lin....@amd.com>; dri-devel@lists.freedesktop.org
> <dri-devel@lists.freedesktop.org>; aamd-...@lists.freedesktop.org
> <aamd-...@lists.freedesktop.org>
> Cc: Chang, HaiJun <haijun.ch...@amd.com>; Yin, ZhenGuo (Chris)
> <zhenguo....@amd.com>; Danilo Krummrich <d...@kernel.org>
> Subject: Re: [PATCH] drm/scheduler: signal scheduled fence when kill
> job 
> 
> 
> On 5/15/25 11:05, Philipp Stanner wrote:
> > On Thu, 2025-05-15 at 10:48 +0200, Christian König wrote:
> > > Explicitly adding the scheduler maintainers.
> > > 
> > > On 5/15/25 04:07, Lin.Cao wrote:
> > > > Previously we only signaled finished fence which may cause some
> > > > submission's dependency cannot be cleared the cause benchmark
> > > > hang.
> > > > Signal both scheduled fence and finished fence could fix this
> > > > issue.
> > 
> > Code seems legit to me; but be so kind and also pimp up the commit
> > message a bit, Christian. It's not very clear what the bug is and
> > why
> > setting the parent to NULL solves it. Or is the issue simply that
> > the
> > fence might be dropped unsignaled, being a bug by definition? Needs
> > to
> > be written down.
> 
> The later, we simply forgot to signal the scheduled fence when an
> application was killed.
> 
> > Grammar is also a bit too broken.
> > 
> > And running the unit tests before pushing is probably also a good
> > idea.
> 
> And maybe even writing a new unit test for that.
> 
> Regards,
> Christian.
> 
> > 
> > > > 
> > > > Signed-off-by: Lin.Cao <linca...@amd.com>
> > 
> > Acked-by: Philipp Stanner <pha...@kernel.org>
> > 
> > > 
> > > Reviewed-by: Christian König <christian.koe...@amd.com>
> > > 
> > > Danilo & Philipp can we quickly get an rb for that? I'm
> > > volunteering
> > > to push it to drm-misc-fixes and add the necessary stable tags
> > > since
> > > this is a fix for a rather ugly bug.
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > 
> > > > ---
> > > >  drivers/gpu/drm/scheduler/sched_entity.c | 1 +
> > > >  1 file changed, 1 insertion(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > index bd39db7bb240..e671aa241720 100644
> > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > @@ -176,6 +176,7 @@ static void
> > > > drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
> > > >  {
> > > >     struct drm_sched_job *job = container_of(wrk,
> > > > typeof(*job), work);
> > > >  
> > > > +   drm_sched_fence_scheduled(job->s_fence, NULL);
> > > >     drm_sched_fence_finished(job->s_fence, -ESRCH);
> > > >     WARN_ON(job->s_fence->parent);
> > > >     job->sched->ops->free_job(job);
> > > 
> > 
> 

Reply via email to