Re: [PATCH] drm/amdgpu: fix task hang from failed job submission during process kill

Philipp Stanner Tue, 12 Aug 2025 05:52:37 -0700

On Tue, 2025-08-12 at 08:58 +0200, Christian König wrote:
> On 12.08.25 08:37, Liu01, Tong (Esther) wrote:
> > [AMD Official Use Only - AMD Internal Distribution Only]
> > 
> > Hi Christian,
> > 
> > If a job is submitted into a stopped entity, in addition to an error log, 
> > it will also cause task to hang and timeout
> 
> Oh that's really ugly and needs to get fixed.


And we agree that the proposed fix is to stop the driver from
submitting to killed entities, don't we?

P.

> 
> > , and subsequently generate a call trace since the fence of the submitted 
> > job is not signaled. Moreover, the refcnt of amdgpu will not decrease 
> > because process killing fails, resulting in the inability to unload amdgpu.
> > 
> > [Tue Aug  5 11:05:20 2025] [drm:amddrm_sched_entity_push_job [amd_sched]] 
> > *ERROR* Trying to push to a killed entity
> > [Tue Aug  5 11:07:43 2025] INFO: task kworker/u17:0:117 blocked for more 
> > than 122 seconds.
> > [Tue Aug  5 11:07:43 2025]       Tainted: G           OE      
> > 6.8.0-45-generic #45-Ubuntu
> > [Tue Aug  5 11:07:43 2025] "echo 0 > 
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [Tue Aug  5 11:07:43 2025] task:kworker/u17:0   state:D stack:0     pid:117 
> >   tgid:117   ppid:2      flags:0x00004000
> > [Tue Aug  5 11:07:43 2025] Workqueue: ttm ttm_bo_delayed_delete [amdttm]
> > [Tue Aug  5 11:07:43 2025] Call Trace:
> > [Tue Aug  5 11:07:43 2025]  <TASK>
> > [Tue Aug  5 11:07:43 2025]  __schedule+0x27c/0x6b0
> > [Tue Aug  5 11:07:43 2025]  schedule+0x33/0x110
> > [Tue Aug  5 11:07:43 2025]  schedule_timeout+0x157/0x170
> > [Tue Aug  5 11:07:43 2025]  dma_fence_default_wait+0x1e1/0x220
> > [Tue Aug  5 11:07:43 2025]  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
> > [Tue Aug  5 11:07:43 2025]  dma_fence_wait_timeout+0x116/0x140
> > [Tue Aug  5 11:07:43 2025]  amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
> > [Tue Aug  5 11:07:43 2025]  ttm_bo_delayed_delete+0x2a/0xc0 [amdttm]
> > [Tue Aug  5 11:07:43 2025]  process_one_work+0x16f/0x350
> > [Tue Aug  5 11:07:43 2025]  worker_thread+0x306/0x440
> > [Tue Aug  5 11:07:43 2025]  ? __pfx_worker_thread+0x10/0x10
> > [Tue Aug  5 11:07:43 2025]  kthread+0xf2/0x120
> > [Tue Aug  5 11:07:43 2025]  ? __pfx_kthread+0x10/0x10
> > [Tue Aug  5 11:07:43 2025]  ret_from_fork+0x47/0x70
> > [Tue Aug  5 11:07:43 2025]  ? __pfx_kthread+0x10/0x10
> > [Tue Aug  5 11:07:43 2025]  ret_from_fork_asm+0x1b/0x30
> > [Tue Aug  5 11:07:43 2025]  </TASK>
> > 
> > Checking vm entity stopped or not in amdgpu_vm_ready() can avoid to submit 
> > job to stopped entity. But as I understand it there still has risk of 
> > memory leaks and resource leaks since amdgpu_vm_clear_freed() is skipped 
> > during killing process. In amdgpu_vm_clear_freed() , it will update page 
> > table to remove mappings and free the mapping structures. If this clean up 
> > is skipped, the page table entries remain in VRAM pointing to freed buffer 
> > object and mapping structures are allocated but not freed. Please correct 
> > me if I have any misunderstanding.
> 
> No your understanding is correct, but that page tables are not cleared is 
> completely harmless.
> 
> The application is killed and can't submit anything any more. We should just 
> make sure that we check amdgpu_vm_ready() in the submit path as well.
> 
> Regards,
> Christian.
> 
> > 
> > Kind regards,
> > Esther
> > 
> > -----Original Message-----
> > From: Koenig, Christian <christian.koe...@amd.com>
> > Sent: Monday, August 11, 2025 8:17 PM
> > To: Liu01, Tong (Esther) <tong.li...@amd.com>; 
> > dri-devel@lists.freedesktop.org
> > Cc: pha...@kernel.org; d...@kernel.org; matthew.br...@intel.com; Ba, Gang 
> > <gang...@amd.com>; matthew.schwa...@linux.dev; cao, lin <lin....@amd.com>
> > Subject: Re: [PATCH] drm/amdgpu: fix task hang from failed job submission 
> > during process kill
> > 
> > Hi Esther,
> > 
> > but that is harmless and potentially only gives a warning in the system log.
> > 
> > You could adjust amdgpu_vm_ready() if necessary.
> > 
> > Regards,
> > Christian.
> > 
> > On 11.08.25 11:05, Liu01, Tong (Esther) wrote:
> > > [AMD Official Use Only - AMD Internal Distribution Only]
> > > 
> > > Hi Christian,
> > > 
> > > The real issue is a race condition during process exit after patch 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1f02f2044bda1db1fd995bc35961ab075fa7b5a2.
> > >  This patch changed amdgpu_vm_wait_idle to use drm_sched_entity_flush 
> > > instead of dma_resv_wait_timeout. Here is what happens:
> > > 
> > > do_exit
> > >     |
> > >     exit_files(tsk) ... amdgpu_flush ... amdgpu_vm_wait_idle ... 
> > > drm_sched_entity_flush (kills entity)
> > >     ...
> > >     exit_task_work(tsk) ...amdgpu_gem_object_close  ...
> > > amdgpu_vm_clear_freed (tries to submit to killed entity)
> > > 
> > > The entity gets killed in amdgpu_vm_wait_idle(), but 
> > > amdgpu_vm_clear_freed() called by exit_task_work() still tries to submit 
> > > jobs.
> > > 
> > > Kind regards,
> > > Esther
> > > 
> > > -----Original Message-----
> > > From: Koenig, Christian <christian.koe...@amd.com>
> > > Sent: Monday, August 11, 2025 4:25 PM
> > > To: Liu01, Tong (Esther) <tong.li...@amd.com>;
> > > dri-devel@lists.freedesktop.org
> > > Cc: pha...@kernel.org; d...@kernel.org; matthew.br...@intel.com; Ba,
> > > Gang <gang...@amd.com>; matthew.schwa...@linux.dev; cao, lin
> > > <lin....@amd.com>; cao, lin <lin....@amd.com>
> > > Subject: Re: [PATCH] drm/amdgpu: fix task hang from failed job
> > > submission during process kill
> > > 
> > > On 11.08.25 09:20, Liu01 Tong wrote:
> > > > During process kill, drm_sched_entity_flush() will kill the vm
> > > > entities. The following job submissions of this process will fail
> > > 
> > > Well when the process is killed how can it still make job submissions?
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > , and
> > > > the resources of these jobs have not been released, nor have the
> > > > fences  been signalled, causing tasks to hang.
> > > > 
> > > > Fix by not doing job init when the entity is stopped. And when the
> > > > job is already submitted, free the job resource if the entity is 
> > > > stopped.
> > > > 
> > > > Signed-off-by: Liu01 Tong <tong.li...@amd.com>
> > > > Signed-off-by: Lin.Cao <linca...@amd.com>
> > > > ---
> > > >  drivers/gpu/drm/scheduler/sched_entity.c | 13 +++++++------
> > > >  drivers/gpu/drm/scheduler/sched_main.c   |  5 +++++
> > > >  2 files changed, 12 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > index ac678de7fe5e..1e744b2eb2db 100644
> > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > @@ -570,6 +570,13 @@ void drm_sched_entity_push_job(struct 
> > > > drm_sched_job *sched_job)
> > > >       bool first;
> > > >       ktime_t submit_ts;
> > > > 
> > > > +     if (entity->stopped) {
> > > > +             DRM_ERROR("Trying to push job to a killed entity\n");
> > > > +             INIT_WORK(&sched_job->work, 
> > > > drm_sched_entity_kill_jobs_work);
> > > > +             schedule_work(&sched_job->work);
> > > > +             return;
> > > > +     }
> > > > +
> > > >       trace_drm_sched_job(sched_job, entity);
> > > >       atomic_inc(entity->rq->sched->score);
> > > >       WRITE_ONCE(entity->last_user, current->group_leader); @@
> > > > -589,12
> > > > +596,6 @@ void drm_sched_entity_push_job(struct drm_sched_job
> > > > *sched_job)
> > > > 
> > > >               /* Add the entity to the run queue */
> > > >               spin_lock(&entity->lock);
> > > > -             if (entity->stopped) {
> > > > -                     spin_unlock(&entity->lock);
> > > > -
> > > > -                     DRM_ERROR("Trying to push to a killed entity\n");
> > > > -                     return;
> > > > -             }
> > > > 
> > > >               rq = entity->rq;
> > > >               sched = rq->sched;
> > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > index bfea608a7106..c15b17d9ffe3 100644
> > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > @@ -795,6 +795,11 @@ int drm_sched_job_init(struct drm_sched_job *job,
> > > >               return -ENOENT;
> > > >       }
> > > > 
> > > > +     if (unlikely(entity->stopped)) {
> > > > +             pr_err("*ERROR* %s: entity is stopped!\n", __func__);
> > > > +             return -EINVAL;
> > > > +     }
> > > > +
> > > >       if (unlikely(!credits)) {
> > > >               pr_err("*ERROR* %s: credits cannot be 0!\n", __func__);
> > > >               return -EINVAL;
> > > 
> > 
>

Re: [PATCH] drm/amdgpu: fix task hang from failed job submission during process kill

Reply via email to