RE: [PATCH] drm/amdgpu: fix task hang from failed job submission during process kill

Liu01, Tong (Esther) Mon, 11 Aug 2025 23:37:21 -0700

[AMD Official Use Only - AMD Internal Distribution Only]

Hi Christian,


If a job is submitted into a stopped entity, in addition to an error log, it 
will also cause task to hang and timeout, and subsequently generate a call 
trace since the fence of the submitted job is not signaled. Moreover, the 
refcnt of amdgpu will not decrease because process killing fails, resulting in 
the inability to unload amdgpu.

[Tue Aug  5 11:05:20 2025] [drm:amddrm_sched_entity_push_job [amd_sched]] 
*ERROR* Trying to push to a killed entity
[Tue Aug  5 11:07:43 2025] INFO: task kworker/u17:0:117 blocked for more than 
122 seconds.
[Tue Aug  5 11:07:43 2025]       Tainted: G           OE      6.8.0-45-generic 
#45-Ubuntu
[Tue Aug  5 11:07:43 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[Tue Aug  5 11:07:43 2025] task:kworker/u17:0   state:D stack:0     pid:117   
tgid:117   ppid:2      flags:0x00004000
[Tue Aug  5 11:07:43 2025] Workqueue: ttm ttm_bo_delayed_delete [amdttm]
[Tue Aug  5 11:07:43 2025] Call Trace:
[Tue Aug  5 11:07:43 2025]  <TASK>
[Tue Aug  5 11:07:43 2025]  __schedule+0x27c/0x6b0
[Tue Aug  5 11:07:43 2025]  schedule+0x33/0x110
[Tue Aug  5 11:07:43 2025]  schedule_timeout+0x157/0x170
[Tue Aug  5 11:07:43 2025]  dma_fence_default_wait+0x1e1/0x220
[Tue Aug  5 11:07:43 2025]  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
[Tue Aug  5 11:07:43 2025]  dma_fence_wait_timeout+0x116/0x140
[Tue Aug  5 11:07:43 2025]  amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
[Tue Aug  5 11:07:43 2025]  ttm_bo_delayed_delete+0x2a/0xc0 [amdttm]
[Tue Aug  5 11:07:43 2025]  process_one_work+0x16f/0x350
[Tue Aug  5 11:07:43 2025]  worker_thread+0x306/0x440
[Tue Aug  5 11:07:43 2025]  ? __pfx_worker_thread+0x10/0x10
[Tue Aug  5 11:07:43 2025]  kthread+0xf2/0x120
[Tue Aug  5 11:07:43 2025]  ? __pfx_kthread+0x10/0x10
[Tue Aug  5 11:07:43 2025]  ret_from_fork+0x47/0x70
[Tue Aug  5 11:07:43 2025]  ? __pfx_kthread+0x10/0x10
[Tue Aug  5 11:07:43 2025]  ret_from_fork_asm+0x1b/0x30
[Tue Aug  5 11:07:43 2025]  </TASK>

Checking vm entity stopped or not in amdgpu_vm_ready() can avoid to submit job 
to stopped entity. But as I understand it there still has risk of memory leaks 
and resource leaks since amdgpu_vm_clear_freed() is skipped during killing 
process. In amdgpu_vm_clear_freed() , it will update page table to remove 
mappings and free the mapping structures. If this clean up is skipped, the page 
table entries remain in VRAM pointing to freed buffer object and mapping 
structures are allocated but not freed. Please correct me if I have any 
misunderstanding.

Kind regards,
Esther

-----Original Message-----
From: Koenig, Christian <christian.koe...@amd.com>
Sent: Monday, August 11, 2025 8:17 PM
To: Liu01, Tong (Esther) <tong.li...@amd.com>; dri-devel@lists.freedesktop.org
Cc: pha...@kernel.org; d...@kernel.org; matthew.br...@intel.com; Ba, Gang 
<gang...@amd.com>; matthew.schwa...@linux.dev; cao, lin <lin....@amd.com>
Subject: Re: [PATCH] drm/amdgpu: fix task hang from failed job submission 
during process kill

Hi Esther,

but that is harmless and potentially only gives a warning in the system log.

You could adjust amdgpu_vm_ready() if necessary.

Regards,
Christian.

On 11.08.25 11:05, Liu01, Tong (Esther) wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi Christian,
>
> The real issue is a race condition during process exit after patch 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1f02f2044bda1db1fd995bc35961ab075fa7b5a2.
>  This patch changed amdgpu_vm_wait_idle to use drm_sched_entity_flush instead 
> of dma_resv_wait_timeout. Here is what happens:
>
> do_exit
>     |
>     exit_files(tsk) ... amdgpu_flush ... amdgpu_vm_wait_idle ... 
> drm_sched_entity_flush (kills entity)
>     ...
>     exit_task_work(tsk) ...amdgpu_gem_object_close  ...
> amdgpu_vm_clear_freed (tries to submit to killed entity)
>
> The entity gets killed in amdgpu_vm_wait_idle(), but amdgpu_vm_clear_freed() 
> called by exit_task_work() still tries to submit jobs.
>
> Kind regards,
> Esther
>
> -----Original Message-----
> From: Koenig, Christian <christian.koe...@amd.com>
> Sent: Monday, August 11, 2025 4:25 PM
> To: Liu01, Tong (Esther) <tong.li...@amd.com>;
> dri-devel@lists.freedesktop.org
> Cc: pha...@kernel.org; d...@kernel.org; matthew.br...@intel.com; Ba,
> Gang <gang...@amd.com>; matthew.schwa...@linux.dev; cao, lin
> <lin....@amd.com>; cao, lin <lin....@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: fix task hang from failed job
> submission during process kill
>
> On 11.08.25 09:20, Liu01 Tong wrote:
>> During process kill, drm_sched_entity_flush() will kill the vm
>> entities. The following job submissions of this process will fail
>
> Well when the process is killed how can it still make job submissions?
>
> Regards,
> Christian.
>
>> , and
>> the resources of these jobs have not been released, nor have the
>> fences  been signalled, causing tasks to hang.
>>
>> Fix by not doing job init when the entity is stopped. And when the
>> job is already submitted, free the job resource if the entity is stopped.
>>
>> Signed-off-by: Liu01 Tong <tong.li...@amd.com>
>> Signed-off-by: Lin.Cao <linca...@amd.com>
>> ---
>>  drivers/gpu/drm/scheduler/sched_entity.c | 13 +++++++------
>>  drivers/gpu/drm/scheduler/sched_main.c   |  5 +++++
>>  2 files changed, 12 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>> b/drivers/gpu/drm/scheduler/sched_entity.c
>> index ac678de7fe5e..1e744b2eb2db 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -570,6 +570,13 @@ void drm_sched_entity_push_job(struct drm_sched_job 
>> *sched_job)
>>       bool first;
>>       ktime_t submit_ts;
>>
>> +     if (entity->stopped) {
>> +             DRM_ERROR("Trying to push job to a killed entity\n");
>> +             INIT_WORK(&sched_job->work, drm_sched_entity_kill_jobs_work);
>> +             schedule_work(&sched_job->work);
>> +             return;
>> +     }
>> +
>>       trace_drm_sched_job(sched_job, entity);
>>       atomic_inc(entity->rq->sched->score);
>>       WRITE_ONCE(entity->last_user, current->group_leader); @@
>> -589,12
>> +596,6 @@ void drm_sched_entity_push_job(struct drm_sched_job
>> *sched_job)
>>
>>               /* Add the entity to the run queue */
>>               spin_lock(&entity->lock);
>> -             if (entity->stopped) {
>> -                     spin_unlock(&entity->lock);
>> -
>> -                     DRM_ERROR("Trying to push to a killed entity\n");
>> -                     return;
>> -             }
>>
>>               rq = entity->rq;
>>               sched = rq->sched;
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index bfea608a7106..c15b17d9ffe3 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -795,6 +795,11 @@ int drm_sched_job_init(struct drm_sched_job *job,
>>               return -ENOENT;
>>       }
>>
>> +     if (unlikely(entity->stopped)) {
>> +             pr_err("*ERROR* %s: entity is stopped!\n", __func__);
>> +             return -EINVAL;
>> +     }
>> +
>>       if (unlikely(!credits)) {
>>               pr_err("*ERROR* %s: credits cannot be 0!\n", __func__);
>>               return -EINVAL;
>

RE: [PATCH] drm/amdgpu: fix task hang from failed job submission during process kill

Reply via email to