Re: [PATCH] drm/amdgpu: fix task hang from failed job submission during process kill

Christian König Tue, 12 Aug 2025 06:00:29 -0700

On 12.08.25 14:52, Philipp Stanner wrote:
> On Tue, 2025-08-12 at 08:58 +0200, Christian König wrote:
>> On 12.08.25 08:37, Liu01, Tong (Esther) wrote:
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>> Hi Christian,
>>>
>>> If a job is submitted into a stopped entity, in addition to an error log, 
>>> it will also cause task to hang and timeout
>>
>> Oh that's really ugly and needs to get fixed.
> 
> And we agree that the proposed fix is to stop the driver from
> submitting to killed entities, don't we?


Yes, absolutely.

It's basically the cleanup procedure which tries to update page tables of a 
killed process and is missing a check to not do that.

Regards,
Christian.

> 
> P.
> 
>>
>>> , and subsequently generate a call trace since the fence of the submitted 
>>> job is not signaled. Moreover, the refcnt of amdgpu will not decrease 
>>> because process killing fails, resulting in the inability to unload amdgpu.
>>>
>>> [Tue Aug  5 11:05:20 2025] [drm:amddrm_sched_entity_push_job [amd_sched]] 
>>> *ERROR* Trying to push to a killed entity
>>> [Tue Aug  5 11:07:43 2025] INFO: task kworker/u17:0:117 blocked for more 
>>> than 122 seconds.
>>> [Tue Aug  5 11:07:43 2025]       Tainted: G           OE      
>>> 6.8.0-45-generic #45-Ubuntu
>>> [Tue Aug  5 11:07:43 2025] "echo 0 > 
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> [Tue Aug  5 11:07:43 2025] task:kworker/u17:0   state:D stack:0     pid:117 
>>>   tgid:117   ppid:2      flags:0x00004000
>>> [Tue Aug  5 11:07:43 2025] Workqueue: ttm ttm_bo_delayed_delete [amdttm]
>>> [Tue Aug  5 11:07:43 2025] Call Trace:
>>> [Tue Aug  5 11:07:43 2025]  <TASK>
>>> [Tue Aug  5 11:07:43 2025]  __schedule+0x27c/0x6b0
>>> [Tue Aug  5 11:07:43 2025]  schedule+0x33/0x110
>>> [Tue Aug  5 11:07:43 2025]  schedule_timeout+0x157/0x170
>>> [Tue Aug  5 11:07:43 2025]  dma_fence_default_wait+0x1e1/0x220
>>> [Tue Aug  5 11:07:43 2025]  ? __pfx_dma_fence_default_wait_cb+0x10/0x10
>>> [Tue Aug  5 11:07:43 2025]  dma_fence_wait_timeout+0x116/0x140
>>> [Tue Aug  5 11:07:43 2025]  amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl]
>>> [Tue Aug  5 11:07:43 2025]  ttm_bo_delayed_delete+0x2a/0xc0 [amdttm]
>>> [Tue Aug  5 11:07:43 2025]  process_one_work+0x16f/0x350
>>> [Tue Aug  5 11:07:43 2025]  worker_thread+0x306/0x440
>>> [Tue Aug  5 11:07:43 2025]  ? __pfx_worker_thread+0x10/0x10
>>> [Tue Aug  5 11:07:43 2025]  kthread+0xf2/0x120
>>> [Tue Aug  5 11:07:43 2025]  ? __pfx_kthread+0x10/0x10
>>> [Tue Aug  5 11:07:43 2025]  ret_from_fork+0x47/0x70
>>> [Tue Aug  5 11:07:43 2025]  ? __pfx_kthread+0x10/0x10
>>> [Tue Aug  5 11:07:43 2025]  ret_from_fork_asm+0x1b/0x30
>>> [Tue Aug  5 11:07:43 2025]  </TASK>
>>>
>>> Checking vm entity stopped or not in amdgpu_vm_ready() can avoid to submit 
>>> job to stopped entity. But as I understand it there still has risk of 
>>> memory leaks and resource leaks since amdgpu_vm_clear_freed() is skipped 
>>> during killing process. In amdgpu_vm_clear_freed() , it will update page 
>>> table to remove mappings and free the mapping structures. If this clean up 
>>> is skipped, the page table entries remain in VRAM pointing to freed buffer 
>>> object and mapping structures are allocated but not freed. Please correct 
>>> me if I have any misunderstanding.
>>
>> No your understanding is correct, but that page tables are not cleared is 
>> completely harmless.
>>
>> The application is killed and can't submit anything any more. We should just 
>> make sure that we check amdgpu_vm_ready() in the submit path as well.
>>
>> Regards,
>> Christian.
>>
>>>
>>> Kind regards,
>>> Esther
>>>
>>> -----Original Message-----
>>> From: Koenig, Christian <christian.koe...@amd.com>
>>> Sent: Monday, August 11, 2025 8:17 PM
>>> To: Liu01, Tong (Esther) <tong.li...@amd.com>; 
>>> dri-devel@lists.freedesktop.org
>>> Cc: pha...@kernel.org; d...@kernel.org; matthew.br...@intel.com; Ba, Gang 
>>> <gang...@amd.com>; matthew.schwa...@linux.dev; cao, lin <lin....@amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: fix task hang from failed job submission 
>>> during process kill
>>>
>>> Hi Esther,
>>>
>>> but that is harmless and potentially only gives a warning in the system log.
>>>
>>> You could adjust amdgpu_vm_ready() if necessary.
>>>
>>> Regards,
>>> Christian.
>>>
>>> On 11.08.25 11:05, Liu01, Tong (Esther) wrote:
>>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>>
>>>> Hi Christian,
>>>>
>>>> The real issue is a race condition during process exit after patch 
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1f02f2044bda1db1fd995bc35961ab075fa7b5a2.
>>>>  This patch changed amdgpu_vm_wait_idle to use drm_sched_entity_flush 
>>>> instead of dma_resv_wait_timeout. Here is what happens:
>>>>
>>>> do_exit
>>>>     |
>>>>     exit_files(tsk) ... amdgpu_flush ... amdgpu_vm_wait_idle ... 
>>>> drm_sched_entity_flush (kills entity)
>>>>     ...
>>>>     exit_task_work(tsk) ...amdgpu_gem_object_close  ...
>>>> amdgpu_vm_clear_freed (tries to submit to killed entity)
>>>>
>>>> The entity gets killed in amdgpu_vm_wait_idle(), but 
>>>> amdgpu_vm_clear_freed() called by exit_task_work() still tries to submit 
>>>> jobs.
>>>>
>>>> Kind regards,
>>>> Esther
>>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian <christian.koe...@amd.com>
>>>> Sent: Monday, August 11, 2025 4:25 PM
>>>> To: Liu01, Tong (Esther) <tong.li...@amd.com>;
>>>> dri-devel@lists.freedesktop.org
>>>> Cc: pha...@kernel.org; d...@kernel.org; matthew.br...@intel.com; Ba,
>>>> Gang <gang...@amd.com>; matthew.schwa...@linux.dev; cao, lin
>>>> <lin....@amd.com>; cao, lin <lin....@amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu: fix task hang from failed job
>>>> submission during process kill
>>>>
>>>> On 11.08.25 09:20, Liu01 Tong wrote:
>>>>> During process kill, drm_sched_entity_flush() will kill the vm
>>>>> entities. The following job submissions of this process will fail
>>>>
>>>> Well when the process is killed how can it still make job submissions?
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> , and
>>>>> the resources of these jobs have not been released, nor have the
>>>>> fences  been signalled, causing tasks to hang.
>>>>>
>>>>> Fix by not doing job init when the entity is stopped. And when the
>>>>> job is already submitted, free the job resource if the entity is stopped.
>>>>>
>>>>> Signed-off-by: Liu01 Tong <tong.li...@amd.com>
>>>>> Signed-off-by: Lin.Cao <linca...@amd.com>
>>>>> ---
>>>>>  drivers/gpu/drm/scheduler/sched_entity.c | 13 +++++++------
>>>>>  drivers/gpu/drm/scheduler/sched_main.c   |  5 +++++
>>>>>  2 files changed, 12 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> index ac678de7fe5e..1e744b2eb2db 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>>>>> @@ -570,6 +570,13 @@ void drm_sched_entity_push_job(struct drm_sched_job 
>>>>> *sched_job)
>>>>>       bool first;
>>>>>       ktime_t submit_ts;
>>>>>
>>>>> +     if (entity->stopped) {
>>>>> +             DRM_ERROR("Trying to push job to a killed entity\n");
>>>>> +             INIT_WORK(&sched_job->work, 
>>>>> drm_sched_entity_kill_jobs_work);
>>>>> +             schedule_work(&sched_job->work);
>>>>> +             return;
>>>>> +     }
>>>>> +
>>>>>       trace_drm_sched_job(sched_job, entity);
>>>>>       atomic_inc(entity->rq->sched->score);
>>>>>       WRITE_ONCE(entity->last_user, current->group_leader); @@
>>>>> -589,12
>>>>> +596,6 @@ void drm_sched_entity_push_job(struct drm_sched_job
>>>>> *sched_job)
>>>>>
>>>>>               /* Add the entity to the run queue */
>>>>>               spin_lock(&entity->lock);
>>>>> -             if (entity->stopped) {
>>>>> -                     spin_unlock(&entity->lock);
>>>>> -
>>>>> -                     DRM_ERROR("Trying to push to a killed entity\n");
>>>>> -                     return;
>>>>> -             }
>>>>>
>>>>>               rq = entity->rq;
>>>>>               sched = rq->sched;
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index bfea608a7106..c15b17d9ffe3 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -795,6 +795,11 @@ int drm_sched_job_init(struct drm_sched_job *job,
>>>>>               return -ENOENT;
>>>>>       }
>>>>>
>>>>> +     if (unlikely(entity->stopped)) {
>>>>> +             pr_err("*ERROR* %s: entity is stopped!\n", __func__);
>>>>> +             return -EINVAL;
>>>>> +     }
>>>>> +
>>>>>       if (unlikely(!credits)) {
>>>>>               pr_err("*ERROR* %s: credits cannot be 0!\n", __func__);
>>>>>               return -EINVAL;
>>>>
>>>
>>
>

Re: [PATCH] drm/amdgpu: fix task hang from failed job submission during process kill

Reply via email to