On 12.08.25 14:52, Philipp Stanner wrote: > On Tue, 2025-08-12 at 08:58 +0200, Christian König wrote: >> On 12.08.25 08:37, Liu01, Tong (Esther) wrote: >>> [AMD Official Use Only - AMD Internal Distribution Only] >>> >>> Hi Christian, >>> >>> If a job is submitted into a stopped entity, in addition to an error log, >>> it will also cause task to hang and timeout >> >> Oh that's really ugly and needs to get fixed. > > And we agree that the proposed fix is to stop the driver from > submitting to killed entities, don't we?
Yes, absolutely. It's basically the cleanup procedure which tries to update page tables of a killed process and is missing a check to not do that. Regards, Christian. > > P. > >> >>> , and subsequently generate a call trace since the fence of the submitted >>> job is not signaled. Moreover, the refcnt of amdgpu will not decrease >>> because process killing fails, resulting in the inability to unload amdgpu. >>> >>> [Tue Aug 5 11:05:20 2025] [drm:amddrm_sched_entity_push_job [amd_sched]] >>> *ERROR* Trying to push to a killed entity >>> [Tue Aug 5 11:07:43 2025] INFO: task kworker/u17:0:117 blocked for more >>> than 122 seconds. >>> [Tue Aug 5 11:07:43 2025] Tainted: G OE >>> 6.8.0-45-generic #45-Ubuntu >>> [Tue Aug 5 11:07:43 2025] "echo 0 > >>> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>> [Tue Aug 5 11:07:43 2025] task:kworker/u17:0 state:D stack:0 pid:117 >>> tgid:117 ppid:2 flags:0x00004000 >>> [Tue Aug 5 11:07:43 2025] Workqueue: ttm ttm_bo_delayed_delete [amdttm] >>> [Tue Aug 5 11:07:43 2025] Call Trace: >>> [Tue Aug 5 11:07:43 2025] <TASK> >>> [Tue Aug 5 11:07:43 2025] __schedule+0x27c/0x6b0 >>> [Tue Aug 5 11:07:43 2025] schedule+0x33/0x110 >>> [Tue Aug 5 11:07:43 2025] schedule_timeout+0x157/0x170 >>> [Tue Aug 5 11:07:43 2025] dma_fence_default_wait+0x1e1/0x220 >>> [Tue Aug 5 11:07:43 2025] ? __pfx_dma_fence_default_wait_cb+0x10/0x10 >>> [Tue Aug 5 11:07:43 2025] dma_fence_wait_timeout+0x116/0x140 >>> [Tue Aug 5 11:07:43 2025] amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl] >>> [Tue Aug 5 11:07:43 2025] ttm_bo_delayed_delete+0x2a/0xc0 [amdttm] >>> [Tue Aug 5 11:07:43 2025] process_one_work+0x16f/0x350 >>> [Tue Aug 5 11:07:43 2025] worker_thread+0x306/0x440 >>> [Tue Aug 5 11:07:43 2025] ? __pfx_worker_thread+0x10/0x10 >>> [Tue Aug 5 11:07:43 2025] kthread+0xf2/0x120 >>> [Tue Aug 5 11:07:43 2025] ? __pfx_kthread+0x10/0x10 >>> [Tue Aug 5 11:07:43 2025] ret_from_fork+0x47/0x70 >>> [Tue Aug 5 11:07:43 2025] ? __pfx_kthread+0x10/0x10 >>> [Tue Aug 5 11:07:43 2025] ret_from_fork_asm+0x1b/0x30 >>> [Tue Aug 5 11:07:43 2025] </TASK> >>> >>> Checking vm entity stopped or not in amdgpu_vm_ready() can avoid to submit >>> job to stopped entity. But as I understand it there still has risk of >>> memory leaks and resource leaks since amdgpu_vm_clear_freed() is skipped >>> during killing process. In amdgpu_vm_clear_freed() , it will update page >>> table to remove mappings and free the mapping structures. If this clean up >>> is skipped, the page table entries remain in VRAM pointing to freed buffer >>> object and mapping structures are allocated but not freed. Please correct >>> me if I have any misunderstanding. >> >> No your understanding is correct, but that page tables are not cleared is >> completely harmless. >> >> The application is killed and can't submit anything any more. We should just >> make sure that we check amdgpu_vm_ready() in the submit path as well. >> >> Regards, >> Christian. >> >>> >>> Kind regards, >>> Esther >>> >>> -----Original Message----- >>> From: Koenig, Christian <christian.koe...@amd.com> >>> Sent: Monday, August 11, 2025 8:17 PM >>> To: Liu01, Tong (Esther) <tong.li...@amd.com>; >>> dri-devel@lists.freedesktop.org >>> Cc: pha...@kernel.org; d...@kernel.org; matthew.br...@intel.com; Ba, Gang >>> <gang...@amd.com>; matthew.schwa...@linux.dev; cao, lin <lin....@amd.com> >>> Subject: Re: [PATCH] drm/amdgpu: fix task hang from failed job submission >>> during process kill >>> >>> Hi Esther, >>> >>> but that is harmless and potentially only gives a warning in the system log. >>> >>> You could adjust amdgpu_vm_ready() if necessary. >>> >>> Regards, >>> Christian. >>> >>> On 11.08.25 11:05, Liu01, Tong (Esther) wrote: >>>> [AMD Official Use Only - AMD Internal Distribution Only] >>>> >>>> Hi Christian, >>>> >>>> The real issue is a race condition during process exit after patch >>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1f02f2044bda1db1fd995bc35961ab075fa7b5a2. >>>> This patch changed amdgpu_vm_wait_idle to use drm_sched_entity_flush >>>> instead of dma_resv_wait_timeout. Here is what happens: >>>> >>>> do_exit >>>> | >>>> exit_files(tsk) ... amdgpu_flush ... amdgpu_vm_wait_idle ... >>>> drm_sched_entity_flush (kills entity) >>>> ... >>>> exit_task_work(tsk) ...amdgpu_gem_object_close ... >>>> amdgpu_vm_clear_freed (tries to submit to killed entity) >>>> >>>> The entity gets killed in amdgpu_vm_wait_idle(), but >>>> amdgpu_vm_clear_freed() called by exit_task_work() still tries to submit >>>> jobs. >>>> >>>> Kind regards, >>>> Esther >>>> >>>> -----Original Message----- >>>> From: Koenig, Christian <christian.koe...@amd.com> >>>> Sent: Monday, August 11, 2025 4:25 PM >>>> To: Liu01, Tong (Esther) <tong.li...@amd.com>; >>>> dri-devel@lists.freedesktop.org >>>> Cc: pha...@kernel.org; d...@kernel.org; matthew.br...@intel.com; Ba, >>>> Gang <gang...@amd.com>; matthew.schwa...@linux.dev; cao, lin >>>> <lin....@amd.com>; cao, lin <lin....@amd.com> >>>> Subject: Re: [PATCH] drm/amdgpu: fix task hang from failed job >>>> submission during process kill >>>> >>>> On 11.08.25 09:20, Liu01 Tong wrote: >>>>> During process kill, drm_sched_entity_flush() will kill the vm >>>>> entities. The following job submissions of this process will fail >>>> >>>> Well when the process is killed how can it still make job submissions? >>>> >>>> Regards, >>>> Christian. >>>> >>>>> , and >>>>> the resources of these jobs have not been released, nor have the >>>>> fences been signalled, causing tasks to hang. >>>>> >>>>> Fix by not doing job init when the entity is stopped. And when the >>>>> job is already submitted, free the job resource if the entity is stopped. >>>>> >>>>> Signed-off-by: Liu01 Tong <tong.li...@amd.com> >>>>> Signed-off-by: Lin.Cao <linca...@amd.com> >>>>> --- >>>>> drivers/gpu/drm/scheduler/sched_entity.c | 13 +++++++------ >>>>> drivers/gpu/drm/scheduler/sched_main.c | 5 +++++ >>>>> 2 files changed, 12 insertions(+), 6 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c >>>>> b/drivers/gpu/drm/scheduler/sched_entity.c >>>>> index ac678de7fe5e..1e744b2eb2db 100644 >>>>> --- a/drivers/gpu/drm/scheduler/sched_entity.c >>>>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c >>>>> @@ -570,6 +570,13 @@ void drm_sched_entity_push_job(struct drm_sched_job >>>>> *sched_job) >>>>> bool first; >>>>> ktime_t submit_ts; >>>>> >>>>> + if (entity->stopped) { >>>>> + DRM_ERROR("Trying to push job to a killed entity\n"); >>>>> + INIT_WORK(&sched_job->work, >>>>> drm_sched_entity_kill_jobs_work); >>>>> + schedule_work(&sched_job->work); >>>>> + return; >>>>> + } >>>>> + >>>>> trace_drm_sched_job(sched_job, entity); >>>>> atomic_inc(entity->rq->sched->score); >>>>> WRITE_ONCE(entity->last_user, current->group_leader); @@ >>>>> -589,12 >>>>> +596,6 @@ void drm_sched_entity_push_job(struct drm_sched_job >>>>> *sched_job) >>>>> >>>>> /* Add the entity to the run queue */ >>>>> spin_lock(&entity->lock); >>>>> - if (entity->stopped) { >>>>> - spin_unlock(&entity->lock); >>>>> - >>>>> - DRM_ERROR("Trying to push to a killed entity\n"); >>>>> - return; >>>>> - } >>>>> >>>>> rq = entity->rq; >>>>> sched = rq->sched; >>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>>>> b/drivers/gpu/drm/scheduler/sched_main.c >>>>> index bfea608a7106..c15b17d9ffe3 100644 >>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>>> @@ -795,6 +795,11 @@ int drm_sched_job_init(struct drm_sched_job *job, >>>>> return -ENOENT; >>>>> } >>>>> >>>>> + if (unlikely(entity->stopped)) { >>>>> + pr_err("*ERROR* %s: entity is stopped!\n", __func__); >>>>> + return -EINVAL; >>>>> + } >>>>> + >>>>> if (unlikely(!credits)) { >>>>> pr_err("*ERROR* %s: credits cannot be 0!\n", __func__); >>>>> return -EINVAL; >>>> >>> >> >