On Tue, 2025-08-12 at 08:58 +0200, Christian König wrote: > On 12.08.25 08:37, Liu01, Tong (Esther) wrote: > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > Hi Christian, > > > > If a job is submitted into a stopped entity, in addition to an error log, > > it will also cause task to hang and timeout > > Oh that's really ugly and needs to get fixed.
And we agree that the proposed fix is to stop the driver from submitting to killed entities, don't we? P. > > > , and subsequently generate a call trace since the fence of the submitted > > job is not signaled. Moreover, the refcnt of amdgpu will not decrease > > because process killing fails, resulting in the inability to unload amdgpu. > > > > [Tue Aug 5 11:05:20 2025] [drm:amddrm_sched_entity_push_job [amd_sched]] > > *ERROR* Trying to push to a killed entity > > [Tue Aug 5 11:07:43 2025] INFO: task kworker/u17:0:117 blocked for more > > than 122 seconds. > > [Tue Aug 5 11:07:43 2025] Tainted: G OE > > 6.8.0-45-generic #45-Ubuntu > > [Tue Aug 5 11:07:43 2025] "echo 0 > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > [Tue Aug 5 11:07:43 2025] task:kworker/u17:0 state:D stack:0 pid:117 > > tgid:117 ppid:2 flags:0x00004000 > > [Tue Aug 5 11:07:43 2025] Workqueue: ttm ttm_bo_delayed_delete [amdttm] > > [Tue Aug 5 11:07:43 2025] Call Trace: > > [Tue Aug 5 11:07:43 2025] <TASK> > > [Tue Aug 5 11:07:43 2025] __schedule+0x27c/0x6b0 > > [Tue Aug 5 11:07:43 2025] schedule+0x33/0x110 > > [Tue Aug 5 11:07:43 2025] schedule_timeout+0x157/0x170 > > [Tue Aug 5 11:07:43 2025] dma_fence_default_wait+0x1e1/0x220 > > [Tue Aug 5 11:07:43 2025] ? __pfx_dma_fence_default_wait_cb+0x10/0x10 > > [Tue Aug 5 11:07:43 2025] dma_fence_wait_timeout+0x116/0x140 > > [Tue Aug 5 11:07:43 2025] amddma_resv_wait_timeout+0x7f/0xf0 [amdkcl] > > [Tue Aug 5 11:07:43 2025] ttm_bo_delayed_delete+0x2a/0xc0 [amdttm] > > [Tue Aug 5 11:07:43 2025] process_one_work+0x16f/0x350 > > [Tue Aug 5 11:07:43 2025] worker_thread+0x306/0x440 > > [Tue Aug 5 11:07:43 2025] ? __pfx_worker_thread+0x10/0x10 > > [Tue Aug 5 11:07:43 2025] kthread+0xf2/0x120 > > [Tue Aug 5 11:07:43 2025] ? __pfx_kthread+0x10/0x10 > > [Tue Aug 5 11:07:43 2025] ret_from_fork+0x47/0x70 > > [Tue Aug 5 11:07:43 2025] ? __pfx_kthread+0x10/0x10 > > [Tue Aug 5 11:07:43 2025] ret_from_fork_asm+0x1b/0x30 > > [Tue Aug 5 11:07:43 2025] </TASK> > > > > Checking vm entity stopped or not in amdgpu_vm_ready() can avoid to submit > > job to stopped entity. But as I understand it there still has risk of > > memory leaks and resource leaks since amdgpu_vm_clear_freed() is skipped > > during killing process. In amdgpu_vm_clear_freed() , it will update page > > table to remove mappings and free the mapping structures. If this clean up > > is skipped, the page table entries remain in VRAM pointing to freed buffer > > object and mapping structures are allocated but not freed. Please correct > > me if I have any misunderstanding. > > No your understanding is correct, but that page tables are not cleared is > completely harmless. > > The application is killed and can't submit anything any more. We should just > make sure that we check amdgpu_vm_ready() in the submit path as well. > > Regards, > Christian. > > > > > Kind regards, > > Esther > > > > -----Original Message----- > > From: Koenig, Christian <christian.koe...@amd.com> > > Sent: Monday, August 11, 2025 8:17 PM > > To: Liu01, Tong (Esther) <tong.li...@amd.com>; > > dri-devel@lists.freedesktop.org > > Cc: pha...@kernel.org; d...@kernel.org; matthew.br...@intel.com; Ba, Gang > > <gang...@amd.com>; matthew.schwa...@linux.dev; cao, lin <lin....@amd.com> > > Subject: Re: [PATCH] drm/amdgpu: fix task hang from failed job submission > > during process kill > > > > Hi Esther, > > > > but that is harmless and potentially only gives a warning in the system log. > > > > You could adjust amdgpu_vm_ready() if necessary. > > > > Regards, > > Christian. > > > > On 11.08.25 11:05, Liu01, Tong (Esther) wrote: > > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > > Hi Christian, > > > > > > The real issue is a race condition during process exit after patch > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1f02f2044bda1db1fd995bc35961ab075fa7b5a2. > > > This patch changed amdgpu_vm_wait_idle to use drm_sched_entity_flush > > > instead of dma_resv_wait_timeout. Here is what happens: > > > > > > do_exit > > > | > > > exit_files(tsk) ... amdgpu_flush ... amdgpu_vm_wait_idle ... > > > drm_sched_entity_flush (kills entity) > > > ... > > > exit_task_work(tsk) ...amdgpu_gem_object_close ... > > > amdgpu_vm_clear_freed (tries to submit to killed entity) > > > > > > The entity gets killed in amdgpu_vm_wait_idle(), but > > > amdgpu_vm_clear_freed() called by exit_task_work() still tries to submit > > > jobs. > > > > > > Kind regards, > > > Esther > > > > > > -----Original Message----- > > > From: Koenig, Christian <christian.koe...@amd.com> > > > Sent: Monday, August 11, 2025 4:25 PM > > > To: Liu01, Tong (Esther) <tong.li...@amd.com>; > > > dri-devel@lists.freedesktop.org > > > Cc: pha...@kernel.org; d...@kernel.org; matthew.br...@intel.com; Ba, > > > Gang <gang...@amd.com>; matthew.schwa...@linux.dev; cao, lin > > > <lin....@amd.com>; cao, lin <lin....@amd.com> > > > Subject: Re: [PATCH] drm/amdgpu: fix task hang from failed job > > > submission during process kill > > > > > > On 11.08.25 09:20, Liu01 Tong wrote: > > > > During process kill, drm_sched_entity_flush() will kill the vm > > > > entities. The following job submissions of this process will fail > > > > > > Well when the process is killed how can it still make job submissions? > > > > > > Regards, > > > Christian. > > > > > > > , and > > > > the resources of these jobs have not been released, nor have the > > > > fences been signalled, causing tasks to hang. > > > > > > > > Fix by not doing job init when the entity is stopped. And when the > > > > job is already submitted, free the job resource if the entity is > > > > stopped. > > > > > > > > Signed-off-by: Liu01 Tong <tong.li...@amd.com> > > > > Signed-off-by: Lin.Cao <linca...@amd.com> > > > > --- > > > > drivers/gpu/drm/scheduler/sched_entity.c | 13 +++++++------ > > > > drivers/gpu/drm/scheduler/sched_main.c | 5 +++++ > > > > 2 files changed, 12 insertions(+), 6 deletions(-) > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c > > > > b/drivers/gpu/drm/scheduler/sched_entity.c > > > > index ac678de7fe5e..1e744b2eb2db 100644 > > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c > > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > > > > @@ -570,6 +570,13 @@ void drm_sched_entity_push_job(struct > > > > drm_sched_job *sched_job) > > > > bool first; > > > > ktime_t submit_ts; > > > > > > > > + if (entity->stopped) { > > > > + DRM_ERROR("Trying to push job to a killed entity\n"); > > > > + INIT_WORK(&sched_job->work, > > > > drm_sched_entity_kill_jobs_work); > > > > + schedule_work(&sched_job->work); > > > > + return; > > > > + } > > > > + > > > > trace_drm_sched_job(sched_job, entity); > > > > atomic_inc(entity->rq->sched->score); > > > > WRITE_ONCE(entity->last_user, current->group_leader); @@ > > > > -589,12 > > > > +596,6 @@ void drm_sched_entity_push_job(struct drm_sched_job > > > > *sched_job) > > > > > > > > /* Add the entity to the run queue */ > > > > spin_lock(&entity->lock); > > > > - if (entity->stopped) { > > > > - spin_unlock(&entity->lock); > > > > - > > > > - DRM_ERROR("Trying to push to a killed entity\n"); > > > > - return; > > > > - } > > > > > > > > rq = entity->rq; > > > > sched = rq->sched; > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > > index bfea608a7106..c15b17d9ffe3 100644 > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > @@ -795,6 +795,11 @@ int drm_sched_job_init(struct drm_sched_job *job, > > > > return -ENOENT; > > > > } > > > > > > > > + if (unlikely(entity->stopped)) { > > > > + pr_err("*ERROR* %s: entity is stopped!\n", __func__); > > > > + return -EINVAL; > > > > + } > > > > + > > > > if (unlikely(!credits)) { > > > > pr_err("*ERROR* %s: credits cannot be 0!\n", __func__); > > > > return -EINVAL; > > > > > >