Hi Esther, but that is harmless and potentially only gives a warning in the system log.
You could adjust amdgpu_vm_ready() if necessary. Regards, Christian. On 11.08.25 11:05, Liu01, Tong (Esther) wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > > Hi Christian, > > The real issue is a race condition during process exit after patch > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1f02f2044bda1db1fd995bc35961ab075fa7b5a2. > This patch changed amdgpu_vm_wait_idle to use drm_sched_entity_flush instead > of dma_resv_wait_timeout. Here is what happens: > > do_exit > | > exit_files(tsk) ... amdgpu_flush ... amdgpu_vm_wait_idle ... > drm_sched_entity_flush (kills entity) > ... > exit_task_work(tsk) ...amdgpu_gem_object_close ... > amdgpu_vm_clear_freed (tries to submit to killed entity) > > The entity gets killed in amdgpu_vm_wait_idle(), but amdgpu_vm_clear_freed() > called by exit_task_work() still tries to submit jobs. > > Kind regards, > Esther > > -----Original Message----- > From: Koenig, Christian <christian.koe...@amd.com> > Sent: Monday, August 11, 2025 4:25 PM > To: Liu01, Tong (Esther) <tong.li...@amd.com>; dri-devel@lists.freedesktop.org > Cc: pha...@kernel.org; d...@kernel.org; matthew.br...@intel.com; Ba, Gang > <gang...@amd.com>; matthew.schwa...@linux.dev; cao, lin <lin....@amd.com>; > cao, lin <lin....@amd.com> > Subject: Re: [PATCH] drm/amdgpu: fix task hang from failed job submission > during process kill > > On 11.08.25 09:20, Liu01 Tong wrote: >> During process kill, drm_sched_entity_flush() will kill the vm >> entities. The following job submissions of this process will fail > > Well when the process is killed how can it still make job submissions? > > Regards, > Christian. > >> , and >> the resources of these jobs have not been released, nor have the >> fences been signalled, causing tasks to hang. >> >> Fix by not doing job init when the entity is stopped. And when the job >> is already submitted, free the job resource if the entity is stopped. >> >> Signed-off-by: Liu01 Tong <tong.li...@amd.com> >> Signed-off-by: Lin.Cao <linca...@amd.com> >> --- >> drivers/gpu/drm/scheduler/sched_entity.c | 13 +++++++------ >> drivers/gpu/drm/scheduler/sched_main.c | 5 +++++ >> 2 files changed, 12 insertions(+), 6 deletions(-) >> >> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c >> b/drivers/gpu/drm/scheduler/sched_entity.c >> index ac678de7fe5e..1e744b2eb2db 100644 >> --- a/drivers/gpu/drm/scheduler/sched_entity.c >> +++ b/drivers/gpu/drm/scheduler/sched_entity.c >> @@ -570,6 +570,13 @@ void drm_sched_entity_push_job(struct drm_sched_job >> *sched_job) >> bool first; >> ktime_t submit_ts; >> >> + if (entity->stopped) { >> + DRM_ERROR("Trying to push job to a killed entity\n"); >> + INIT_WORK(&sched_job->work, drm_sched_entity_kill_jobs_work); >> + schedule_work(&sched_job->work); >> + return; >> + } >> + >> trace_drm_sched_job(sched_job, entity); >> atomic_inc(entity->rq->sched->score); >> WRITE_ONCE(entity->last_user, current->group_leader); @@ -589,12 >> +596,6 @@ void drm_sched_entity_push_job(struct drm_sched_job >> *sched_job) >> >> /* Add the entity to the run queue */ >> spin_lock(&entity->lock); >> - if (entity->stopped) { >> - spin_unlock(&entity->lock); >> - >> - DRM_ERROR("Trying to push to a killed entity\n"); >> - return; >> - } >> >> rq = entity->rq; >> sched = rq->sched; >> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >> b/drivers/gpu/drm/scheduler/sched_main.c >> index bfea608a7106..c15b17d9ffe3 100644 >> --- a/drivers/gpu/drm/scheduler/sched_main.c >> +++ b/drivers/gpu/drm/scheduler/sched_main.c >> @@ -795,6 +795,11 @@ int drm_sched_job_init(struct drm_sched_job *job, >> return -ENOENT; >> } >> >> + if (unlikely(entity->stopped)) { >> + pr_err("*ERROR* %s: entity is stopped!\n", __func__); >> + return -EINVAL; >> + } >> + >> if (unlikely(!credits)) { >> pr_err("*ERROR* %s: credits cannot be 0!\n", __func__); >> return -EINVAL; >