Re: [PATCH] drm/amdgpu: fix task hang from failed job submission during process kill

Christian König Mon, 11 Aug 2025 01:25:27 -0700

On 11.08.25 09:20, Liu01 Tong wrote:
> During process kill, drm_sched_entity_flush() will kill the vm
> entities. The following job submissions of this process will fail


Well when the process is killed how can it still make job submissions?

Regards,
Christian.

>, and
> the resources of these jobs have not been released, nor have the fences
> been signalled, causing tasks to hang.
> 
> Fix by not doing job init when the entity is stopped. And when the job
> is already submitted, free the job resource if the entity is stopped.
> 
> Signed-off-by: Liu01 Tong <tong.li...@amd.com>
> Signed-off-by: Lin.Cao <linca...@amd.com>
> ---
>  drivers/gpu/drm/scheduler/sched_entity.c | 13 +++++++------
>  drivers/gpu/drm/scheduler/sched_main.c   |  5 +++++
>  2 files changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
> b/drivers/gpu/drm/scheduler/sched_entity.c
> index ac678de7fe5e..1e744b2eb2db 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -570,6 +570,13 @@ void drm_sched_entity_push_job(struct drm_sched_job 
> *sched_job)
>       bool first;
>       ktime_t submit_ts;
>  
> +     if (entity->stopped) {
> +             DRM_ERROR("Trying to push job to a killed entity\n");
> +             INIT_WORK(&sched_job->work, drm_sched_entity_kill_jobs_work);
> +             schedule_work(&sched_job->work);
> +             return;
> +     }
> +
>       trace_drm_sched_job(sched_job, entity);
>       atomic_inc(entity->rq->sched->score);
>       WRITE_ONCE(entity->last_user, current->group_leader);
> @@ -589,12 +596,6 @@ void drm_sched_entity_push_job(struct drm_sched_job 
> *sched_job)
>  
>               /* Add the entity to the run queue */
>               spin_lock(&entity->lock);
> -             if (entity->stopped) {
> -                     spin_unlock(&entity->lock);
> -
> -                     DRM_ERROR("Trying to push to a killed entity\n");
> -                     return;
> -             }
>  
>               rq = entity->rq;
>               sched = rq->sched;
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> b/drivers/gpu/drm/scheduler/sched_main.c
> index bfea608a7106..c15b17d9ffe3 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -795,6 +795,11 @@ int drm_sched_job_init(struct drm_sched_job *job,
>               return -ENOENT;
>       }
>  
> +     if (unlikely(entity->stopped)) {
> +             pr_err("*ERROR* %s: entity is stopped!\n", __func__);
> +             return -EINVAL;
> +     }
> +
>       if (unlikely(!credits)) {
>               pr_err("*ERROR* %s: credits cannot be 0!\n", __func__);
>               return -EINVAL;

Re: [PATCH] drm/amdgpu: fix task hang from failed job submission during process kill

Reply via email to