Re: [PATCH] drm/amdgpu: resume MES scheduling after user queue hang detection and recovery

Alex Deucher Fri, 07 Nov 2025 06:07:47 -0800

On Fri, Nov 7, 2025 at 6:27 AM Jesse.Zhang <[email protected]> wrote:
>
> This patch ensures the MES is properly resumed
> after detecting and recovering from a user queue hang condition.
>
> Key changes:
> 1. Track when a hung user queue is detected using found_hung_queue flag
> 2. Call amdgpu_mes_resume() to restart MES scheduling after completing
>    the hang recovery process
> 3. This complements the existing recovery steps (fence force completion
>    and device wedging) by ensuring the scheduler can process new work
>
> Without this resume call, the MES scheduler may remain in a paused state
> even after the hung queue has been handled, preventing newly submitted
> work from being processed and leading to system stalls.
>
> Signed-off-by: Jesse Zhang <[email protected]>


Acked-by: Alex Deucher <[email protected]>

> ---
>  drivers/gpu/drm/amd/amdgpu/mes_userqueue.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c 
> b/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c
> index b1ee9473d628..6d1af72916a5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c
> @@ -208,6 +208,7 @@ static int mes_userq_detect_and_reset(struct 
> amdgpu_device *adev,
>         unsigned int hung_db_num = 0;
>         unsigned long queue_id;
>         u32 db_array[8];
> +       bool found_hung_queue =false;
>         int r, i;
>
>         if (db_array_size > 8) {
> @@ -232,6 +233,7 @@ static int mes_userq_detect_and_reset(struct 
> amdgpu_device *adev,
>                                 for (i = 0; i < hung_db_num; i++) {
>                                         if (queue->doorbell_index == 
> db_array[i]) {
>                                                 queue->state = 
> AMDGPU_USERQ_STATE_HUNG;
> +                                               found_hung_queue = true;
>                                                 
> atomic_inc(&adev->gpu_reset_counter);
>                                                 
> amdgpu_userq_fence_driver_force_completion(queue);
>                                                 
> drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL);
> @@ -241,6 +243,11 @@ static int mes_userq_detect_and_reset(struct 
> amdgpu_device *adev,
>                 }
>         }
>
> +       if (found_hung_queue) {
> +               /* Resume scheduling after hang recovery */
> +               r = amdgpu_mes_resume(adev);
> +       }
> +
>         return r;
>  }
>
> --
> 2.49.0
>

Re: [PATCH] drm/amdgpu: resume MES scheduling after user queue hang detection and recovery

Reply via email to