RE: [PATCH] drm/amdgpu: resume MES scheduling after user queue hang detection and recovery

Liang, Prike Sun, 09 Nov 2025 17:49:18 -0800

[Public]

OK, so that looks the queue is not the functional status as I pointed out in 
the last meeting.
We might to check the queue whether is remapped or active before performing the 
queue resume.



Regards,
      Prike

> -----Original Message-----
> From: amd-gfx <[email protected]> On Behalf Of Alex
> Deucher
> Sent: Friday, November 7, 2025 10:07 PM
> To: Zhang, Jesse(Jie) <[email protected]>
> Cc: [email protected]; Deucher, Alexander
> <[email protected]>; Koenig, Christian <[email protected]>
> Subject: Re: [PATCH] drm/amdgpu: resume MES scheduling after user queue hang
> detection and recovery
>
> On Fri, Nov 7, 2025 at 6:27 AM Jesse.Zhang <[email protected]> wrote:
> >
> > This patch ensures the MES is properly resumed after detecting and
> > recovering from a user queue hang condition.
> >
> > Key changes:
> > 1. Track when a hung user queue is detected using found_hung_queue
> > flag 2. Call amdgpu_mes_resume() to restart MES scheduling after completing
> >    the hang recovery process
> > 3. This complements the existing recovery steps (fence force completion
> >    and device wedging) by ensuring the scheduler can process new work
> >
> > Without this resume call, the MES scheduler may remain in a paused
> > state even after the hung queue has been handled, preventing newly
> > submitted work from being processed and leading to system stalls.
> >
> > Signed-off-by: Jesse Zhang <[email protected]>
>
> Acked-by: Alex Deucher <[email protected]>
>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/mes_userqueue.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c
> > b/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c
> > index b1ee9473d628..6d1af72916a5 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c
> > @@ -208,6 +208,7 @@ static int mes_userq_detect_and_reset(struct
> amdgpu_device *adev,
> >         unsigned int hung_db_num = 0;
> >         unsigned long queue_id;
> >         u32 db_array[8];
> > +       bool found_hung_queue =false;
> >         int r, i;
> >
> >         if (db_array_size > 8) {
> > @@ -232,6 +233,7 @@ static int mes_userq_detect_and_reset(struct
> amdgpu_device *adev,
> >                                 for (i = 0; i < hung_db_num; i++) {
> >                                         if (queue->doorbell_index == 
> > db_array[i]) {
> >                                                 queue->state =
> > AMDGPU_USERQ_STATE_HUNG;
> > +                                               found_hung_queue =
> > + true;
> >                                                 
> > atomic_inc(&adev->gpu_reset_counter);
> >
> amdgpu_userq_fence_driver_force_completion(queue);
> >
> > drm_dev_wedged_event(adev_to_drm(adev),
> DRM_WEDGE_RECOVERY_NONE, NULL); @@ -241,6 +243,11 @@ static int
> mes_userq_detect_and_reset(struct amdgpu_device *adev,
> >                 }
> >         }
> >
> > +       if (found_hung_queue) {
> > +               /* Resume scheduling after hang recovery */
> > +               r = amdgpu_mes_resume(adev);
> > +       }
> > +
> >         return r;
> >  }
> >
> > --
> > 2.49.0
> >

RE: [PATCH] drm/amdgpu: resume MES scheduling after user queue hang detection and recovery

Reply via email to