RE: [PATCH] drm/amdgpu: resume MES scheduling after user queue hang detection and recovery

Liang, Prike Sun, 09 Nov 2025 23:58:50 -0800

[Public]

Regards,
      Prike


> -----Original Message-----
> From: Zhang, Jesse(Jie) <[email protected]>
> Sent: Monday, November 10, 2025 1:38 PM
> To: Liang, Prike <[email protected]>; Alex Deucher
> <[email protected]>
> Cc: [email protected]; Deucher, Alexander
> <[email protected]>; Koenig, Christian <[email protected]>
> Subject: RE: [PATCH] drm/amdgpu: resume MES scheduling after user queue hang
> detection and recovery
>
> [Public]
>
> > -----Original Message-----
> > From: Liang, Prike <[email protected]>
> > Sent: Monday, November 10, 2025 9:49 AM
> > To: Alex Deucher <[email protected]>; Zhang, Jesse(Jie)
> > <[email protected]>
> > Cc: [email protected]; Deucher, Alexander
> > <[email protected]>; Koenig, Christian
> > <[email protected]>
> > Subject: RE: [PATCH] drm/amdgpu: resume MES scheduling after user
> > queue hang detection and recovery
> >
> > [Public]
> >
> > OK, so that looks the queue is not the functional status as I pointed
> > out in the last meeting.
> > We might to check the queue whether is remapped or active before
> > performing the queue resume.
> [Zhang, Jesse(Jie)]  The MES firmware will check the registers
> regCP_GFX_HQD_ACTIVE/regCP_HQD_ACTIVE_ACTIVE after a reset.
If all the queue that requires being active outside of firmware, then that's 
fine to
always keeping queue resume in the driver.

> Thanks
> Jesse
> >
> >
> > Regards,
> >       Prike
> >
> > > -----Original Message-----
> > > From: amd-gfx <[email protected]> On Behalf Of
> > > Alex Deucher
> > > Sent: Friday, November 7, 2025 10:07 PM
> > > To: Zhang, Jesse(Jie) <[email protected]>
> > > Cc: [email protected]; Deucher, Alexander
> > > <[email protected]>; Koenig, Christian
> > > <[email protected]>
> > > Subject: Re: [PATCH] drm/amdgpu: resume MES scheduling after user
> > > queue hang detection and recovery
> > >
> > > On Fri, Nov 7, 2025 at 6:27 AM Jesse.Zhang <[email protected]> wrote:
> > > >
> > > > This patch ensures the MES is properly resumed after detecting and
> > > > recovering from a user queue hang condition.
> > > >
> > > > Key changes:
> > > > 1. Track when a hung user queue is detected using found_hung_queue
> > > > flag 2. Call amdgpu_mes_resume() to restart MES scheduling after 
> > > > completing
> > > >    the hang recovery process
> > > > 3. This complements the existing recovery steps (fence force completion
> > > >    and device wedging) by ensuring the scheduler can process new
> > > > work
> > > >
> > > > Without this resume call, the MES scheduler may remain in a paused
> > > > state even after the hung queue has been handled, preventing newly
> > > > submitted work from being processed and leading to system stalls.
> > > >
> > > > Signed-off-by: Jesse Zhang <[email protected]>
> > >
> > > Acked-by: Alex Deucher <[email protected]>
> > >
> > > > ---
> > > >  drivers/gpu/drm/amd/amdgpu/mes_userqueue.c | 7 +++++++
> > > >  1 file changed, 7 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c
> > > > b/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c
> > > > index b1ee9473d628..6d1af72916a5 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c
> > > > @@ -208,6 +208,7 @@ static int mes_userq_detect_and_reset(struct
> > > amdgpu_device *adev,
> > > >         unsigned int hung_db_num = 0;
> > > >         unsigned long queue_id;
> > > >         u32 db_array[8];
> > > > +       bool found_hung_queue =false;
> > > >         int r, i;
> > > >
> > > >         if (db_array_size > 8) {
> > > > @@ -232,6 +233,7 @@ static int mes_userq_detect_and_reset(struct
> > > amdgpu_device *adev,
> > > >                                 for (i = 0; i < hung_db_num; i++) {
> > > >                                         if (queue->doorbell_index == 
> > > > db_array[i]) {
> > > >                                                 queue->state =
> > > > AMDGPU_USERQ_STATE_HUNG;
> > > > +                                               found_hung_queue =
> > > > + true;
> > > >
> > > > atomic_inc(&adev->gpu_reset_counter);
> > > >
> > > amdgpu_userq_fence_driver_force_completion(queue);
> > > >
> > > > drm_dev_wedged_event(adev_to_drm(adev),
> > > DRM_WEDGE_RECOVERY_NONE, NULL); @@ -241,6 +243,11 @@ static int
> > > mes_userq_detect_and_reset(struct amdgpu_device *adev,
> > > >                 }
> > > >         }
> > > >
> > > > +       if (found_hung_queue) {
> > > > +               /* Resume scheduling after hang recovery */
> > > > +               r = amdgpu_mes_resume(adev);
> > > > +       }
> > > > +
> > > >         return r;
> > > >  }
> > > >
> > > > --
> > > > 2.49.0
> > > >
>

RE: [PATCH] drm/amdgpu: resume MES scheduling after user queue hang detection and recovery

Reply via email to