RE: [PATCH] drm/amdgpu: resume MES scheduling after user queue hang detection and recovery

Zhang, Jesse(Jie) Sun, 09 Nov 2025 21:38:04 -0800

[Public]

> -----Original Message-----
> From: Liang, Prike <[email protected]>
> Sent: Monday, November 10, 2025 9:49 AM
> To: Alex Deucher <[email protected]>; Zhang, Jesse(Jie)
> <[email protected]>
> Cc: [email protected]; Deucher, Alexander
> <[email protected]>; Koenig, Christian <[email protected]>
> Subject: RE: [PATCH] drm/amdgpu: resume MES scheduling after user queue hang
> detection and recovery
>
> [Public]
>
> OK, so that looks the queue is not the functional status as I pointed out in 
> the last
> meeting.
> We might to check the queue whether is remapped or active before performing 
> the
> queue resume.
[Zhang, Jesse(Jie)]  The MES firmware will check the registers 
regCP_GFX_HQD_ACTIVE/regCP_HQD_ACTIVE_ACTIVE after a reset.


Thanks
Jesse
>
>
> Regards,
>       Prike
>
> > -----Original Message-----
> > From: amd-gfx <[email protected]> On Behalf Of
> > Alex Deucher
> > Sent: Friday, November 7, 2025 10:07 PM
> > To: Zhang, Jesse(Jie) <[email protected]>
> > Cc: [email protected]; Deucher, Alexander
> > <[email protected]>; Koenig, Christian
> > <[email protected]>
> > Subject: Re: [PATCH] drm/amdgpu: resume MES scheduling after user
> > queue hang detection and recovery
> >
> > On Fri, Nov 7, 2025 at 6:27 AM Jesse.Zhang <[email protected]> wrote:
> > >
> > > This patch ensures the MES is properly resumed after detecting and
> > > recovering from a user queue hang condition.
> > >
> > > Key changes:
> > > 1. Track when a hung user queue is detected using found_hung_queue
> > > flag 2. Call amdgpu_mes_resume() to restart MES scheduling after 
> > > completing
> > >    the hang recovery process
> > > 3. This complements the existing recovery steps (fence force completion
> > >    and device wedging) by ensuring the scheduler can process new
> > > work
> > >
> > > Without this resume call, the MES scheduler may remain in a paused
> > > state even after the hung queue has been handled, preventing newly
> > > submitted work from being processed and leading to system stalls.
> > >
> > > Signed-off-by: Jesse Zhang <[email protected]>
> >
> > Acked-by: Alex Deucher <[email protected]>
> >
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/mes_userqueue.c | 7 +++++++
> > >  1 file changed, 7 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c
> > > b/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c
> > > index b1ee9473d628..6d1af72916a5 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c
> > > @@ -208,6 +208,7 @@ static int mes_userq_detect_and_reset(struct
> > amdgpu_device *adev,
> > >         unsigned int hung_db_num = 0;
> > >         unsigned long queue_id;
> > >         u32 db_array[8];
> > > +       bool found_hung_queue =false;
> > >         int r, i;
> > >
> > >         if (db_array_size > 8) {
> > > @@ -232,6 +233,7 @@ static int mes_userq_detect_and_reset(struct
> > amdgpu_device *adev,
> > >                                 for (i = 0; i < hung_db_num; i++) {
> > >                                         if (queue->doorbell_index == 
> > > db_array[i]) {
> > >                                                 queue->state =
> > > AMDGPU_USERQ_STATE_HUNG;
> > > +                                               found_hung_queue =
> > > + true;
> > >
> > > atomic_inc(&adev->gpu_reset_counter);
> > >
> > amdgpu_userq_fence_driver_force_completion(queue);
> > >
> > > drm_dev_wedged_event(adev_to_drm(adev),
> > DRM_WEDGE_RECOVERY_NONE, NULL); @@ -241,6 +243,11 @@ static int
> > mes_userq_detect_and_reset(struct amdgpu_device *adev,
> > >                 }
> > >         }
> > >
> > > +       if (found_hung_queue) {
> > > +               /* Resume scheduling after hang recovery */
> > > +               r = amdgpu_mes_resume(adev);
> > > +       }
> > > +
> > >         return r;
> > >  }
> > >
> > > --
> > > 2.49.0
> > >

RE: [PATCH] drm/amdgpu: resume MES scheduling after user queue hang detection and recovery

Reply via email to