[Public]
Regards,
Prike
> -----Original Message-----
> From: Zhang, Jesse(Jie) <[email protected]>
> Sent: Monday, November 10, 2025 1:38 PM
> To: Liang, Prike <[email protected]>; Alex Deucher
> <[email protected]>
> Cc: [email protected]; Deucher, Alexander
> <[email protected]>; Koenig, Christian <[email protected]>
> Subject: RE: [PATCH] drm/amdgpu: resume MES scheduling after user queue hang
> detection and recovery
>
> [Public]
>
> > -----Original Message-----
> > From: Liang, Prike <[email protected]>
> > Sent: Monday, November 10, 2025 9:49 AM
> > To: Alex Deucher <[email protected]>; Zhang, Jesse(Jie)
> > <[email protected]>
> > Cc: [email protected]; Deucher, Alexander
> > <[email protected]>; Koenig, Christian
> > <[email protected]>
> > Subject: RE: [PATCH] drm/amdgpu: resume MES scheduling after user
> > queue hang detection and recovery
> >
> > [Public]
> >
> > OK, so that looks the queue is not the functional status as I pointed
> > out in the last meeting.
> > We might to check the queue whether is remapped or active before
> > performing the queue resume.
> [Zhang, Jesse(Jie)] The MES firmware will check the registers
> regCP_GFX_HQD_ACTIVE/regCP_HQD_ACTIVE_ACTIVE after a reset.
If all the queue that requires being active outside of firmware, then that's
fine to
always keeping queue resume in the driver.
> Thanks
> Jesse
> >
> >
> > Regards,
> > Prike
> >
> > > -----Original Message-----
> > > From: amd-gfx <[email protected]> On Behalf Of
> > > Alex Deucher
> > > Sent: Friday, November 7, 2025 10:07 PM
> > > To: Zhang, Jesse(Jie) <[email protected]>
> > > Cc: [email protected]; Deucher, Alexander
> > > <[email protected]>; Koenig, Christian
> > > <[email protected]>
> > > Subject: Re: [PATCH] drm/amdgpu: resume MES scheduling after user
> > > queue hang detection and recovery
> > >
> > > On Fri, Nov 7, 2025 at 6:27 AM Jesse.Zhang <[email protected]> wrote:
> > > >
> > > > This patch ensures the MES is properly resumed after detecting and
> > > > recovering from a user queue hang condition.
> > > >
> > > > Key changes:
> > > > 1. Track when a hung user queue is detected using found_hung_queue
> > > > flag 2. Call amdgpu_mes_resume() to restart MES scheduling after
> > > > completing
> > > > the hang recovery process
> > > > 3. This complements the existing recovery steps (fence force completion
> > > > and device wedging) by ensuring the scheduler can process new
> > > > work
> > > >
> > > > Without this resume call, the MES scheduler may remain in a paused
> > > > state even after the hung queue has been handled, preventing newly
> > > > submitted work from being processed and leading to system stalls.
> > > >
> > > > Signed-off-by: Jesse Zhang <[email protected]>
> > >
> > > Acked-by: Alex Deucher <[email protected]>
> > >
> > > > ---
> > > > drivers/gpu/drm/amd/amdgpu/mes_userqueue.c | 7 +++++++
> > > > 1 file changed, 7 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c
> > > > b/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c
> > > > index b1ee9473d628..6d1af72916a5 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c
> > > > @@ -208,6 +208,7 @@ static int mes_userq_detect_and_reset(struct
> > > amdgpu_device *adev,
> > > > unsigned int hung_db_num = 0;
> > > > unsigned long queue_id;
> > > > u32 db_array[8];
> > > > + bool found_hung_queue =false;
> > > > int r, i;
> > > >
> > > > if (db_array_size > 8) {
> > > > @@ -232,6 +233,7 @@ static int mes_userq_detect_and_reset(struct
> > > amdgpu_device *adev,
> > > > for (i = 0; i < hung_db_num; i++) {
> > > > if (queue->doorbell_index ==
> > > > db_array[i]) {
> > > > queue->state =
> > > > AMDGPU_USERQ_STATE_HUNG;
> > > > + found_hung_queue =
> > > > + true;
> > > >
> > > > atomic_inc(&adev->gpu_reset_counter);
> > > >
> > > amdgpu_userq_fence_driver_force_completion(queue);
> > > >
> > > > drm_dev_wedged_event(adev_to_drm(adev),
> > > DRM_WEDGE_RECOVERY_NONE, NULL); @@ -241,6 +243,11 @@ static int
> > > mes_userq_detect_and_reset(struct amdgpu_device *adev,
> > > > }
> > > > }
> > > >
> > > > + if (found_hung_queue) {
> > > > + /* Resume scheduling after hang recovery */
> > > > + r = amdgpu_mes_resume(adev);
> > > > + }
> > > > +
> > > > return r;
> > > > }
> > > >
> > > > --
> > > > 2.49.0
> > > >
>