[Public] > -----Original Message----- > From: Liang, Prike <[email protected]> > Sent: Monday, November 10, 2025 9:49 AM > To: Alex Deucher <[email protected]>; Zhang, Jesse(Jie) > <[email protected]> > Cc: [email protected]; Deucher, Alexander > <[email protected]>; Koenig, Christian <[email protected]> > Subject: RE: [PATCH] drm/amdgpu: resume MES scheduling after user queue hang > detection and recovery > > [Public] > > OK, so that looks the queue is not the functional status as I pointed out in > the last > meeting. > We might to check the queue whether is remapped or active before performing > the > queue resume. [Zhang, Jesse(Jie)] The MES firmware will check the registers regCP_GFX_HQD_ACTIVE/regCP_HQD_ACTIVE_ACTIVE after a reset.
Thanks Jesse > > > Regards, > Prike > > > -----Original Message----- > > From: amd-gfx <[email protected]> On Behalf Of > > Alex Deucher > > Sent: Friday, November 7, 2025 10:07 PM > > To: Zhang, Jesse(Jie) <[email protected]> > > Cc: [email protected]; Deucher, Alexander > > <[email protected]>; Koenig, Christian > > <[email protected]> > > Subject: Re: [PATCH] drm/amdgpu: resume MES scheduling after user > > queue hang detection and recovery > > > > On Fri, Nov 7, 2025 at 6:27 AM Jesse.Zhang <[email protected]> wrote: > > > > > > This patch ensures the MES is properly resumed after detecting and > > > recovering from a user queue hang condition. > > > > > > Key changes: > > > 1. Track when a hung user queue is detected using found_hung_queue > > > flag 2. Call amdgpu_mes_resume() to restart MES scheduling after > > > completing > > > the hang recovery process > > > 3. This complements the existing recovery steps (fence force completion > > > and device wedging) by ensuring the scheduler can process new > > > work > > > > > > Without this resume call, the MES scheduler may remain in a paused > > > state even after the hung queue has been handled, preventing newly > > > submitted work from being processed and leading to system stalls. > > > > > > Signed-off-by: Jesse Zhang <[email protected]> > > > > Acked-by: Alex Deucher <[email protected]> > > > > > --- > > > drivers/gpu/drm/amd/amdgpu/mes_userqueue.c | 7 +++++++ > > > 1 file changed, 7 insertions(+) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c > > > b/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c > > > index b1ee9473d628..6d1af72916a5 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c > > > @@ -208,6 +208,7 @@ static int mes_userq_detect_and_reset(struct > > amdgpu_device *adev, > > > unsigned int hung_db_num = 0; > > > unsigned long queue_id; > > > u32 db_array[8]; > > > + bool found_hung_queue =false; > > > int r, i; > > > > > > if (db_array_size > 8) { > > > @@ -232,6 +233,7 @@ static int mes_userq_detect_and_reset(struct > > amdgpu_device *adev, > > > for (i = 0; i < hung_db_num; i++) { > > > if (queue->doorbell_index == > > > db_array[i]) { > > > queue->state = > > > AMDGPU_USERQ_STATE_HUNG; > > > + found_hung_queue = > > > + true; > > > > > > atomic_inc(&adev->gpu_reset_counter); > > > > > amdgpu_userq_fence_driver_force_completion(queue); > > > > > > drm_dev_wedged_event(adev_to_drm(adev), > > DRM_WEDGE_RECOVERY_NONE, NULL); @@ -241,6 +243,11 @@ static int > > mes_userq_detect_and_reset(struct amdgpu_device *adev, > > > } > > > } > > > > > > + if (found_hung_queue) { > > > + /* Resume scheduling after hang recovery */ > > > + r = amdgpu_mes_resume(adev); > > > + } > > > + > > > return r; > > > } > > > > > > -- > > > 2.49.0 > > >
