On Fri, Nov 7, 2025 at 6:27 AM Jesse.Zhang <[email protected]> wrote: > > This patch ensures the MES is properly resumed > after detecting and recovering from a user queue hang condition. > > Key changes: > 1. Track when a hung user queue is detected using found_hung_queue flag > 2. Call amdgpu_mes_resume() to restart MES scheduling after completing > the hang recovery process > 3. This complements the existing recovery steps (fence force completion > and device wedging) by ensuring the scheduler can process new work > > Without this resume call, the MES scheduler may remain in a paused state > even after the hung queue has been handled, preventing newly submitted > work from being processed and leading to system stalls. > > Signed-off-by: Jesse Zhang <[email protected]>
Acked-by: Alex Deucher <[email protected]> > --- > drivers/gpu/drm/amd/amdgpu/mes_userqueue.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c > b/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c > index b1ee9473d628..6d1af72916a5 100644 > --- a/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c > +++ b/drivers/gpu/drm/amd/amdgpu/mes_userqueue.c > @@ -208,6 +208,7 @@ static int mes_userq_detect_and_reset(struct > amdgpu_device *adev, > unsigned int hung_db_num = 0; > unsigned long queue_id; > u32 db_array[8]; > + bool found_hung_queue =false; > int r, i; > > if (db_array_size > 8) { > @@ -232,6 +233,7 @@ static int mes_userq_detect_and_reset(struct > amdgpu_device *adev, > for (i = 0; i < hung_db_num; i++) { > if (queue->doorbell_index == > db_array[i]) { > queue->state = > AMDGPU_USERQ_STATE_HUNG; > + found_hung_queue = true; > > atomic_inc(&adev->gpu_reset_counter); > > amdgpu_userq_fence_driver_force_completion(queue); > > drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL); > @@ -241,6 +243,11 @@ static int mes_userq_detect_and_reset(struct > amdgpu_device *adev, > } > } > > + if (found_hung_queue) { > + /* Resume scheduling after hang recovery */ > + r = amdgpu_mes_resume(adev); > + } > + > return r; > } > > -- > 2.49.0 >
