Instead of MES does the detection and driver does the reset, this series implements compute queue/pipe reset with detection and reset both done in MES.
When REMOVE_QUEUE fails, driver takes it as at least one queue hanged. Driver sends SUSPEND to suspend all queues, then RESET to reset hung queues. MES will unmap hung queues and store hung queues information in doorbell array and hqd_info for driver. Driver finds valid doorbell offset in doorbell array and looks up hqd_info for each hung queue's information. Next, driver cleans up hung queues and sends RESUME to resume healthy queues. Amber Lin (8): drm/amdgpu: Fix gfx_hqd_mask in mes 12.1 drm/amdgpu: Fixup boost mes detect hang array size drm/amdgpu: Fixup detect and reset drm/amdgpu: Create hqd info structure drm/amdgpu: Missing multi-XCC support in MES drm/amdgpu: Enable suspend/resume gang in mes 12.1 drm/amdkfd: Add detect+reset hangs to GC 12.1 drm/amdkfd: Reset queue/pipe in MES drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 89 ++++++++--- drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h | 23 ++- drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/mes_userqueue.c | 2 +- drivers/gpu/drm/amd/amdgpu/mes_v12_1.c | 98 ++++++++---- .../drm/amd/amdkfd/kfd_device_queue_manager.c | 151 +++++++++++++++++- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1 + drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 1 + 8 files changed, 306 insertions(+), 61 deletions(-) -- 2.43.0
