Instead of MES does the detection and driver does the reset, this series
implements compute queue/pipe reset with detection and reset both done
in MES.

When REMOVE_QUEUE fails, driver takes it as at least one queue hanged.
Driver sends SUSPEND to suspend all queues, then RESET to reset hung
queues. MES will unmap hung queues and store hung queues information
in doorbell array and hqd_info for driver. Driver finds valid doorbell
offset in doorbell array and looks up hqd_info for each hung queue's
information. Next, driver cleans up hung queues and sends RESUME to resume
healthy queues. 

Amber Lin (8):
  drm/amdgpu: Fix gfx_hqd_mask in mes 12.1
  drm/amdgpu: Fixup boost mes detect hang array size
  drm/amdgpu: Fixup detect and reset
  drm/amdgpu: Create hqd info structure
  drm/amdgpu: Missing multi-XCC support in MES
  drm/amdgpu: Enable suspend/resume gang in mes 12.1
  drm/amdkfd: Add detect+reset hangs to GC 12.1
  drm/amdkfd: Reset queue/pipe in MES

 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c       |  89 ++++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h       |  23 ++-
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c        |   2 +-
 drivers/gpu/drm/amd/amdgpu/mes_userqueue.c    |   2 +-
 drivers/gpu/drm/amd/amdgpu/mes_v12_1.c        |  98 ++++++++----
 .../drm/amd/amdkfd/kfd_device_queue_manager.c | 151 +++++++++++++++++-
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |   1 +
 drivers/gpu/drm/amd/amdkfd/kfd_topology.c     |   1 +
 8 files changed, 306 insertions(+), 61 deletions(-)

-- 
2.43.0

Reply via email to