This set improves per queue reset support for a number of IPs.
When we reset the queue, the queue is lost so we need
to re-emit the unprocessed state from subsequent submissions.
To that end, in order to make sure we actually restore
unprocessed state, we need to enable legacy enforce isolation
so that we can safely re-emit the unprocessed state.  If
we don't multiple jobs can run in parallel and we may not
end up resetting the correct one.  This is similar to how
windows handles queues.  This also gives us correct guilty
tracking for GC.

Tested on GC 10 and 11 chips with a game running and
then running hang tests.  The game pauses when the
hang happens, then continues after the queue reset.

I tried this same approach and GC8 and 9, but it
was not as reliable as soft recovery.  As such, I've dropped
the KGQ reset code for pre-GC10.

The same approach is extended to SDMA and VCN.
They don't need enforce isolation because those engines
are single threaded so they always operate serially.

Rework re-emit to signal the seq number of the bad job and
verify that to verify that the reset worked, then re-emit the
rest of the non-guilty state.  This way we are not waiting on
the rest of the state to complete, and if the subsequent state
also contains a bad job, we'll end up in queue reset again rather
than adapter reset.

Git tree:
https://gitlab.freedesktop.org/agd5f/linux/-/commits/kq_resets?ref_type=heads

v4: Drop explicit padding patches
    Drop new timeout macro
    Rework re-emit sequence
v5: Add a helper for reemit
    Convert VCN, JPEG, SDMA to use new helpers
v6: Update SDMA 4.4.2 to use new helpers
    Move ptr tracking to amdgpu_fence
    Skip all jobs from the bad context on the ring
v7: Rework the backup logic
    Move and clean up the guilty logic for engine resets
    Integrate suggestions from Christian
    Add JPEG 4.0.5 support

Alex Deucher (28):
  drm/amdgpu: enable legacy enforce isolation by default
  drm/amdgpu/gfx7: drop reset_kgq
  drm/amdgpu/gfx8: drop reset_kgq
  drm/amdgpu/gfx9: drop reset_kgq
  drm/amdgpu: switch job hw_fence to amdgpu_fence
  drm/amdgpu: update ring reset function signature
  drm/amdgpu: move force completion into ring resets
  drm/amdgpu: move guilty handling into ring resets
  drm/amdgpu: track ring state associated with a job
  drm/amdgpu/gfx9: re-emit unprocessed state on kcq reset
  drm/amdgpu/gfx9.4.3: re-emit unprocessed state on kcq reset
  drm/amdgpu/gfx10: re-emit unprocessed state on ring reset
  drm/amdgpu/gfx11: re-emit unprocessed state on ring reset
  drm/amdgpu/gfx12: re-emit unprocessed state on ring reset
  drm/amdgpu/sdma6: re-emit unprocessed state on ring reset
  drm/amdgpu/sdma7: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg2: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg2.5: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg3: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg4: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg4.0.3: re-emit unprocessed state on ring reset
  drm/amdgpu/jpeg4.0.5: add queue reset
  drm/amdgpu/jpeg5: add queue reset
  drm/amdgpu/jpeg5.0.1: re-emit unprocessed state on ring reset
  drm/amdgpu/vcn4: re-emit unprocessed state on ring reset
  drm/amdgpu/vcn4.0.3: re-emit unprocessed state on ring reset
  drm/amdgpu/vcn4.0.5: re-emit unprocessed state on ring reset
  drm/amdgpu/vcn5: re-emit unprocessed state on ring reset

Christian König (1):
  drm/amdgpu: rework queue reset scheduler interaction

 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |   6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c   | 120 ++++++++++++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c      |   8 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c     |  59 ++++------
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.h     |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c    |  27 +++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h    |  35 +++++-
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c      |  66 ++++++-----
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c      |  61 ++++++----
 drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c      |  61 ++++++----
 drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c       |  71 ------------
 drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c       |  71 ------------
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c       |  67 +++--------
 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c     |  23 +++-
 drivers/gpu/drm/amd/amdgpu/jpeg_v2_0.c      |  21 +++-
 drivers/gpu/drm/amd/amdgpu/jpeg_v2_5.c      |  21 +++-
 drivers/gpu/drm/amd/amdgpu/jpeg_v3_0.c      |  21 +++-
 drivers/gpu/drm/amd/amdgpu/jpeg_v4_0.c      |  21 +++-
 drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c    |  21 +++-
 drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_5.c    |  25 ++++
 drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_0.c    |  28 +++++
 drivers/gpu/drm/amd/amdgpu/jpeg_v5_0_1.c    |  21 +++-
 drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c    |  61 +++++-----
 drivers/gpu/drm/amd/amdgpu/sdma_v5_0.c      |  33 +++++-
 drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c      |  35 +++++-
 drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c      |  22 +++-
 drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c      |  22 +++-
 drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c       |  19 +++-
 drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c     |  20 +++-
 drivers/gpu/drm/amd/amdgpu/vcn_v4_0_5.c     |  20 +++-
 drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c     |  20 +++-
 32 files changed, 710 insertions(+), 400 deletions(-)

-- 
2.49.0

Reply via email to