On Wed, May 28, 2025 at 9:48 AM Christian König <christian.koe...@amd.com> wrote: > > On 5/28/25 15:38, Alex Deucher wrote: > > On Wed, May 28, 2025 at 7:40 AM Christian König > > <christian.koe...@amd.com> wrote: > >> > >> On 5/28/25 06:19, Alex Deucher wrote: > >>> Re-emit the unprocessed state after resetting the queue. > >>> > >>> Signed-off-by: Alex Deucher <alexander.deuc...@amd.com> > >>> --- > >>> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 19 ++++++++++++------- > >>> 1 file changed, 12 insertions(+), 7 deletions(-) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > >>> b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > >>> index 3193eb88b6889..f6e04cf21abcc 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c > >>> @@ -9537,6 +9537,7 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring > >>> *ring, unsigned int vmid) > >>> struct amdgpu_kiq *kiq = &adev->gfx.kiq[0]; > >>> struct amdgpu_ring *kiq_ring = &kiq->ring; > >>> unsigned long flags; > >>> + unsigned int i; > >>> u32 tmp; > >>> u64 addr; > >>> int r; > >>> @@ -9571,10 +9572,8 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring > >>> *ring, unsigned int vmid) > >>> SOC15_REG_OFFSET(GC, 0, > >>> mmCP_VMID_RESET), 0, 0xffffffff); > >>> kiq->pmf->kiq_map_queues(kiq_ring, ring); > >>> amdgpu_ring_commit(kiq_ring); > >>> - > >>> - spin_unlock_irqrestore(&kiq->ring_lock, flags); > >>> - > >>> r = amdgpu_ring_test_ring(kiq_ring); > >> > >> I don't think we should do a ring test on the KIQ here That basically > >> doesn't tells as much and might cause additional problems. > > > > We need some way to wait for the KIQ submission to complete. This is > > a simple way to accomplish that. > > Ok, that makes sense. > > > > >> > >>> + spin_unlock_irqrestore(&kiq->ring_lock, flags); > >>> if (r) > >>> return r; > >>> > >>> @@ -9584,7 +9583,15 @@ static int gfx_v10_0_reset_kgq(struct amdgpu_ring > >>> *ring, unsigned int vmid) > >>> return r; > >>> } > >>> > >>> - return amdgpu_ring_test_ring(ring); > >>> + if (amdgpu_ring_alloc(ring, 8 + ring->ring_backup_entries_to_copy)) > >>> + return -ENOMEM; > >>> + amdgpu_ring_emit_fence(ring, ring->fence_drv.gpu_addr, > >>> + ring->ring_backup_seq, 0); > >>> + for (i = 0; i < ring->ring_backup_entries_to_copy; i++) > >>> + amdgpu_ring_write(ring, ring->ring_backup[i]); > >>> + amdgpu_ring_commit(ring); > >> > >> I'm not sure if the commands are always relocatable. We should probably > >> just instruct the ring to re-start with the original RPTR/WPTR. > >> > >> That would also avoid the need to save/restore the ring content with the > >> CPU. > > > > I tried that originally, but I couldn't make it work for a few reasons: > > 1. We need to have enforce isolation enabled otherwise we almost > > always reset the wrong VMID so then when we execute the rest of the > > pipeline, we hang again. > > 2. When enforce isolation is enabled, we need to signal the fence > > associated with the guilty job first otherwise we get stuck waiting on > > the pipeline sync when we execute the rest of the pipeline > > So we execute the guilty job submission again? What prevents that one from > running?
Without enforce isolation we usually end up reseting the wrong job. I re-emit everything from the ring after the bad job. What usually happens is that a game is running and then you start an app which hangs the GPU (e.g., an infinite loop in a shader). Most of the time we reset the game rather than the bad app. The scheduler times out on the app first and then we reset the queue and re-emit the following jobs, one of which is the bad app. Queue reset fails and we fall back to a full adapter reset. With enforce isolation, we reset the bad app instead. Alex > > Christian. > > > > > Alex > > > >> > >> Regards, > >> Christian. > >> > >>> + > >>> + return gfx_v10_0_ring_test_ib(ring, AMDGPU_QUEUE_RESET_TIMEOUT); > >>> } > >>> > >>> static int gfx_v10_0_reset_kcq(struct amdgpu_ring *ring, > >>> @@ -9612,9 +9619,8 @@ static int gfx_v10_0_reset_kcq(struct amdgpu_ring > >>> *ring, > >>> kiq->pmf->kiq_unmap_queues(kiq_ring, ring, RESET_QUEUES, > >>> 0, 0); > >>> amdgpu_ring_commit(kiq_ring); > >>> - spin_unlock_irqrestore(&kiq->ring_lock, flags); > >>> - > >>> r = amdgpu_ring_test_ring(kiq_ring); > >>> + spin_unlock_irqrestore(&kiq->ring_lock, flags); > >>> if (r) > >>> return r; > >>> > >>> @@ -9891,7 +9897,6 @@ static const struct amdgpu_ring_funcs > >>> gfx_v10_0_ring_funcs_gfx = { > >>> .emit_wreg = gfx_v10_0_ring_emit_wreg, > >>> .emit_reg_wait = gfx_v10_0_ring_emit_reg_wait, > >>> .emit_reg_write_reg_wait = gfx_v10_0_ring_emit_reg_write_reg_wait, > >>> - .soft_recovery = gfx_v10_0_ring_soft_recovery, > >>> .emit_mem_sync = gfx_v10_0_emit_mem_sync, > >>> .reset = gfx_v10_0_reset_kgq, > >>> .emit_cleaner_shader = gfx_v10_0_ring_emit_cleaner_shader, > >> >