On Tue, May 27, 2025 at 12:47 PM Rodrigo Siqueira <sique...@igalia.com> wrote: > > On 05/22, Alex Deucher wrote: > > From: Christian König <ckoenig.leichtzumer...@gmail.com> > > > > Stopping the scheduler for queue reset is generally a good idea because > > it prevents any worker from touching the ring buffer. > > > > But using amdgpu_fence_driver_force_completion() before restarting it was > > a really bad idea because it marked fences as failed while the work was > > potentially still running. > > > > Stop doing that and cleanup the comment a bit. > > > > v2: keep amdgpu_fence_driver_force_completion() for non-gfx rings > > Why keep this amdgpu_fence_driver_force_completion() for non-gfx is ok? > From the commit descriptions, sounds like we want to avoid > amdgpu_fence_driver_force_completion() before the driver restarts the > queue.
It depends on what the queue reset actually does. If the entire queue is lost, then amdgpu_fence_driver_force_completion() is what we want. If the queue reset is able to reset the queue and restore the non-guilty state, then there is no need to call this because the subsequent fences should signal and the rest of the jobs should naturally complete. Alex > > > > > Reviewed-by: Alex Deucher <alexander.deuc...@amd.com> > > Signed-off-by: Christian König <christian.koe...@amd.com> > > Signed-off-by: Alex Deucher <alexander.deuc...@amd.com> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 26 +++++++++++++------------ > > 1 file changed, 14 insertions(+), 12 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > index acb21fc8b3ce5..e57401ef85140 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > @@ -136,10 +136,12 @@ static enum drm_gpu_sched_stat > > amdgpu_job_timedout(struct drm_sched_job *s_job) > > } else if (amdgpu_gpu_recovery && ring->funcs->reset) { > > bool is_guilty; > > > > - dev_err(adev->dev, "Starting %s ring reset\n", > > s_job->sched->name); > > - /* stop the scheduler, but don't mess with the > > - * bad job yet because if ring reset fails > > - * we'll fall back to full GPU reset. > > + dev_err(adev->dev, "Starting %s ring reset\n", > > + s_job->sched->name); > > + > > + /* > > + * Stop the scheduler to prevent anybody else from touching > > the > > + * ring buffer. > > */ > > drm_sched_wqueue_stop(&ring->sched); > > > > @@ -157,19 +159,19 @@ static enum drm_gpu_sched_stat > > amdgpu_job_timedout(struct drm_sched_job *s_job) > > > > r = amdgpu_ring_reset(ring, job->vmid); > > if (!r) { > > - if (amdgpu_ring_sched_ready(ring)) > > - drm_sched_stop(&ring->sched, s_job); > > if (is_guilty) { > > atomic_inc(&ring->adev->gpu_reset_counter); > > - amdgpu_fence_driver_force_completion(ring); > > + if (ring->funcs->type != AMDGPU_RING_TYPE_GFX) > > + > > amdgpu_fence_driver_force_completion(ring); > > } > > - if (amdgpu_ring_sched_ready(ring)) > > - drm_sched_start(&ring->sched, 0); > > - dev_err(adev->dev, "Ring %s reset succeeded\n", > > ring->sched.name); > > - drm_dev_wedged_event(adev_to_drm(adev), > > DRM_WEDGE_RECOVERY_NONE); > > + drm_sched_wqueue_start(&ring->sched); > > + dev_err(adev->dev, "Ring %s reset succeeded\n", > > + ring->sched.name); > > + drm_dev_wedged_event(adev_to_drm(adev), > > + DRM_WEDGE_RECOVERY_NONE); > > goto exit; > > } > > - dev_err(adev->dev, "Ring %s reset failure\n", > > ring->sched.name); > > + dev_err(adev->dev, "Ring %s reset failed\n", > > ring->sched.name); > > } > > dma_fence_set_error(&s_job->s_fence->finished, -ETIME); > > > > -- > > 2.49.0 > > > > -- > Rodrigo Siqueira