Am 25.03.25 um 16:24 schrieb Srinivasan Shanmugam:
> This commit addresses the issue where the cleaner shader was not
> correctly executed during gang submissions due to improper handling of
> the isolation spearhead.
>
> - Enhanced the `amdgpu_gfx_run_cleaner_shader_job` function to
>   initialize `isolation->spearhead` with the job's scheduled fence for
>   cleaner shader calls.
> - Updated the `amdgpu_vm_flush` function to properly initialize
>   `isolation->spearhead` to the job's scheduled fence when the cleaner
>   shader is required.
>
> Cc: Christian König <christian.koe...@amd.com>
> Cc: Alex Deucher <alexander.deuc...@amd.com>
> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmu...@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 4 ++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  | 4 +++-
>  2 files changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index 72af5e5a894a..807f17093006 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -1436,6 +1436,7 @@ static ssize_t 
> amdgpu_gfx_get_available_compute_partition(struct device *dev,
>  static int amdgpu_gfx_run_cleaner_shader_job(struct amdgpu_ring *ring)
>  {
>       struct amdgpu_device *adev = ring->adev;
> +     struct amdgpu_isolation *isolation = &adev->isolation[ring->xcp_id];
>       struct drm_gpu_scheduler *sched = &ring->sched;
>       struct drm_sched_entity entity;
>       struct dma_fence *f;
> @@ -1464,6 +1465,9 @@ static int amdgpu_gfx_run_cleaner_shader_job(struct 
> amdgpu_ring *ring)
>               ib->ptr[i] = ring->funcs->nop;
>       ib->length_dw = ring->funcs->align_mask + 1;
>  
> +     if (job->base.s_fence)
> +             isolation->spearhead = 
> dma_fence_get(&job->base.s_fence->scheduled);
> +

Apart from being very risky because of not grabbing locks that will leak the 
previous isolation->spearhead fence.

>       f = amdgpu_job_submit(job);
>  
>       r = dma_fence_wait(f, false);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index b5ddfcbbc9fc..e23400b53489 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -692,8 +692,10 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct 
> amdgpu_job *job,
>       if (need_pipe_sync)
>               amdgpu_ring_emit_pipeline_sync(ring);
>  
> -     if (cleaner_shader_needed)
> +     if (cleaner_shader_needed) {
> +             isolation->spearhead = 
> dma_fence_get(&job->base.s_fence->scheduled);

Same here.

Over all this change doesn't seem to make much sense to me.

Why exactly is isolation->spearhead not pointing to the dummy kernel job we 
submit?

Regards,
Christian.

>               ring->funcs->emit_cleaner_shader(ring);
> +     }
>  
>       if (vm_flush_needed) {
>               trace_amdgpu_vm_flush(ring, job->vmid, job->vm_pd_addr);

Reply via email to