Am 26.03.25 um 14:55 schrieb Alex Deucher: > On Wed, Mar 26, 2025 at 4:13 AM Christian König > <christian.koe...@amd.com> wrote: >> Am 25.03.25 um 16:24 schrieb Srinivasan Shanmugam: >>> This commit addresses the issue where the cleaner shader was not >>> correctly executed during gang submissions due to improper handling of >>> the isolation spearhead. >>> >>> - Enhanced the `amdgpu_gfx_run_cleaner_shader_job` function to >>> initialize `isolation->spearhead` with the job's scheduled fence for >>> cleaner shader calls. >>> - Updated the `amdgpu_vm_flush` function to properly initialize >>> `isolation->spearhead` to the job's scheduled fence when the cleaner >>> shader is required. >>> >>> Cc: Christian König <christian.koe...@amd.com> >>> Cc: Alex Deucher <alexander.deuc...@amd.com> >>> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmu...@amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 4 ++++ >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 +++- >>> 2 files changed, 7 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c >>> index 72af5e5a894a..807f17093006 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c >>> @@ -1436,6 +1436,7 @@ static ssize_t >>> amdgpu_gfx_get_available_compute_partition(struct device *dev, >>> static int amdgpu_gfx_run_cleaner_shader_job(struct amdgpu_ring *ring) >>> { >>> struct amdgpu_device *adev = ring->adev; >>> + struct amdgpu_isolation *isolation = &adev->isolation[ring->xcp_id]; >>> struct drm_gpu_scheduler *sched = &ring->sched; >>> struct drm_sched_entity entity; >>> struct dma_fence *f; >>> @@ -1464,6 +1465,9 @@ static int amdgpu_gfx_run_cleaner_shader_job(struct >>> amdgpu_ring *ring) >>> ib->ptr[i] = ring->funcs->nop; >>> ib->length_dw = ring->funcs->align_mask + 1; >>> >>> + if (job->base.s_fence) >>> + isolation->spearhead = >>> dma_fence_get(&job->base.s_fence->scheduled); >>> + >> Apart from being very risky because of not grabbing locks that will leak the >> previous isolation->spearhead fence. >> >>> f = amdgpu_job_submit(job); >>> >>> r = dma_fence_wait(f, false); >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>> index b5ddfcbbc9fc..e23400b53489 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>> @@ -692,8 +692,10 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct >>> amdgpu_job *job, >>> if (need_pipe_sync) >>> amdgpu_ring_emit_pipeline_sync(ring); >>> >>> - if (cleaner_shader_needed) >>> + if (cleaner_shader_needed) { >>> + isolation->spearhead = >>> dma_fence_get(&job->base.s_fence->scheduled); >> Same here. >> >> Over all this change doesn't seem to make much sense to me. >> >> Why exactly is isolation->spearhead not pointing to the dummy kernel job we >> submit? > Does the owner check or gang_submit check in > amdgpu_device_enforce_isolation() fail to set up the spearhead?
I'm currently debugging exactly that. Good news is that I can reproduce the problem. Regards, Christian. > > if (isolation->owner != owner) { > if (!job->gang_submit) { > dep = amdgpu_device_get_gang(adev); > if (!dma_fence_is_signaled(dep)) > goto out_return_dep; > dma_fence_put(dep); > } > > > Alex > >> Regards, >> Christian. >> >>> ring->funcs->emit_cleaner_shader(ring); >>> + } >>> >>> if (vm_flush_needed) { >>> trace_amdgpu_vm_flush(ring, job->vmid, job->vm_pd_addr);