On Wed, Mar 26, 2025 at 4:13 AM Christian König <christian.koe...@amd.com> wrote: > > Am 25.03.25 um 16:24 schrieb Srinivasan Shanmugam: > > This commit addresses the issue where the cleaner shader was not > > correctly executed during gang submissions due to improper handling of > > the isolation spearhead. > > > > - Enhanced the `amdgpu_gfx_run_cleaner_shader_job` function to > > initialize `isolation->spearhead` with the job's scheduled fence for > > cleaner shader calls. > > - Updated the `amdgpu_vm_flush` function to properly initialize > > `isolation->spearhead` to the job's scheduled fence when the cleaner > > shader is required. > > > > Cc: Christian König <christian.koe...@amd.com> > > Cc: Alex Deucher <alexander.deuc...@amd.com> > > Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmu...@amd.com> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 4 ++++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 +++- > > 2 files changed, 7 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c > > index 72af5e5a894a..807f17093006 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c > > @@ -1436,6 +1436,7 @@ static ssize_t > > amdgpu_gfx_get_available_compute_partition(struct device *dev, > > static int amdgpu_gfx_run_cleaner_shader_job(struct amdgpu_ring *ring) > > { > > struct amdgpu_device *adev = ring->adev; > > + struct amdgpu_isolation *isolation = &adev->isolation[ring->xcp_id]; > > struct drm_gpu_scheduler *sched = &ring->sched; > > struct drm_sched_entity entity; > > struct dma_fence *f; > > @@ -1464,6 +1465,9 @@ static int amdgpu_gfx_run_cleaner_shader_job(struct > > amdgpu_ring *ring) > > ib->ptr[i] = ring->funcs->nop; > > ib->length_dw = ring->funcs->align_mask + 1; > > > > + if (job->base.s_fence) > > + isolation->spearhead = > > dma_fence_get(&job->base.s_fence->scheduled); > > + > > Apart from being very risky because of not grabbing locks that will leak the > previous isolation->spearhead fence. > > > f = amdgpu_job_submit(job); > > > > r = dma_fence_wait(f, false); > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > > index b5ddfcbbc9fc..e23400b53489 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > > @@ -692,8 +692,10 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct > > amdgpu_job *job, > > if (need_pipe_sync) > > amdgpu_ring_emit_pipeline_sync(ring); > > > > - if (cleaner_shader_needed) > > + if (cleaner_shader_needed) { > > + isolation->spearhead = > > dma_fence_get(&job->base.s_fence->scheduled); > > Same here. > > Over all this change doesn't seem to make much sense to me. > > Why exactly is isolation->spearhead not pointing to the dummy kernel job we > submit?
Does the owner check or gang_submit check in amdgpu_device_enforce_isolation() fail to set up the spearhead? if (isolation->owner != owner) { if (!job->gang_submit) { dep = amdgpu_device_get_gang(adev); if (!dma_fence_is_signaled(dep)) goto out_return_dep; dma_fence_put(dep); } Alex > > Regards, > Christian. > > > ring->funcs->emit_cleaner_shader(ring); > > + } > > > > if (vm_flush_needed) { > > trace_amdgpu_vm_flush(ring, job->vmid, job->vm_pd_addr); >