On 4/22/2025 2:57 PM, Christian König wrote:
Am 22.04.25 um 11:14 schrieb Liang, Prike:
[Public]
-----Original Message-----
From: Christian König <ckoenig.leichtzumer...@gmail.com>
Sent: Thursday, April 17, 2025 3:40 PM
To: Liang, Prike <prike.li...@amd.com>; Koenig, Christian
<christian.koe...@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <alexander.deuc...@amd.com>
Subject: Re: [PATCH 4/4] drm/amdgpu: free the evf when the attached bo release
Am 16.04.25 um 16:47 schrieb Liang, Prike:
[Public]
From: Koenig, Christian <christian.koe...@amd.com>
Sent: Wednesday, April 16, 2025 7:07 PM
To: Liang, Prike <prike.li...@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <alexander.deuc...@amd.com>
Subject: Re: [PATCH 4/4] drm/amdgpu: free the evf when the attached
bo release
Am 16.04.25 um 10:50 schrieb Prike Liang:
Free the evf when the attached bo released. The evf still be
dependent on and referred to by the attached bo that is scheduled by
the kernel queue SDMA or gfx after the evf signalled.
Signed-off-by: Prike Liang <prike.li...@amd.com>
---
.../drm/amd/amdgpu/amdgpu_eviction_fence.c | 31 ++++++++++++++++--
-
.../drm/amd/amdgpu/amdgpu_eviction_fence.h | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 1 +
3 files changed, 28 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_eviction_fence.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_eviction_fence.c
index b34225bbd85d..60be1ac5047d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_eviction_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_eviction_fence.c
@@ -27,6 +27,7 @@
#define work_to_evf_mgr(w, name) container_of(w, struct
amdgpu_eviction_fence_mgr, name) #define evf_mgr_to_fpriv(e)
container_of(e, struct amdgpu_fpriv, evf_mgr)
+#define fence_to_evf(f) container_of(f, struct
+amdgpu_eviction_fence, base)
static const char *
amdgpu_eviction_fence_get_driver_name(struct dma_fence *fence) @@
-47,7 +48,7 @@ int amdgpu_eviction_fence_replace_fence(struct
amdgpu_eviction_fence_mgr *evf_mgr,
struct drm_exec *exec) {
- struct amdgpu_eviction_fence *old_ef, *new_ef;
+ struct amdgpu_eviction_fence *new_ef;
struct drm_gem_object *obj;
unsigned long index;
int ret;
@@ -72,7 +73,6 @@ amdgpu_eviction_fence_replace_fence(struct
amdgpu_eviction_fence_mgr *evf_mgr,
/* Update the eviction fence now */
spin_lock(&evf_mgr->ev_fence_lock);
- old_ef = evf_mgr->ev_fence;
evf_mgr->ev_fence = new_ef;
spin_unlock(&evf_mgr->ev_fence_lock);
@@ -102,9 +102,6 @@ amdgpu_eviction_fence_replace_fence(struct
amdgpu_eviction_fence_mgr *evf_mgr,
}
}
- /* Free old fence */
- if (old_ef)
- dma_fence_put(&old_ef->base);
That change looks completely incorrect to me, you will now leak the old fence.
The eviction fence is attached and shared by all the restored validated VM BOs
during UQ restore, and at this placement the eviction fence is only detached
from
one of the BOs. Using amdgpu_userq_remove_all_eviction_fences() will walk over
the resv objects and detach the fence from the resv objs when freeing the BO.
Yeah, but that doesn't justify this change here. See you're completely messing
up
the fence reference count with that.
But there's a problem: even though dropping all the evf attached to VM BOs with
this patch, the evf still referred to by the SDMA and GFX kernel queue jobs at
the
case when enabling the kq and uq at the same time. Thoughts?
Mhm, the eviction fence is always added as bookmark isn't it? As long as the GFX
and SDMA jobs are not for evicting something then they should only depend on
fences with usage < bookmark.
Can you dig up when they are added to the dependencies of the job?
When the eviction fence was added to the user queue VM BOs reservation and then
updated the BO page table, which will add the eviction fence to the VM sync at
amdgpu_sync_resv()
Stop, wait a second. That shouldn't happen at the first place. Why is the
eviction fence considered a dependency for page table updates?
When it is added only as bookkeep then we should never consider that here.
Looks like something in the sync obj is messed up.
It is like this. Here, amdgpu_sync_resv is using DMA_RESV_USAGE_BOOKKEEP.
int amdgpu_sync_resv() {
..
/* TODO: Use DMA_RESV_USAGE_READ here */
dma_resv_for_each_fence(&cursor, resv,
DMA_RESV_USAGE_BOOKKEEP, f) {
dma_fence_chain_for_each(f, f) {
..
}
during PT update amdgpu_vm_bo_update() is using sync to moving
fences(Eviction fence) before mapping anything. Because of this Eviction
fence will act as dependency.
~arvind
Regards,
Christian.
, and then the eviction fence will be added as a dependent fence by propagating
with amdgpu_sync_push_to_job(). With removing the eviction fence from the VM
sync at amdgpu_sync_resv(), then the eviction fence can be released properly.
Thanks,
Prike
Thanks,
Christian.
PS: Please stop calling the eviction fence evf.
return 0;
free_err:
@@ -237,6 +234,30 @@ void amdgpu_eviction_fence_detach(struct
amdgpu_eviction_fence_mgr *evf_mgr,
dma_fence_put(stub);
}
+void amdgpu_userq_remove_all_eviction_fences(struct amdgpu_bo *bo)
Please name that amdgpu_eviction_fence_remove_all().
Noted.
Regards,
Christian.
+{
+ struct dma_resv *resv = &bo->tbo.base._resv;
+ struct dma_fence *fence, *stub;
+ struct dma_resv_iter cursor;
+
+ dma_resv_assert_held(resv);
+
+ stub = dma_fence_get_stub();
+ dma_resv_for_each_fence(&cursor, resv,
DMA_RESV_USAGE_BOOKKEEP, fence) {
+ struct amdgpu_eviction_fence *ev_fence;
+
+ ev_fence = fence_to_evf(fence);
+ if (!ev_fence || !dma_fence_is_signaled(&ev_fence->base))
+ continue;
+
+ dma_resv_replace_fences(resv, fence->context, stub,
+ DMA_RESV_USAGE_BOOKKEEP);
+
+ }
+
+ dma_fence_put(stub);
+}
+
int amdgpu_eviction_fence_init(struct amdgpu_eviction_fence_mgr
*evf_mgr) {
/* This needs to be done one time per open */ diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_eviction_fence.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_eviction_fence.h
index fcd867b7147d..da99ac322a2e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_eviction_fence.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_eviction_fence.h
@@ -66,4 +66,5 @@ amdgpu_eviction_fence_signal(struct
amdgpu_eviction_fence_mgr *evf_mgr, int
amdgpu_eviction_fence_replace_fence(struct amdgpu_eviction_fence_mgr
*evf_mgr,
struct drm_exec *exec);
+void amdgpu_userq_remove_all_eviction_fences(struct amdgpu_bo *bo);
#endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 1e73ce30d4d7..f001018a01eb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1392,6 +1392,7 @@ void amdgpu_bo_release_notify(struct
ttm_buffer_object *bo)
amdgpu_vram_mgr_set_cleared(bo->resource);
dma_resv_add_fence(&bo->base._resv, fence,
DMA_RESV_USAGE_KERNEL);
dma_fence_put(fence);
+ amdgpu_userq_remove_all_eviction_fences(abo);
out:
dma_resv_unlock(&bo->base._resv);