Reviewed-by: Arvind Yadav <arvind.ya...@amd.com>

On 5/15/2025 6:49 PM, Liang, Prike wrote:
[Public]

[Public]

I haven't cleaned up the userq resource destroy at postclose callback in my 
last patch, so here please remove the duplicated useq destroy. With that, the 
change in the patch is
Reviewed-by: Prike Liang <prike.li...@amd.com>

Regards,
       Prike

-----Original Message-----
From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of
Jesse.Zhang
Sent: Thursday, May 15, 2025 3:07 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <alexander.deuc...@amd.com>; Koenig, Christian
<christian.koe...@amd.com>; Zhang, Jesse(Jie) <jesse.zh...@amd.com>
Subject: [PATCH] drm/amdgpu: Fix eviction fence worker race during fd close

The current cleanup order during file descriptor close can lead to a race 
condition
where the eviction fence worker attempts to access a destroyed mutex from the
user queue manager:

[  517.294055] DEBUG_LOCKS_WARN_ON(lock->magic != lock) [  517.294060]
WARNING: CPU: 8 PID: 2030 at kernel/locking/mutex.c:564 [  517.294094]
Workqueue: events amdgpu_eviction_fence_suspend_worker [amdgpu]

The issue occurs because:
1. We destroy the user queue manager (including its mutex) first 2. Then try to
destroy eviction fences which may have pending work 3. The eviction fence worker
may try to access the already-destroyed mutex

Fix this by reordering the cleanup to:
1. First mark the fd as closing and destroy eviction fences,
    which flushes any pending work
2. Then safely destroy the user queue manager after we're certain
    no more fence work will be executed

Signed-off-by: Jesse Zhang <jesse.zh...@amd.com>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 2 +-
  2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 4ddd08ce8885..4db92e0a60da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2913,8 +2913,8 @@ static int amdgpu_drm_release(struct inode *inode,
struct file *filp)

       if (fpriv) {
               fpriv->evf_mgr.fd_closing = true;
-             amdgpu_userq_mgr_fini(&fpriv->userq_mgr);
               amdgpu_eviction_fence_destroy(&fpriv->evf_mgr);
+             amdgpu_userq_mgr_fini(&fpriv->userq_mgr);
       }

       return drm_release(inode, filp);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 9fbb04aee97b..1fec3713fbf2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -1504,8 +1504,8 @@ void amdgpu_driver_postclose_kms(struct drm_device
*dev,

       if (!fpriv->evf_mgr.fd_closing) {
               fpriv->evf_mgr.fd_closing = true;
-             amdgpu_userq_mgr_fini(&fpriv->userq_mgr);
               amdgpu_eviction_fence_destroy(&fpriv->evf_mgr);
+             amdgpu_userq_mgr_fini(&fpriv->userq_mgr);
       }
       amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr);
       amdgpu_vm_fini(adev, &fpriv->vm);
--
2.49.0

Reply via email to