From: Yifan Zhang <yifan1.zh...@amd.com> [ Upstream commit 90237b16ec1d7afa16e2173cc9a664377214cdd9 ]
This patch is to fix a kfd_prcess ref leak. Signed-off-by: Yifan Zhang <yifan1.zh...@amd.com> Reviewed-by: Philip Yang <philip.y...@amd.com> Signed-off-by: Alex Deucher <alexander.deuc...@amd.com> Signed-off-by: Sasha Levin <sas...@kernel.org> --- **YES** This commit should be backported to stable kernel trees. Here's my detailed analysis: ## Bug Analysis This patch fixes a clear **resource leak bug** where a reference count is not properly decremented in an error path. Specifically: 1. **The leak occurs in `kfd_signal_poison_consumed_event()`** at line 1334 where the function returns early without calling `kfd_unref_process(p)`: ```c if (unlikely(user_gpu_id == -EINVAL)) { WARN_ONCE(1, "Could not get user_gpu_id from dev->id:%x\n", dev->id); return; // BUG: Missing kfd_unref_process(p) } ``` 2. **The function `kfd_lookup_process_by_pasid()` increments the reference count** on the kfd_process object (as confirmed by examining the kernel source). This means every successful call must be paired with `kfd_unref_process()`. 3. **The fix is minimal and correct**: It simply adds the missing `kfd_unref_process(p)` call before the early return, ensuring the reference count is properly decremented. ## Why This Should Be Backported 1. **It's a clear bug fix**: This fixes a reference count leak that will cause memory to be leaked every time this error path is hit. Over time, this could lead to significant memory exhaustion. 2. **The fix is minimal and contained**: Only one line is added, making it low-risk for regression. The change doesn't alter any logic or introduce new features. 3. **It affects user-visible behavior**: When GPU errors occur and this code path is triggered repeatedly, the memory leak could impact system stability and performance. 4. **Similar to previously backported commits**: Looking at the historical examples, commit #4 which also fixed a memory leak in create_process failure was marked for backporting and included "Cc: sta...@vger.kernel.org". 5. **The bug is in error handling code**: While the error condition might be rare (invalid GPU ID), when it does occur, the leak happens every time. Error handling bugs are particularly important to fix because they can accumulate unnoticed. 6. **No architectural changes**: This is a simple bug fix that doesn't change any APIs, data structures, or introduce new functionality. The commit follows the stable kernel rules perfectly: it fixes a real bug, is minimal in scope, has clear consequences if not fixed (memory leak), and doesn't introduce new features or risky changes. drivers/gpu/drm/amd/amdkfd/kfd_events.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_events.c index fecdb67940750..3a926eb82379b 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_events.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_events.c @@ -1331,6 +1331,7 @@ void kfd_signal_poison_consumed_event(struct kfd_node *dev, u32 pasid) user_gpu_id = kfd_process_get_user_gpu_id(p, dev->id); if (unlikely(user_gpu_id == -EINVAL)) { WARN_ONCE(1, "Could not get user_gpu_id from dev->id:%x\n", dev->id); + kfd_unref_process(p); return; } -- 2.39.5