in_recovery is earlier than ras feature

Bob Zhou Fri, 27 Oct 2023 03:03:23 -0700

Checking ras->in_recovery is earlier than ras feature that causes the
below null pointer issue. So update the check order to fix it.


BUG: kernel NULL pointer dereference, address: 00000000000000e8
RIP: 0010:amdgpu_ras_reset_error_count+0xf6/0x190 [amdgpu]
Call Trace:
 <TASK>
 ? show_regs+0x72/0x90
 ? __die+0x25/0x80
 ? page_fault_oops+0x79/0x190
 ? do_user_addr_fault+0x30c/0x640
 ? __wake_up_klogd.part.0+0x40/0x70
 ? exc_page_fault+0x81/0x1b0
 ? asm_exc_page_fault+0x27/0x30
 ? amdgpu_ras_reset_error_count+0xf6/0x190 [amdgpu]
 ? __pfx_gmc_v9_0_late_init+0x10/0x10 [amdgpu]
 gmc_v9_0_late_init+0x97/0xe0 [amdgpu]

Fixes: be5c7eb10406 ("drm/amdgpu: bypass RAS error reset in some conditions")

Signed-off-by: Bob Zhou <bob.z...@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 303fbb6a48b6..3af50754800d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1229,15 +1229,15 @@ int amdgpu_ras_reset_error_count(struct amdgpu_device 
*adev,
                return -EOPNOTSUPP;
        }
 
+       if (!amdgpu_ras_is_supported(adev, block) ||
+           !amdgpu_ras_get_mca_debug_mode(adev))
+               return -EOPNOTSUPP;
+
        /* skip ras error reset in gpu reset */
        if ((amdgpu_in_reset(adev) || atomic_read(&ras->in_recovery)) &&
            mca_funcs && mca_funcs->mca_set_debug_mode)
                return -EOPNOTSUPP;
 
-       if (!amdgpu_ras_is_supported(adev, block) ||
-           !amdgpu_ras_get_mca_debug_mode(adev))
-               return -EOPNOTSUPP;
-
        if (block_obj->hw_ops->reset_ras_error_count)
                block_obj->hw_ops->reset_ras_error_count(adev);
 
-- 
2.34.1

[PATCH] drm/amdgpu: fix check order ras->in_recovery is earlier than ras feature

Reply via email to