From: Guchun Chen <[email protected]>

[ Upstream commit 12c17b9d62663c14a5343d6742682b3e67280754 ]

When running ras uncorrectable error injection and triggering GPU
reset on sGPU, below issue is observed. It's caused by the list
uninitialized when accessing.

[   80.047227] BUG: unable to handle page fault for address: ffffffffc0f4f750
[   80.047300] #PF: supervisor write access in kernel mode
[   80.047351] #PF: error_code(0x0003) - permissions violation
[   80.047404] PGD 12c20e067 P4D 12c20e067 PUD 12c210067 PMD 41c4ee067 PTE 
404316061
[   80.047477] Oops: 0003 [#1] SMP PTI
[   80.047516] CPU: 7 PID: 377 Comm: kworker/7:2 Tainted: G           OE     
5.4.0-rc7-guchchen #1
[   80.047594] Hardware name: System manufacturer System Product Name/TUF 
Z370-PLUS GAMING II, BIOS 0411 09/21/2018
[   80.047888] Workqueue: events amdgpu_ras_do_recovery [amdgpu]

Signed-off-by: Guchun Chen <[email protected]>
Reviewed-by: John Clements <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index b0aa4e1ed4df7..cd18596b47d33 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1444,9 +1444,10 @@ static void amdgpu_ras_do_recovery(struct work_struct 
*work)
        struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev, false);
 
        /* Build list of devices to query RAS related errors */
-       if  (hive && adev->gmc.xgmi.num_physical_nodes > 1) {
+       if  (hive && adev->gmc.xgmi.num_physical_nodes > 1)
                device_list_handle = &hive->device_list;
-       } else {
+       else {
+               INIT_LIST_HEAD(&device_list);
                list_add_tail(&adev->gmc.xgmi.head, &device_list);
                device_list_handle = &device_list;
        }
-- 
2.25.1



Reply via email to