On 1/19/21 7:22 AM, Horace Chen wrote:
Fix a racing issue when jobs on 2 rings timeout simultaneously.

If 2 rings timed out at the same time, the amdgpu_device_gpu_recover
will be reentered. Then the adev->gmc.xgmi.head will be grabbed
by 2 local linked list, which may cause wild pointer issue in
iterating.

lock the device earily to prevent the node be added to 2 different
lists.

Signed-off-by: Horace Chen <horace.c...@amd.com>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42 +++++++++++++++-------
  1 file changed, 30 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 4d434803fb49..9574da3abc32 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4540,6 +4540,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
        int i, r = 0;
        bool need_emergency_restart = false;
        bool audio_suspended = false;
+       bool get_dev_lock = false;
/*
         * Special case: RAS triggered and full reset isn't supported
@@ -4582,28 +4583,45 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
*adev,
         * Build list of devices to reset.
         * In case we are in XGMI hive mode, resort the device list
         * to put adev in the 1st position.
+        *
+        * lock the device before we try to operate the linked list
+        * if didn't get the device lock, don't touch the linked list since
+        * others may iterating it.
         */
        INIT_LIST_HEAD(&device_list);
        if (adev->gmc.xgmi.num_physical_nodes > 1) {
                if (!hive)
                        return -ENODEV;
-               if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
-                       list_rotate_to_front(&adev->gmc.xgmi.head, 
&hive->device_list);
-               device_list_handle = &hive->device_list;
+
+               list_for_each_entry(tmp_adev, &hive->device_list, 
gmc.xgmi.head) {
+                       get_dev_lock = amdgpu_device_lock_adev(tmp_adev, hive);
+                       if (!get_dev_lock)
+                               break;


What about unlocking back all the devices you already locked if the break
happens in the middle of the iteration ?
Note that at skip_recovery: we don't do it. BTW, i see this issue is already in the current code.

Also, maybe now it's better to separate the actual locking in amdgpu_device_lock_adev from the other stuff going on there since I don't think you would wont to toggle stuff like adev->mp1_state back and forth and also the function name is not descriptive of
the other stuff going on there anyway.

Andrey


+               }
+               if (get_dev_lock) {
+                       if (!list_is_first(&adev->gmc.xgmi.head, 
&hive->device_list))
+                               list_rotate_to_front(&adev->gmc.xgmi.head, 
&hive->device_list);
+                       device_list_handle = &hive->device_list;
+               }
        } else {
-               list_add_tail(&adev->gmc.xgmi.head, &device_list);
-               device_list_handle = &device_list;
+               get_dev_lock = amdgpu_device_lock_adev(adev, hive);
+               tmp_adev = adev;
+               if (get_dev_lock) {
+                       list_add_tail(&adev->gmc.xgmi.head, &device_list);
+                       device_list_handle = &device_list;
+               }
+       }
+
+       if (!get_dev_lock) {
+               dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another 
already in progress",
+                                       job ? job->base.id : -1);
+               r = 0;
+               /* even we skipped this reset, still need to set the job to 
guilty */
+               goto skip_recovery;
        }
/* block all schedulers and reset given job's ring */
        list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
-               if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
-                       dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as 
another already in progress",
-                                 job ? job->base.id : -1);
-                       r = 0;
-                       goto skip_recovery;
-               }
-
                /*
                 * Try to put the audio codec into suspend state
                 * before gpu reset started.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to