On 1/19/21 7:22 AM, Horace Chen wrote:
Fix a racing issue when jobs on 2 rings timeout simultaneously.
If 2 rings timed out at the same time, the amdgpu_device_gpu_recover
will be reentered. Then the adev->gmc.xgmi.head will be grabbed
by 2 local linked list, which may cause wild pointer issue in
iterating.
lock the device earily to prevent the node be added to 2 different
lists.
Signed-off-by: Horace Chen <horace.c...@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42 +++++++++++++++-------
1 file changed, 30 insertions(+), 12 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 4d434803fb49..9574da3abc32 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4540,6 +4540,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
int i, r = 0;
bool need_emergency_restart = false;
bool audio_suspended = false;
+ bool get_dev_lock = false;
/*
* Special case: RAS triggered and full reset isn't supported
@@ -4582,28 +4583,45 @@ int amdgpu_device_gpu_recover(struct amdgpu_device
*adev,
* Build list of devices to reset.
* In case we are in XGMI hive mode, resort the device list
* to put adev in the 1st position.
+ *
+ * lock the device before we try to operate the linked list
+ * if didn't get the device lock, don't touch the linked list since
+ * others may iterating it.
*/
INIT_LIST_HEAD(&device_list);
if (adev->gmc.xgmi.num_physical_nodes > 1) {
if (!hive)
return -ENODEV;
- if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
- list_rotate_to_front(&adev->gmc.xgmi.head,
&hive->device_list);
- device_list_handle = &hive->device_list;
+
+ list_for_each_entry(tmp_adev, &hive->device_list,
gmc.xgmi.head) {
+ get_dev_lock = amdgpu_device_lock_adev(tmp_adev, hive);
+ if (!get_dev_lock)
+ break;
What about unlocking back all the devices you already locked if the break
happens in the middle of the iteration ?
Note that at skip_recovery: we don't do it. BTW, i see this issue is already in
the current code.
Also, maybe now it's better to separate the actual locking in
amdgpu_device_lock_adev
from the other stuff going on there since I don't think you would wont to toggle
stuff
like adev->mp1_state back and forth and also the function name is not
descriptive of
the other stuff going on there anyway.
Andrey
+ }
+ if (get_dev_lock) {
+ if (!list_is_first(&adev->gmc.xgmi.head,
&hive->device_list))
+ list_rotate_to_front(&adev->gmc.xgmi.head,
&hive->device_list);
+ device_list_handle = &hive->device_list;
+ }
} else {
- list_add_tail(&adev->gmc.xgmi.head, &device_list);
- device_list_handle = &device_list;
+ get_dev_lock = amdgpu_device_lock_adev(adev, hive);
+ tmp_adev = adev;
+ if (get_dev_lock) {
+ list_add_tail(&adev->gmc.xgmi.head, &device_list);
+ device_list_handle = &device_list;
+ }
+ }
+
+ if (!get_dev_lock) {
+ dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another
already in progress",
+ job ? job->base.id : -1);
+ r = 0;
+ /* even we skipped this reset, still need to set the job to
guilty */
+ goto skip_recovery;
}
/* block all schedulers and reset given job's ring */
list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
- if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
- dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as
another already in progress",
- job ? job->base.id : -1);
- r = 0;
- goto skip_recovery;
- }
-
/*
* Try to put the audio codec into suspend state
* before gpu reset started.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx