I'm ok to seperate the locking in amdgpu_device_lock_adev here, I'll do some
test and update the code later.
Thanks & Regards,
Horace.
--------------------------------------------------------------------------------
*发件人:* Grodzovsky, Andrey <andrey.grodzov...@amd.com>
<mailto:andrey.grodzov...@amd.com>
*发送时间:* 2021年1月19日 22:33
*收件人:* Chen, Horace <horace.c...@amd.com> <mailto:horace.c...@amd.com>;
amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
<amd-gfx@lists.freedesktop.org> <mailto:amd-gfx@lists.freedesktop.org>
*抄送:* Quan, Evan <evan.q...@amd.com> <mailto:evan.q...@amd.com>; Tuikov,
Luben <luben.tui...@amd.com> <mailto:luben.tui...@amd.com>; Koenig, Christian
<christian.koe...@amd.com> <mailto:christian.koe...@amd.com>; Deucher,
Alexander <alexander.deuc...@amd.com> <mailto:alexander.deuc...@amd.com>;
Xiao, Jack <jack.x...@amd.com> <mailto:jack.x...@amd.com>; Zhang, Hawking
<hawking.zh...@amd.com> <mailto:hawking.zh...@amd.com>; Liu, Monk
<monk....@amd.com> <mailto:monk....@amd.com>; Xu, Feifei <feifei...@amd.com>
<mailto:feifei...@amd.com>; Wang, Kevin(Yang) <kevin1.w...@amd.com>
<mailto:kevin1.w...@amd.com>; Xiaojie Yuan <xiaojie.y...@amd.com>
<mailto:xiaojie.y...@amd.com>
*主题:* Re: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout
On 1/19/21 7:22 AM, Horace Chen wrote:
> Fix a racing issue when jobs on 2 rings timeout simultaneously.
>
> If 2 rings timed out at the same time, the amdgpu_device_gpu_recover
> will be reentered. Then the adev->gmc.xgmi.head will be grabbed
> by 2 local linked list, which may cause wild pointer issue in
> iterating.
>
> lock the device earily to prevent the node be added to 2 different
> lists.
>
> Signed-off-by: Horace Chen <horace.c...@amd.com> <mailto:horace.c...@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42 +++++++++++++++-------
> 1 file changed, 30 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 4d434803fb49..9574da3abc32 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4540,6 +4540,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device
*adev,
> int i, r = 0;
> bool need_emergency_restart = false;
> bool audio_suspended = false;
> + bool get_dev_lock = false;
>
> /*
> * Special case: RAS triggered and full reset isn't supported
> @@ -4582,28 +4583,45 @@ int amdgpu_device_gpu_recover(struct amdgpu_device
*adev,
> * Build list of devices to reset.
> * In case we are in XGMI hive mode, resort the device list
> * to put adev in the 1st position.
> + *
> + * lock the device before we try to operate the linked list
> + * if didn't get the device lock, don't touch the linked list since
> + * others may iterating it.
> */
> INIT_LIST_HEAD(&device_list);
> if (adev->gmc.xgmi.num_physical_nodes > 1) {
> if (!hive)
> return -ENODEV;
> - if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
> - list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> - device_list_handle = &hive->device_list;
> +
> + list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
> + get_dev_lock = amdgpu_device_lock_adev(tmp_adev, hive);
> + if (!get_dev_lock)
> + break;
What about unlocking back all the devices you already locked if the break
happens in the middle of the iteration ?
Note that at skip_recovery: we don't do it. BTW, i see this issue is already in
the current code.
Also, maybe now it's better to separate the actual locking in
amdgpu_device_lock_adev
from the other stuff going on there since I don't think you would wont to toggle
stuff
like adev->mp1_state back and forth and also the function name is not
descriptive of
the other stuff going on there anyway.
Andrey
> + }
> + if (get_dev_lock) {
> + if (!list_is_first(&adev->gmc.xgmi.head,
&hive->device_list))
> + list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> + device_list_handle = &hive->device_list;
> + }
> } else {
> - list_add_tail(&adev->gmc.xgmi.head, &device_list);
> - device_list_handle = &device_list;
> + get_dev_lock = amdgpu_device_lock_adev(adev, hive);
> + tmp_adev = adev;
> + if (get_dev_lock) {
> + list_add_tail(&adev->gmc.xgmi.head, &device_list);
> + device_list_handle = &device_list;
> + }
> + }
> +
> + if (!get_dev_lock) {
> + dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as
another already in progress",
> + job ? job->base.id : -1);
> + r = 0;
> + /* even we skipped this reset, still need to set the job to
guilty */
> + goto skip_recovery;
> }
>
> /* block all schedulers and reset given job's ring */
> list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> - if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
> - dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another
already in progress",
> - job ? job->base.id : -1);
> - r = 0;
> - goto skip_recovery;
> - }
> -
> /*
> * Try to put the audio codec into suspend state
> * before gpu reset started.