amdgpu: race issue when jobs on 2 ring timeout

Andrey Grodzovsky Tue, 19 Jan 2021 09:15:38 -0800

Well, it shouldn't happen with the hive locked as I am browsing the code butthen your code shouldreflect that and if you do fail to lock particular adev AFTER the hive is lockedyou should not silently breakiteration but throw an error, WARN_ON or BUG_ON then. Or alternatively bail outwith unlocking all already

locked devices.


Andrey


On 1/19/21 12:09 PM, Chen, Horace wrote:

[AMD Official Use Only - Internal Distribution Only]
OK, I understand. You mean one device in the hive may be locked upindependently without locking up the whole hive.
It could happen, I'll change my code.

Thanks & Regards,
Horace.

--------------------------------------------------------------------------------
*发件人:* Grodzovsky, Andrey <andrey.grodzov...@amd.com>
*发送时间:* 2021年1月20日 0:58
*收件人:* Chen, Horace <horace.c...@amd.com>; amd-gfx@lists.freedesktop.org<amd-gfx@lists.freedesktop.org>*抄送:* Quan, Evan <evan.q...@amd.com>; Tuikov, Luben <luben.tui...@amd.com>;Koenig, Christian <christian.koe...@amd.com>; Deucher, Alexander<alexander.deuc...@amd.com>; Xiao, Jack <jack.x...@amd.com>; Zhang, Hawking<hawking.zh...@amd.com>; Liu, Monk <monk....@amd.com>; Xu, Feifei<feifei...@amd.com>; Wang, Kevin(Yang) <kevin1.w...@amd.com>; Xiaojie Yuan<xiaojie.y...@amd.com>
*主题:* Re: 回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout


On 1/19/21 11:39 AM, Chen, Horace wrote:
[AMD Official Use Only - Internal Distribution Only]


Hi Andrey,
I think the list in the XGMI hive won't be break in the middle if we lock thedevice before we change the list. Because if 2 devices in 1 hive went intothe function, it will follow the same sequence to lock the devices. So one ofthem will definately break at the first device. I add iterate devices here isjust to lock all device in the hive since we will change the device sequencein the hive soon after.
I didn't mean break in a sense of breaking the list itself, I just meant theliteral 'break' instruction
to terminate the iteration once you failed to lock a particular device.
The reason to break the interation in the middle is that the list is changedduring the iteration without taking any lock. It is quite bad since I'mfixing one of this issue. And for XGMI hive, there are 2 locks protecting thelist, one is the device lock I changed here, the other one is in front of mychange, there is a hive->lock to protect the hive.
Even the bad thing really happened, I think moving back through the list isalso very dengerous since we don't know what the list finally be, Unless westack the devices we have iterated through a mirrored list. That can be a bigchange.
Not sure we are on the same page, my concern is let's sat your XGMI hiveconsists of 2 devices, you manged to call successfully doamdgpu_device_lock_adev for dev1 but then failed for dev2, in this case youwill bail out without releasing dev1, no ?
Andrey
I'm ok to seperate the locking in amdgpu_device_lock_adev here, I'll do sometest and update the code later.
Thanks & Regards,
Horace.
--------------------------------------------------------------------------------
*发件人:* Grodzovsky, Andrey <andrey.grodzov...@amd.com><mailto:andrey.grodzov...@amd.com>
*发送时间:* 2021年1月19日 22:33
*收件人:* Chen, Horace <horace.c...@amd.com> <mailto:horace.c...@amd.com>;amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org><amd-gfx@lists.freedesktop.org> <mailto:amd-gfx@lists.freedesktop.org>*抄送:* Quan, Evan <evan.q...@amd.com> <mailto:evan.q...@amd.com>; Tuikov,Luben <luben.tui...@amd.com> <mailto:luben.tui...@amd.com>; Koenig, Christian<christian.koe...@amd.com> <mailto:christian.koe...@amd.com>; Deucher,Alexander <alexander.deuc...@amd.com> <mailto:alexander.deuc...@amd.com>;Xiao, Jack <jack.x...@amd.com> <mailto:jack.x...@amd.com>; Zhang, Hawking<hawking.zh...@amd.com> <mailto:hawking.zh...@amd.com>; Liu, Monk<monk....@amd.com> <mailto:monk....@amd.com>; Xu, Feifei <feifei...@amd.com><mailto:feifei...@amd.com>; Wang, Kevin(Yang) <kevin1.w...@amd.com><mailto:kevin1.w...@amd.com>; Xiaojie Yuan <xiaojie.y...@amd.com><mailto:xiaojie.y...@amd.com>
*主题:* Re: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

On 1/19/21 7:22 AM, Horace Chen wrote:
> Fix a racing issue when jobs on 2 rings timeout simultaneously.
>
> If 2 rings timed out at the same time, the amdgpu_device_gpu_recover
> will be reentered. Then the adev->gmc.xgmi.head will be grabbed
> by 2 local linked list, which may cause wild pointer issue in
> iterating.
>
> lock the device earily to prevent the node be added to 2 different
> lists.
>
> Signed-off-by: Horace Chen <horace.c...@amd.com> <mailto:horace.c...@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42 +++++++++++++++-------
>   1 file changed, 30 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.cb/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 4d434803fb49..9574da3abc32 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4540,6 +4540,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
*adev,
>        int i, r = 0;
>        bool need_emergency_restart = false;
>        bool audio_suspended = false;
> +     bool get_dev_lock = false;
>
>        /*
>         * Special case: RAS triggered and full reset isn't supported
> @@ -4582,28 +4583,45 @@ int amdgpu_device_gpu_recover(struct amdgpu_device*adev,
>         * Build list of devices to reset.
>         * In case we are in XGMI hive mode, resort the device list
>         * to put adev in the 1st position.
> +      *
> +      * lock the device before we try to operate the linked list
> +      * if didn't get the device lock, don't touch the linked list since
> +      * others may iterating it.
>         */
>        INIT_LIST_HEAD(&device_list);
>        if (adev->gmc.xgmi.num_physical_nodes > 1) {
>                if (!hive)
>                        return -ENODEV;
> -             if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
> - list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> -             device_list_handle = &hive->device_list;
> +
> + list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
> +                     get_dev_lock = amdgpu_device_lock_adev(tmp_adev, hive);
> +                     if (!get_dev_lock)
> +                             break;


What about unlocking back all the devices you already locked if the break
happens in the middle of the iteration ?
Note that at skip_recovery: we don't do it. BTW, i see this issue is already in
the current code.

Also, maybe now it's better to separate the actual locking in
amdgpu_device_lock_adev
from the other stuff going on there since I don't think you would wont to toggle
stuff
like adev->mp1_state back and forth and also the function name is not
descriptive of
the other stuff going on there anyway.

Andrey


> +             }
> +             if (get_dev_lock) {
> + if (!list_is_first(&adev->gmc.xgmi.head,&hive->device_list))
> + list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> +                     device_list_handle = &hive->device_list;
> +             }
>        } else {
> - list_add_tail(&adev->gmc.xgmi.head, &device_list);
> -             device_list_handle = &device_list;
> +             get_dev_lock = amdgpu_device_lock_adev(adev, hive);
> +             tmp_adev = adev;
> +             if (get_dev_lock) {
> + list_add_tail(&adev->gmc.xgmi.head, &device_list);
> +                     device_list_handle = &device_list;
> +             }
> +     }
> +
> +     if (!get_dev_lock) {
> + dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, asanother already in progress",
> +                                     job ? job->base.id : -1);
> +             r = 0;
> + /* even we skipped this reset, still need to set the job toguilty */
> +             goto skip_recovery;
>        }
>
>        /* block all schedulers and reset given job's ring */
>        list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> -             if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
> - dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as anotheralready in progress",
> -                               job ? job->base.id : -1);
> -                     r = 0;
> -                     goto skip_recovery;
> -             }
> -
>                /*
>                 * Try to put the audio codec into suspend state
>                 * before gpu reset started.

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: 回复: 回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

Reply via email to