Hi Christian,
I'm not sure what do you mean by "we need to change SRIOV not the driver".

Do you mean we should change the reset sequence in SRIOV? This will be a huge 
change for our SRIOV solution.

>From my point of view, we can directly use
amdgpu_device_lock_adev and amdgpu_device_unlock_adev in flr_work instead of 
try_lock since no one will conflict with this thread with reset_domain 
introduced.
But we do need the reset_sem and adev->in_gpu_reset to keep device untouched 
via user space.

Best Regards,
Jingwen Chen

On 2022/1/3 下午6:17, Christian König wrote:
> Please don't. This patch is vital to the cleanup of the reset procedure.
>
> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>
> Christian.
>
> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>> Sure, I guess i can drop this patch then.
>>
>> Andrey
>>
>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>> I do agree with shaoyun, if the host find the gpu engine hangs first, and 
>>> do the flr, guest side thread may not know this and still try to access 
>>> HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the 
>>> reset status). And this may lead to very bad result.
>>>
>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>> These patches look good to me. JingWen will pull these patches and do some 
>>>> basic TDR test on sriov environment, and give feedback.
>>>>
>>>> Best wishes
>>>> Emily Deng
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Liu, Monk <monk....@amd.com>
>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>> To: Koenig, Christian <christian.koe...@amd.com>; Grodzovsky, Andrey
>>>>> <andrey.grodzov...@amd.com>; dri-devel@lists.freedesktop.org; amd-
>>>>> g...@lists.freedesktop.org; Chen, Horace <horace.c...@amd.com>; Chen,
>>>>> JingWen <jingwen.ch...@amd.com>; Deng, Emily <emily.d...@amd.com>
>>>>> Cc: dan...@ffwll.ch
>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>>>>> protection
>>>>> for SRIOV
>>>>>
>>>>> [AMD Official Use Only]
>>>>>
>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>
>>>>> Please take a review on Andrey's patch
>>>>>
>>>>> Thanks
>>>>> -------------------------------------------------------------------
>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>> -------------------------------------------------------------------
>>>>> we are hiring software manager for CVS core team
>>>>> -------------------------------------------------------------------
>>>>>
>>>>> -----Original Message-----
>>>>> From: Koenig, Christian <christian.koe...@amd.com>
>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>> To: Grodzovsky, Andrey <andrey.grodzov...@amd.com>; dri-
>>>>> de...@lists.freedesktop.org; amd-...@lists.freedesktop.org
>>>>> Cc: dan...@ffwll.ch; Liu, Monk <monk....@amd.com>; Chen, Horace
>>>>> <horace.c...@amd.com>
>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>>>>> protection
>>>>> for SRIOV
>>>>>
>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>> Since now flr work is serialized against  GPU resets there is no need
>>>>>> for this.
>>>>>>
>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
>>>>> Acked-by: Christian König <christian.koe...@amd.com>
>>>>>
>>>>>> ---
>>>>>>    drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>    drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>    2 files changed, 22 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>> work_struct *work)
>>>>>>        struct amdgpu_device *adev = container_of(virt, struct
>>>>> amdgpu_device, virt);
>>>>>>        int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>
>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>> -     * the VF FLR.
>>>>>> -     */
>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>> -        return;
>>>>>> -
>>>>>>        amdgpu_virt_fini_data_exchange(adev);
>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>
>>>>>>        xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>>>>
>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct
>>>>> work_struct *work)
>>>>>>        } while (timeout > 1);
>>>>>>
>>>>>>    flr_done:
>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>> -    up_write(&adev->reset_sem);
>>>>>> -
>>>>>>        /* Trigger recovery for world switch failure if no TDR */
>>>>>>        if (amdgpu_device_should_recover_gpu(adev)
>>>>>>            && (!amdgpu_device_has_job_running(adev) || diff --git
>>>>>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>> work_struct *work)
>>>>>>        struct amdgpu_device *adev = container_of(virt, struct
>>>>> amdgpu_device, virt);
>>>>>>        int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>
>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE received,
>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>> -     * the VF FLR.
>>>>>> -     */
>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>> -        return;
>>>>>> -
>>>>>>        amdgpu_virt_fini_data_exchange(adev);
>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>
>>>>>>        xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
>>>>>>
>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct
>>>>> work_struct *work)
>>>>>>        } while (timeout > 1);
>>>>>>
>>>>>>    flr_done:
>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>> -    up_write(&adev->reset_sem);
>>>>>> -
>>>>>>        /* Trigger recovery for world switch failure if no TDR */
>>>>>>        if (amdgpu_device_should_recover_gpu(adev)
>>>>>>            && (!amdgpu_device_has_job_running(adev) ||
>

Reply via email to