virt: Drop concurrent GPU reset protection for SRIOV

Liu, Shaoyun Tue, 04 Jan 2022 16:01:12 -0800

[AMD Official Use Only]

I see, I didn't notice you already have  this implemented . so the flr_work 
routine itself is synced now, in this case , I  agree it should be safe to 
remove the in_gpu_reset and  reset_semm in the flr_work.


Regards
Shaoyun.liu

-----Original Message-----
From: Grodzovsky, Andrey <andrey.grodzov...@amd.com> 
Sent: Tuesday, January 4, 2022 3:55 PM
To: Liu, Shaoyun <shaoyun....@amd.com>; Koenig, Christian 
<christian.koe...@amd.com>; Liu, Monk <monk....@amd.com>; Chen, JingWen 
<jingwen.ch...@amd.com>; Christian König <ckoenig.leichtzumer...@gmail.com>; 
Deng, Emily <emily.d...@amd.com>; dri-de...@lists.freedesktop.org; 
amd-gfx@lists.freedesktop.org; Chen, Horace <horace.c...@amd.com>
Cc: dan...@ffwll.ch
Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection 
for SRIOV

On 2022-01-04 12:13 p.m., Liu, Shaoyun wrote:

> [AMD Official Use Only]
>
> I mostly agree with the sequences Christian  described .  Just one  thing 
> might need to  discuss here.  For FLR notified from host,  in new sequenceas 
> described  , driver need to reply the  READY_TO_RESET in the  workitem  from 
> a reset  work queue which means inside flr_work, driver can not directly 
> reply to host but need to queue another workqueue .


Can you clarify why 'driver can not directly reply to host but need to queue 
another workqueue' ? To my understating all steps 3-6 in Christian's 
description happen from the same single wq thread serially.


>   For current  code ,  the flr_work for sriov itself is a work queue queued 
> from ISR .  I think we should try to response to the host driver as soon as 
> possible.  Queue another workqueue  inside  the workqueue  doesn't sounds 
> efficient to me.


Check patch 5 please [1] - I just substituted
schedule_work(&adev->virt.flr_work) for
queue_work(adev->reset_domain.wq,&adev->virt.flr_work) so no extra requeue 
here, just instead of sending to system_wq it's sent to dedicated reset wq

[1] -
https://lore.kernel.org/all/20211222221400.790842-1-andrey.grodzov...@amd.com/

Andrey


> Anyway, what we need is a working  solution for our project.  So if we need 
> to change the sequence, we  need to make sure it's been tested first and 
> won't break the functionality before the code is landed in the branch .
>
> Regards
> Shaoyun.liu
>
>
> -----Original Message-----
> From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of 
> Christian König
> Sent: Tuesday, January 4, 2022 6:36 AM
> To: Liu, Monk <monk....@amd.com>; Chen, JingWen 
> <jingwen.ch...@amd.com>; Christian König 
> <ckoenig.leichtzumer...@gmail.com>; Grodzovsky, Andrey 
> <andrey.grodzov...@amd.com>; Deng, Emily <emily.d...@amd.com>; 
> dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, 
> Horace <horace.c...@amd.com>
> Cc: dan...@ffwll.ch
> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
> protection for SRIOV
>
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of 
>>>> signaling the need for a reset, similar to each job timeout on each queue. 
>>>> Otherwise you have a race condition between the hypervisor and the 
>>>> scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>> FLR is about to start or was already executed, but host will do FLR 
>> anyway without waiting for guest too long
>>
> Then we have a major design issue in the SRIOV protocol and really need to 
> question this.
>
> How do you want to prevent a race between the hypervisor resetting the 
> hardware and the client trying the same because of a timeout?
>
> As far as I can see the procedure should be:
> 1. We detect that a reset is necessary, either because of a fault a timeout 
> or signal from hypervisor.
> 2. For each of those potential reset sources a work item is send to the 
> single workqueue.
> 3. One of those work items execute first and prepares the reset.
> 4. We either do the reset our self or notify the hypervisor that we are ready 
> for the reset.
> 5. Cleanup after the reset, eventually resubmit jobs etc..
> 6. Cancel work items which might have been scheduled from other reset sources.
>
> It does make sense that the hypervisor resets the hardware without waiting 
> for the clients for too long, but if we don't follow this general steps we 
> will always have a race between the different components.
>
> Regards,
> Christian.
>
> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>> [AMD Official Use Only]
>>
>>>> See the FLR request from the hypervisor is just another source of 
>>>> signaling the need for a reset, similar to each job timeout on each queue. 
>>>> Otherwise you have a race condition between the hypervisor and the 
>>>> scheduler.
>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>> FLR is about to start or was already executed, but host will do FLR 
>> anyway without waiting for guest too long
>>
>>>> In other words I strongly think that the current SRIOV reset 
>>>> implementation is severely broken and what Andrey is doing is actually 
>>>> fixing it.
>> It makes the code to crash ... how could it be a fix ?
>>
>> I'm afraid the patch is NAK from me,  but it is welcome if the cleanup do 
>> not ruin the logic, Andry or jingwen can try it if needed.
>>
>> Thanks
>> -------------------------------------------------------------------
>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>> -------------------------------------------------------------------
>> we are hiring software manager for CVS core team
>> -------------------------------------------------------------------
>>
>> -----Original Message-----
>> From: Koenig, Christian <christian.koe...@amd.com>
>> Sent: Tuesday, January 4, 2022 6:19 PM
>> To: Chen, JingWen <jingwen.ch...@amd.com>; Christian König 
>> <ckoenig.leichtzumer...@gmail.com>; Grodzovsky, Andrey 
>> <andrey.grodzov...@amd.com>; Deng, Emily <emily.d...@amd.com>; Liu, 
>> Monk <monk....@amd.com>; dri-de...@lists.freedesktop.org; 
>> amd-gfx@lists.freedesktop.org; Chen, Horace <horace.c...@amd.com>; 
>> Chen, JingWen <jingwen.ch...@amd.com>
>> Cc: dan...@ffwll.ch
>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>> protection for SRIOV
>>
>> Hi Jingwen,
>>
>> well what I mean is that we need to adjust the implementation in amdgpu to 
>> actually match the requirements.
>>
>> Could be that the reset sequence is questionable in general, but I doubt so 
>> at least for now.
>>
>> See the FLR request from the hypervisor is just another source of signaling 
>> the need for a reset, similar to each job timeout on each queue. Otherwise 
>> you have a race condition between the hypervisor and the scheduler.
>>
>> Properly setting in_gpu_reset is indeed mandatory, but should happen at a 
>> central place and not in the SRIOV specific code.
>>
>> In other words I strongly think that the current SRIOV reset implementation 
>> is severely broken and what Andrey is doing is actually fixing it.
>>
>> Regards,
>> Christian.
>>
>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>> Hi Christian,
>>> I'm not sure what do you mean by "we need to change SRIOV not the driver".
>>>
>>> Do you mean we should change the reset sequence in SRIOV? This will be a 
>>> huge change for our SRIOV solution.
>>>
>>>    From my point of view, we can directly use 
>>> amdgpu_device_lock_adev and amdgpu_device_unlock_adev in flr_work instead 
>>> of try_lock since no one will conflict with this thread with reset_domain 
>>> introduced.
>>> But we do need the reset_sem and adev->in_gpu_reset to keep device 
>>> untouched via user space.
>>>
>>> Best Regards,
>>> Jingwen Chen
>>>
>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>> Please don't. This patch is vital to the cleanup of the reset procedure.
>>>>
>>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver.
>>>>
>>>> Christian.
>>>>
>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>> Sure, I guess i can drop this patch then.
>>>>>
>>>>> Andrey
>>>>>
>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, 
>>>>>> and do the flr, guest side thread may not know this and still try to 
>>>>>> access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to 
>>>>>> identify the reset status). And this may lead to very bad result.
>>>>>>
>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>> These patches look good to me. JingWen will pull these patches and do 
>>>>>>> some basic TDR test on sriov environment, and give feedback.
>>>>>>>
>>>>>>> Best wishes
>>>>>>> Emily Deng
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Liu, Monk <monk....@amd.com>
>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>> To: Koenig, Christian <christian.koe...@amd.com>; Grodzovsky, 
>>>>>>>> Andrey <andrey.grodzov...@amd.com>; 
>>>>>>>> dri-de...@lists.freedesktop.org; amd- 
>>>>>>>> g...@lists.freedesktop.org; Chen, Horace <horace.c...@amd.com>; 
>>>>>>>> Chen, JingWen <jingwen.ch...@amd.com>; Deng, Emily 
>>>>>>>> <emily.d...@amd.com>
>>>>>>>> Cc: dan...@ffwll.ch
>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>>> reset protection for SRIOV
>>>>>>>>
>>>>>>>> [AMD Official Use Only]
>>>>>>>>
>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>
>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> ---------------------------------------------------------------
>>>>>>>> -
>>>>>>>> -
>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>> ---------------------------------------------------------------
>>>>>>>> -
>>>>>>>> -
>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>> ---------------------------------------------------------------
>>>>>>>> -
>>>>>>>> -
>>>>>>>> --
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <christian.koe...@amd.com>
>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>> To: Grodzovsky, Andrey <andrey.grodzov...@amd.com>; dri- 
>>>>>>>> de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org
>>>>>>>> Cc: dan...@ffwll.ch; Liu, Monk <monk....@amd.com>; Chen, Horace 
>>>>>>>> <horace.c...@amd.com>
>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>>> reset protection for SRIOV
>>>>>>>>
>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>> Since now flr work is serialized against  GPU resets there is 
>>>>>>>>> no need for this.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
>>>>>>>> Acked-by: Christian König <christian.koe...@amd.com>
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>       drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>       2 files changed, 22 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>> @@ -248,15 +248,7 @@ static void 
>>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, 
>>>>>>>>> struct
>>>>>>>> amdgpu_device, virt);
>>>>>>>>>           int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>
>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>>> received,
>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>> -     * the VF FLR.
>>>>>>>>> -     */
>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>> -        return;
>>>>>>>>> -
>>>>>>>>>           amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>
>>>>>>>>>           xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 
>>>>>>>>> 0, 0, 0);
>>>>>>>>>
>>>>>>>>> @@ -269,9 +261,6 @@ static void 
>>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>
>>>>>>>>>       flr_done:
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>> -
>>>>>>>>>           /* Trigger recovery for world switch failure if no 
>>>>>>>>> TDR */
>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) || diff 
>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>> @@ -277,15 +277,7 @@ static void 
>>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           struct amdgpu_device *adev = container_of(virt, 
>>>>>>>>> struct
>>>>>>>> amdgpu_device, virt);
>>>>>>>>>           int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>
>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>>> received,
>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>> -     * the VF FLR.
>>>>>>>>> -     */
>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>> -        return;
>>>>>>>>> -
>>>>>>>>>           amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>
>>>>>>>>>           xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 
>>>>>>>>> 0, 0, 0);
>>>>>>>>>
>>>>>>>>> @@ -298,9 +290,6 @@ static void 
>>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>>           } while (timeout > 1);
>>>>>>>>>
>>>>>>>>>       flr_done:
>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>> -
>>>>>>>>>           /* Trigger recovery for world switch failure if no 
>>>>>>>>> TDR */
>>>>>>>>>           if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>               && (!amdgpu_device_has_job_running(adev) ||

RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

Reply via email to