virt: Drop concurrent GPU reset protection for SRIOV

JingWen Chen Thu, 06 Jan 2022 21:46:57 -0800


On 2022/1/7 上午11:57, JingWen Chen wrote:
> On 2022/1/7 上午3:13, Andrey Grodzovsky wrote:
>> On 2022-01-06 12:18 a.m., JingWen Chen wrote:
>>> On 2022/1/6 下午12:59, JingWen Chen wrote:
>>>> On 2022/1/6 上午2:24, Andrey Grodzovsky wrote:
>>>>> On 2022-01-05 2:59 a.m., Christian König wrote:
>>>>>> Am 05.01.22 um 08:34 schrieb JingWen Chen:
>>>>>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote:
>>>>>>>> On 2022-01-04 6:36 a.m., Christian König wrote:
>>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>
>>>>>>>>>>>> See the FLR request from the hypervisor is just another source of 
>>>>>>>>>>>> signaling the need for a reset, similar to each job timeout on 
>>>>>>>>>>>> each queue. Otherwise you have a race condition between the 
>>>>>>>>>>>> hypervisor and the scheduler.
>>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>>>>>>>>>> FLR is about to start or was already executed, but host will do FLR 
>>>>>>>>>> anyway without waiting for guest too long
>>>>>>>>>>
>>>>>>>>> Then we have a major design issue in the SRIOV protocol and really 
>>>>>>>>> need to question this.
>>>>>>>>>
>>>>>>>>> How do you want to prevent a race between the hypervisor resetting 
>>>>>>>>> the hardware and the client trying the same because of a timeout?
>>>>>>>>>
>>>>>>>>> As far as I can see the procedure should be:
>>>>>>>>> 1. We detect that a reset is necessary, either because of a fault a 
>>>>>>>>> timeout or signal from hypervisor.
>>>>>>>>> 2. For each of those potential reset sources a work item is send to 
>>>>>>>>> the single workqueue.
>>>>>>>>> 3. One of those work items execute first and prepares the reset.
>>>>>>>>> 4. We either do the reset our self or notify the hypervisor that we 
>>>>>>>>> are ready for the reset.
>>>>>>>>> 5. Cleanup after the reset, eventually resubmit jobs etc..
>>>>>>>>> 6. Cancel work items which might have been scheduled from other reset 
>>>>>>>>> sources.
>>>>>>>>>
>>>>>>>>> It does make sense that the hypervisor resets the hardware without 
>>>>>>>>> waiting for the clients for too long, but if we don't follow this 
>>>>>>>>> general steps we will always have a race between the different 
>>>>>>>>> components.
>>>>>>>> Monk, just to add to this - if indeed as you say that 'FLR from 
>>>>>>>> hypervisor is just to notify guest the hw VF FLR is about to start or 
>>>>>>>> was already executed, but host will do FLR anyway without waiting for 
>>>>>>>> guest too long'
>>>>>>>> and there is no strict waiting from the hypervisor for 
>>>>>>>> IDH_READY_TO_RESET to be recived from guest before starting the reset 
>>>>>>>> then setting in_gpu_reset and locking reset_sem from guest side is not 
>>>>>>>> really full proof
>>>>>>>> protection from MMIO accesses by the guest - it only truly helps if 
>>>>>>>> hypervisor waits for that message before initiation of HW reset.
>>>>>>>>
>>>>>>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never 
>>>>>>> has the chance to send the response back, then other VFs will have to 
>>>>>>> wait it reset. All the vfs will hang in this case. Or sometimes the 
>>>>>>> mailbox has some delay and other VFs will also wait. The user of other 
>>>>>>> VFs will be affected in this case.
>>>>>> Yeah, agree completely with JingWen. The hypervisor is the one in charge 
>>>>>> here, not the guest.
>>>>>>
>>>>>> What the hypervisor should do (and it already seems to be designed that 
>>>>>> way) is to send the guest a message that a reset is about to happen and 
>>>>>> give it some time to response appropriately.
>>>>>>
>>>>>> The guest on the other hand then tells the hypervisor that all 
>>>>>> processing has stopped and it is ready to restart. If that doesn't 
>>>>>> happen in time the hypervisor should eliminate the guest probably 
>>>>>> trigger even more severe consequences, e.g. restart the whole VM etc...
>>>>>>
>>>>>> Christian.
>>>>> So what's the end conclusion here regarding dropping this particular 
>>>>> patch ? Seems to me we still need to drop it to prevent driver's MMIO 
>>>>> access
>>>>> to the GPU during reset from various places in the code.
>>>>>
>>>>> Andrey
>>>>>
>>>> Hi Andrey & Christian,
>>>>
>>>> I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) 
>>>> and run some tests. If a engine hang during an OCL benchmark(using kfd), 
>>>> we can see the logs below:
>>
>> Did you port the entire patchset or just 'drm/amd/virt: Drop concurrent GPU 
>> reset protection for SRIOV' ?
>>
>>
> I ported the entire patchset
>>>> [  397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0.
>>>> [  428.400582] [drm] clean up the vf2pf work item
>>>> [  428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 
>>>> ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread 
>>>> xgemmStandalone pid 3557)
>>>> [  428.527576] amdgpu 0000:00:07.0: amdgpu:   in page starting at address 
>>>> 0x00007fc991c04000 from client 0x1b (UTCL2)
>>>> [  437.531392] amdgpu: qcm fence wait loop timeout expired
>>>> [  437.535738] amdgpu: The cp might be in an unrecoverable state due to an 
>>>> unsuccessful queues preemption
>>>> [  437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin!
>>>> [  438.087443] [drm] RE-INIT-early: nv_common succeeded
>>>>
>>>> As kfd relies on these to check if GPU is in reset, dropping it will hit 
>>>> some page fault and fence error very easily.
>>> To be clear, we can also hit the page fault with the reset_sem and 
>>> in_gpu_reset, just not as easily as dropping them.
>> Are you saying that the entire patch-set with and without patch 
>> 'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV
>> is casing this GPUVM page fault during testing engine hang while running 
>> benchmark ?
>>
>> Do you never observe this page fault when running this test with original 
>> tree without the new patch-set ?
>>
>> Andrey
>>
> I think this page fault issue can be seen even on the original tree. It's 
> just drop the concurrent GPU reset will hit it more easily.
>
> We may need a new way to protect the reset in SRIOV.
>
Hi Andrey


Actually, I would like to propose a RFC based on your patch, which will move 
the waiting logic in SRIOV flr work into amdgpu_device_gpu_recover_imp, host 
will wait a certain time till the pre_reset work done and guest send back 
response then actually do the vf flr. Hopefully this will help solving the page 
fault issue.

JingWen

>>>>>>>> Andrey
>>>>>>>>
>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk:
>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>
>>>>>>>>>>>> See the FLR request from the hypervisor is just another source of 
>>>>>>>>>>>> signaling the need for a reset, similar to each job timeout on 
>>>>>>>>>>>> each queue. Otherwise you have a race condition between the 
>>>>>>>>>>>> hypervisor and the scheduler.
>>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF 
>>>>>>>>>> FLR is about to start or was already executed, but host will do FLR 
>>>>>>>>>> anyway without waiting for guest too long
>>>>>>>>>>
>>>>>>>>>>>> In other words I strongly think that the current SRIOV reset 
>>>>>>>>>>>> implementation is severely broken and what Andrey is doing is 
>>>>>>>>>>>> actually fixing it.
>>>>>>>>>> It makes the code to crash ... how could it be a fix ?
>>>>>>>>>>
>>>>>>>>>> I'm afraid the patch is NAK from me,  but it is welcome if the 
>>>>>>>>>> cleanup do not ruin the logic, Andry or jingwen can try it if needed.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>> we are hiring software manager for CVS core team
>>>>>>>>>> -------------------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Koenig, Christian <christian.koe...@amd.com>
>>>>>>>>>> Sent: Tuesday, January 4, 2022 6:19 PM
>>>>>>>>>> To: Chen, JingWen <jingwen.ch...@amd.com>; Christian König 
>>>>>>>>>> <ckoenig.leichtzumer...@gmail.com>; Grodzovsky, Andrey 
>>>>>>>>>> <andrey.grodzov...@amd.com>; Deng, Emily <emily.d...@amd.com>; Liu, 
>>>>>>>>>> Monk <monk....@amd.com>; dri-devel@lists.freedesktop.org; 
>>>>>>>>>> amd-...@lists.freedesktop.org; Chen, Horace <horace.c...@amd.com>; 
>>>>>>>>>> Chen, JingWen <jingwen.ch...@amd.com>
>>>>>>>>>> Cc: dan...@ffwll.ch
>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset 
>>>>>>>>>> protection for SRIOV
>>>>>>>>>>
>>>>>>>>>> Hi Jingwen,
>>>>>>>>>>
>>>>>>>>>> well what I mean is that we need to adjust the implementation in 
>>>>>>>>>> amdgpu to actually match the requirements.
>>>>>>>>>>
>>>>>>>>>> Could be that the reset sequence is questionable in general, but I 
>>>>>>>>>> doubt so at least for now.
>>>>>>>>>>
>>>>>>>>>> See the FLR request from the hypervisor is just another source of 
>>>>>>>>>> signaling the need for a reset, similar to each job timeout on each 
>>>>>>>>>> queue. Otherwise you have a race condition between the hypervisor 
>>>>>>>>>> and the scheduler.
>>>>>>>>>>
>>>>>>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen 
>>>>>>>>>> at a central place and not in the SRIOV specific code.
>>>>>>>>>>
>>>>>>>>>> In other words I strongly think that the current SRIOV reset 
>>>>>>>>>> implementation is severely broken and what Andrey is doing is 
>>>>>>>>>> actually fixing it.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen:
>>>>>>>>>>> Hi Christian,
>>>>>>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the 
>>>>>>>>>>> driver".
>>>>>>>>>>>
>>>>>>>>>>> Do you mean we should change the reset sequence in SRIOV? This will 
>>>>>>>>>>> be a huge change for our SRIOV solution.
>>>>>>>>>>>
>>>>>>>>>>>     From my point of view, we can directly use 
>>>>>>>>>>> amdgpu_device_lock_adev
>>>>>>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since 
>>>>>>>>>>> no one will conflict with this thread with reset_domain introduced.
>>>>>>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device 
>>>>>>>>>>> untouched via user space.
>>>>>>>>>>>
>>>>>>>>>>> Best Regards,
>>>>>>>>>>> Jingwen Chen
>>>>>>>>>>>
>>>>>>>>>>> On 2022/1/3 下午6:17, Christian König wrote:
>>>>>>>>>>>> Please don't. This patch is vital to the cleanup of the reset 
>>>>>>>>>>>> procedure.
>>>>>>>>>>>>
>>>>>>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not 
>>>>>>>>>>>> the driver.
>>>>>>>>>>>>
>>>>>>>>>>>> Christian.
>>>>>>>>>>>>
>>>>>>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>> Sure, I guess i can drop this patch then.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Andrey
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote:
>>>>>>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs 
>>>>>>>>>>>>>> first, and do the flr, guest side thread may not know this and 
>>>>>>>>>>>>>> still try to access HW(e.g. kfd is using a lot of 
>>>>>>>>>>>>>> amdgpu_in_reset and reset_sem to identify the reset status). And 
>>>>>>>>>>>>>> this may lead to very bad result.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote:
>>>>>>>>>>>>>>> These patches look good to me. JingWen will pull these patches 
>>>>>>>>>>>>>>> and do some basic TDR test on sriov environment, and give 
>>>>>>>>>>>>>>> feedback.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>>> Emily Deng
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>> From: Liu, Monk <monk....@amd.com>
>>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM
>>>>>>>>>>>>>>>> To: Koenig, Christian <christian.koe...@amd.com>; Grodzovsky,
>>>>>>>>>>>>>>>> Andrey <andrey.grodzov...@amd.com>;
>>>>>>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- 
>>>>>>>>>>>>>>>> g...@lists.freedesktop.org;
>>>>>>>>>>>>>>>> Chen, Horace <horace.c...@amd.com>; Chen, JingWen
>>>>>>>>>>>>>>>> <jingwen.ch...@amd.com>; Deng, Emily <emily.d...@amd.com>
>>>>>>>>>>>>>>>> Cc: dan...@ffwll.ch
>>>>>>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>>>>>>>>>>> reset
>>>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AMD Official Use Only]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please take a review on Andrey's patch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD
>>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>>> -- we are hiring software manager for CVS core team
>>>>>>>>>>>>>>>> -----------------------------------------------------------------
>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>> From: Koenig, Christian <christian.koe...@amd.com>
>>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM
>>>>>>>>>>>>>>>> To: Grodzovsky, Andrey <andrey.grodzov...@amd.com>; dri-
>>>>>>>>>>>>>>>> de...@lists.freedesktop.org; amd-...@lists.freedesktop.org
>>>>>>>>>>>>>>>> Cc: dan...@ffwll.ch; Liu, Monk <monk....@amd.com>; Chen, Horace
>>>>>>>>>>>>>>>> <horace.c...@amd.com>
>>>>>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU 
>>>>>>>>>>>>>>>> reset
>>>>>>>>>>>>>>>> protection for SRIOV
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky:
>>>>>>>>>>>>>>>>> Since now flr work is serialized against  GPU resets there is 
>>>>>>>>>>>>>>>>> no
>>>>>>>>>>>>>>>>> need for this.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
>>>>>>>>>>>>>>>> Acked-by: Christian König <christian.koe...@amd.com>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------
>>>>>>>>>>>>>>>>>        drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
>>>>>>>>>>>>>>>>>        2 files changed, 22 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644
>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>>>>>>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void 
>>>>>>>>>>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>            struct amdgpu_device *adev = container_of(virt, 
>>>>>>>>>>>>>>>>> struct
>>>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>>>            int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>>>>>>>>>>> received,
>>>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>            xgpu_ai_mailbox_trans_msg(adev, 
>>>>>>>>>>>>>>>>> IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void 
>>>>>>>>>>>>>>>>> xgpu_ai_mailbox_flr_work(struct
>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>            } while (timeout > 1);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>        flr_done:
>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>            /* Trigger recovery for world switch failure if no 
>>>>>>>>>>>>>>>>> TDR
>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>            if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>>>                && (!amdgpu_device_has_job_running(adev) || 
>>>>>>>>>>>>>>>>> diff
>>>>>>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644
>>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>>>>>>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void 
>>>>>>>>>>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>            struct amdgpu_device *adev = container_of(virt, 
>>>>>>>>>>>>>>>>> struct
>>>>>>>>>>>>>>>> amdgpu_device, virt);
>>>>>>>>>>>>>>>>>            int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -    /* block amdgpu_gpu_recover till msg FLR COMPLETE 
>>>>>>>>>>>>>>>>> received,
>>>>>>>>>>>>>>>>> -     * otherwise the mailbox msg will be ruined/reseted by
>>>>>>>>>>>>>>>>> -     * the VF FLR.
>>>>>>>>>>>>>>>>> -     */
>>>>>>>>>>>>>>>>> -    if (!down_write_trylock(&adev->reset_sem))
>>>>>>>>>>>>>>>>> -        return;
>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 1);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>            xgpu_nv_mailbox_trans_msg(adev, 
>>>>>>>>>>>>>>>>> IDH_READY_TO_RESET, 0,
>>>>>>>>>>>>>>>>> 0, 0);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void 
>>>>>>>>>>>>>>>>> xgpu_nv_mailbox_flr_work(struct
>>>>>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>>>>>>            } while (timeout > 1);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>        flr_done:
>>>>>>>>>>>>>>>>> -    atomic_set(&adev->in_gpu_reset, 0);
>>>>>>>>>>>>>>>>> -    up_write(&adev->reset_sem);
>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>            /* Trigger recovery for world switch failure if no 
>>>>>>>>>>>>>>>>> TDR
>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>            if (amdgpu_device_should_recover_gpu(adev)
>>>>>>>>>>>>>>>>>                && (!amdgpu_device_has_job_running(adev) ||

Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

Reply via email to