On 2022/1/7 上午11:57, JingWen Chen wrote: > On 2022/1/7 上午3:13, Andrey Grodzovsky wrote: >> On 2022-01-06 12:18 a.m., JingWen Chen wrote: >>> On 2022/1/6 下午12:59, JingWen Chen wrote: >>>> On 2022/1/6 上午2:24, Andrey Grodzovsky wrote: >>>>> On 2022-01-05 2:59 a.m., Christian König wrote: >>>>>> Am 05.01.22 um 08:34 schrieb JingWen Chen: >>>>>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote: >>>>>>>> On 2022-01-04 6:36 a.m., Christian König wrote: >>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk: >>>>>>>>>> [AMD Official Use Only] >>>>>>>>>> >>>>>>>>>>>> See the FLR request from the hypervisor is just another source of >>>>>>>>>>>> signaling the need for a reset, similar to each job timeout on >>>>>>>>>>>> each queue. Otherwise you have a race condition between the >>>>>>>>>>>> hypervisor and the scheduler. >>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF >>>>>>>>>> FLR is about to start or was already executed, but host will do FLR >>>>>>>>>> anyway without waiting for guest too long >>>>>>>>>> >>>>>>>>> Then we have a major design issue in the SRIOV protocol and really >>>>>>>>> need to question this. >>>>>>>>> >>>>>>>>> How do you want to prevent a race between the hypervisor resetting >>>>>>>>> the hardware and the client trying the same because of a timeout? >>>>>>>>> >>>>>>>>> As far as I can see the procedure should be: >>>>>>>>> 1. We detect that a reset is necessary, either because of a fault a >>>>>>>>> timeout or signal from hypervisor. >>>>>>>>> 2. For each of those potential reset sources a work item is send to >>>>>>>>> the single workqueue. >>>>>>>>> 3. One of those work items execute first and prepares the reset. >>>>>>>>> 4. We either do the reset our self or notify the hypervisor that we >>>>>>>>> are ready for the reset. >>>>>>>>> 5. Cleanup after the reset, eventually resubmit jobs etc.. >>>>>>>>> 6. Cancel work items which might have been scheduled from other reset >>>>>>>>> sources. >>>>>>>>> >>>>>>>>> It does make sense that the hypervisor resets the hardware without >>>>>>>>> waiting for the clients for too long, but if we don't follow this >>>>>>>>> general steps we will always have a race between the different >>>>>>>>> components. >>>>>>>> Monk, just to add to this - if indeed as you say that 'FLR from >>>>>>>> hypervisor is just to notify guest the hw VF FLR is about to start or >>>>>>>> was already executed, but host will do FLR anyway without waiting for >>>>>>>> guest too long' >>>>>>>> and there is no strict waiting from the hypervisor for >>>>>>>> IDH_READY_TO_RESET to be recived from guest before starting the reset >>>>>>>> then setting in_gpu_reset and locking reset_sem from guest side is not >>>>>>>> really full proof >>>>>>>> protection from MMIO accesses by the guest - it only truly helps if >>>>>>>> hypervisor waits for that message before initiation of HW reset. >>>>>>>> >>>>>>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never >>>>>>> has the chance to send the response back, then other VFs will have to >>>>>>> wait it reset. All the vfs will hang in this case. Or sometimes the >>>>>>> mailbox has some delay and other VFs will also wait. The user of other >>>>>>> VFs will be affected in this case. >>>>>> Yeah, agree completely with JingWen. The hypervisor is the one in charge >>>>>> here, not the guest. >>>>>> >>>>>> What the hypervisor should do (and it already seems to be designed that >>>>>> way) is to send the guest a message that a reset is about to happen and >>>>>> give it some time to response appropriately. >>>>>> >>>>>> The guest on the other hand then tells the hypervisor that all >>>>>> processing has stopped and it is ready to restart. If that doesn't >>>>>> happen in time the hypervisor should eliminate the guest probably >>>>>> trigger even more severe consequences, e.g. restart the whole VM etc... >>>>>> >>>>>> Christian. >>>>> So what's the end conclusion here regarding dropping this particular >>>>> patch ? Seems to me we still need to drop it to prevent driver's MMIO >>>>> access >>>>> to the GPU during reset from various places in the code. >>>>> >>>>> Andrey >>>>> >>>> Hi Andrey & Christian, >>>> >>>> I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) >>>> and run some tests. If a engine hang during an OCL benchmark(using kfd), >>>> we can see the logs below: >> >> Did you port the entire patchset or just 'drm/amd/virt: Drop concurrent GPU >> reset protection for SRIOV' ? >> >> > I ported the entire patchset >>>> [ 397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>>> [ 397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>>> [ 397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>>> [ 397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>>> [ 397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>>> [ 397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>>> [ 397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>>> [ 397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>>> [ 398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>>> [ 398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>>> [ 398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>>> [ 398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>>> [ 428.400582] [drm] clean up the vf2pf work item >>>> [ 428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 >>>> ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread >>>> xgemmStandalone pid 3557) >>>> [ 428.527576] amdgpu 0000:00:07.0: amdgpu: in page starting at address >>>> 0x00007fc991c04000 from client 0x1b (UTCL2) >>>> [ 437.531392] amdgpu: qcm fence wait loop timeout expired >>>> [ 437.535738] amdgpu: The cp might be in an unrecoverable state due to an >>>> unsuccessful queues preemption >>>> [ 437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin! >>>> [ 438.087443] [drm] RE-INIT-early: nv_common succeeded >>>> >>>> As kfd relies on these to check if GPU is in reset, dropping it will hit >>>> some page fault and fence error very easily. >>> To be clear, we can also hit the page fault with the reset_sem and >>> in_gpu_reset, just not as easily as dropping them. >> Are you saying that the entire patch-set with and without patch >> 'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV >> is casing this GPUVM page fault during testing engine hang while running >> benchmark ? >> >> Do you never observe this page fault when running this test with original >> tree without the new patch-set ? >> >> Andrey >> > I think this page fault issue can be seen even on the original tree. It's > just drop the concurrent GPU reset will hit it more easily. > > We may need a new way to protect the reset in SRIOV. > Hi Andrey
Actually, I would like to propose a RFC based on your patch, which will move the waiting logic in SRIOV flr work into amdgpu_device_gpu_recover_imp, host will wait a certain time till the pre_reset work done and guest send back response then actually do the vf flr. Hopefully this will help solving the page fault issue. JingWen >>>>>>>> Andrey >>>>>>>> >>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk: >>>>>>>>>> [AMD Official Use Only] >>>>>>>>>> >>>>>>>>>>>> See the FLR request from the hypervisor is just another source of >>>>>>>>>>>> signaling the need for a reset, similar to each job timeout on >>>>>>>>>>>> each queue. Otherwise you have a race condition between the >>>>>>>>>>>> hypervisor and the scheduler. >>>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF >>>>>>>>>> FLR is about to start or was already executed, but host will do FLR >>>>>>>>>> anyway without waiting for guest too long >>>>>>>>>> >>>>>>>>>>>> In other words I strongly think that the current SRIOV reset >>>>>>>>>>>> implementation is severely broken and what Andrey is doing is >>>>>>>>>>>> actually fixing it. >>>>>>>>>> It makes the code to crash ... how could it be a fix ? >>>>>>>>>> >>>>>>>>>> I'm afraid the patch is NAK from me, but it is welcome if the >>>>>>>>>> cleanup do not ruin the logic, Andry or jingwen can try it if needed. >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> ------------------------------------------------------------------- >>>>>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD >>>>>>>>>> ------------------------------------------------------------------- >>>>>>>>>> we are hiring software manager for CVS core team >>>>>>>>>> ------------------------------------------------------------------- >>>>>>>>>> >>>>>>>>>> -----Original Message----- >>>>>>>>>> From: Koenig, Christian <christian.koe...@amd.com> >>>>>>>>>> Sent: Tuesday, January 4, 2022 6:19 PM >>>>>>>>>> To: Chen, JingWen <jingwen.ch...@amd.com>; Christian König >>>>>>>>>> <ckoenig.leichtzumer...@gmail.com>; Grodzovsky, Andrey >>>>>>>>>> <andrey.grodzov...@amd.com>; Deng, Emily <emily.d...@amd.com>; Liu, >>>>>>>>>> Monk <monk....@amd.com>; dri-devel@lists.freedesktop.org; >>>>>>>>>> amd-...@lists.freedesktop.org; Chen, Horace <horace.c...@amd.com>; >>>>>>>>>> Chen, JingWen <jingwen.ch...@amd.com> >>>>>>>>>> Cc: dan...@ffwll.ch >>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset >>>>>>>>>> protection for SRIOV >>>>>>>>>> >>>>>>>>>> Hi Jingwen, >>>>>>>>>> >>>>>>>>>> well what I mean is that we need to adjust the implementation in >>>>>>>>>> amdgpu to actually match the requirements. >>>>>>>>>> >>>>>>>>>> Could be that the reset sequence is questionable in general, but I >>>>>>>>>> doubt so at least for now. >>>>>>>>>> >>>>>>>>>> See the FLR request from the hypervisor is just another source of >>>>>>>>>> signaling the need for a reset, similar to each job timeout on each >>>>>>>>>> queue. Otherwise you have a race condition between the hypervisor >>>>>>>>>> and the scheduler. >>>>>>>>>> >>>>>>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen >>>>>>>>>> at a central place and not in the SRIOV specific code. >>>>>>>>>> >>>>>>>>>> In other words I strongly think that the current SRIOV reset >>>>>>>>>> implementation is severely broken and what Andrey is doing is >>>>>>>>>> actually fixing it. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Christian. >>>>>>>>>> >>>>>>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen: >>>>>>>>>>> Hi Christian, >>>>>>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the >>>>>>>>>>> driver". >>>>>>>>>>> >>>>>>>>>>> Do you mean we should change the reset sequence in SRIOV? This will >>>>>>>>>>> be a huge change for our SRIOV solution. >>>>>>>>>>> >>>>>>>>>>> From my point of view, we can directly use >>>>>>>>>>> amdgpu_device_lock_adev >>>>>>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since >>>>>>>>>>> no one will conflict with this thread with reset_domain introduced. >>>>>>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device >>>>>>>>>>> untouched via user space. >>>>>>>>>>> >>>>>>>>>>> Best Regards, >>>>>>>>>>> Jingwen Chen >>>>>>>>>>> >>>>>>>>>>> On 2022/1/3 下午6:17, Christian König wrote: >>>>>>>>>>>> Please don't. This patch is vital to the cleanup of the reset >>>>>>>>>>>> procedure. >>>>>>>>>>>> >>>>>>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not >>>>>>>>>>>> the driver. >>>>>>>>>>>> >>>>>>>>>>>> Christian. >>>>>>>>>>>> >>>>>>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky: >>>>>>>>>>>>> Sure, I guess i can drop this patch then. >>>>>>>>>>>>> >>>>>>>>>>>>> Andrey >>>>>>>>>>>>> >>>>>>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote: >>>>>>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs >>>>>>>>>>>>>> first, and do the flr, guest side thread may not know this and >>>>>>>>>>>>>> still try to access HW(e.g. kfd is using a lot of >>>>>>>>>>>>>> amdgpu_in_reset and reset_sem to identify the reset status). And >>>>>>>>>>>>>> this may lead to very bad result. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote: >>>>>>>>>>>>>>> These patches look good to me. JingWen will pull these patches >>>>>>>>>>>>>>> and do some basic TDR test on sriov environment, and give >>>>>>>>>>>>>>> feedback. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best wishes >>>>>>>>>>>>>>> Emily Deng >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>>>>> From: Liu, Monk <monk....@amd.com> >>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM >>>>>>>>>>>>>>>> To: Koenig, Christian <christian.koe...@amd.com>; Grodzovsky, >>>>>>>>>>>>>>>> Andrey <andrey.grodzov...@amd.com>; >>>>>>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- >>>>>>>>>>>>>>>> g...@lists.freedesktop.org; >>>>>>>>>>>>>>>> Chen, Horace <horace.c...@amd.com>; Chen, JingWen >>>>>>>>>>>>>>>> <jingwen.ch...@amd.com>; Deng, Emily <emily.d...@amd.com> >>>>>>>>>>>>>>>> Cc: dan...@ffwll.ch >>>>>>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU >>>>>>>>>>>>>>>> reset >>>>>>>>>>>>>>>> protection for SRIOV >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [AMD Official Use Only] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Please take a review on Andrey's patch >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>> ----------------------------------------------------------------- >>>>>>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD >>>>>>>>>>>>>>>> ----------------------------------------------------------------- >>>>>>>>>>>>>>>> -- we are hiring software manager for CVS core team >>>>>>>>>>>>>>>> ----------------------------------------------------------------- >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>>>>> From: Koenig, Christian <christian.koe...@amd.com> >>>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM >>>>>>>>>>>>>>>> To: Grodzovsky, Andrey <andrey.grodzov...@amd.com>; dri- >>>>>>>>>>>>>>>> de...@lists.freedesktop.org; amd-...@lists.freedesktop.org >>>>>>>>>>>>>>>> Cc: dan...@ffwll.ch; Liu, Monk <monk....@amd.com>; Chen, Horace >>>>>>>>>>>>>>>> <horace.c...@amd.com> >>>>>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU >>>>>>>>>>>>>>>> reset >>>>>>>>>>>>>>>> protection for SRIOV >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky: >>>>>>>>>>>>>>>>> Since now flr work is serialized against GPU resets there is >>>>>>>>>>>>>>>>> no >>>>>>>>>>>>>>>>> need for this. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com> >>>>>>>>>>>>>>>> Acked-by: Christian König <christian.koe...@amd.com> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 ----------- >>>>>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 ----------- >>>>>>>>>>>>>>>>> 2 files changed, 22 deletions(-) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644 >>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void >>>>>>>>>>>>>>>>> xgpu_ai_mailbox_flr_work(struct >>>>>>>>>>>>>>>> work_struct *work) >>>>>>>>>>>>>>>>> struct amdgpu_device *adev = container_of(virt, >>>>>>>>>>>>>>>>> struct >>>>>>>>>>>>>>>> amdgpu_device, virt); >>>>>>>>>>>>>>>>> int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT; >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> - /* block amdgpu_gpu_recover till msg FLR COMPLETE >>>>>>>>>>>>>>>>> received, >>>>>>>>>>>>>>>>> - * otherwise the mailbox msg will be ruined/reseted by >>>>>>>>>>>>>>>>> - * the VF FLR. >>>>>>>>>>>>>>>>> - */ >>>>>>>>>>>>>>>>> - if (!down_write_trylock(&adev->reset_sem)) >>>>>>>>>>>>>>>>> - return; >>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev); >>>>>>>>>>>>>>>>> - atomic_set(&adev->in_gpu_reset, 1); >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> xgpu_ai_mailbox_trans_msg(adev, >>>>>>>>>>>>>>>>> IDH_READY_TO_RESET, 0, >>>>>>>>>>>>>>>>> 0, 0); >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void >>>>>>>>>>>>>>>>> xgpu_ai_mailbox_flr_work(struct >>>>>>>>>>>>>>>> work_struct *work) >>>>>>>>>>>>>>>>> } while (timeout > 1); >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> flr_done: >>>>>>>>>>>>>>>>> - atomic_set(&adev->in_gpu_reset, 0); >>>>>>>>>>>>>>>>> - up_write(&adev->reset_sem); >>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>> /* Trigger recovery for world switch failure if no >>>>>>>>>>>>>>>>> TDR >>>>>>>>>>>>>>>>> */ >>>>>>>>>>>>>>>>> if (amdgpu_device_should_recover_gpu(adev) >>>>>>>>>>>>>>>>> && (!amdgpu_device_has_job_running(adev) || >>>>>>>>>>>>>>>>> diff >>>>>>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644 >>>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void >>>>>>>>>>>>>>>>> xgpu_nv_mailbox_flr_work(struct >>>>>>>>>>>>>>>> work_struct *work) >>>>>>>>>>>>>>>>> struct amdgpu_device *adev = container_of(virt, >>>>>>>>>>>>>>>>> struct >>>>>>>>>>>>>>>> amdgpu_device, virt); >>>>>>>>>>>>>>>>> int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT; >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> - /* block amdgpu_gpu_recover till msg FLR COMPLETE >>>>>>>>>>>>>>>>> received, >>>>>>>>>>>>>>>>> - * otherwise the mailbox msg will be ruined/reseted by >>>>>>>>>>>>>>>>> - * the VF FLR. >>>>>>>>>>>>>>>>> - */ >>>>>>>>>>>>>>>>> - if (!down_write_trylock(&adev->reset_sem)) >>>>>>>>>>>>>>>>> - return; >>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev); >>>>>>>>>>>>>>>>> - atomic_set(&adev->in_gpu_reset, 1); >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> xgpu_nv_mailbox_trans_msg(adev, >>>>>>>>>>>>>>>>> IDH_READY_TO_RESET, 0, >>>>>>>>>>>>>>>>> 0, 0); >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void >>>>>>>>>>>>>>>>> xgpu_nv_mailbox_flr_work(struct >>>>>>>>>>>>>>>> work_struct *work) >>>>>>>>>>>>>>>>> } while (timeout > 1); >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> flr_done: >>>>>>>>>>>>>>>>> - atomic_set(&adev->in_gpu_reset, 0); >>>>>>>>>>>>>>>>> - up_write(&adev->reset_sem); >>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>> /* Trigger recovery for world switch failure if no >>>>>>>>>>>>>>>>> TDR >>>>>>>>>>>>>>>>> */ >>>>>>>>>>>>>>>>> if (amdgpu_device_should_recover_gpu(adev) >>>>>>>>>>>>>>>>> && (!amdgpu_device_has_job_running(adev) ||