On 2022/1/7 上午3:13, Andrey Grodzovsky wrote: > > On 2022-01-06 12:18 a.m., JingWen Chen wrote: >> On 2022/1/6 下午12:59, JingWen Chen wrote: >>> On 2022/1/6 上午2:24, Andrey Grodzovsky wrote: >>>> On 2022-01-05 2:59 a.m., Christian König wrote: >>>>> Am 05.01.22 um 08:34 schrieb JingWen Chen: >>>>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote: >>>>>>> On 2022-01-04 6:36 a.m., Christian König wrote: >>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk: >>>>>>>>> [AMD Official Use Only] >>>>>>>>> >>>>>>>>>>> See the FLR request from the hypervisor is just another source of >>>>>>>>>>> signaling the need for a reset, similar to each job timeout on each >>>>>>>>>>> queue. Otherwise you have a race condition between the hypervisor >>>>>>>>>>> and the scheduler. >>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF >>>>>>>>> FLR is about to start or was already executed, but host will do FLR >>>>>>>>> anyway without waiting for guest too long >>>>>>>>> >>>>>>>> Then we have a major design issue in the SRIOV protocol and really >>>>>>>> need to question this. >>>>>>>> >>>>>>>> How do you want to prevent a race between the hypervisor resetting the >>>>>>>> hardware and the client trying the same because of a timeout? >>>>>>>> >>>>>>>> As far as I can see the procedure should be: >>>>>>>> 1. We detect that a reset is necessary, either because of a fault a >>>>>>>> timeout or signal from hypervisor. >>>>>>>> 2. For each of those potential reset sources a work item is send to >>>>>>>> the single workqueue. >>>>>>>> 3. One of those work items execute first and prepares the reset. >>>>>>>> 4. We either do the reset our self or notify the hypervisor that we >>>>>>>> are ready for the reset. >>>>>>>> 5. Cleanup after the reset, eventually resubmit jobs etc.. >>>>>>>> 6. Cancel work items which might have been scheduled from other reset >>>>>>>> sources. >>>>>>>> >>>>>>>> It does make sense that the hypervisor resets the hardware without >>>>>>>> waiting for the clients for too long, but if we don't follow this >>>>>>>> general steps we will always have a race between the different >>>>>>>> components. >>>>>>> Monk, just to add to this - if indeed as you say that 'FLR from >>>>>>> hypervisor is just to notify guest the hw VF FLR is about to start or >>>>>>> was already executed, but host will do FLR anyway without waiting for >>>>>>> guest too long' >>>>>>> and there is no strict waiting from the hypervisor for >>>>>>> IDH_READY_TO_RESET to be recived from guest before starting the reset >>>>>>> then setting in_gpu_reset and locking reset_sem from guest side is not >>>>>>> really full proof >>>>>>> protection from MMIO accesses by the guest - it only truly helps if >>>>>>> hypervisor waits for that message before initiation of HW reset. >>>>>>> >>>>>> Hi Andrey, this cannot be done. If somehow guest kernel hangs and never >>>>>> has the chance to send the response back, then other VFs will have to >>>>>> wait it reset. All the vfs will hang in this case. Or sometimes the >>>>>> mailbox has some delay and other VFs will also wait. The user of other >>>>>> VFs will be affected in this case. >>>>> Yeah, agree completely with JingWen. The hypervisor is the one in charge >>>>> here, not the guest. >>>>> >>>>> What the hypervisor should do (and it already seems to be designed that >>>>> way) is to send the guest a message that a reset is about to happen and >>>>> give it some time to response appropriately. >>>>> >>>>> The guest on the other hand then tells the hypervisor that all processing >>>>> has stopped and it is ready to restart. If that doesn't happen in time >>>>> the hypervisor should eliminate the guest probably trigger even more >>>>> severe consequences, e.g. restart the whole VM etc... >>>>> >>>>> Christian. >>>> So what's the end conclusion here regarding dropping this particular patch >>>> ? Seems to me we still need to drop it to prevent driver's MMIO access >>>> to the GPU during reset from various places in the code. >>>> >>>> Andrey >>>> >>> Hi Andrey & Christian, >>> >>> I have ported your patch(drop the reset_sem and in_gpu_reset in flr work) >>> and run some tests. If a engine hang during an OCL benchmark(using kfd), we >>> can see the logs below: > > > Did you port the entire patchset or just 'drm/amd/virt: Drop concurrent GPU > reset protection for SRIOV' ? > > I ported the entire patchset >>> >>> [ 397.190727] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>> [ 397.301496] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>> [ 397.406601] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>> [ 397.532343] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>> [ 397.642251] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>> [ 397.746634] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>> [ 397.850761] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>> [ 397.960544] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>> [ 398.065218] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>> [ 398.182173] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>> [ 398.288264] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>> [ 398.394712] amdgpu 0000:00:07.0: amdgpu: wait for kiq fence error: 0. >>> [ 428.400582] [drm] clean up the vf2pf work item >>> [ 428.500528] amdgpu 0000:00:07.0: amdgpu: [gfxhub] page fault (src_id:0 >>> ring:153 vmid:8 pasid:32771, for process xgemmStandalone pid 3557 thread >>> xgemmStandalone pid 3557) >>> [ 428.527576] amdgpu 0000:00:07.0: amdgpu: in page starting at address >>> 0x00007fc991c04000 from client 0x1b (UTCL2) >>> [ 437.531392] amdgpu: qcm fence wait loop timeout expired >>> [ 437.535738] amdgpu: The cp might be in an unrecoverable state due to an >>> unsuccessful queues preemption >>> [ 437.537191] amdgpu 0000:00:07.0: amdgpu: GPU reset begin! >>> [ 438.087443] [drm] RE-INIT-early: nv_common succeeded >>> >>> As kfd relies on these to check if GPU is in reset, dropping it will hit >>> some page fault and fence error very easily. >> To be clear, we can also hit the page fault with the reset_sem and >> in_gpu_reset, just not as easily as dropping them. > > Are you saying that the entire patch-set with and without patch > 'drm/amd/virt: Drop concurrent GPU reset protection for SRIOV > is casing this GPUVM page fault during testing engine hang while running > benchmark ? > > Do you never observe this page fault when running this test with original > tree without the new patch-set ? > > Andrey > I think this page fault issue can be seen even on the original tree. It's just drop the concurrent GPU reset will hit it more easily.
We may need a new way to protect the reset in SRIOV. > >>>>>>> Andrey >>>>>>> >>>>>>> >>>>>>>> Regards, >>>>>>>> Christian. >>>>>>>> >>>>>>>> Am 04.01.22 um 11:49 schrieb Liu, Monk: >>>>>>>>> [AMD Official Use Only] >>>>>>>>> >>>>>>>>>>> See the FLR request from the hypervisor is just another source of >>>>>>>>>>> signaling the need for a reset, similar to each job timeout on each >>>>>>>>>>> queue. Otherwise you have a race condition between the hypervisor >>>>>>>>>>> and the scheduler. >>>>>>>>> No it's not, FLR from hypervisor is just to notify guest the hw VF >>>>>>>>> FLR is about to start or was already executed, but host will do FLR >>>>>>>>> anyway without waiting for guest too long >>>>>>>>> >>>>>>>>>>> In other words I strongly think that the current SRIOV reset >>>>>>>>>>> implementation is severely broken and what Andrey is doing is >>>>>>>>>>> actually fixing it. >>>>>>>>> It makes the code to crash ... how could it be a fix ? >>>>>>>>> >>>>>>>>> I'm afraid the patch is NAK from me, but it is welcome if the >>>>>>>>> cleanup do not ruin the logic, Andry or jingwen can try it if needed. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> ------------------------------------------------------------------- >>>>>>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD >>>>>>>>> ------------------------------------------------------------------- >>>>>>>>> we are hiring software manager for CVS core team >>>>>>>>> ------------------------------------------------------------------- >>>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: Koenig, Christian <christian.koe...@amd.com> >>>>>>>>> Sent: Tuesday, January 4, 2022 6:19 PM >>>>>>>>> To: Chen, JingWen <jingwen.ch...@amd.com>; Christian König >>>>>>>>> <ckoenig.leichtzumer...@gmail.com>; Grodzovsky, Andrey >>>>>>>>> <andrey.grodzov...@amd.com>; Deng, Emily <emily.d...@amd.com>; Liu, >>>>>>>>> Monk <monk....@amd.com>; dri-devel@lists.freedesktop.org; >>>>>>>>> amd-...@lists.freedesktop.org; Chen, Horace <horace.c...@amd.com>; >>>>>>>>> Chen, JingWen <jingwen.ch...@amd.com> >>>>>>>>> Cc: dan...@ffwll.ch >>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset >>>>>>>>> protection for SRIOV >>>>>>>>> >>>>>>>>> Hi Jingwen, >>>>>>>>> >>>>>>>>> well what I mean is that we need to adjust the implementation in >>>>>>>>> amdgpu to actually match the requirements. >>>>>>>>> >>>>>>>>> Could be that the reset sequence is questionable in general, but I >>>>>>>>> doubt so at least for now. >>>>>>>>> >>>>>>>>> See the FLR request from the hypervisor is just another source of >>>>>>>>> signaling the need for a reset, similar to each job timeout on each >>>>>>>>> queue. Otherwise you have a race condition between the hypervisor and >>>>>>>>> the scheduler. >>>>>>>>> >>>>>>>>> Properly setting in_gpu_reset is indeed mandatory, but should happen >>>>>>>>> at a central place and not in the SRIOV specific code. >>>>>>>>> >>>>>>>>> In other words I strongly think that the current SRIOV reset >>>>>>>>> implementation is severely broken and what Andrey is doing is >>>>>>>>> actually fixing it. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>> Am 04.01.22 um 10:07 schrieb JingWen Chen: >>>>>>>>>> Hi Christian, >>>>>>>>>> I'm not sure what do you mean by "we need to change SRIOV not the >>>>>>>>>> driver". >>>>>>>>>> >>>>>>>>>> Do you mean we should change the reset sequence in SRIOV? This will >>>>>>>>>> be a huge change for our SRIOV solution. >>>>>>>>>> >>>>>>>>>> From my point of view, we can directly use >>>>>>>>>> amdgpu_device_lock_adev >>>>>>>>>> and amdgpu_device_unlock_adev in flr_work instead of try_lock since >>>>>>>>>> no one will conflict with this thread with reset_domain introduced. >>>>>>>>>> But we do need the reset_sem and adev->in_gpu_reset to keep device >>>>>>>>>> untouched via user space. >>>>>>>>>> >>>>>>>>>> Best Regards, >>>>>>>>>> Jingwen Chen >>>>>>>>>> >>>>>>>>>> On 2022/1/3 下午6:17, Christian König wrote: >>>>>>>>>>> Please don't. This patch is vital to the cleanup of the reset >>>>>>>>>>> procedure. >>>>>>>>>>> >>>>>>>>>>> If SRIOV doesn't work with that we need to change SRIOV and not the >>>>>>>>>>> driver. >>>>>>>>>>> >>>>>>>>>>> Christian. >>>>>>>>>>> >>>>>>>>>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky: >>>>>>>>>>>> Sure, I guess i can drop this patch then. >>>>>>>>>>>> >>>>>>>>>>>> Andrey >>>>>>>>>>>> >>>>>>>>>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote: >>>>>>>>>>>>> I do agree with shaoyun, if the host find the gpu engine hangs >>>>>>>>>>>>> first, and do the flr, guest side thread may not know this and >>>>>>>>>>>>> still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset >>>>>>>>>>>>> and reset_sem to identify the reset status). And this may lead to >>>>>>>>>>>>> very bad result. >>>>>>>>>>>>> >>>>>>>>>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote: >>>>>>>>>>>>>> These patches look good to me. JingWen will pull these patches >>>>>>>>>>>>>> and do some basic TDR test on sriov environment, and give >>>>>>>>>>>>>> feedback. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best wishes >>>>>>>>>>>>>> Emily Deng >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>>>> From: Liu, Monk <monk....@amd.com> >>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM >>>>>>>>>>>>>>> To: Koenig, Christian <christian.koe...@amd.com>; Grodzovsky, >>>>>>>>>>>>>>> Andrey <andrey.grodzov...@amd.com>; >>>>>>>>>>>>>>> dri-devel@lists.freedesktop.org; amd- >>>>>>>>>>>>>>> g...@lists.freedesktop.org; >>>>>>>>>>>>>>> Chen, Horace <horace.c...@amd.com>; Chen, JingWen >>>>>>>>>>>>>>> <jingwen.ch...@amd.com>; Deng, Emily <emily.d...@amd.com> >>>>>>>>>>>>>>> Cc: dan...@ffwll.ch >>>>>>>>>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU >>>>>>>>>>>>>>> reset >>>>>>>>>>>>>>> protection for SRIOV >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [AMD Official Use Only] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Please take a review on Andrey's patch >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> ----------------------------------------------------------------- >>>>>>>>>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD >>>>>>>>>>>>>>> ----------------------------------------------------------------- >>>>>>>>>>>>>>> -- we are hiring software manager for CVS core team >>>>>>>>>>>>>>> ----------------------------------------------------------------- >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>>>> From: Koenig, Christian <christian.koe...@amd.com> >>>>>>>>>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM >>>>>>>>>>>>>>> To: Grodzovsky, Andrey <andrey.grodzov...@amd.com>; dri- >>>>>>>>>>>>>>> de...@lists.freedesktop.org; amd-...@lists.freedesktop.org >>>>>>>>>>>>>>> Cc: dan...@ffwll.ch; Liu, Monk <monk....@amd.com>; Chen, Horace >>>>>>>>>>>>>>> <horace.c...@amd.com> >>>>>>>>>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU >>>>>>>>>>>>>>> reset >>>>>>>>>>>>>>> protection for SRIOV >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky: >>>>>>>>>>>>>>>> Since now flr work is serialized against GPU resets there is >>>>>>>>>>>>>>>> no >>>>>>>>>>>>>>>> need for this. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com> >>>>>>>>>>>>>>> Acked-by: Christian König <christian.koe...@amd.com> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 ----------- >>>>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 ----------- >>>>>>>>>>>>>>>> 2 files changed, 22 deletions(-) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>>>>>>>>>>>> index 487cd654b69e..7d59a66e3988 100644 >>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>>>>>>>>>>>> @@ -248,15 +248,7 @@ static void >>>>>>>>>>>>>>>> xgpu_ai_mailbox_flr_work(struct >>>>>>>>>>>>>>> work_struct *work) >>>>>>>>>>>>>>>> struct amdgpu_device *adev = container_of(virt, >>>>>>>>>>>>>>>> struct >>>>>>>>>>>>>>> amdgpu_device, virt); >>>>>>>>>>>>>>>> int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT; >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - /* block amdgpu_gpu_recover till msg FLR COMPLETE >>>>>>>>>>>>>>>> received, >>>>>>>>>>>>>>>> - * otherwise the mailbox msg will be ruined/reseted by >>>>>>>>>>>>>>>> - * the VF FLR. >>>>>>>>>>>>>>>> - */ >>>>>>>>>>>>>>>> - if (!down_write_trylock(&adev->reset_sem)) >>>>>>>>>>>>>>>> - return; >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev); >>>>>>>>>>>>>>>> - atomic_set(&adev->in_gpu_reset, 1); >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, >>>>>>>>>>>>>>>> 0, >>>>>>>>>>>>>>>> 0, 0); >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct >>>>>>>>>>>>>>> work_struct *work) >>>>>>>>>>>>>>>> } while (timeout > 1); >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> flr_done: >>>>>>>>>>>>>>>> - atomic_set(&adev->in_gpu_reset, 0); >>>>>>>>>>>>>>>> - up_write(&adev->reset_sem); >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> /* Trigger recovery for world switch failure if no >>>>>>>>>>>>>>>> TDR >>>>>>>>>>>>>>>> */ >>>>>>>>>>>>>>>> if (amdgpu_device_should_recover_gpu(adev) >>>>>>>>>>>>>>>> && (!amdgpu_device_has_job_running(adev) || diff >>>>>>>>>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>>>>>>>>>>>> index e3869067a31d..f82c066c8e8d 100644 >>>>>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>>>>>>>>>>>> @@ -277,15 +277,7 @@ static void >>>>>>>>>>>>>>>> xgpu_nv_mailbox_flr_work(struct >>>>>>>>>>>>>>> work_struct *work) >>>>>>>>>>>>>>>> struct amdgpu_device *adev = container_of(virt, >>>>>>>>>>>>>>>> struct >>>>>>>>>>>>>>> amdgpu_device, virt); >>>>>>>>>>>>>>>> int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT; >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - /* block amdgpu_gpu_recover till msg FLR COMPLETE >>>>>>>>>>>>>>>> received, >>>>>>>>>>>>>>>> - * otherwise the mailbox msg will be ruined/reseted by >>>>>>>>>>>>>>>> - * the VF FLR. >>>>>>>>>>>>>>>> - */ >>>>>>>>>>>>>>>> - if (!down_write_trylock(&adev->reset_sem)) >>>>>>>>>>>>>>>> - return; >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> amdgpu_virt_fini_data_exchange(adev); >>>>>>>>>>>>>>>> - atomic_set(&adev->in_gpu_reset, 1); >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, >>>>>>>>>>>>>>>> 0, >>>>>>>>>>>>>>>> 0, 0); >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct >>>>>>>>>>>>>>> work_struct *work) >>>>>>>>>>>>>>>> } while (timeout > 1); >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> flr_done: >>>>>>>>>>>>>>>> - atomic_set(&adev->in_gpu_reset, 0); >>>>>>>>>>>>>>>> - up_write(&adev->reset_sem); >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> /* Trigger recovery for world switch failure if no >>>>>>>>>>>>>>>> TDR >>>>>>>>>>>>>>>> */ >>>>>>>>>>>>>>>> if (amdgpu_device_should_recover_gpu(adev) >>>>>>>>>>>>>>>> && (!amdgpu_device_has_job_running(adev) ||