[AMD Official Use Only] I see, I didn't notice you already have this implemented . so the flr_work routine itself is synced now, in this case , I agree it should be safe to remove the in_gpu_reset and reset_semm in the flr_work.
Regards Shaoyun.liu -----Original Message----- From: Grodzovsky, Andrey <andrey.grodzov...@amd.com> Sent: Tuesday, January 4, 2022 3:55 PM To: Liu, Shaoyun <shaoyun....@amd.com>; Koenig, Christian <christian.koe...@amd.com>; Liu, Monk <monk....@amd.com>; Chen, JingWen <jingwen.ch...@amd.com>; Christian König <ckoenig.leichtzumer...@gmail.com>; Deng, Emily <emily.d...@amd.com>; dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, Horace <horace.c...@amd.com> Cc: dan...@ffwll.ch Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV On 2022-01-04 12:13 p.m., Liu, Shaoyun wrote: > [AMD Official Use Only] > > I mostly agree with the sequences Christian described . Just one thing > might need to discuss here. For FLR notified from host, in new sequenceas > described , driver need to reply the READY_TO_RESET in the workitem from > a reset work queue which means inside flr_work, driver can not directly > reply to host but need to queue another workqueue . Can you clarify why 'driver can not directly reply to host but need to queue another workqueue' ? To my understating all steps 3-6 in Christian's description happen from the same single wq thread serially. > For current code , the flr_work for sriov itself is a work queue queued > from ISR . I think we should try to response to the host driver as soon as > possible. Queue another workqueue inside the workqueue doesn't sounds > efficient to me. Check patch 5 please [1] - I just substituted schedule_work(&adev->virt.flr_work) for queue_work(adev->reset_domain.wq,&adev->virt.flr_work) so no extra requeue here, just instead of sending to system_wq it's sent to dedicated reset wq [1] - https://lore.kernel.org/all/20211222221400.790842-1-andrey.grodzov...@amd.com/ Andrey > Anyway, what we need is a working solution for our project. So if we need > to change the sequence, we need to make sure it's been tested first and > won't break the functionality before the code is landed in the branch . > > Regards > Shaoyun.liu > > > -----Original Message----- > From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of > Christian König > Sent: Tuesday, January 4, 2022 6:36 AM > To: Liu, Monk <monk....@amd.com>; Chen, JingWen > <jingwen.ch...@amd.com>; Christian König > <ckoenig.leichtzumer...@gmail.com>; Grodzovsky, Andrey > <andrey.grodzov...@amd.com>; Deng, Emily <emily.d...@amd.com>; > dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Chen, > Horace <horace.c...@amd.com> > Cc: dan...@ffwll.ch > Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset > protection for SRIOV > > Am 04.01.22 um 11:49 schrieb Liu, Monk: >> [AMD Official Use Only] >> >>>> See the FLR request from the hypervisor is just another source of >>>> signaling the need for a reset, similar to each job timeout on each queue. >>>> Otherwise you have a race condition between the hypervisor and the >>>> scheduler. >> No it's not, FLR from hypervisor is just to notify guest the hw VF >> FLR is about to start or was already executed, but host will do FLR >> anyway without waiting for guest too long >> > Then we have a major design issue in the SRIOV protocol and really need to > question this. > > How do you want to prevent a race between the hypervisor resetting the > hardware and the client trying the same because of a timeout? > > As far as I can see the procedure should be: > 1. We detect that a reset is necessary, either because of a fault a timeout > or signal from hypervisor. > 2. For each of those potential reset sources a work item is send to the > single workqueue. > 3. One of those work items execute first and prepares the reset. > 4. We either do the reset our self or notify the hypervisor that we are ready > for the reset. > 5. Cleanup after the reset, eventually resubmit jobs etc.. > 6. Cancel work items which might have been scheduled from other reset sources. > > It does make sense that the hypervisor resets the hardware without waiting > for the clients for too long, but if we don't follow this general steps we > will always have a race between the different components. > > Regards, > Christian. > > Am 04.01.22 um 11:49 schrieb Liu, Monk: >> [AMD Official Use Only] >> >>>> See the FLR request from the hypervisor is just another source of >>>> signaling the need for a reset, similar to each job timeout on each queue. >>>> Otherwise you have a race condition between the hypervisor and the >>>> scheduler. >> No it's not, FLR from hypervisor is just to notify guest the hw VF >> FLR is about to start or was already executed, but host will do FLR >> anyway without waiting for guest too long >> >>>> In other words I strongly think that the current SRIOV reset >>>> implementation is severely broken and what Andrey is doing is actually >>>> fixing it. >> It makes the code to crash ... how could it be a fix ? >> >> I'm afraid the patch is NAK from me, but it is welcome if the cleanup do >> not ruin the logic, Andry or jingwen can try it if needed. >> >> Thanks >> ------------------------------------------------------------------- >> Monk Liu | Cloud GPU & Virtualization Solution | AMD >> ------------------------------------------------------------------- >> we are hiring software manager for CVS core team >> ------------------------------------------------------------------- >> >> -----Original Message----- >> From: Koenig, Christian <christian.koe...@amd.com> >> Sent: Tuesday, January 4, 2022 6:19 PM >> To: Chen, JingWen <jingwen.ch...@amd.com>; Christian König >> <ckoenig.leichtzumer...@gmail.com>; Grodzovsky, Andrey >> <andrey.grodzov...@amd.com>; Deng, Emily <emily.d...@amd.com>; Liu, >> Monk <monk....@amd.com>; dri-de...@lists.freedesktop.org; >> amd-gfx@lists.freedesktop.org; Chen, Horace <horace.c...@amd.com>; >> Chen, JingWen <jingwen.ch...@amd.com> >> Cc: dan...@ffwll.ch >> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset >> protection for SRIOV >> >> Hi Jingwen, >> >> well what I mean is that we need to adjust the implementation in amdgpu to >> actually match the requirements. >> >> Could be that the reset sequence is questionable in general, but I doubt so >> at least for now. >> >> See the FLR request from the hypervisor is just another source of signaling >> the need for a reset, similar to each job timeout on each queue. Otherwise >> you have a race condition between the hypervisor and the scheduler. >> >> Properly setting in_gpu_reset is indeed mandatory, but should happen at a >> central place and not in the SRIOV specific code. >> >> In other words I strongly think that the current SRIOV reset implementation >> is severely broken and what Andrey is doing is actually fixing it. >> >> Regards, >> Christian. >> >> Am 04.01.22 um 10:07 schrieb JingWen Chen: >>> Hi Christian, >>> I'm not sure what do you mean by "we need to change SRIOV not the driver". >>> >>> Do you mean we should change the reset sequence in SRIOV? This will be a >>> huge change for our SRIOV solution. >>> >>> From my point of view, we can directly use >>> amdgpu_device_lock_adev and amdgpu_device_unlock_adev in flr_work instead >>> of try_lock since no one will conflict with this thread with reset_domain >>> introduced. >>> But we do need the reset_sem and adev->in_gpu_reset to keep device >>> untouched via user space. >>> >>> Best Regards, >>> Jingwen Chen >>> >>> On 2022/1/3 下午6:17, Christian König wrote: >>>> Please don't. This patch is vital to the cleanup of the reset procedure. >>>> >>>> If SRIOV doesn't work with that we need to change SRIOV and not the driver. >>>> >>>> Christian. >>>> >>>> Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky: >>>>> Sure, I guess i can drop this patch then. >>>>> >>>>> Andrey >>>>> >>>>> On 2021-12-24 4:57 a.m., JingWen Chen wrote: >>>>>> I do agree with shaoyun, if the host find the gpu engine hangs first, >>>>>> and do the flr, guest side thread may not know this and still try to >>>>>> access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to >>>>>> identify the reset status). And this may lead to very bad result. >>>>>> >>>>>> On 2021/12/24 下午4:58, Deng, Emily wrote: >>>>>>> These patches look good to me. JingWen will pull these patches and do >>>>>>> some basic TDR test on sriov environment, and give feedback. >>>>>>> >>>>>>> Best wishes >>>>>>> Emily Deng >>>>>>> >>>>>>> >>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Liu, Monk <monk....@amd.com> >>>>>>>> Sent: Thursday, December 23, 2021 6:14 PM >>>>>>>> To: Koenig, Christian <christian.koe...@amd.com>; Grodzovsky, >>>>>>>> Andrey <andrey.grodzov...@amd.com>; >>>>>>>> dri-de...@lists.freedesktop.org; amd- >>>>>>>> g...@lists.freedesktop.org; Chen, Horace <horace.c...@amd.com>; >>>>>>>> Chen, JingWen <jingwen.ch...@amd.com>; Deng, Emily >>>>>>>> <emily.d...@amd.com> >>>>>>>> Cc: dan...@ffwll.ch >>>>>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU >>>>>>>> reset protection for SRIOV >>>>>>>> >>>>>>>> [AMD Official Use Only] >>>>>>>> >>>>>>>> @Chen, Horace @Chen, JingWen @Deng, Emily >>>>>>>> >>>>>>>> Please take a review on Andrey's patch >>>>>>>> >>>>>>>> Thanks >>>>>>>> --------------------------------------------------------------- >>>>>>>> - >>>>>>>> - >>>>>>>> -- Monk Liu | Cloud GPU & Virtualization Solution | AMD >>>>>>>> --------------------------------------------------------------- >>>>>>>> - >>>>>>>> - >>>>>>>> -- we are hiring software manager for CVS core team >>>>>>>> --------------------------------------------------------------- >>>>>>>> - >>>>>>>> - >>>>>>>> -- >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Koenig, Christian <christian.koe...@amd.com> >>>>>>>> Sent: Thursday, December 23, 2021 4:42 PM >>>>>>>> To: Grodzovsky, Andrey <andrey.grodzov...@amd.com>; dri- >>>>>>>> de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org >>>>>>>> Cc: dan...@ffwll.ch; Liu, Monk <monk....@amd.com>; Chen, Horace >>>>>>>> <horace.c...@amd.com> >>>>>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU >>>>>>>> reset protection for SRIOV >>>>>>>> >>>>>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky: >>>>>>>>> Since now flr work is serialized against GPU resets there is >>>>>>>>> no need for this. >>>>>>>>> >>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com> >>>>>>>> Acked-by: Christian König <christian.koe...@amd.com> >>>>>>>> >>>>>>>>> --- >>>>>>>>> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 ----------- >>>>>>>>> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 ----------- >>>>>>>>> 2 files changed, 22 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>>>>> index 487cd654b69e..7d59a66e3988 100644 >>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>>>>> @@ -248,15 +248,7 @@ static void >>>>>>>>> xgpu_ai_mailbox_flr_work(struct >>>>>>>> work_struct *work) >>>>>>>>> struct amdgpu_device *adev = container_of(virt, >>>>>>>>> struct >>>>>>>> amdgpu_device, virt); >>>>>>>>> int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT; >>>>>>>>> >>>>>>>>> - /* block amdgpu_gpu_recover till msg FLR COMPLETE >>>>>>>>> received, >>>>>>>>> - * otherwise the mailbox msg will be ruined/reseted by >>>>>>>>> - * the VF FLR. >>>>>>>>> - */ >>>>>>>>> - if (!down_write_trylock(&adev->reset_sem)) >>>>>>>>> - return; >>>>>>>>> - >>>>>>>>> amdgpu_virt_fini_data_exchange(adev); >>>>>>>>> - atomic_set(&adev->in_gpu_reset, 1); >>>>>>>>> >>>>>>>>> xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, >>>>>>>>> 0, 0, 0); >>>>>>>>> >>>>>>>>> @@ -269,9 +261,6 @@ static void >>>>>>>>> xgpu_ai_mailbox_flr_work(struct >>>>>>>> work_struct *work) >>>>>>>>> } while (timeout > 1); >>>>>>>>> >>>>>>>>> flr_done: >>>>>>>>> - atomic_set(&adev->in_gpu_reset, 0); >>>>>>>>> - up_write(&adev->reset_sem); >>>>>>>>> - >>>>>>>>> /* Trigger recovery for world switch failure if no >>>>>>>>> TDR */ >>>>>>>>> if (amdgpu_device_should_recover_gpu(adev) >>>>>>>>> && (!amdgpu_device_has_job_running(adev) || diff >>>>>>>>> --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>>>>> index e3869067a31d..f82c066c8e8d 100644 >>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>>>>> @@ -277,15 +277,7 @@ static void >>>>>>>>> xgpu_nv_mailbox_flr_work(struct >>>>>>>> work_struct *work) >>>>>>>>> struct amdgpu_device *adev = container_of(virt, >>>>>>>>> struct >>>>>>>> amdgpu_device, virt); >>>>>>>>> int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT; >>>>>>>>> >>>>>>>>> - /* block amdgpu_gpu_recover till msg FLR COMPLETE >>>>>>>>> received, >>>>>>>>> - * otherwise the mailbox msg will be ruined/reseted by >>>>>>>>> - * the VF FLR. >>>>>>>>> - */ >>>>>>>>> - if (!down_write_trylock(&adev->reset_sem)) >>>>>>>>> - return; >>>>>>>>> - >>>>>>>>> amdgpu_virt_fini_data_exchange(adev); >>>>>>>>> - atomic_set(&adev->in_gpu_reset, 1); >>>>>>>>> >>>>>>>>> xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, >>>>>>>>> 0, 0, 0); >>>>>>>>> >>>>>>>>> @@ -298,9 +290,6 @@ static void >>>>>>>>> xgpu_nv_mailbox_flr_work(struct >>>>>>>> work_struct *work) >>>>>>>>> } while (timeout > 1); >>>>>>>>> >>>>>>>>> flr_done: >>>>>>>>> - atomic_set(&adev->in_gpu_reset, 0); >>>>>>>>> - up_write(&adev->reset_sem); >>>>>>>>> - >>>>>>>>> /* Trigger recovery for world switch failure if no >>>>>>>>> TDR */ >>>>>>>>> if (amdgpu_device_should_recover_gpu(adev) >>>>>>>>> && (!amdgpu_device_has_job_running(adev) ||