Hi Christian, I'm not sure what do you mean by "we need to change SRIOV not the driver".
Do you mean we should change the reset sequence in SRIOV? This will be a huge change for our SRIOV solution. >From my point of view, we can directly use amdgpu_device_lock_adev and amdgpu_device_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced. But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space. Best Regards, Jingwen Chen On 2022/1/3 下午6:17, Christian König wrote: > Please don't. This patch is vital to the cleanup of the reset procedure. > > If SRIOV doesn't work with that we need to change SRIOV and not the driver. > > Christian. > > Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky: >> Sure, I guess i can drop this patch then. >> >> Andrey >> >> On 2021-12-24 4:57 a.m., JingWen Chen wrote: >>> I do agree with shaoyun, if the host find the gpu engine hangs first, and >>> do the flr, guest side thread may not know this and still try to access >>> HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the >>> reset status). And this may lead to very bad result. >>> >>> On 2021/12/24 下午4:58, Deng, Emily wrote: >>>> These patches look good to me. JingWen will pull these patches and do some >>>> basic TDR test on sriov environment, and give feedback. >>>> >>>> Best wishes >>>> Emily Deng >>>> >>>> >>>> >>>>> -----Original Message----- >>>>> From: Liu, Monk <monk....@amd.com> >>>>> Sent: Thursday, December 23, 2021 6:14 PM >>>>> To: Koenig, Christian <christian.koe...@amd.com>; Grodzovsky, Andrey >>>>> <andrey.grodzov...@amd.com>; dri-devel@lists.freedesktop.org; amd- >>>>> g...@lists.freedesktop.org; Chen, Horace <horace.c...@amd.com>; Chen, >>>>> JingWen <jingwen.ch...@amd.com>; Deng, Emily <emily.d...@amd.com> >>>>> Cc: dan...@ffwll.ch >>>>> Subject: RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset >>>>> protection >>>>> for SRIOV >>>>> >>>>> [AMD Official Use Only] >>>>> >>>>> @Chen, Horace @Chen, JingWen @Deng, Emily >>>>> >>>>> Please take a review on Andrey's patch >>>>> >>>>> Thanks >>>>> ------------------------------------------------------------------- >>>>> Monk Liu | Cloud GPU & Virtualization Solution | AMD >>>>> ------------------------------------------------------------------- >>>>> we are hiring software manager for CVS core team >>>>> ------------------------------------------------------------------- >>>>> >>>>> -----Original Message----- >>>>> From: Koenig, Christian <christian.koe...@amd.com> >>>>> Sent: Thursday, December 23, 2021 4:42 PM >>>>> To: Grodzovsky, Andrey <andrey.grodzov...@amd.com>; dri- >>>>> de...@lists.freedesktop.org; amd-...@lists.freedesktop.org >>>>> Cc: dan...@ffwll.ch; Liu, Monk <monk....@amd.com>; Chen, Horace >>>>> <horace.c...@amd.com> >>>>> Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset >>>>> protection >>>>> for SRIOV >>>>> >>>>> Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky: >>>>>> Since now flr work is serialized against GPU resets there is no need >>>>>> for this. >>>>>> >>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com> >>>>> Acked-by: Christian König <christian.koe...@amd.com> >>>>> >>>>>> --- >>>>>> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 ----------- >>>>>> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 ----------- >>>>>> 2 files changed, 22 deletions(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>> index 487cd654b69e..7d59a66e3988 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c >>>>>> @@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct >>>>> work_struct *work) >>>>>> struct amdgpu_device *adev = container_of(virt, struct >>>>> amdgpu_device, virt); >>>>>> int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT; >>>>>> >>>>>> - /* block amdgpu_gpu_recover till msg FLR COMPLETE received, >>>>>> - * otherwise the mailbox msg will be ruined/reseted by >>>>>> - * the VF FLR. >>>>>> - */ >>>>>> - if (!down_write_trylock(&adev->reset_sem)) >>>>>> - return; >>>>>> - >>>>>> amdgpu_virt_fini_data_exchange(adev); >>>>>> - atomic_set(&adev->in_gpu_reset, 1); >>>>>> >>>>>> xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0); >>>>>> >>>>>> @@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct >>>>> work_struct *work) >>>>>> } while (timeout > 1); >>>>>> >>>>>> flr_done: >>>>>> - atomic_set(&adev->in_gpu_reset, 0); >>>>>> - up_write(&adev->reset_sem); >>>>>> - >>>>>> /* Trigger recovery for world switch failure if no TDR */ >>>>>> if (amdgpu_device_should_recover_gpu(adev) >>>>>> && (!amdgpu_device_has_job_running(adev) || diff --git >>>>>> a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>> b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>> index e3869067a31d..f82c066c8e8d 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c >>>>>> @@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct >>>>> work_struct *work) >>>>>> struct amdgpu_device *adev = container_of(virt, struct >>>>> amdgpu_device, virt); >>>>>> int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT; >>>>>> >>>>>> - /* block amdgpu_gpu_recover till msg FLR COMPLETE received, >>>>>> - * otherwise the mailbox msg will be ruined/reseted by >>>>>> - * the VF FLR. >>>>>> - */ >>>>>> - if (!down_write_trylock(&adev->reset_sem)) >>>>>> - return; >>>>>> - >>>>>> amdgpu_virt_fini_data_exchange(adev); >>>>>> - atomic_set(&adev->in_gpu_reset, 1); >>>>>> >>>>>> xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0); >>>>>> >>>>>> @@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct >>>>> work_struct *work) >>>>>> } while (timeout > 1); >>>>>> >>>>>> flr_done: >>>>>> - atomic_set(&adev->in_gpu_reset, 0); >>>>>> - up_write(&adev->reset_sem); >>>>>> - >>>>>> /* Trigger recovery for world switch failure if no TDR */ >>>>>> if (amdgpu_device_should_recover_gpu(adev) >>>>>> && (!amdgpu_device_has_job_running(adev) || >