RE: [PATCH 2/5] drm/amdgpu: stop resubmitting jobs for bare metal reset

2022-10-26 Thread Liu, Shaoyun
he re-submission for all kind of reset since kernel already signal the reset event to user level (at least for compute stack) ? Regard Sshaoyun.liu -Original Message- From: Koenig, Christian Sent: Wednesday, October 26, 2022 1:27 PM To: Liu, Shaoyun ; Tuikov, Luben ; Prosyak, Vitaly ; De

RE: [PATCH 2/5] drm/amdgpu: stop resubmitting jobs for bare metal reset

2022-10-26 Thread Liu, Shaoyun
[AMD Official Use Only - General] The user space shouldn't care about SRIOV or not , I don't think we need to keep the re-submission for SRIOV as well. The reset from SRIOV could trigger the host do a whole GPU reset which will have the same issue as bare metal. Regards Shaoyun.liu -

RE: [RFC v4 04/11] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.

2022-02-08 Thread Liu, Shaoyun
, Andrey Sent: Tuesday, February 8, 2022 7:23 PM To: dri-devel@lists.freedesktop.org; amd-...@lists.freedesktop.org Cc: Koenig, Christian ; dan...@ffwll.ch; Liu, Monk ; Chen, Horace ; Lazar, Lijo ; Chen, JingWen ; Grodzovsky, Andrey ; Liu, Shaoyun Subject: [RFC v4 04/11] drm/amd/virt: For SRIOV

RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2022-01-04 Thread Liu, Shaoyun
From: Grodzovsky, Andrey Sent: Tuesday, January 4, 2022 3:55 PM To: Liu, Shaoyun ; Koenig, Christian ; Liu, Monk ; Chen, JingWen ; Christian König ; Deng, Emily ; dri-devel@lists.freedesktop.org; amd-...@lists.freedesktop.org; Chen, Horace Cc: dan...@ffwll.ch Subject: Re: [RFC v2 8/8] drm/amd/virt:

RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2022-01-04 Thread Liu, Shaoyun
[AMD Official Use Only] I mostly agree with the sequences Christian described . Just one thing might need to discuss here. For FLR notified from host, in new sequenceas described , driver need to reply the READY_TO_RESET in the workitem from a reset work queue which means inside flr_

RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2021-12-23 Thread Liu, Shaoyun
[AMD Official Use Only] I have a discussion with Andrey about this offline. It seems dangerous to remove the in_gpu_reset and reset_semm directly inside the flr_work. In the case when the reset is triggered from host side , gpu need to be locked while host perform reset after flr_work

RE: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu

2021-12-20 Thread Liu, Shaoyun
[AMD Official Use Only] Hi , Andrey I actually has some concerns about this change . 1. on SRIOV configuration , the reset notify coming from host , and driver already trigger a work queue to handle the reset (check xgpu_*_mailbox_flr_work) , is it a good idea to trigger another work queue

Re: [PATCH] drm/amdkfd/kfd_mqd_manager_v10: Avoid fall-through warning

2019-07-22 Thread Liu, Shaoyun
Reviewed-by:  shaoyunl On 2019-07-22 1:47 p.m., Gustavo A. R. Silva wrote: > In preparation to enabling -Wimplicit-fallthrough, this patch silences > the following warning: > > drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_mqd_manager_v10.c: In function > ‘mqd_manager_init_v10’: > ./include/linux/dyn

Re: [PATCH] drm/amdkfd/kfd_mqd_manager_v10: Fix missing break in switch statement

2019-07-22 Thread Liu, Shaoyun
PE_CP: > > case KFD_MQD_TYPE_CP: > case KFD_MQD_TYPE_COMPUTE: > pr_debug("%s@%i\n", __func__, __LINE__); > mqd->allocate_mqd = allocate_mqd; > > Thanks > -- > Gustavo > > >> Alex >> _

Re: [PATCH] drm/amdkfd/kfd_mqd_manager_v10: Fix missing break in switch statement

2019-07-22 Thread Liu, Shaoyun
This one properly in purpose , The mqd init for CP and  COMPUTE will have the same  routine . Regard sshaoyun.liu On 2019-07-21 6:59 p.m., Gustavo A. R. Silva wrote: > Add missing break statement in order to prevent the code from falling > through to case KFD_MQD_TYPE_COMPUTE. > > This bug was

Re: [PATCH] drm/amdkfd: fix potential null pointer dereference on pointer peer_dev

2019-07-02 Thread Liu, Shaoyun
From the comments , "we will  loop GPUs that already be processed (with lower value of proximity_domain) ",  the device should already been added into the  topology_device_list.  So in this case , kfd_topology_device_by_proximity_domain will not return a NULL pointer.  If you really get the nu