RE: [PATCH] drm/amdkfd: fixed page fault when enable MES shader debugger

2024-12-19 Thread Li, Yunxiang (Teddy)
[Public] > From: jesse.zh...@amd.com > Sent: Wednesday, December 18, 2024 6:26 PM > Subject: [PATCH] drm/amdkfd: fixed page fault when enable MES shader debugger > > Initialize the process context address before setting the shader debugger. I think it would make sense to pull this into it's own

RE: [PATCH v12 5/5] drm/amdgpu: track bo memory stats at runtime

2024-12-19 Thread Li, Yunxiang (Teddy)
[Public] > From: Koenig, Christian > Sent: Thursday, December 19, 2024 5:07 > Am 16.12.24 um 18:49 schrieb Yunxiang Li: > > Before, every time fdinfo is queried we try to lock all the BOs in the > > VM and calculate memory usage from scratch. This works okay if the > > fdinfo is rarely read and t

RE: [Bug report] Regression with kernel v6.13-rc2

2024-12-18 Thread Li, Yunxiang (Teddy)
[Public] > From: Tobias Klausmann > Sent: Wednesday, December 18, 2024 10:54 > Hi! > > I have been hitting kernel messages from AMDGPU since v6.13-rc2, for > example: > > [Wed Dec 18 15:56:24 2024] gmc_v11_0_process_interrupt: 10 callbacks > suppressed [Wed Dec 18 15:56:24 2024] amdgpu :03:00

RE: [Bug report] Regression with kernel v6.13-rc2

2024-12-18 Thread Li, Yunxiang (Teddy)
[Public] +Felix > From: Tobias Klausmann > Sent: Wednesday, December 18, 2024 10:54 > > I have been hitting kernel messages from AMDGPU since v6.13-rc2, for > example: > > [Wed Dec 18 15:56:24 2024] gmc_v11_0_process_interrupt: 10 callbacks > suppressed [Wed Dec 18 15:56:24 2024] amdgpu :03:

RE: [PATCH v10 4/6] drm: consider GEM object shared when it is exported

2024-12-12 Thread Li, Yunxiang (Teddy)
[Public] > From: Koenig, Christian > Sent: Thursday, December 12, 2024 4:25 > Am 11.12.24 um 17:14 schrieb Li, Yunxiang (Teddy): > > [Public] > > > >> From: Koenig, Christian > >> Sent: Wednesday, December 11, 2024 10:03 Am 11.12.24 um 15:02 schrieb &

RE: [PATCH v10 4/6] drm: consider GEM object shared when it is exported

2024-12-11 Thread Li, Yunxiang (Teddy)
[Public] > From: Koenig, Christian > Sent: Wednesday, December 11, 2024 10:03 > Am 11.12.24 um 15:02 schrieb Li, Yunxiang (Teddy): > > [Public] > > > >> From: Koenig, Christian > >> Sent: Wednesday, December 11, 2024 3:16 Am 10.12.24 um 18:59 schrieb >

RE: [PATCH v10 4/6] drm: consider GEM object shared when it is exported

2024-12-11 Thread Li, Yunxiang (Teddy)
[Public] > From: Koenig, Christian > Sent: Wednesday, December 11, 2024 3:16 > Am 10.12.24 um 18:59 schrieb Yunxiang Li: > > Tracking the state of a GEM object for shared stats is quite difficult > > since the handle_count is managed behind driver's back. So instead > > considers GEM object share

RE: [PATCH v9 5/5] drm/amdgpu: track bo memory stats at runtime

2024-12-03 Thread Li, Yunxiang (Teddy)
[Public] > From: Tvrtko Ursulin > Sent: Tuesday, December 3, 2024 10:16 > On 03/12/2024 15:03, Li, Yunxiang (Teddy) wrote: > > [Public] > > > >> From: Tvrtko Ursulin > >> Sent: Tuesday, December 3, 2024 9:42 > >> On 28/11/2024 18:54, Yunxiang Li w

RE: [PATCH v9 5/5] drm/amdgpu: track bo memory stats at runtime

2024-12-03 Thread Li, Yunxiang (Teddy)
[Public] > From: Tvrtko Ursulin > Sent: Tuesday, December 3, 2024 9:42 > On 28/11/2024 18:54, Yunxiang Li wrote: > > Before, every time fdinfo is queried we try to lock all the BOs in the > > VM and calculate memory usage from scratch. This works okay if the > > fdinfo is rarely read and the VMs

RE: [PATCH v7 2/4] drm: make drm-active- stats optional

2024-11-18 Thread Li, Yunxiang (Teddy)
[Public] > From: Tvrtko Ursulin > Sent: Monday, November 11, 2024 5:30 > On 10/11/2024 15:41, Yunxiang Li wrote: > > Make drm-active- optional just like drm-resident- and drm-purgeable-. > > As Jani has already commented the commit message needs some work. > > > Signed-off-by: Yunxiang Li > > CC

RE: [PATCH v8 2/5] drm: make drm-active- stats optional

2024-11-18 Thread Li, Yunxiang (Teddy)
[Public] Indeed I have, thanks for the reminder Teddy

RE: [PATCH v8 3/5] Documentation/gpu: Clarify drm memory stats definition

2024-11-18 Thread Li, Yunxiang (Teddy)
[Public] > From: Tvrtko Ursulin > Sent: Monday, November 18, 2024 9:38 > On 16/11/2024 04:44, Yunxiang Li wrote: > > Define how to handle buffers with multiple possible placement so we > > don't get incompatible implementations. Callout the resident > > requirement for drm-purgeable- explicitly.

RE: [PATCH v7 4/4] drm/amdgpu: track bo memory stats at runtime

2024-11-14 Thread Li, Yunxiang (Teddy)
[AMD Official Use Only - AMD Internal Distribution Only] > From: Tvrtko Ursulin > Sent: Wednesday, November 13, 2024 12:31 > On 13/11/2024 17:01, Li, Yunxiang (Teddy) wrote: > > [Public] > > > >> From: Koenig, Christian > >> Sent: Wednesday, November 13,

RE: [PATCH v7 4/4] drm/amdgpu: track bo memory stats at runtime

2024-11-13 Thread Li, Yunxiang (Teddy)
[Public] > From: Koenig, Christian > Sent: Wednesday, November 13, 2024 9:22 > Am 13.11.24 um 14:53 schrieb Li, Yunxiang (Teddy): > > [Public] > > > >> From: Koenig, Christian > >> Sent: Wednesday, November 13, 2024 6:39 Am 13.11.24 um 11:25 schrieb >

RE: [PATCH v7 4/4] drm/amdgpu: track bo memory stats at runtime

2024-11-13 Thread Li, Yunxiang (Teddy)
[Public] > From: Koenig, Christian > Sent: Wednesday, November 13, 2024 3:49 > Am 12.11.24 um 19:16 schrieb Li, Yunxiang (Teddy): > >> From: Christian König > >> Sent: Tuesday, November 12, 2024 5:54 > >> Am 10.11.24 um 16:41 schrieb Yunxiang Li: > >&g

RE: [PATCH v7 4/4] drm/amdgpu: track bo memory stats at runtime

2024-11-13 Thread Li, Yunxiang (Teddy)
[Public] > From: Koenig, Christian > Sent: Wednesday, November 13, 2024 6:39 > Am 13.11.24 um 11:25 schrieb Tvrtko Ursulin: > > On 13/11/2024 08:49, Christian König wrote: > >> Am 12.11.24 um 19:16 schrieb Li, Yunxiang (Teddy): > >>> [SNIP] > &g

RE: [PATCH v7 4/4] drm/amdgpu: track bo memory stats at runtime

2024-11-12 Thread Li, Yunxiang (Teddy)
[Public] > From: Christian König > Sent: Tuesday, November 12, 2024 5:54 > Am 10.11.24 um 16:41 schrieb Yunxiang Li: > > @@ -310,6 +311,92 @@ static void amdgpu_vm_bo_reset_state_machine(struct > amdgpu_vm *vm) > > spin_unlock(&vm->status_lock); > > } > > > > +/** > > + * amdgpu_vm_update_s

RE: [PATCH v6 5/5] drm/amdgpu: track bo memory stats at runtime

2024-11-07 Thread Li, Yunxiang (Teddy)
[Public] > From: Tvrtko Ursulin > Sent: Thursday, November 7, 2024 5:48 > On 31/10/2024 13:48, Li, Yunxiang (Teddy) wrote: > > [Public] > > > >> From: Christian König > >> Sent: Thursday, October 31, 2024 8:54 Am 25.10.24 um 19:41 schrieb > >>

RE: [PATCH v6 4/5] drm: add drm_memory_stats_is_zero

2024-11-07 Thread Li, Yunxiang (Teddy)
[AMD Official Use Only - AMD Internal Distribution Only] > From: Tvrtko Ursulin > Sent: Thursday, November 7, 2024 5:41 > On 25/10/2024 18:41, Yunxiang Li wrote: > > Add a helper to check if the memory stats is zero, this will be used > > to check for memory accounting errors. > > > > Signed-off-

RE: [PATCH v6 5/5] drm/amdgpu: track bo memory stats at runtime

2024-10-31 Thread Li, Yunxiang (Teddy)
[Public] > From: Christian König > Sent: Thursday, October 31, 2024 8:54 > Am 25.10.24 um 19:41 schrieb Yunxiang Li: > > Before, every time fdinfo is queried we try to lock all the BOs in the > > VM and calculate memory usage from scratch. This works okay if the > > fdinfo is rarely read and the

RE: [PATCH v5 4/4] drm/amdgpu: track bo memory stats at runtime

2024-10-23 Thread Li, Yunxiang (Teddy)
[AMD Official Use Only - AMD Internal Distribution Only] Yeah it looks like I missed the whole active/purgeable thing as well... Teddy

RE: [PATCH v5 4/4] drm/amdgpu: track bo memory stats at runtime

2024-10-23 Thread Li, Yunxiang (Teddy)
wrote: > >>> Am 23.10.24 um 09:38 schrieb Tvrtko Ursulin: > >>>> > >>>> On 22/10/2024 17:24, Christian König wrote: > >>>>> Am 22.10.24 um 17:17 schrieb Li, Yunxiang (Teddy): > >>>>>> [Public] > >>>>>&

RE: [PATCH v5 4/4] drm/amdgpu: track bo memory stats at runtime

2024-10-22 Thread Li, Yunxiang (Teddy)
[AMD Official Use Only - AMD Internal Distribution Only] It sounds like it makes the most sense to ignore the BOs that have no placement or private placements then, it would simplify the code too. Teddy

RE: [PATCH v5 4/4] drm/amdgpu: track bo memory stats at runtime

2024-10-22 Thread Li, Yunxiang (Teddy)
[Public] I suppose we could add a field like amd-memory-private: to cover the private placements. When would a BO not have a placement, is it when it is being moved? Since we are tracking the state changes, I wonder if such situations can be avoided now so whenever we call these stat update fun

RE: [PATCH v5 4/4] drm/amdgpu: track bo memory stats at runtime

2024-10-22 Thread Li, Yunxiang (Teddy)
[Public] > > > > +static uint32_t fold_memtype(uint32_t memtype) { > > In general please add prefixes to even static functions, e.g. amdgpu_vm_ or > amdgpu_bo_. > > > + /* Squash private placements into 'cpu' to keep the legacy userspace > > view. > */ > > + switch (mem_type) { > > + case T

RE: [PATCH v2 2/2] drm/amdkfd: pause autosuspend when creating pdd

2024-10-11 Thread Li, Yunxiang (Teddy)
[AMD Official Use Only - AMD Internal Distribution Only] Yes, my understanding is that autosuspend happens after a timeout, so it would only go into suspend if the pdd is not used for a while after it's created. Teddy

RE: [PATCH 1/2] drm/amdkfd: cleanup device pointer dereference chains

2024-10-10 Thread Li, Yunxiang (Teddy)
[AMD Official Use Only - AMD Internal Distribution Only] Oops, lemme fix that real quick... > -Original Message- > From: Joshi, Mukul > Sent: Thursday, October 10, 2024 11:46 > To: Li, Yunxiang (Teddy) ; amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander ; Koenig, Ch

RE: [PATCH v2 2/2] drm/amdgpu: track bo memory stats at runtime

2024-09-10 Thread Li, Yunxiang (Teddy)
[Public] > Ok that looks extremely ugly. Please just add a separate function and call > that > from the TTM move function. Should I still remove the adev argument? It is never used and causes a few call sites having to find an adev unnecessarily. > Please either drop that or compare each memor

RE: [PATCH v2 2/2] drm/amdgpu: track bo memory stats at runtime

2024-06-24 Thread Li, Yunxiang (Teddy)
[Public] I'm not happy with the new helper names, and not quite satisfied with the function arguments either. But it's what I got right now, would appreciate it if anyone got better ideas. Teddy

RE: [PATCH] drm/amdgpu: track bo memory stats at runtime

2024-06-20 Thread Li, Yunxiang (Teddy)
[Public] > > + dma_resv_lock(bo->tbo.base.resv, NULL); > > Why do you grab the BO lock to update the stats? That doesn't seem to make > any sense. > > > + update_stats = !(bo->flags & AMDGPU_GEM_WAS_EXPORTED); > > + if (update_stats) > > + amdgpu_bo_add_memory(bo, &stats); > > +

RE: [PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks

2024-06-04 Thread Li, Yunxiang (Teddy)
[AMD Official Use Only - AMD Internal Distribution Only] The trouble with taking the read side lock in the MES helper functions is that we use a lot of them during reset under the write lock. So either we need to duplicate the helper functions or we will get inconsistencies where a random subse

RE: [PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks

2024-05-30 Thread Li, Yunxiang (Teddy)
[AMD Official Use Only - AMD Internal Distribution Only] > It's not clear to me what's the requirement for reset domain locking around > MES calls. We have a lot more of them in kfd_device_queue_manager.c > (mostly calling adev->mes.funcs->... > directly). Do they all need to be wrapped individual

RE: [PATCH v2 04/10] drm/amdgpu/kfd: remove is_hws_hang and is_resetting

2024-05-29 Thread Li, Yunxiang (Teddy)
[AMD Official Use Only - AMD Internal Distribution Only] > One thing I could see going wrong is, that down_read_trylock(&dqm->dev- > >adev->reset_domain->sem) will not fail immediately when the reset is > scheduled. So there may be multipe attempts at HW access that detect an > error or time out,

RE: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started

2024-05-29 Thread Li, Yunxiang (Teddy)
[AMD Official Use Only - AMD Internal Distribution Only] > Yeah, I know. That's one of the reason I've pointed out on the patch adding > that that this behavior is actually completely broken. > > If you run into issues with the MES because of this then please suggest a > revert of that patch. I t

RE: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started

2024-05-29 Thread Li, Yunxiang (Teddy)
[Public] > The problem is that we don't force complete the non scheduler rings, e.g. MES, > KIQ etc... > > Try to remove this check here from the loop in > amdgpu_device_pre_asic_reset(): > > if (!amdgpu_ring_sched_ready(ring)) > continue; Ah, I see. Thou

RE: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started

2024-05-29 Thread Li, Yunxiang (Teddy)
[AMD Official Use Only - AMD Internal Distribution Only] > I don't think trying to add some reset handling here makes sense in the first > place. > Part of the reset/recovery procedure is to signal all fence and that includes > the one we are waiting for here. > So this wait should return immedi

RE: [PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started

2024-05-29 Thread Li, Yunxiang (Teddy)
[Public] > It's perfectly possible that the reset has already started before we enter > the function. Yeah, this could and does happen, but it just means we are back to the old behavior. I guess I could use "can I take the read side lock?" to test if the function is called outside of reset or

RE: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery

2024-05-24 Thread Li, Yunxiang (Teddy)
[AMD Official Use Only - AMD Internal Distribution Only] Yes, the two places are 1. In debugfs and 2. In MI100's en/disable_debug_trap, and evidently someone is testing the debugfs interface because there's a bug fix for a race condition of it. Teddy

RE: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery

2024-05-24 Thread Li, Yunxiang (Teddy)
[AMD Official Use Only - AMD Internal Distribution Only] > If that is true you could in theory lower the locked area of the existing > lock, but adding a new one is strict no-go from my side. I'll try this, right now I see two places where this would be problematic because they are trying to su

RE: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery

2024-05-23 Thread Li, Yunxiang (Teddy)
[Public] > > Here is taking a different lock than the reset_domain->sem. It is a > > seperate reset_domain->gpu_sem that is only locked when we will actuall do > > reset, it is not taken in the skip_hw_reset path. > > Exactly that is what you should *not* do. Please don't add any new lock to >

RE: [PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery

2024-05-23 Thread Li, Yunxiang (Teddy)
[AMD Official Use Only - AMD Internal Distribution Only] > > +void amdgpu_lock_hw_access(struct amdgpu_device *adev); void > > +amdgpu_unlock_hw_access(struct amdgpu_device *adev); int > > +amdgpu_begin_hw_access(struct amdgpu_device *adev); void > > +amdgpu_end_hw_access(struct amdgpu_device *ade

RE: [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row

2024-04-30 Thread Li, Yunxiang (Teddy)
[Public] Hi Christ, I got R-b from the SRIOV team for the rest of the patches, can you help review this last one? I think the concerns from the previous thread are all addressed https://patchwork.freedesktop.org/patch/590678/?series=132727 Regards, Teddy

RE: [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic

2024-04-26 Thread Li, Yunxiang (Teddy)
[Public] > Why remove this? Oops it's a copy-paste error from the previous revision > Need to call amdgpu_virt_release_full_gpu(adev, true) before retry, and the > same as below. I thought we talked about if we call amdgpu_virt_{reset,request_full}_gpu again we don't need to release full gpu, I

RE: [PATCH] drm/amdgpu: Fix two reset triggered in a row

2024-04-25 Thread Li, Yunxiang (Teddy)
[Public] > Looks like that is handled by the scheduler work item now as well. See > function gfx_v9_0_fault() for an example. Cool so it is blocked by drm_sched_stop also. I think that covers everything.

RE: [PATCH] drm/amdgpu: Fix two reset triggered in a row

2024-04-24 Thread Li, Yunxiang (Teddy)
[Public] > We have the KFD, FLR, the per engine one in the scheduler and IIRC one more > for the CP (illegal operation and register write). > > I'm not sure about the CP one, but all others should be handled correctly > with the V2 patch as far as I can see. Where can I find the CP one? Nothing

RE: [PATCH] drm/amdgpu: Fix two reset triggered in a row

2024-04-22 Thread Li, Yunxiang (Teddy)
[Public] > We can't do this technically as there are cases where we skip full device > reset (even then amdgpu_in_reset will return true). The better thing to do is > to move amdgpu_device_stop_pending_resets() later in > gpu_recover()- if a device has undergone full reset, then cancel all pendi

RE: [PATCH] drm/amd/display: Fix vblank refcount in vrr transition

2022-10-14 Thread Li, Yunxiang (Teddy)
[AMD Official Use Only - General] > This patch results in a large number of compile errors if CONFIG_DEBUG_FS=n. > Reverting it fixes the problem. > > This is an architecture independent problem. > > Guenter Oops, seem to be because at amdgpu_dm.c:8328 the } should be inside the #endif not outs