[PATCH v12 5/5] drm/amdgpu: track bo memory stats at runtime

2024-12-16 Thread Yunxiang Li
ision, we track the BOs as they change states. This way when the fdinfo is queried we only need to take the status lock and copy out the usage stats with minimal impact to the runtime performance. With this new approach however, we would no longer be able to track active buffers. Signed-off-by: Yunxia

[PATCH v11 2/5] drm: make drm-active- stats optional

2024-12-12 Thread Yunxiang Li
ptional to enable amdgpu to switch to the second method. Signed-off-by: Yunxiang Li Reviewed-by: Tvrtko Ursulin CC: dri-de...@lists.freedesktop.org CC: intel-...@lists.freedesktop.org CC: amd-gfx@lists.freedesktop.org --- drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 1 + drivers/gpu/drm/

[PATCH v11 5/5] drm/amdgpu: track bo memory stats at runtime

2024-12-12 Thread Yunxiang Li
ision, we track the BOs as they change states. This way when the fdinfo is queried we only need to take the status lock and copy out the usage stats with minimal impact to the runtime performance. With this new approach however, we would no longer be able to track active buffers. Signed-off-by: Yunxia

[PATCH v11 4/5] drm/amdgpu: remove unused function parameter

2024-12-12 Thread Yunxiang Li
amdgpu_vm_bo_invalidate doesn't use the adev parameter and not all callers have a reference to adev handy, so remove it for cleanliness. Signed-off-by: Yunxiang Li Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_

[PATCH v11 3/5] Documentation/gpu: Clarify drm memory stats definition

2024-12-12 Thread Yunxiang Li
having both is better for back-compat. Also re-order the paragraphs to flow better. Signed-off-by: Yunxiang Li Reviewed-by: Tvrtko Ursulin CC: dri-de...@lists.freedesktop.org --- Documentation/gpu/drm-usage-stats.rst | 54 +-- 1 file changed, 27 insertions(+), 27 del

[PATCH v11 1/5] drm: add drm_memory_stats_is_zero

2024-12-12 Thread Yunxiang Li
Add a helper to check if the memory stats is zero, this will be used to check for memory accounting errors. Signed-off-by: Yunxiang Li Reviewed-by: Christian König CC: dri-de...@lists.freedesktop.org --- drivers/gpu/drm/drm_file.c | 10 ++ include/drm/drm_file.h | 1 + 2 files

[PATCH v11 0/5] rework bo mem stats tracking

2024-12-12 Thread Yunxiang Li
umentation, minor tweaks, and some bug fixes found during testing v9: documentation fix as suggested, no functional change v10: change how gem objects shared via flink is counted, and fix a race between fdinfo read and buffer move v11: drop the v10 flink changes and instead hook into gem open/close Y

[PATCH v10 6/6] drm/amdgpu: track bo memory stats at runtime

2024-12-10 Thread Yunxiang Li
ision, we track the BOs as they change states. This way when the fdinfo is queried we only need to take the status lock and copy out the usage stats with minimal impact to the runtime performance. With this new approach however, we would no longer be able to track active buffers. Signed-off-by: Yunxia

[PATCH v10 1/6] drm: add drm_memory_stats_is_zero

2024-12-10 Thread Yunxiang Li
Add a helper to check if the memory stats is zero, this will be used to check for memory accounting errors. Signed-off-by: Yunxiang Li Reviewed-by: Christian König CC: dri-de...@lists.freedesktop.org --- drivers/gpu/drm/drm_file.c | 10 ++ include/drm/drm_file.h | 1 + 2 files

[PATCH v10 3/6] Documentation/gpu: Clarify drm memory stats definition

2024-12-10 Thread Yunxiang Li
having both is better for back-compat. Also re-order the paragraphs to flow better. Signed-off-by: Yunxiang Li Reviewed-by: Tvrtko Ursulin CC: dri-de...@lists.freedesktop.org --- Documentation/gpu/drm-usage-stats.rst | 54 +-- 1 file changed, 27 insertions(+), 27 del

[PATCH v10 5/6] drm/amdgpu: remove unused function parameter

2024-12-10 Thread Yunxiang Li
amdgpu_vm_bo_invalidate doesn't use the adev parameter and not all callers have a reference to adev handy, so remove it for cleanliness. Signed-off-by: Yunxiang Li Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_

[PATCH v10 4/6] drm: consider GEM object shared when it is exported

2024-12-10 Thread Yunxiang Li
o get notified when GEM object is being shared. Signed-off-by: Yunxiang Li CC: dri-de...@lists.freedesktop.org --- drivers/gpu/drm/drm_gem.c | 3 +++ drivers/gpu/drm/drm_prime.c | 3 +++ include/drm/drm_gem.h | 12 +++- 3 files changed, 17 insertions(+), 1 deletion(-) diff --

[PATCH v10 2/6] drm: make drm-active- stats optional

2024-12-10 Thread Yunxiang Li
ptional to enable amdgpu to switch to the second method. Signed-off-by: Yunxiang Li Reviewed-by: Tvrtko Ursulin CC: dri-de...@lists.freedesktop.org CC: intel-...@lists.freedesktop.org CC: amd-gfx@lists.freedesktop.org --- drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 1 + drivers/gpu/drm/

[PATCH v10 0/6] rework bo mem stats tracking

2024-12-10 Thread Yunxiang Li
umentation, minor tweaks, and some bug fixes found during testing v9: documentation fix as suggested, no functional change v10: change how gem objects shared via flink is counted, and fix a race between fdinfo read and buffer move Yunxiang Li (6): drm: add drm_memory_stats_is_zero drm: make

[PATCH v3 2/2] drm/amdkfd: pause autosuspend when creating pdd

2024-12-03 Thread Yunxiang Li
the KMS node) first and have not yet gone to sleep. Fixes: cc009e613de6 ("drm/amdkfd: Add KFD support for soc21 v3") Signed-off-by: Yunxiang Li --- v3: remove the cleanup in kfd_bind_process_to_device and document why this issue doesn't always happen drivers/gpu/drm/amd/amdkfd/kf

[PATCH v3 1/2] drm/amdkfd: cleanup device pointer dereference chains

2024-12-03 Thread Yunxiang Li
Pull out some duplicated dereference chains into variables, and in some cases grab struct device pointer directly from amdgpu_device instead of via drm_device. Signed-off-by: Yunxiang Li Reviewed-by: Mukul Joshi --- v3: no change drivers/gpu/drm/amd/amdkfd/kfd_process.c | 29

[PATCH v9 3/5] Documentation/gpu: Clarify drm memory stats definition

2024-11-28 Thread Yunxiang Li
y- is legacy, amdgpu only behavior. Re-order the paragraphs to flow better as well. Signed-off-by: Yunxiang Li CC: dri-de...@lists.freedesktop.org --- Documentation/gpu/drm-usage-stats.rst | 54 +-- 1 file changed, 27 insertions(+), 27 deletions(-) diff --git a/Documentation/gp

[PATCH v9 4/5] drm/amdgpu: remove unused function parameter

2024-11-28 Thread Yunxiang Li
amdgpu_vm_bo_invalidate doesn't use the adev parameter and not all callers have a reference to adev handy, so remove it for cleanliness. Signed-off-by: Yunxiang Li Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_

[PATCH v9 5/5] drm/amdgpu: track bo memory stats at runtime

2024-11-28 Thread Yunxiang Li
ision, we track the BOs as they change states. This way when the fdinfo is queried we only need to take the status lock and copy out the usage stats with minimal impact to the runtime performance. With this new approach however, we would no longer be able to track active buffers. Signed-off-by: Yunxia

[PATCH v9 1/5] drm: add drm_memory_stats_is_zero

2024-11-28 Thread Yunxiang Li
Add a helper to check if the memory stats is zero, this will be used to check for memory accounting errors. Signed-off-by: Yunxiang Li Reviewed-by: Christian König CC: dri-de...@lists.freedesktop.org --- drivers/gpu/drm/drm_file.c | 10 ++ include/drm/drm_file.h | 1 + 2 files

[PATCH v9 2/5] drm: make drm-active- stats optional

2024-11-28 Thread Yunxiang Li
ptional to enable amdgpu to switch to the second method. Signed-off-by: Yunxiang Li CC: dri-de...@lists.freedesktop.org CC: intel-...@lists.freedesktop.org CC: amd-gfx@lists.freedesktop.org --- drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 1 + drivers/gpu/drm/drm_file.c

[PATCH v9 0/5] rework bo mem stats tracking

2024-11-28 Thread Yunxiang Li
umentation, minor tweaks, and some bug fixes found during testing v9: documentation fix as suggested, no functional change Yunxiang Li (5): drm: add drm_memory_stats_is_zero drm: make drm-active- stats optional Documentation/gpu: Clarify drm memory stats definition drm/amdgpu: remove unuse

[PATCH v8 0/5] rework bo mem stats tracking

2024-11-15 Thread Yunxiang Li
umentation, minor tweaks, and some bug fixes found during testing Yunxiang Li (5): drm: add drm_memory_stats_is_zero drm: make drm-active- stats optional Documentation/gpu: Clarify drm memory stats definition drm/amdgpu: remove unused function parameter drm/amdgpu: track bo memory stats

[PATCH v8 1/5] drm: add drm_memory_stats_is_zero

2024-11-15 Thread Yunxiang Li
Add a helper to check if the memory stats is zero, this will be used to check for memory accounting errors. Signed-off-by: Yunxiang Li Reviewed-by: Christian König CC: dri-de...@lists.freedesktop.org --- drivers/gpu/drm/drm_file.c | 10 ++ include/drm/drm_file.h | 1 + 2 files

[PATCH v8 4/5] drm/amdgpu: remove unused function parameter

2024-11-15 Thread Yunxiang Li
amdgpu_vm_bo_invalidate doesn't use the adev parameter and not all callers have a reference to adev handy, so remove it for cleanliness. Signed-off-by: Yunxiang Li Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_

[PATCH v8 5/5] drm/amdgpu: track bo memory stats at runtime

2024-11-15 Thread Yunxiang Li
ision, we track the BOs as they change states. This way when the fdinfo is queried we only need to take the status lock and copy out the usage stats with minimal impact to the runtime performance. With this new approach however, we would no longer be able to track active buffers. Signed-off-by: Yunxia

[PATCH v8 3/5] Documentation/gpu: Clarify drm memory stats definition

2024-11-15 Thread Yunxiang Li
having both is better for back-compat. Also re-order the paragraphs to flow better. Signed-off-by: Yunxiang Li CC: dri-de...@lists.freedesktop.org --- Documentation/gpu/drm-usage-stats.rst | 36 --- 1 file changed, 16 insertions(+), 20 deletions(-) diff --git a/Docume

[PATCH v8 2/5] drm: make drm-active- stats optional

2024-11-15 Thread Yunxiang Li
ptional to enable amdgpu to switch to the second method. Signed-off-by: Yunxiang Li CC: dri-de...@lists.freedesktop.org CC: intel-...@lists.freedesktop.org CC: amd-gfx@lists.freedesktop.org --- drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 1 + drivers/gpu/drm/drm_file.c

[PATCH v8 2/4] drm: make drm-active- stats optional

2024-11-11 Thread Yunxiang Li
drm-active- optional to enable amdgpu to switch to the second method. Signed-off-by: Yunxiang Li CC: dri-de...@lists.freedesktop.org CC: intel-...@lists.freedesktop.org CC: amd-gfx@lists.freedesktop.org --- drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 1 + drivers/gpu/drm/drm_file.c

[PATCH v7 4/4] drm/amdgpu: track bo memory stats at runtime

2024-11-10 Thread Yunxiang Li
ision, we track the BOs as they change states. This way when the fdinfo is queried we only need to take the status lock and copy out the usage stats with minimal impact to the runtime performance. With this new approach however, we would no longer be able to track active buffers. Signed-off-by: Yunxia

[PATCH v7 0/4] rework bo mem stats tracking

2024-11-10 Thread Yunxiang Li
n a modern system all of VRAM can be mapped if needed. v5: rebase on top of the drm_print_memory_stats refactor v6: split the drm changes into a seperate patch for drm-devel review, fix handling of drm-total- vs drm-resident- and handle drm-purgable-. v7: make drm-active- optional Yunxiang Li

[PATCH v7 3/4] drm/amdgpu: remove unused function parameter

2024-11-10 Thread Yunxiang Li
amdgpu_vm_bo_invalidate doesn't use the adev parameter and not all callers have a reference to adev handy, so remove it for cleanliness. Signed-off-by: Yunxiang Li Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_

[PATCH v7 1/4] drm: add drm_memory_stats_is_zero

2024-11-10 Thread Yunxiang Li
Add a helper to check if the memory stats is zero, this will be used to check for memory accounting errors. Signed-off-by: Yunxiang Li Reviewed-by: Christian König CC: dri-de...@lists.freedesktop.org --- drivers/gpu/drm/drm_file.c | 10 ++ include/drm/drm_file.h | 1 + 2 files

[PATCH v6 2/5] drm/amdgpu: make drm-memory-* report resident memory

2024-10-25 Thread Yunxiang Li
The old behavior reports the resident memory usage for this key and the documentation say so as well. However this was accidentally changed to include buffers that was evicted. Fixes: a2529f67e2ed ("drm/amdgpu: Use drm_print_memory_stats helper from fdinfo") Signed-off-by: Yunxiang Li

[PATCH v6 0/5] rework bo mem stats tracking

2024-10-25 Thread Yunxiang Li
7;t know where would be a good place to add such info, especially how I could remove a BO's stat when it's fence is signaled. Yunxiang Li (5): drm/amdgpu: remove unused function parameter drm/amdgpu: make drm-memory-* report resident memory drm/amdgpu: stop tracking v

[PATCH v6 3/5] drm/amdgpu: stop tracking visible memory stats

2024-10-25 Thread Yunxiang Li
Since on modern systems all of vram can be made visible anyways, to simplify the new implementation, drops tracking how much memory is visible for now. If this is really needed we can add it back on top of the new implementation. Signed-off-by: Yunxiang Li Reviewed-by: Christian König

[PATCH v6 4/5] drm: add drm_memory_stats_is_zero

2024-10-25 Thread Yunxiang Li
Add a helper to check if the memory stats is zero, this will be used to check for memory accounting errors. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/drm_file.c | 9 + include/drm/drm_file.h | 1 + 2 files changed, 10 insertions(+) diff --git a/drivers/gpu/drm/drm_file.c b

[PATCH v6 1/5] drm/amdgpu: remove unused function parameter

2024-10-25 Thread Yunxiang Li
amdgpu_vm_bo_invalidate doesn't use the adev parameter and not all callers have a reference to adev handy, so remove it for cleanliness. Signed-off-by: Yunxiang Li Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_

[PATCH v6 5/5] drm/amdgpu: track bo memory stats at runtime

2024-10-25 Thread Yunxiang Li
ision, we track the BOs as they change states. This way when the fdinfo is queried we only need to take the status lock and copy out the usage stats with minimal impact to the runtime performance. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 14 +- drivers/gpu/dr

[PATCH v5 4/4] drm/amdgpu: track bo memory stats at runtime

2024-10-18 Thread Yunxiang Li
ision, we track the BOs as they change states. This way when the fdinfo is queried we only need to take the status lock and copy out the usage stats with minimal impact to the runtime performance. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 14 +- drivers/gpu/dr

[PATCH v5 3/4] drm/amdgpu: stop tracking visible memory stats

2024-10-18 Thread Yunxiang Li
Since on modern systems all of vram can be made visible anyways, to simplify the new implementation, drops tracking how much memory is visible for now. If this is really needed we can add it back on top of the new implementation, or just report all the BOs as visible. Signed-off-by: Yunxiang Li

[PATCH v5 2/4] drm/amdgpu: make drm-memory-* report resident memory

2024-10-18 Thread Yunxiang Li
The old behavior reports the resident memory usage for this key and the documentation say so as well. However this was accidentally changed to include buffers that was evicted. Fixes: a2529f67e2ed ("drm/amdgpu: Use drm_print_memory_stats helper from fdinfo") Signed-off-by: Y

[PATCH v5 1/4] drm/amdgpu: remove unused function parameter

2024-10-18 Thread Yunxiang Li
amdgpu_vm_bo_invalidate doesn't use the adev parameter and not all callers have a reference to adev handy, so remove it for cleanliness. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 2 +- drivers/gpu/dr

[PATCH v5 0/4] rework bo mem stats tracking

2024-10-18 Thread Yunxiang Li
n a modern system all of VRAM can be mapped if needed. v5: rebase on top of the drm_print_memory_stats refactor Yunxiang Li (4): drm/amdgpu: remove unused function parameter drm/amdgpu: make drm-memory-* report resident memory drm/amdgpu: stop tracking visible memory stats drm/amdgpu: track

[PATCH v2 1/2] drm/amdkfd: cleanup device pointer dereference chains

2024-10-10 Thread Yunxiang Li
Pull out some duplicated dereference chains into variables, and in some cases grab struct device pointer directly from amdgpu_device instead of via drm_device. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 29 1 file changed, 15 insertions

[PATCH v2 2/2] drm/amdkfd: pause autosuspend when creating pdd

2024-10-10 Thread Yunxiang Li
pm_runtime_resume_and_get helper. Fixes: cc009e613de6 ("drm/amdkfd: Add KFD support for soc21 v3") Signed-off-by: Yunxiang Li --- It is unclear to me if kfd_process_destroy_pdds also have this problem, or is freeing gtt mem guaranteed to succeed even with the GPU in suspend. drivers/gpu/drm/

[PATCH 1/2] drm/amdkfd: cleanup device pointer dereference chains

2024-10-10 Thread Yunxiang Li
Pull out some duplicated dereference chains into variables, and in some cases grab struct device pointer directly from amdgpu_device instead of via drm_device. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 32 +--- 1 file changed, 18 insertions

[PATCH 2/2] drm/amdkfd: pause autosuspend when creating pdd

2024-10-10 Thread Yunxiang Li
pm_runtime_resume_and_get helper. Fixes: cc009e613de6 ("drm/amdkfd: Add KFD support for soc21 v3") Signed-off-by: Yunxiang Li --- It is unclear to me if kfd_process_destroy_pdds also have this problem, or is freeing gtt mem guaranteed to succeed even with the GPU in suspend. drivers/gpu/drm/

[PATCH v4] drm/amdgpu: track bo memory stats at runtime

2024-09-16 Thread Yunxiang Li
ision, we track the BOs as they change state. This way when the fdinfo is queried we only need to take the status lock and copy out the usage stats with minimal impact to the runtime performance. Signed-off-by: Yunxiang Li --- v3: add amdgpu_vm_bo_move function instead of changing amdgpu_vm_bo_inval

[PATCH v3 3/3] drm/amdgpu: track bo memory stats at runtime

2024-09-11 Thread Yunxiang Li
ision, we track the BOs as they change state. This way when the fdinfo is queried we only need to take the status lock and copy out the usage stats with minimal impact to the runtime performance. Signed-off-by: Yunxiang Li --- v3: add amdgpu_vm_bo_move function instead of changing amdgpu_vm_bo_inval

[PATCH v3 1/3] drm/amdgpu: stop tracking visible memory stats

2024-09-11 Thread Yunxiang Li
Since on modern systems all of vram can be made visible anyways, to simplify the new implementation, drops tracking how much memory is visible for now. If this is still needed we can add it back on top of the new implementation. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu

[PATCH v3 2/3] drm/amdgpu: remove unused function parameter

2024-09-11 Thread Yunxiang Li
amdgpu_vm_bo_invalidate doesn't use the adev parameter and not all callers have a reference to adev handy, so remove it for cleanliness. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 2 +- drivers/gpu/dr

[PATCH v2 1/2] drm/amdgpu: stop tracking visible memory stats

2024-06-24 Thread Yunxiang Li
Since on modern systems all of vram can be made visible anyways, to simplify the new implementation, drops tracking how much memory is visible for now. If this is still needed we can add it back on top of the new implementation. Signed-off-by: Yunxiang Li --- v2: split into two patchs for

[PATCH v2 2/2] drm/amdgpu: track bo memory stats at runtime

2024-06-24 Thread Yunxiang Li
ision, we track the BOs as they change state. This way when the fdinfo is queried we only need to take the status lock and copy out the usage stats with minimal impact to the runtime performance. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +- drivers/gpu/dr

[PATCH] drm/amdgpu: track bo memory stats at runtime

2024-06-19 Thread Yunxiang Li
ce on modern systems all of vram can be visible anyways. Also we do not track "unsharing" a BO, the shared stat is only decreased when the BO is destroyed. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 26 ++- drivers/gpu/drm/amd/amdgpu/amdgpu

[PATCH v4 7/9] drm/amdgpu: fix locking scope when flushing tlb

2024-06-04 Thread Yunxiang Li
Which method is used to flush tlb does not depend on whether a reset is in progress or not. We should skip flush altogether if the GPU will get reset. So put both path under reset_domain read lock. Signed-off-by: Yunxiang Li Reviewed-by: Christian König CC: sta...@vger.kernel.org --- drivers

[PATCH v4 5/9] drm/amdgpu: use helper in amdgpu_gart_unbind

2024-06-04 Thread Yunxiang Li
When amdgpu_gart_invalidate_tlb helper is introduced this part was left out of the conversion. Avoid the code duplication here. Signed-off-by: Yunxiang Li Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff

[PATCH v4 3/9] drm/amdgpu/kfd: remove is_hws_hang and is_resetting

2024-06-04 Thread Yunxiang Li
set. If the on-going reset decided to skip GPU reset we have a bad time, otherwise the extra reset will get cancelled anyway. remove_queue_mes forgot to check is_resetting flag compared to the pre-MES path unmap_queue_cpsch, so it did not block hw access during reset correctly. Signed-off-by: Yunxi

[PATCH v4 8/9] drm/amdgpu: add lock in amdgpu_gart_invalidate_tlb

2024-06-04 Thread Yunxiang Li
We need to take the reset domain lock before flush hdp. We can't put the lock inside amdgpu_device_flush_hdp itself because it is used during reset where we already take the write side lock. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c | 6 +- 1 file chang

[PATCH v4 6/9] drm/amdgpu: call flush_gpu_tlb directly in gfxhub enable

2024-06-04 Thread Yunxiang Li
Here since we are in reset and takes the reset_domain write side lock already. We can't use the flush tlb helper which tries to take the read side. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 4 +--- drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 2 +- drivers/gp

[PATCH v4 9/9] drm/amdgpu: add lock in kfd_process_dequeue_from_device

2024-06-04 Thread Yunxiang Li
We need to take the reset domain lock before talking to MES. While in this case we can take the lock inside the mes helper. We can't do so for most other mes helpers since they are used during reset. So for consistency sake we add the lock here. Signed-off-by: Yunxiang Li --- drivers/gp

[PATCH v4 2/9] drm/amdgpu: fix sriov host flr handler

2024-06-04 Thread Yunxiang Li
ready, so it would be the same as before. But this gets rid of the hack with reset_domain locking and also let us tell how slow ready to reset actually is from the host. The ready to reset speed can be improved later. Signed-off-by: Yunxiang Li Acked-by: Christian König Reviewed-by: Emily Deng

[PATCH v4 4/9] drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover

2024-06-04 Thread Yunxiang Li
At this point the gart is not set up, there's no point to invalidate tlb here and it could even be harmful. Signed-off-by: Yunxiang Li Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/gpu/drm/amd/a

[PATCH v4 1/9] drm/amdgpu: add skip_hw_access checks for sriov

2024-06-04 Thread Yunxiang Li
Accessing registers via host is missing the check for skip_hw_access and the lockdep check that comes with it. Signed-off-by: Yunxiang Li Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/gpu/drm/amd

[PATCH v4 0/9] drm/amdgpu: prevent concurrent GPU access during reset

2024-06-04 Thread Yunxiang Li
fence poll if reset is started Revert "drm/amdgpu: Queue KFD reset workitem in VF FED" updated: drm/amdgpu: fix sriov host flr handler drm/amdgpu: fix missing reset domain locks Yunxiang Li (9): drm/amdgpu: add skip_hw_access checks for sriov drm/amdgpu:

[PATCH v3 2/8] drm/amdgpu: fix sriov host flr handler

2024-05-30 Thread Yunxiang Li
ready, so it would be the same as before. But this gets rid of the hack with reset_domain locking and also let us know how slow the reset actually is on the host. The pre-reset speed can thus be improved later. Signed-off-by: Yunxiang Li --- v3: still call amdgpu_virt_fini_data_exchange right away

[PATCH v3 3/8] drm/amdgpu/kfd: remove is_hws_hang and is_resetting

2024-05-30 Thread Yunxiang Li
set. If the on-going reset decided to skip GPU reset we have a bad time, otherwise the extra reset will get cancelled anyway. remove_queue_mes forgot to check is_resetting flag compared to the pre-MES path unmap_queue_cpsch, so it did not block hw access during reset correctly. Signed-off-by: Yunxi

[PATCH v3 0/8] drm/amdgpu: prevent concurrent GPU access during reset

2024-05-30 Thread Yunxiang Li
kun Gao (1): drm/amd/amdgpu: remove unnecessary flush when enable gart Yunxiang Li (7): drm/amdgpu: add skip_hw_access checks for sriov drm/amdgpu: fix sriov host flr handler drm/amdgpu/kfd: remove is_hws_hang and is_resetting drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover drm/a

[PATCH v3 6/8] drm/amdgpu: use helper in amdgpu_gart_unbind

2024-05-30 Thread Yunxiang Li
When amdgpu_gart_invalidate_tlb helper is introduced this part was left out of the conversion. Avoid the code duplication here. Signed-off-by: Yunxiang Li Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff

[PATCH v3 8/8] drm/amdgpu: fix missing reset domain locks

2024-05-30 Thread Yunxiang Li
These functions are missing the lock for reset domain. Signed-off-by: Yunxiang Li --- v3: only bracket amdgpu_device_flush_hdp with the read lock, amdgpu_gmc_flush_gpu_tlb already takes the lock drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c | 6 +- drivers/gpu/drm/amd/amdkfd

[PATCH v3 5/8] drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover

2024-05-30 Thread Yunxiang Li
At this point the gart is not set up, there's no point to invalidate tlb here and it could even be harmful. Signed-off-by: Yunxiang Li Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/gpu/drm/amd/a

[PATCH v3 7/8] drm/amdgpu: fix locking scope when flushing tlb

2024-05-30 Thread Yunxiang Li
Which method is used to flush tlb does not depend on whether a reset is in progress or not. We should skip flush altogether if the GPU will get reset. So put both path under reset_domain read lock. Signed-off-by: Yunxiang Li Reviewed-by: Christian König CC: sta...@vger.kernel.org --- drivers

[PATCH v3 1/8] drm/amdgpu: add skip_hw_access checks for sriov

2024-05-30 Thread Yunxiang Li
Accessing registers via host is missing the check for skip_hw_access and the lockdep check that comes with it. Signed-off-by: Yunxiang Li Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/gpu/drm/amd

[PATCH v3 4/8] drm/amd/amdgpu: remove unnecessary flush when enable gart

2024-05-30 Thread Yunxiang Li
From: Likun Gao Remove hdp flush for gc v11/12 when enable gart. Remove flush tlb for gc v10/11/12 when enable gart. The flush is done for each GART mapping when it is created. Signed-off-by: Likun Gao Signed-off-by: Yunxiang Li Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu

[PATCH v2 09/10] drm/amdgpu: fix missing reset domain locks

2024-05-28 Thread Yunxiang Li
These functions are missing the lock for reset domain. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c | 4 +++- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c| 8 ++-- drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 9 +++-- 3

[PATCH v2 04/10] drm/amdgpu/kfd: remove is_hws_hang and is_resetting

2024-05-28 Thread Yunxiang Li
set. If the on-going reset decided to skip GPU reset we have a bad time, otherwise the extra reset will get cancelled anyway. remove_queue_mes forgot to check is_resetting flag compared to the pre-MES path unmap_queue_cpsch, so it did not block hw access during reset correctly. Signed-off-by: Yunxi

[PATCH v2 06/10] drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover

2024-05-28 Thread Yunxiang Li
At this point the gart is not set up, there's no point to invalidate tlb here and it could even be harmful. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c b/driver

[PATCH v2 08/10] drm/amdgpu: fix locking scope when flushing tlb

2024-05-28 Thread Yunxiang Li
Which method is used to flush tlb does not depend on whether a reset is in progress or not. We should skip flush altogether if the GPU will get reset. So put both path under reset_domain read lock. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 66

[PATCH v2 10/10] Revert "drm/amdgpu: Queue KFD reset workitem in VF FED"

2024-05-28 Thread Yunxiang Li
This reverts commit 2149ee697a7a3091a16447c647d4a30f7468553a. The issue is already fixed by fa5a7f2ccb7e ("drm/amdgpu: Fix two reset triggered in a row") Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

[PATCH v2 07/10] drm/amdgpu: use helper in amdgpu_gart_unbind

2024-05-28 Thread Yunxiang Li
When amdgpu_gart_invalidate_tlb helper is introduced this part was left out of the conversion. Avoid the code duplication here. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd

[PATCH v2 03/10] drm/amdgpu: abort fence poll if reset is started

2024-05-28 Thread Yunxiang Li
If a reset is triggered, there's no point in waiting for the fence back anymore, it just makes the reset code wait for a long time for the reset_domain read lock to be dropped. This also makes our reply to host FLR fast enough so the host doesn't timeout. Signed-off-by: Yunxiang Li --

[PATCH v2 05/10] drm/amd/amdgpu: remove unnecessary flush when enable gart

2024-05-28 Thread Yunxiang Li
From: Likun Gao Remove hdp flush for gc v11/12 when enable gart. Remove flush tlb for gc v10/11/12 when enable gart. Signed-off-by: Likun Gao Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 3 --- drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 3 --- drivers/gpu/drm/amd

[PATCH v2 02/10] drm/amdgpu: fix sriov host flr handler

2024-05-28 Thread Yunxiang Li
ready, so it would be the same as before. But this gets rid of the hack with reset_domain locking and also let us know how slow the reset actually is on the host. The pre-reset speed can thus be improved later. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2

[PATCH v2 01/10] drm/amdgpu: add skip_hw_access checks for sriov

2024-05-28 Thread Yunxiang Li
Accessing registers via host is missing the check for skip_hw_access and the lockdep check that comes with it. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers

[PATCH v2 00/10] drm/amdgpu: prevent concurrent GPU access during reset

2024-05-28 Thread Yunxiang Li
reliable. This series hopes to address these bugs. Likun Gao (1): drm/amd/amdgpu: remove unnecessary flush when enable gart Yunxiang Li (9): drm/amdgpu: add skip_hw_access checks for sriov drm/amdgpu: fix sriov host flr handler drm/amdgpu: abort fence poll if reset is started drm/amdgpu

[PATCH 2/4] drm/amdgpu: use helper in amdgpu_gart_unbind

2024-05-22 Thread Yunxiang Li
When amdgpu_gart_invalidate_tlb helper is introduced this part was left out of the conversion. Avoid the code duplication here. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_gart.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd

[PATCH 3/4] drm/amdgpu: remove tlb flush in amdgpu_gtt_mgr_recover

2024-05-22 Thread Yunxiang Li
At this point the gart is not set up, there's no point to invalidate tlb here and it could even be harmful. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c b/driver

[PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery

2024-05-22 Thread Yunxiang Li
not take the read lock and deadlock itself, and normal access should avoid waiting on the reset to finish and should instead treat the hardware access as failed. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 5 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

[PATCH 1/4] drm/amd/amdgpu: remove unnecessary flush when enable gart

2024-05-22 Thread Yunxiang Li
From: Likun Gao Remove hdp flush for gc v11/12 when enable gart. Remove flush tlb for gc v10/11/12 when enable gart. Signed-off-by: Likun Gao Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 3 --- drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 3 --- drivers/gpu/drm/amd

[PATCH v2 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic

2024-04-26 Thread Yunxiang Li
lling amdgpu_device_pre_asic_reset each retry which properly free the resources from previous try by calling amdgpu_virt_fini_data_exchange. Signed-off-by: Yunxiang Li --- v2: put back release full access and the missed return drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 47 ++ 1

[PATCH v4 2/4] drm/amdgpu: Add reset_context flag for host FLR

2024-04-26 Thread Yunxiang Li
reset notification. Signed-off-by: Yunxiang Li --- v2: fix typo v3: pass reset_context directly v4: clear the flag in case we retry drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 - drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 1

[PATCH 4/4] drm/amdgpu: Move ras resume into SRIOV function

2024-04-25 Thread Yunxiang Li
This is part of the reset, move it into the reset function. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu

[PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic

2024-04-25 Thread Yunxiang Li
lling amdgpu_device_pre_asic_reset each retry which properly free the resources from previous try by calling amdgpu_virt_fini_data_exchange. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 50 ++ 1 file changed, 22 insertions(+), 28 deletions(-) diff --

[PATCH v3 2/4] drm/amdgpu: Add reset_context flag for host FLR

2024-04-25 Thread Yunxiang Li
reset notification. Signed-off-by: Yunxiang Li --- v2: fix typo v3: pass reset_context directly drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 1 + drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c

[PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row

2024-04-25 Thread Yunxiang Li
is supposedly in a good state, any reset scheduled after this point would be a legitimate reset. Remove unnecessary and incorrect checks for amdgpu_in_reset that was kinda serving this purpose. Signed-off-by: Yunxiang Li --- v2: instead of adding amdgpu_in_reset check, move when we cancel pending

[PATCH v2] drm/amdgpu: Fix two reset triggered in a row

2024-04-23 Thread Yunxiang Li
scheduled after that point would be legitimate. Remove unnecessary and incorrect checks for amdgpu_in_reset that was kinda serving this purpose. Signed-off-by: Yunxiang Li --- v2: instead of adding amdgpu_in_reset check, move when we cancel pending resets drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

[PATCH WIP] drm/amdgpu: Fix kfd_locked locking issue

2024-04-22 Thread Yunxiang Li
easier to reason about. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 76 +- 1 file changed, 30 insertions(+), 46 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 67da24e9f0a2

[PATCH v2] drm/amdgpu: Add reset_context flag for host FLR

2024-04-22 Thread Yunxiang Li
Using the job pointer to check if the FLR comes from the host is wrong, there are other reset triggers that pass NULL for job. So add a flag explicitly for host triggered reset. Signed-off-by: Yunxiang Li --- v2: fix typo drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++- drivers/gpu/drm/amd

[PATCH] drm/amdgpu: Fix two reset triggered in a row

2024-04-22 Thread Yunxiang Li
Reset request from KFD is missing a check for if a reset is already in progress, this causes a second reset to be triggered right after the previous one finishes. Add the check to align with the other reset sources. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2

[PATCH] drm/amdgpu: Add reset_context flag for host FLR

2024-04-22 Thread Yunxiang Li
Using the job pointer to check if the FLR comes from the host is wrong, there are other reset triggers that pass NULL for job. So add a flag explicitly for host triggered reset. Signed-off-by: Yunxiang Li --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++- drivers/gpu/drm/amd/amdgpu

  1   2   >