Re: [PATCH] drm/amdgpu: Avoid extra evict-restore process.

2025-07-09 Thread Philip Yang
On 2025-07-08 16:14, Gang Ba wrote: If vm belongs to another process, this is fclose after fork, wait may enable signaling KFD eviction fence and cause parent process queue evicted. Signed-off-by: Gang Ba --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 7 +++ 1 file changed, 7 insertions

Re: [PATCH] drm/amdgpu: delete function amdgpu_flush

2025-07-02 Thread Philip Yang
On 2025-07-01 03:28, Christian König wrote: Clear NAK to removing this! The amdgpu_flush function is vital for correct operation. no fflush call from libdrm/amdgpu, so amdgpu_flush is only called from fclose -> filp_flush The intention is to block closing the file handle in child processes a

Re: [PATCH] drm/amdgpu: delete function amdgpu_flush

2025-06-27 Thread Philip Yang
On 2025-06-27 01:20, YuanShang Mao (River) wrote: [AMD Official Use Only - AMD Internal Distribution Only] Currently, amdgpu_flush is used to prevent new jobs from being submitted in the same context when a file descriptor is closed and to wait for existing jobs to complete. Additionally, if

Re: [PATCH] drm/amdkfd: Don't call mmput from MMU notifier callback

2025-06-24 Thread Philip Yang
On 2025-06-23 18:18, Chen, Xiaogang wrote: On 6/23/2025 11:59 AM, Philip Yang wrote: If the process is exiting, the mmput inside mmu notifier callback from compactd or fork or numa balancing could release the last reference of mm struct to call exit_mmap and free_pgtable, this triggers

[PATCH] drm/amdkfd: Don't call mmput from MMU notifier callback

2025-06-23 Thread Philip Yang
] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x543/0x7d0 [amdgpu] kfd_ioctl_alloc_memory_of_gpu+0x24c/0x4e0 [amdgpu] kfd_ioctl+0x29d/0x500 [amdgpu] Fixes: fa582c6f3684 ("drm/amdkfd: Use mmget_not_zero in MMU notifier") Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 23 +++--

Re: [PATCH] drm/amdgpu: Fix SDMA UTC_L1 handling during start/stop sequences

2025-06-16 Thread Philip Yang
On 2025-06-16 07:43, Jesse Zhang wrote: This commit makes two key fixes to SDMA v4.4.2 handling: 1. disable UTC_L1 in sdma_cntl register when stopping SDMA engines by reading the current value before modifying UTC_L1_ENABLE bit. 2. Ensure UTC_L1_ENABLE is consistently managed by: - Ad

Re: [PATCH] drm/amdgpu: Add chain runlists support to GC9.4.2

2025-06-06 Thread Philip Yang
On 2025-06-05 12:11, Amber Lin wrote: Starting from MEC v97, GC 9.4.2 supports chain runlists of XNACK+/XNACK- processes. Signed-off-by: Amber Lin Reviewed-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 3 +++ drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c | 12

Re: [PATCH] drm/amdkfd: Fix kfd process ref leaking when userptr unmapping

2025-05-28 Thread Philip Yang
urce after application exit. NULL pointer check is also necessary as kfd_lookup_process_by_pid() may return NULL pointer if app process/task is already destroyed. Regards, Philip -Original Message- From: amd-gfx On Behalf Of Philip Yang Sent: Tuesday, May 27, 2025 11:35 AM T

Re: [PATCH 2/2] drm/amdkfd: add svm_migrate_successful_pages

2025-05-28 Thread Philip Yang
On 2025-05-28 13:19, James Zhu wrote: to get migration pages. When migrating pages from system to vram, needn't check bit MIGRATE_PFN_VALID, since the system page could be allocated, but not be accessed. I think the corner case is vram_pages becomes negative value when migrating prange from

Re: [PATCH 1/2] drm/amdkfd: remove unused code

2025-05-28 Thread Philip Yang
On 2025-05-28 13:19, James Zhu wrote: upages is assigned under cpages = 0, so it isn't really used in this function. Signed-off-by: James Zhu Reviewed-by: Philip.Yang --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkf

[PATCH] drm/amdkfd: Fix kfd process ref leaking when userptr unmapping

2025-05-27 Thread Philip Yang
kfd_lookup_process_by_pid increases process ref, the refcount is leaking. Fixes: 7a566d7f56f4 ("amd/amdkfd: Trigger segfault for early userptr unmmapping") Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 9 +++-- 1 file changed, 7 insert

Re: [PATCH] amd/amdkfd: fix a kfd_process ref leak

2025-05-26 Thread Philip Yang
On 2025-05-21 06:12, Yifan Zhang wrote: This patch is to fix a kfd_prcess ref leak. Signed-off-by: Yifan Zhang Reviewed-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_events.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c b/drivers/gpu

Re: [PATCH 3/3] drm/amdkfd: destroy_pdds release pdd->drm_file at end

2025-05-16 Thread Philip Yang
On 2025-05-15 17:31, Chen, Xiaogang wrote: On 5/15/2025 3:45 PM, Philip Yang wrote: On 2025-05-15 10:29, Chen, Xiaogang wrote: Does this patch fix a bug or just make code look more reasonable? kfd_process_destroy_pdds releases pdd related buffers, not related to operations on vm. So vm

Re: [PATCH 3/3] drm/amdkfd: destroy_pdds release pdd->drm_file at end

2025-05-15 Thread Philip Yang
ently, as fput(pdd->drm_file) to free vm is right between free vm mapping qpd->cwsr_mem, qpd->ib_mem and free kernel bo qpd->proc_doorbells, pdd->proc_ctx_bo, to make it clear for future change. Regards, Philip Regards Xiaogang On 5/14/2025 12:10 PM, Philip Yang wrote: Relea

Re: [PATCH 2/3] drm/amdgpu: amdgpu_vm_fini hold vm lock to access vm->va

2025-05-15 Thread Philip Yang
On 2025-05-15 10:40, Chen, Xiaogang wrote: On 5/14/2025 12:10 PM, Philip Yang wrote: Move vm root bo unreserve after vm->va mapping free because we should hold vm lock to access vm->va. Signed-off-by: Philip Yang ---   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 8   1 file chan

[PATCH 3/3] drm/amdkfd: destroy_pdds release pdd->drm_file at end

2025-05-14 Thread Philip Yang
Release pdd->drm_file may free the vm if this is the last reference, move it to the last step after memory is unmapped. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/

[PATCH 1/3] drm/amdgpu: seq64 memory unmap uses uninterruptible lock

2025-05-14 Thread Philip Yang
g. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c index 3939761be31c..d45ebfb642ca 100644 --- a/drivers/gpu/drm/

[PATCH 0/3] Remove process exit error message

2025-05-14 Thread Philip Yang
This series fix the dmesg error message "still active bo inside vm" and 2 potential races when process exit and vm cleanup. Philip Yang (3): drm/amdgpu: seq64 memory unmap uses uninterruptible lock drm/amdgpu: amdgpu_vm_fini hold vm lock to access vm->va drm/amdkfd: destroy_pdd

[PATCH 2/3] drm/amdgpu: amdgpu_vm_fini hold vm lock to access vm->va

2025-05-14 Thread Philip Yang
Move vm root bo unreserve after vm->va mapping free because we should hold vm lock to access vm->va. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_v

[PATCH] drm/amdgpu: csa unmap use uninterruptible lock

2025-05-07 Thread Philip Yang
+0x217/0x3c0 do_group_exit+0x3b/0xb0 get_signal+0x14a/0x8d0 arch_do_signal_or_restart+0xde/0x100 exit_to_user_mode_loop+0xc1/0x1a0 exit_to_user_mode_prepare+0xf4/0x100 syscall_exit_to_user_mode+0x17/0x40 do_syscall_64+0x69/0xc0 Signed-off-by: Philip Yang --- drivers/gpu/drm/amd

Re: [PATCH] drm/amdkfd: Fix some kfd related recover issues

2025-04-17 Thread Philip Yang
On 2025-03-21 19:35, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only] [AMD Official Use Only - AMD Internal Distribution Only] -Original Message- From: Lazar, Lijo Sent: Friday, March 21, 2025 7:06 PM To: Deng, Emily ; amd-gfx@lists.freedesktop.org Subject

Re: [PATCH] drm/amdgpu: Fix missing drain retry fault the last entry

2025-03-04 Thread Philip Yang
On 2025-03-03 19:44, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only] [AMD Official Use Only - AMD Internal Distribution Only] Ping.. Emily Deng Best Wishes -Original Message- From: Emily Deng Sent: Monday, March 3, 2025 5:35 PM To: amd-gfx@lists.f

Re: [PATCH] drm/amdkfd: Fix NULL Pointer Dereference in KFD queue

2025-02-28 Thread Philip Yang
idate queue cwsr area and eop buffer size") This patch is Reviewed-by: Philip Yang Signed-off-by: Andrew Martin --- drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c b/drivers/gpu/drm/

Re: [PATCH] drm/amdkfd: clamp queue size to minimum

2025-02-26 Thread Philip Yang
On 2025-02-25 21:41, David Yat Sin wrote: If queue size is less than minimum, clamp it to minimum to prevent underflow when writing queue mqd. Signed-off-by: David Yat Sin --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 4 include/uapi/linux/kfd_ioctl.h | 2 ++ 2 files chang

[PATCH 0/5] Fix mode1 reset test failures

2025-02-26 Thread Philip Yang
reset_domain->wq to ensure ongoing mode1 reset is done or user queues are evicted, then free outstanding BOs. Philip Yang (5): drm/amdkfd: Remove kfd_process_hw_exception worker drm/amdkfd: KFD release_work possible circular locking drm/amdkfd: Fix mode1 reset crash issue drm/amdkfd:

[PATCH 1/5] drm/amdkfd: Remove kfd_process_hw_exception worker

2025-02-26 Thread Philip Yang
With GPU reset-domain worker implemented, KFD hw_exception worker is not needed any more, just call amdgpu_amdkfd_gpu_reset directly from kfd_hws_hang. Suggested-by: Felix Kuehling Signed-off-by: Philip Yang Reviewed-by: Lijo Lazar --- drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c

[PATCH 2/5] drm/amdkfd: KFD release_work possible circular locking

2025-02-26 Thread Philip Yang
pletion)(&p->release_work)); lock((wq_completion)amdgpu-reset-dev); To fix this, KFD create process move flush release work outside kfd_process_mutex. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 16 1 f

[PATCH 3/5] drm/amdkfd: Fix mode1 reset crash issue

2025-02-26 Thread Philip Yang
free outstanding BOs. Signed-off-by: Philip Yang Reviewed-by: Lijo Lazar --- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 17 + 1 file changed, 17 insertions(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c index 2715ca53e9da

[PATCH 5/5] drm/amdkfd: debugfs hang_hws skip GPU with MES

2025-02-26 Thread Philip Yang
debugfs hang_hws is used by GPU reset test with HWS, for MES this crash the kernel with NULL pointer access because dqm->packet_mgr is not setup for MES path. Skip GPU with MES for now, MES hang_hws debugfs interface will be supported later. Signed-off-by: Philip Yang Reviewed-by: Kent Russ

[PATCH 4/5] drm/amdkfd: Fix pqm_destroy_queue race with GPU reset

2025-02-26 Thread Philip Yang
If GPU in reset, destroy_queue return -EIO, pqm_destroy_queue should delete the queue from process_queue_list and free the resource. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 2 +- 1 file changed, 1 insertion(+), 1

Re: [PATCH] drm/amdkfd: Correct the postion of reserve and unreserve memory

2025-02-24 Thread Philip Yang
On 2025-02-20 06:59, Emily Deng wrote: Call amdgpu_amdkfd_reserve_mem_limit in svm_range_vram_node_new when creating a new SVM BO. Call amdgpu_amdkfd_unreserve_mem_limit in svm_range_bo_release when the SVM BO is deleted. Signed-off-by: Emily Deng --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.

Re: [PATCH] drm/amdkfd: Preserve cp_hqd_pq_control on update_mqd

2025-02-18 Thread Philip Yang
On 2025-02-18 12:24, David Yat Sin wrote: When userspace applications call AMDKFD_IOC_UPDATE_QUEUE. Preserve bitfields that do not need to be modified as they contain flags to track queue states that are used by CP FW. Signed-off-by: David Yat Sin --- drivers/gpu/drm/amd/amdkfd/kfd_mqd_mana

Re: [PATCH] drm/amdkfd: Fix Circular Locking Dependency in 'svm_range_cpu_invalidate_pagetables'

2025-02-18 Thread Philip Yang
954] RDX: RSI: RDI: 01200011 [ 223.426965] RBP: R08: R09: [ 223.426975] R10: 7f4675e81a50 R11: 0246 R12: 0001 [ 223.426986] R13: 7fff5c3e5470 R14: 7fff5c3e53e0 R15: 7f

Re: [PATCH] drm/amdkfd: Fix the deadlock in svm_range_restore_work

2025-02-13 Thread Philip Yang
On 2025-02-12 23:33, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only] *From:*Yang, Philip *Sent:* Wednesday, February 12, 2025 10:31 PM *To:* Deng, Emily ; Yang, Philip ; Chen, Xiaogang ; amd-gfx@lists.freedesktop.org *Subject:* Re: [PATCH] drm/amdkfd: Fix the de

Re: [PATCH v2 4/9] drm/amdkfd: Validate user queue buffers

2025-02-12 Thread Philip Yang
On 2025-02-12 17:42, Uwe Kleine-König wrote: #regzbot introduced: 68e599db7a549f010a329515f3508d8a8c3467a4 #regzbot monitor: https://bugs.debian.org/1093124 Hello, On Thu, Jul 18, 2024 at 05:05:53PM -0400, Philip Yang wrote: Find user queue

Re: [PATCH] drm/amdkfd: Fix user queue validation on Gfx7/8

2025-02-12 Thread Philip Yang
ping... On 2025-01-29 19:04, Philip Yang wrote: To workaround queue full h/w issue on Gfx7/8, when application create AQL queues, the ring buffer bo allocate size is queue_size/2 and mapped to GPU twice using 2 attachments with same ring_bo backing memory. For

Re: [PATCH] drm/amdkfd: Fix the deadlock in svm_range_restore_work

2025-02-12 Thread Philip Yang
On 2025-02-12 03:54, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only] Ping……   Emily Deng Best Wishes

Re: [PATCH] drm/amdgpu: Unlocked unmap only clear page table leaves

2025-02-11 Thread Philip Yang
On 2025-02-11 05:34, Christian König wrote: Am 20.01.25 um 16:59 schrieb Philip Yang: On 2025-01-15 06:01, Christian König wrote: Am 14.01.25 um 15:53 schrieb Philip Yang

Re: [PATCH] drm/amdkfd: Fix the deadlock in svm_range_restore_work

2025-02-10 Thread Philip Yang
On 2025-02-10 02:51, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only] [AMD Official Use Only - AMD Internal Distribution Only]

Re: [PATCH] drm/amdgpu: Set snoop bit for SDMA for MI series

2025-02-07 Thread Philip Yang
loop Modified function names based on review comments. Signed-off-by: Harish Kasiviswanathan with one nitpick fixed, this patch is Reviewed-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c | 25 ++ drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c

Re: [PATCH v2] drm/amdgpu: Unlocked unmap only clear page table leaves

2025-02-07 Thread Philip Yang
On 2025-02-07 05:17, Christian König wrote: Am 30.01.25 um 17:19 schrieb Philip Yang: On 2025-01-29 11:40, Christian König wrote: Am 23.01.25 um 21:39 schrieb Philip Yang: SVM migration

Re: [PATCH] drm/amdgpu: Set snoop bit for SDMA for MI series

2025-02-06 Thread Philip Yang
On 2025-02-05 22:07, Kasiviswanathan, Harish wrote: [Public]     From: Yang, Philip S

Re: [PATCH] drm/amdgpu: Set snoop bit for SDMA for MI series

2025-02-05 Thread Philip Yang
On 2025-02-04 18:02, Harish Kasiviswanathan wrote: SDMA writes has to probe invalidate RW lines. Set snoop bit in mmhub for this to happen. Signed-off-by: Harish Kasiviswanathan --- drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c | 25 ++ drivers/gpu

Re: [PATCH 2/2] drm/amdkfd: use GTT for VRAM on APUs only if GTT is larger

2025-01-31 Thread Philip Yang
On 2025-01-30 15:51, Alex Deucher wrote: If the user has configured a large carveout on a small APU, only use GTT for VRAM allocations if GTT is larger than VRAM. Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 -- 1 file ch

Re: [PATCH v2] drm/amdgpu: Unlocked unmap only clear page table leaves

2025-01-30 Thread Philip Yang
On 2025-01-29 11:40, Christian König wrote: Am 23.01.25 um 21:39 schrieb Philip Yang: SVM migration unmap pages from GPU and then update mapping to GPU to recover page fault. Currently unmap clears the PDE entry for

[PATCH] drm/amdkfd: Fix user queue validation on Gfx7/8

2025-01-29 Thread Philip Yang
allocation and mapping size. Fixes: 68e599db7a54 ("drm/amdkfd: Validate user queue buffers") Suggested-by: Tomáš Trnka Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/

[PATCH v2] drm/amdgpu: Unlocked unmap only clear page table leaves

2025-01-23 Thread Philip Yang
GPU performance. Update mapping to huge page will still free the PTB bo. With this change, the vm->pt_freed list and work is not needed. Add WARN_ON(unlocked) in amdgpu_vm_pt_add_list to catch if unmap to free the PTB. v2: Limit update fragment size, not hack entry_end (Christian) Signed-off-by:

Re: [PATCH] drm/amdkfd: Change page discontinuity handling at svm_migrate_copy_to_vram

2025-01-20 Thread Philip Yang
On 2025-01-15 16:40, Xiaogang.Chen wrote: From: Xiaogang Chen Current svm_migrate_copy_to_vram handles sys pages(src) and dst pages (vram) discontinuation in different way. When src got discontinuity migrates j pages that ith page is not migrated; When dst

Re: [PATCH] drm/amdgpu: Unlocked unmap only clear page table leaves

2025-01-20 Thread Philip Yang
On 2025-01-15 06:01, Christian König wrote: Am 14.01.25 um 15:53 schrieb Philip Yang: SVM migration unmap pages from GPU and then update mapping to GPU to recover page fault. Currently unmap clears the PDE entry for

[PATCH] drm/amdgpu: Unlocked unmap only clear page table leaves

2025-01-14 Thread Philip Yang
the PTB bo. With this change, the vm->pt_freed list and work is not needed. Add WARN_ON(unlocked) in amdgpu_vm_pt_free_dfs to catch if unmap to free the PTB. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 4 --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h| 4 --- d

Re: [PATCH v4] drm/amdkfd: Fix partial migrate issue

2025-01-10 Thread Philip Yang
On 2025-01-10 11:23, Chen, Xiaogang wrote: On 1/10/2025 8:37 AM, Philip Yang wrote: On 2025-01-10 02:49, Emily Deng wrote: For partial migrate from ram to vram, the migrate->cpages

Re: [PATCH v2] drm/amdgpu: Fix the looply call svm_range_restore_pages issue

2025-01-10 Thread Philip Yang
On 2025-01-09 12:14, Felix Kuehling wrote: On 2025-01-08 20:11, Philip Yang wrote: On 2025-01-07 22:08, Deng, Emily wrote: [AMD Official Use Only - AMD Internal

Re: [PATCH v5] drm/amdkfd: Fix partial migrate issue

2025-01-10 Thread Philip Yang
On 2025-01-10 09:25, Emily Deng wrote: For partial migrate from ram to vram, the migrate->cpages is not equal to migrate->npages, should use migrate->npages to check all needed migrate pages which could be copied or not. And only need to set those pages could be m

Re: [PATCH v4] drm/amdkfd: Fix partial migrate issue

2025-01-10 Thread Philip Yang
hat fixed, this patch is Reviewed-by: Philip Yang Signed-off-by: Emily Deng --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu

Re: [PATCH v2] drm/amdgpu: Fix the looply call svm_range_restore_pages issue

2025-01-08 Thread Philip Yang
On 2025-01-07 22:08, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only] Hi Philip, It still has the deadlock, maybe the best way is

Re: [PATCH v3] drm/amdkfd: Fix partial migrate issue

2025-01-08 Thread Philip Yang
On 2025-01-08 08:19, Emily Deng wrote: For partial migrate from ram to vram, the migrate->cpages is not equal to migrate->npages, should use migrate->npages to check all needed migrate pages which could be copied or not. And only need to set those pages could be m

Re: [PATCH] drm/amdkfd: Fix partial migrate issue

2025-01-08 Thread Philip Yang
On 2025-01-07 19:31, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only]     From: Yan

Re: [PATCH] drm/amdkfd: Fix partial migrate issue

2025-01-07 Thread Philip Yang
On 2025-01-07 10:50, Chen, Xiaogang wrote: On 1/6/2025 8:02 PM, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only]

Re: [PATCH v2] drm/amdgpu: Fix the looply call svm_range_restore_pages issue

2025-01-07 Thread Philip Yang
On 2025-01-07 07:30, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only] Hi Felix, You are right, it is easily to hit deadlock, don't know why LOCKDEP doesn't catch this. Need to find another solution. Hi Philip, Do you have a sol

Re: [PATCH] drm/amdkfd: Fix partial migrate issue

2025-01-07 Thread Philip Yang
On 2025-01-06 21:31, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only]   From: Yang, Philip

Re: [PATCH] drm/amdkfd: Fix partial migrate issue

2025-01-06 Thread Philip Yang
On 2025-01-02 19:06, Emily Deng wrote: For partial migrate from ram to vram, the migrate->cpages is not equal to migrate->npages, should use migrate->npages to check all needed migrate pages which could be copied or not. And only need to set those pages could be m

[PATCH 6/6] drm/amdgpu: Show warning message if IH ring overflow

2024-12-13 Thread Philip Yang
*_ih.c except ASICs older than Vega which has only one ih ring. Signed-off-by: Philip Yang Reviewed-by: Christian König Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c | 6 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h | 1 + drivers/gpu/drm/amd/amdgpu/navi10_ih.c

[PATCH 5/6] drm/amdkfd: Improve signal event slow path

2024-12-13 Thread Philip Yang
then driver process the first event interrupt, set_event and event slot is auto-reset, then for the second event interrupt, KFD goes to slow path as event is not signaled, just drop the second event interrupt because the application only need wakeup once. Signed-off-by: Philip Yang Reviewed-by:

[PATCH 3/6] drm/amdgpu: Optimize gfx v9 GPU page fault handling

2024-12-13 Thread Philip Yang
handle the gfx v9 path, cover retry on/off and CAM filter on/off cases. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 10 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 4 ++ drivers/gpu/drm/amd/amdkfd/kfd_device.c| 67

[PATCH 4/6] drm/amdkfd: Queue interrupt work to different CPU

2024-12-13 Thread Philip Yang
queue with number of workers equals to number of partitions, let queue_work select the next CPU round robin among the local CPUs of same NUMA. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_device.c| 25 -- drivers/gpu/drm/amd

[PATCH 1/6] drm/amdgpu: Don't enable sdma 4.4.5 CTXEMPTY interrupt

2024-12-13 Thread Philip Yang
-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c index 4c8308b2878b..56507ae919b0 100644 --- a

[PATCH 2/6] drm/amdkfd: KFD interrupt access ih_fifo data in-place

2024-12-13 Thread Philip Yang
To handle 4 to 8 interrupts per second running CPX mode with 4 streams/queues per KFD node, KFD interrupt handler becomes the performance bottleneck. Remove the kfifo_out memcpy overhead by accessing ih_fifo data in-place and updating rptr with kfifo_skip_count. Signed-off-by: Philip

Re: [PATCH] drm/amdkfd: fix the hang caused by the write reorder to fence_addr

2024-10-22 Thread Philip Yang
On 2024-10-21 04:12, Christian König wrote: Am 18.10.24 um 23:59 schrieb Philip Yang: On 2024-10-18 14:28, Felix Kuehling wrote: On 2024-10-17 04:34, Victor Zhao wrote: make sure

Re: [PATCH] Revert "drm/amdkfd: SMI report dropped event count"

2024-10-21 Thread Philip Yang
On 2024-10-21 13:46, Alex Deucher wrote: This reverts commit a3ab2d45b9887ee609cd3bea39f668236935774c. The userspace side for this code is not ready yet so revert for now. Signed-off-by: Alex Deucher Cc: Philip Yang Reviewed-by: Philip Yang

Re: [PATCH] drm/amdkfd: fix the hang caused by the write reorder to fence_addr

2024-10-18 Thread Philip Yang
On 2024-10-18 14:28, Felix Kuehling wrote: On 2024-10-17 04:34, Victor Zhao wrote: make sure KFD_FENCE_INIT write to fence_addr before pm_send_query_status called, to avoid qcm fence timeout caused by incorrect ord

Re: [PATCH] drm/amd/amdkfd: add/remove kfd queues on start/stop KFD scheduling

2024-10-18 Thread Philip Yang
It is safe to access dqm->sched status inside dqm_lock, no race with gpu reset. Reviewed-by: Philip Yang On 2024-10-18 11:10, Shaoyun Liu wrote: From: shaoyunl Add back kfd queues in start scheduling that originally been removed on stop schedul

Re: [PATCH] drm/amdkfd: fix the hang caused by the write reorder to fence_addr

2024-10-18 Thread Philip Yang
ordering. Signed-off-by: Victor Zhao Reviewed-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 1 + drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd

Re: [PATCH] drm/amd/amdkfd: add/remove kfd queues on start/stop KFD scheduling

2024-10-17 Thread Philip Yang
On 2024-10-17 12:12, Shaoyun Liu wrote: From: shaoyunl Add back kfd queues in start scheduling that originally been removed on stop scheduling. Signed-off-by: Shaoyun Liu --- .../drm/amd/amdkfd/kfd_device_queue_manager.c | 40 +-- 1 file change

[PATCH v3] drm/amdkfd: Accounting pdd vram_usage for svm

2024-10-11 Thread Philip Yang
pe because it is updated outside process mutex now. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +++--- drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 2 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 4 ++-- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 26 +++

Re: [PATCH 1/2] drm/amdkfd: Save pdd to svm_bo to replace node

2024-10-11 Thread Philip Yang
Drop this patch series, as Felix pointed out, the forked process takes svm_bo device pages ref, svm_bo->pdd could refer to the process that doesn't exist any more. Regards, Philip On 2024-10-11 11:00, Philip Yang wrote: KFD process device

[PATCH 2/2] drm/amdkfd: Accounting pdd vram_usage for svm

2024-10-11 Thread Philip Yang
c64_t because it is updated outside process mutex now. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +++--- drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 2 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 4 ++-- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 ++ 4

[PATCH 1/2] drm/amdkfd: Save pdd to svm_bo to replace node

2024-10-11 Thread Philip Yang
KFD process device data pdd will be used for VRAM usage accounting, save pdd to svm_bo to avoid searching pdd for every accounting, and get KFD node from pdd->dev. svm_bo->pdd will always be valid because KFD process release free all svm_bo first, then destroy process pdds. Signed-off-by:

Re: [PATCH] drm/amdkfd: Accounting pdd vram_usage for svm

2024-10-11 Thread Philip Yang
On 2024-10-09 17:20, Felix Kuehling wrote: On 2024-10-04 16:28, Philip Yang wrote: Per process device data pdd->vram_usage is used by rocm-smi to report VRAM usage, this is currently missing the svm_bo us

[PATCH] drm/amdkfd: Accounting pdd vram_usage for svm

2024-10-04 Thread Philip Yang
updated outside process mutex now. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +++--- drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 2 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 4 ++-- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 22 ++ 4 f

[PATCH] drm/amdkfd: Copy wave state only for compute queue

2024-10-03 Thread Philip Yang
get_wave_state is not defined for sdma queue, copy_context_work_handler calls it for sdma queue will crash. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd

Re: [PATCH] drm/amdkfd: fix vm-pasid lookup for multiple partitions

2024-09-11 Thread Philip Yang
On 2024-09-11 02:54, Christian König wrote: Yeah, I completely agree with Xiaogang. The PASID is an identifier of an address space. And the idea of the KFD was that we can just use the same address space and with it the page ta

Re: [PATCH] drm/amdkfd: fix vm-pasid lookup for multiple partitions

2024-09-10 Thread Philip Yang
On 2024-09-09 14:46, Christian König wrote: Am 09.09.24 um 18:02 schrieb Kim, Jonathan: [Public] -Original Message- From: Christian König Sent: Thursday, September 5, 202

Re: [PATCH V5] drm/amdgpu: Surface svm_default_granularity, a RW module parameter

2024-09-04 Thread Philip Yang
ff-by: Ramesh Errabolu With 2 below nitpicks fixed, this patch is Reviewed-by: Philip Yang change subject to "drm/amdkfd: Add svm_default_granularity module parameter" --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/am

Re: [PATCH v2] drm/amdgpu: fix a call trace when unload amdgpu driver

2024-09-04 Thread Philip Yang
leased before ttm_resource_manager is finilized, drain the workqueue in ttm_device. v2: move drain_workqueue to amdgpu_ttm.c Fixes:d99fbd9aab62 ("drm/ttm: Always take the bo delayed cleanup path for imported bos") Suggested-by: Christian König Signed-off-by: Asher Song Acked-by: Ph

Re: [PATCH] drm/amdgpu: fix invalid fence handling in amdgpu_vm_tlb_flush

2024-09-04 Thread Philip Yang
On 2024-09-02 05:06, Christian König wrote: Am 02.09.24 um 05:03 schrieb Lang Yu: Fixes: 5a1c27951966 ("drm/amdgpu: implement TLB flush fence") Signed-off-by: Lang Yu Ah yes, that exp

Re: [PATCH V3] drm/amdgpu: Surface svm_default_granularity, a RW module parameter

2024-09-03 Thread Philip Yang
On 2024-08-29 18:31, Chen, Xiaogang wrote: On 8/29/2024 5:13 PM, Ramesh Errabolu wrote: Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or respon

Re: [PATCH] drm/amdkfd: restore_process_worker race with GPU reset

2024-08-29 Thread Philip Yang
On 2024-08-29 17:15, Felix Kuehling wrote: On 2024-08-23 15:49, Philip Yang wrote: If GPU reset kick in while KFD restore_process_worker running, this may causes different issues, for example below rcu stall warning

Re: [PATCH] drm/amdkfd: restore_process_worker race with GPU reset

2024-08-29 Thread Philip Yang
On 2024-08-28 18:01, Felix Kuehling wrote: On 2024-08-23 15:49, Philip Yang wrote: If GPU reset kick in while KFD restore_process_worker running, this may causes different issues, for example below rcu stall

Re: [PATCH v2] drm/amdgpu: Surface svm_attr_gobm, a RW module parameter

2024-08-28 Thread Philip Yang
On 2024-08-26 15:34, Ramesh Errabolu wrote: Enables users to update the default size of buffer used in migration either from Sysmem to VRAM or vice versa. The param GOBM refers to granularity of buffer migration, and is specified in terms of log(numPages(buff

[PATCH v3 0/4] Improve SVM migrate event report

2024-08-27 Thread Philip Yang
. v3: Simplify event drop count handling (James Zhu) Philip Yang (4): drm/amdkfd: Document and define SVM events message macro drm/amdkfd: Output migrate end event if migrate failed drm/amdkfd: Increase SMI event fifo size drm/amdkfd: SMI report dropped event count drivers/gpu/drm/amd

[PATCH v3 4/4] drm/amdkfd: SMI report dropped event count

2024-08-27 Thread Philip Yang
and reset drop count to zero. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 25 + include/uapi/linux/kfd_ioctl.h | 6 + 2 files changed, 27 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c b

[PATCH] drm/amdkfd: SMI report dropped event count

2024-08-27 Thread Philip Yang
and reset drop count to zero. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 25 + include/uapi/linux/kfd_ioctl.h | 6 + 2 files changed, 27 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c b

[PATCH v3 1/4] drm/amdkfd: Document and define SVM events message macro

2024-08-27 Thread Philip Yang
future. No functional changes. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 45 + include/uapi/linux/kfd_ioctl.h | 100 +--- 2 files changed, 109 insertions(+), 36 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd

[PATCH v3 3/4] drm/amdkfd: Increase SMI event fifo size

2024-08-27 Thread Philip Yang
prefix to the macro name. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c index 1d94b445a060..9b8169761ec5

[PATCH v3 2/4] drm/amdkfd: Output migrate end event if migrate failed

2024-08-27 Thread Philip Yang
If page migration failed, also output migrate end event to match with migrate start event, with failure error_code added to the end of the migrate message macro. This will not break uAPI because application uses old message macro sscanf drop and ignore the error_code. Signed-off-by: Philip Yang

[PATCH] drm/amdkfd: restore_process_worker race with GPU reset

2024-08-23 Thread Philip Yang
] Call Trace: update_process_times+0x94/0xd0 RIP: 0010:amdgpu_vm_handle_moved+0x9a/0x210 [amdgpu] amdgpu_amdkfd_gpuvm_restore_process_bos+0x3d6/0x7d0 [amdgpu] restore_process_helper+0x27/0x80 [amdgpu] Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 56

Re: [PATCH v2 3/4] drm/amdkfd: Increase SMI event fifo size

2024-08-22 Thread Philip Yang
On 2024-08-22 10:34, James Zhu wrote: On 2024-07-30 16:15, Philip Yang wrote: SMI event fifo size 1KB was enough to report GPU vm fault or reset [JZ] There is a typo here. it should be NOT enough

Re: [PATCH v2 1/4] drm/amdkfd: Document and define SVM events message macro

2024-08-22 Thread Philip Yang
On 2024-08-22 10:32, James Zhu wrote: On 2024-07-30 16:15, Philip Yang wrote: Document how to use SMI system management interface to enable and receive SVM events. Document SVM event triggers. Define SVM events message

Re: [PATCH v6] drm/amdkfd: Change kfd/svm page fault drain handling

2024-08-22 Thread Philip Yang
page faults at deferred work. So, the time period that kfd does not handle page faults is reduced and can be controlled. Signed-off-by: Xiaogang.Chen Some nitpicks below. This patch is Reviewed-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 +- drivers/gpu

Re: [PATCH] drm/amdgpu: Surface svm_attr_gobm, a RW module parameter

2024-08-22 Thread Philip Yang
On 2024-08-21 19:22, Ramesh Errabolu wrote: KFD's design of unified memory (UM) does not allow users to configure the size of buffer used in migrating buffer either from Sysmem to VRAM or vice versa. This is not true, app can change range granularit

  1   2   3   4   5   6   7   8   >