[PATCH] drm/amdkfd: Fix false positive queue buffer free warning

2025-10-10 Thread Philip Yang
("drm/amdkfd: Validate user queue svm memory residency") Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_s

Re: [PATCH] drm/amdkfd: Fix svm_bo and vram page refcount

2025-10-03 Thread Philip Yang
On 2025-10-03 17:46, Felix Kuehling wrote: On 2025-10-03 17:18, Philip Yang wrote: On 2025-10-03 17:05, Felix Kuehling wrote: On 2025-09-26 17:03, Philip Yang wrote: zone_device_page_init uses set_page_count to set vram page refcount to 1, there is race if step 2 happens between step 1

Re: [PATCH 2/3] drm/amdkfd: svm unmap use page aligned address

2025-10-03 Thread Philip Yang
On 2025-10-02 18:04, Chen, Xiaogang wrote: On 10/2/2025 12:43 PM, Philip Yang wrote: svm_range_unmap_from_gpus uses page aligned start, end address, the end address is inclusive. Fixes: 38c55f6719f7 ("drm/amdkfd: Handle lack of READ permissions in SVM mapping") Signed-off-by: P

Re: [PATCH] drm/amdkfd: Fix svm_bo and vram page refcount

2025-10-03 Thread Philip Yang
On 2025-10-03 17:05, Felix Kuehling wrote: On 2025-09-26 17:03, Philip Yang wrote: zone_device_page_init uses set_page_count to set vram page refcount to 1, there is race if step 2 happens between step 1 and 3. 1. CPU page fault handler get vram page, migrate the vram page to system page 2

[PATCH v2 2/3] drm/amdkfd: svm unmap use page aligned address

2025-10-03 Thread Philip Yang
svm_range_unmap_from_gpus uses page aligned start, end address, the end address is inclusive. Fixes: 38c55f6719f7 ("drm/amdkfd: Handle lack of READ permissions in SVM mapping") Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 7 +++ 1 file changed, 3 insert

[PATCH 1/3] drm/amdgpu: svm check hmm range kzalloc return NULL

2025-10-02 Thread Philip Yang
Add hmm_range kzalloc return NULL error check. In case the get_pages return failed, free and set hmm_range to NULL, to avoid double free in get_pages_done. Fixes: 29e6f5716115 ("drm/amdgpu: use user provided hmm_range buffer in amdgpu_ttm_tt_get_user_pages") Signed-off-by: P

[PATCH 3/3] drm/amdkfd: Don't stuck in svm restore worker

2025-10-02 Thread Philip Yang
ation hangs to wait for queue finish. svm restore work should unmap the memory range from GPUs then resume queues. If GPU page fault happens on the unmapped address, it is application use-after-free bug. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c

[PATCH 2/3] drm/amdkfd: svm unmap use page aligned address

2025-10-02 Thread Philip Yang
svm_range_unmap_from_gpus uses page aligned start, end address, the end address is inclusive. Fixes: 38c55f6719f7 ("drm/amdkfd: Handle lack of READ permissions in SVM mapping") Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 4 ++-- 1 file changed, 2 insert

Re: [PATCH v4 2/2] amd/amdkfd: enhance kfd process check in switch partition

2025-10-01 Thread Philip Yang
esolve the race by adding an atomic kfd_process counter kfd_processes_count, it increment as create kfd process, decrement as finish kfd_process_wq_release. v2: Put kfd_processes_count per kfd_dev, move decrement to kfd_process_destroy_pdds and bug fix. (Philip Yang) [3966658.307702] divide

[PATCH] drm/amdkfd: Fix svm_bo and vram page refcount

2025-09-26 Thread Philip Yang
race bug. Add WARN_ONCE to report this issue early because the refcount bug is hard to investigate. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b

Re: [PATCH v4 1/2] amd/amdkfd: resolve a race in amdgpu_amdkfd_device_fini_sw

2025-09-24 Thread Philip Yang
.a1i5000.a18.x86_64 #1 Signed-off-by: Yifan Zhang Reviewed-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c index 349c351e242b

Re: [PATCH] drm/amdkfd: Fix mmap write lock not release

2025-09-20 Thread Philip Yang
On 2025-09-17 19:12, Chen, Xiaogang wrote: On 9/16/2025 5:41 PM, Philip Yang wrote: If mmap write lock is taken while draining retry fault, mmap write lock is not released because svm_range_restore_pages calls mmap_read_unlock then returns. This causes deadlock and systen hang later because

[PATCH] drm/amdkfd: Fix mmap write lock not release

2025-09-16 Thread Philip Yang
draining retry fault fix this bug. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 6604a37b304f..fb02ff9ae62a 100644 --- a/drivers/gpu/drm/amd

Re: [PATCH v2 1/3] Revert "drm/amdkfd: return migration pages from copy function"

2025-09-11 Thread Philip Yang
MIGRATE_PFN_MIGRATE bit is cleared if it loses the race. Missing Signed-off-by tag, with tag added, this patch is Reviewed-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 72 1 file changed, 36 insertions(+), 36 deletions(-) diff --git a/drivers/gpu/drm/amd

Re: [PATCH v3 3/3] drm/amdkfd: free system struct pages when migration bit is cleared

2025-09-10 Thread Philip Yang
On 2025-09-09 16:43, James Zhu wrote: if destination is on system ram. migrate_vma_pages can fail if a CPU thread faults on the same page. However, the page table is locked and only one of the new pages will be inserted. The device driver will see that the MIGRATE_PFN_MIGRATE bit is cleared if

Re: [PATCH v3 2/3] drm/amdkfd: add function svm_migrate_successful_pages

2025-09-10 Thread Philip Yang
rc[i] & MIGRATE_PFN_MIGRATE) incorrect indent + mpages++; + } this is extra braces, you should see the compile error w/o patch 3/3. with those fixed, this patch is Reviewed-by: Philip Yang } - return upages; + return mpages; } static int

Re: [PATCH v2 3/3] drm/amdkfd: free system struct pages when migration bit is cleared

2025-09-09 Thread Philip Yang
On 2025-09-08 12:15, James Zhu wrote: if destination is on system ram. migrate_vma_pages can fail if a CPU thread faults on the same page. However, the page table is locked and only one of the new pages will be inserted. The device driver will see that the MIGRATE_PFN_MIGRATE bit is cleared if

Re: [PATCH v2 2/3] drm/amdkfd: add function svm_migrate_successful_pages

2025-09-09 Thread Philip Yang
On 2025-09-08 12:15, James Zhu wrote: to get migration pages. dst MIGRATE_PFN_VALID bit and src MIGRATE_PFN_MIGRATE bit should always be set when migration success. cpage includes src MIGRATE_PFN_MIGRATE bit set and MIGRATE_PFN_VALID bit unset pages for both ram and vram when memory is only al

Re: [PATCH] drm/amdgpu: Avoid extra evict-restore process.

2025-07-09 Thread Philip Yang
On 2025-07-08 16:14, Gang Ba wrote: If vm belongs to another process, this is fclose after fork, wait may enable signaling KFD eviction fence and cause parent process queue evicted. Signed-off-by: Gang Ba --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 7 +++ 1 file changed, 7 insertions

Re: [PATCH] drm/amdgpu: delete function amdgpu_flush

2025-07-02 Thread Philip Yang
On 2025-07-01 03:28, Christian König wrote: Clear NAK to removing this! The amdgpu_flush function is vital for correct operation. no fflush call from libdrm/amdgpu, so amdgpu_flush is only called from fclose -> filp_flush The intention is to block closing the file handle in child processes a

Re: [PATCH] drm/amdgpu: delete function amdgpu_flush

2025-06-27 Thread Philip Yang
On 2025-06-27 01:20, YuanShang Mao (River) wrote: [AMD Official Use Only - AMD Internal Distribution Only] Currently, amdgpu_flush is used to prevent new jobs from being submitted in the same context when a file descriptor is closed and to wait for existing jobs to complete. Additionally, if

Re: [PATCH] drm/amdkfd: Don't call mmput from MMU notifier callback

2025-06-24 Thread Philip Yang
On 2025-06-23 18:18, Chen, Xiaogang wrote: On 6/23/2025 11:59 AM, Philip Yang wrote: If the process is exiting, the mmput inside mmu notifier callback from compactd or fork or numa balancing could release the last reference of mm struct to call exit_mmap and free_pgtable, this triggers

[PATCH] drm/amdkfd: Don't call mmput from MMU notifier callback

2025-06-23 Thread Philip Yang
] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x543/0x7d0 [amdgpu] kfd_ioctl_alloc_memory_of_gpu+0x24c/0x4e0 [amdgpu] kfd_ioctl+0x29d/0x500 [amdgpu] Fixes: fa582c6f3684 ("drm/amdkfd: Use mmget_not_zero in MMU notifier") Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 23 +++--

Re: [PATCH] drm/amdgpu: Fix SDMA UTC_L1 handling during start/stop sequences

2025-06-16 Thread Philip Yang
On 2025-06-16 07:43, Jesse Zhang wrote: This commit makes two key fixes to SDMA v4.4.2 handling: 1. disable UTC_L1 in sdma_cntl register when stopping SDMA engines by reading the current value before modifying UTC_L1_ENABLE bit. 2. Ensure UTC_L1_ENABLE is consistently managed by: - Ad

Re: [PATCH] drm/amdgpu: Add chain runlists support to GC9.4.2

2025-06-06 Thread Philip Yang
On 2025-06-05 12:11, Amber Lin wrote: Starting from MEC v97, GC 9.4.2 supports chain runlists of XNACK+/XNACK- processes. Signed-off-by: Amber Lin Reviewed-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 3 +++ drivers/gpu/drm/amd/amdgpu/gfx_v9_4_2.c | 12

Re: [PATCH] drm/amdkfd: Fix kfd process ref leaking when userptr unmapping

2025-05-28 Thread Philip Yang
urce after application exit. NULL pointer check is also necessary as kfd_lookup_process_by_pid() may return NULL pointer if app process/task is already destroyed. Regards, Philip -Original Message- From: amd-gfx On Behalf Of Philip Yang Sent: Tuesday, May 27, 2025 11:35 AM T

Re: [PATCH 2/2] drm/amdkfd: add svm_migrate_successful_pages

2025-05-28 Thread Philip Yang
On 2025-05-28 13:19, James Zhu wrote: to get migration pages. When migrating pages from system to vram, needn't check bit MIGRATE_PFN_VALID, since the system page could be allocated, but not be accessed. I think the corner case is vram_pages becomes negative value when migrating prange from

Re: [PATCH 1/2] drm/amdkfd: remove unused code

2025-05-28 Thread Philip Yang
On 2025-05-28 13:19, James Zhu wrote: upages is assigned under cpages = 0, so it isn't really used in this function. Signed-off-by: James Zhu Reviewed-by: Philip.Yang --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkf

[PATCH] drm/amdkfd: Fix kfd process ref leaking when userptr unmapping

2025-05-27 Thread Philip Yang
kfd_lookup_process_by_pid increases process ref, the refcount is leaking. Fixes: 7a566d7f56f4 ("amd/amdkfd: Trigger segfault for early userptr unmmapping") Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 9 +++-- 1 file changed, 7 insert

Re: [PATCH] amd/amdkfd: fix a kfd_process ref leak

2025-05-26 Thread Philip Yang
On 2025-05-21 06:12, Yifan Zhang wrote: This patch is to fix a kfd_prcess ref leak. Signed-off-by: Yifan Zhang Reviewed-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_events.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_events.c b/drivers/gpu

Re: [PATCH 3/3] drm/amdkfd: destroy_pdds release pdd->drm_file at end

2025-05-16 Thread Philip Yang
On 2025-05-15 17:31, Chen, Xiaogang wrote: On 5/15/2025 3:45 PM, Philip Yang wrote: On 2025-05-15 10:29, Chen, Xiaogang wrote: Does this patch fix a bug or just make code look more reasonable? kfd_process_destroy_pdds releases pdd related buffers, not related to operations on vm. So vm

Re: [PATCH 3/3] drm/amdkfd: destroy_pdds release pdd->drm_file at end

2025-05-15 Thread Philip Yang
ently, as fput(pdd->drm_file) to free vm is right between free vm mapping qpd->cwsr_mem, qpd->ib_mem and free kernel bo qpd->proc_doorbells, pdd->proc_ctx_bo, to make it clear for future change. Regards, Philip Regards Xiaogang On 5/14/2025 12:10 PM, Philip Yang wrote: Relea

Re: [PATCH 2/3] drm/amdgpu: amdgpu_vm_fini hold vm lock to access vm->va

2025-05-15 Thread Philip Yang
On 2025-05-15 10:40, Chen, Xiaogang wrote: On 5/14/2025 12:10 PM, Philip Yang wrote: Move vm root bo unreserve after vm->va mapping free because we should hold vm lock to access vm->va. Signed-off-by: Philip Yang ---   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 8   1 file chan

[PATCH 3/3] drm/amdkfd: destroy_pdds release pdd->drm_file at end

2025-05-14 Thread Philip Yang
Release pdd->drm_file may free the vm if this is the last reference, move it to the last step after memory is unmapped. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/

[PATCH 1/3] drm/amdgpu: seq64 memory unmap uses uninterruptible lock

2025-05-14 Thread Philip Yang
g. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c index 3939761be31c..d45ebfb642ca 100644 --- a/drivers/gpu/drm/

[PATCH 0/3] Remove process exit error message

2025-05-14 Thread Philip Yang
This series fix the dmesg error message "still active bo inside vm" and 2 potential races when process exit and vm cleanup. Philip Yang (3): drm/amdgpu: seq64 memory unmap uses uninterruptible lock drm/amdgpu: amdgpu_vm_fini hold vm lock to access vm->va drm/amdkfd: destroy_pdd

[PATCH 2/3] drm/amdgpu: amdgpu_vm_fini hold vm lock to access vm->va

2025-05-14 Thread Philip Yang
Move vm root bo unreserve after vm->va mapping free because we should hold vm lock to access vm->va. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_v

[PATCH] drm/amdgpu: csa unmap use uninterruptible lock

2025-05-07 Thread Philip Yang
+0x217/0x3c0 do_group_exit+0x3b/0xb0 get_signal+0x14a/0x8d0 arch_do_signal_or_restart+0xde/0x100 exit_to_user_mode_loop+0xc1/0x1a0 exit_to_user_mode_prepare+0xf4/0x100 syscall_exit_to_user_mode+0x17/0x40 do_syscall_64+0x69/0xc0 Signed-off-by: Philip Yang --- drivers/gpu/drm/amd

Re: [PATCH] drm/amdkfd: Fix some kfd related recover issues

2025-04-17 Thread Philip Yang
On 2025-03-21 19:35, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only] [AMD Official Use Only - AMD Internal Distribution Only] -Original Message- From: Lazar, Lijo Sent: Friday, March 21, 2025 7:06 PM To: Deng, Emily ; amd-gfx@lists.freedesktop.org Subject

Re: [PATCH] drm/amdgpu: Fix missing drain retry fault the last entry

2025-03-04 Thread Philip Yang
On 2025-03-03 19:44, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only] [AMD Official Use Only - AMD Internal Distribution Only] Ping.. Emily Deng Best Wishes -Original Message- From: Emily Deng Sent: Monday, March 3, 2025 5:35 PM To: amd-gfx@lists.f

Re: [PATCH] drm/amdkfd: Fix NULL Pointer Dereference in KFD queue

2025-02-28 Thread Philip Yang
idate queue cwsr area and eop buffer size") This patch is Reviewed-by: Philip Yang Signed-off-by: Andrew Martin --- drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c b/drivers/gpu/drm/

Re: [PATCH] drm/amdkfd: clamp queue size to minimum

2025-02-26 Thread Philip Yang
On 2025-02-25 21:41, David Yat Sin wrote: If queue size is less than minimum, clamp it to minimum to prevent underflow when writing queue mqd. Signed-off-by: David Yat Sin --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 4 include/uapi/linux/kfd_ioctl.h | 2 ++ 2 files chang

[PATCH 0/5] Fix mode1 reset test failures

2025-02-26 Thread Philip Yang
reset_domain->wq to ensure ongoing mode1 reset is done or user queues are evicted, then free outstanding BOs. Philip Yang (5): drm/amdkfd: Remove kfd_process_hw_exception worker drm/amdkfd: KFD release_work possible circular locking drm/amdkfd: Fix mode1 reset crash issue drm/amdkfd:

[PATCH 1/5] drm/amdkfd: Remove kfd_process_hw_exception worker

2025-02-26 Thread Philip Yang
With GPU reset-domain worker implemented, KFD hw_exception worker is not needed any more, just call amdgpu_amdkfd_gpu_reset directly from kfd_hws_hang. Suggested-by: Felix Kuehling Signed-off-by: Philip Yang Reviewed-by: Lijo Lazar --- drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c

[PATCH 2/5] drm/amdkfd: KFD release_work possible circular locking

2025-02-26 Thread Philip Yang
pletion)(&p->release_work)); lock((wq_completion)amdgpu-reset-dev); To fix this, KFD create process move flush release work outside kfd_process_mutex. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 16 1 f

[PATCH 3/5] drm/amdkfd: Fix mode1 reset crash issue

2025-02-26 Thread Philip Yang
free outstanding BOs. Signed-off-by: Philip Yang Reviewed-by: Lijo Lazar --- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 17 + 1 file changed, 17 insertions(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c index 2715ca53e9da

[PATCH 5/5] drm/amdkfd: debugfs hang_hws skip GPU with MES

2025-02-26 Thread Philip Yang
debugfs hang_hws is used by GPU reset test with HWS, for MES this crash the kernel with NULL pointer access because dqm->packet_mgr is not setup for MES path. Skip GPU with MES for now, MES hang_hws debugfs interface will be supported later. Signed-off-by: Philip Yang Reviewed-by: Kent Russ

[PATCH 4/5] drm/amdkfd: Fix pqm_destroy_queue race with GPU reset

2025-02-26 Thread Philip Yang
If GPU in reset, destroy_queue return -EIO, pqm_destroy_queue should delete the queue from process_queue_list and free the resource. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 2 +- 1 file changed, 1 insertion(+), 1

Re: [PATCH] drm/amdkfd: Correct the postion of reserve and unreserve memory

2025-02-24 Thread Philip Yang
On 2025-02-20 06:59, Emily Deng wrote: Call amdgpu_amdkfd_reserve_mem_limit in svm_range_vram_node_new when creating a new SVM BO. Call amdgpu_amdkfd_unreserve_mem_limit in svm_range_bo_release when the SVM BO is deleted. Signed-off-by: Emily Deng --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.

Re: [PATCH] drm/amdkfd: Preserve cp_hqd_pq_control on update_mqd

2025-02-18 Thread Philip Yang
On 2025-02-18 12:24, David Yat Sin wrote: When userspace applications call AMDKFD_IOC_UPDATE_QUEUE. Preserve bitfields that do not need to be modified as they contain flags to track queue states that are used by CP FW. Signed-off-by: David Yat Sin --- drivers/gpu/drm/amd/amdkfd/kfd_mqd_mana

Re: [PATCH] drm/amdkfd: Fix Circular Locking Dependency in 'svm_range_cpu_invalidate_pagetables'

2025-02-18 Thread Philip Yang
954] RDX: RSI: RDI: 01200011 [ 223.426965] RBP: R08: R09: [ 223.426975] R10: 7f4675e81a50 R11: 0246 R12: 0001 [ 223.426986] R13: 7fff5c3e5470 R14: 7fff5c3e53e0 R15: 7f

Re: [PATCH] drm/amdkfd: Fix the deadlock in svm_range_restore_work

2025-02-13 Thread Philip Yang
On 2025-02-12 23:33, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only] *From:*Yang, Philip *Sent:* Wednesday, February 12, 2025 10:31 PM *To:* Deng, Emily ; Yang, Philip ; Chen, Xiaogang ; amd-gfx@lists.freedesktop.org *Subject:* Re: [PATCH] drm/amdkfd: Fix the de

Re: [PATCH v2 4/9] drm/amdkfd: Validate user queue buffers

2025-02-12 Thread Philip Yang
On 2025-02-12 17:42, Uwe Kleine-König wrote: #regzbot introduced: 68e599db7a549f010a329515f3508d8a8c3467a4 #regzbot monitor: https://bugs.debian.org/1093124 Hello, On Thu, Jul 18, 2024 at 05:05:53PM -0400, Philip Yang wrote: Find user queue

Re: [PATCH] drm/amdkfd: Fix user queue validation on Gfx7/8

2025-02-12 Thread Philip Yang
ping... On 2025-01-29 19:04, Philip Yang wrote: To workaround queue full h/w issue on Gfx7/8, when application create AQL queues, the ring buffer bo allocate size is queue_size/2 and mapped to GPU twice using 2 attachments with same ring_bo backing memory. For

Re: [PATCH] drm/amdkfd: Fix the deadlock in svm_range_restore_work

2025-02-12 Thread Philip Yang
On 2025-02-12 03:54, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only] Ping……   Emily Deng Best Wishes

Re: [PATCH] drm/amdgpu: Unlocked unmap only clear page table leaves

2025-02-11 Thread Philip Yang
On 2025-02-11 05:34, Christian König wrote: Am 20.01.25 um 16:59 schrieb Philip Yang: On 2025-01-15 06:01, Christian König wrote: Am 14.01.25 um 15:53 schrieb Philip Yang

Re: [PATCH] drm/amdkfd: Fix the deadlock in svm_range_restore_work

2025-02-10 Thread Philip Yang
On 2025-02-10 02:51, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only] [AMD Official Use Only - AMD Internal Distribution Only]

Re: [PATCH] drm/amdgpu: Set snoop bit for SDMA for MI series

2025-02-07 Thread Philip Yang
loop Modified function names based on review comments. Signed-off-by: Harish Kasiviswanathan with one nitpick fixed, this patch is Reviewed-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c | 25 ++ drivers/gpu/drm/amd/amdgpu/mmhub_v1_8.c

Re: [PATCH v2] drm/amdgpu: Unlocked unmap only clear page table leaves

2025-02-07 Thread Philip Yang
On 2025-02-07 05:17, Christian König wrote: Am 30.01.25 um 17:19 schrieb Philip Yang: On 2025-01-29 11:40, Christian König wrote: Am 23.01.25 um 21:39 schrieb Philip Yang: SVM migration

Re: [PATCH] drm/amdgpu: Set snoop bit for SDMA for MI series

2025-02-06 Thread Philip Yang
On 2025-02-05 22:07, Kasiviswanathan, Harish wrote: [Public]     From: Yang, Philip S

Re: [PATCH] drm/amdgpu: Set snoop bit for SDMA for MI series

2025-02-05 Thread Philip Yang
On 2025-02-04 18:02, Harish Kasiviswanathan wrote: SDMA writes has to probe invalidate RW lines. Set snoop bit in mmhub for this to happen. Signed-off-by: Harish Kasiviswanathan --- drivers/gpu/drm/amd/amdgpu/mmhub_v1_7.c | 25 ++ drivers/gpu

Re: [PATCH 2/2] drm/amdkfd: use GTT for VRAM on APUs only if GTT is larger

2025-01-31 Thread Philip Yang
On 2025-01-30 15:51, Alex Deucher wrote: If the user has configured a large carveout on a small APU, only use GTT for VRAM allocations if GTT is larger than VRAM. Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 -- 1 file ch

Re: [PATCH v2] drm/amdgpu: Unlocked unmap only clear page table leaves

2025-01-30 Thread Philip Yang
On 2025-01-29 11:40, Christian König wrote: Am 23.01.25 um 21:39 schrieb Philip Yang: SVM migration unmap pages from GPU and then update mapping to GPU to recover page fault. Currently unmap clears the PDE entry for

[PATCH] drm/amdkfd: Fix user queue validation on Gfx7/8

2025-01-29 Thread Philip Yang
allocation and mapping size. Fixes: 68e599db7a54 ("drm/amdkfd: Validate user queue buffers") Suggested-by: Tomáš Trnka Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/

[PATCH v2] drm/amdgpu: Unlocked unmap only clear page table leaves

2025-01-23 Thread Philip Yang
GPU performance. Update mapping to huge page will still free the PTB bo. With this change, the vm->pt_freed list and work is not needed. Add WARN_ON(unlocked) in amdgpu_vm_pt_add_list to catch if unmap to free the PTB. v2: Limit update fragment size, not hack entry_end (Christian) Signed-off-by:

Re: [PATCH] drm/amdkfd: Change page discontinuity handling at svm_migrate_copy_to_vram

2025-01-20 Thread Philip Yang
On 2025-01-15 16:40, Xiaogang.Chen wrote: From: Xiaogang Chen Current svm_migrate_copy_to_vram handles sys pages(src) and dst pages (vram) discontinuation in different way. When src got discontinuity migrates j pages that ith page is not migrated; When dst

Re: [PATCH] drm/amdgpu: Unlocked unmap only clear page table leaves

2025-01-20 Thread Philip Yang
On 2025-01-15 06:01, Christian König wrote: Am 14.01.25 um 15:53 schrieb Philip Yang: SVM migration unmap pages from GPU and then update mapping to GPU to recover page fault. Currently unmap clears the PDE entry for

[PATCH] drm/amdgpu: Unlocked unmap only clear page table leaves

2025-01-14 Thread Philip Yang
the PTB bo. With this change, the vm->pt_freed list and work is not needed. Add WARN_ON(unlocked) in amdgpu_vm_pt_free_dfs to catch if unmap to free the PTB. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 4 --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h| 4 --- d

Re: [PATCH v4] drm/amdkfd: Fix partial migrate issue

2025-01-10 Thread Philip Yang
On 2025-01-10 11:23, Chen, Xiaogang wrote: On 1/10/2025 8:37 AM, Philip Yang wrote: On 2025-01-10 02:49, Emily Deng wrote: For partial migrate from ram to vram, the migrate->cpages

Re: [PATCH v2] drm/amdgpu: Fix the looply call svm_range_restore_pages issue

2025-01-10 Thread Philip Yang
On 2025-01-09 12:14, Felix Kuehling wrote: On 2025-01-08 20:11, Philip Yang wrote: On 2025-01-07 22:08, Deng, Emily wrote: [AMD Official Use Only - AMD Internal

Re: [PATCH v5] drm/amdkfd: Fix partial migrate issue

2025-01-10 Thread Philip Yang
On 2025-01-10 09:25, Emily Deng wrote: For partial migrate from ram to vram, the migrate->cpages is not equal to migrate->npages, should use migrate->npages to check all needed migrate pages which could be copied or not. And only need to set those pages could be m

Re: [PATCH v4] drm/amdkfd: Fix partial migrate issue

2025-01-10 Thread Philip Yang
hat fixed, this patch is Reviewed-by: Philip Yang Signed-off-by: Emily Deng --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu

Re: [PATCH v2] drm/amdgpu: Fix the looply call svm_range_restore_pages issue

2025-01-08 Thread Philip Yang
On 2025-01-07 22:08, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only] Hi Philip, It still has the deadlock, maybe the best way is

Re: [PATCH v3] drm/amdkfd: Fix partial migrate issue

2025-01-08 Thread Philip Yang
On 2025-01-08 08:19, Emily Deng wrote: For partial migrate from ram to vram, the migrate->cpages is not equal to migrate->npages, should use migrate->npages to check all needed migrate pages which could be copied or not. And only need to set those pages could be m

Re: [PATCH] drm/amdkfd: Fix partial migrate issue

2025-01-08 Thread Philip Yang
On 2025-01-07 19:31, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only]     From: Yan

Re: [PATCH] drm/amdkfd: Fix partial migrate issue

2025-01-07 Thread Philip Yang
On 2025-01-07 10:50, Chen, Xiaogang wrote: On 1/6/2025 8:02 PM, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only]

Re: [PATCH v2] drm/amdgpu: Fix the looply call svm_range_restore_pages issue

2025-01-07 Thread Philip Yang
On 2025-01-07 07:30, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only] Hi Felix, You are right, it is easily to hit deadlock, don't know why LOCKDEP doesn't catch this. Need to find another solution. Hi Philip, Do you have a sol

Re: [PATCH] drm/amdkfd: Fix partial migrate issue

2025-01-07 Thread Philip Yang
On 2025-01-06 21:31, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only]   From: Yang, Philip

Re: [PATCH] drm/amdkfd: Fix partial migrate issue

2025-01-06 Thread Philip Yang
On 2025-01-02 19:06, Emily Deng wrote: For partial migrate from ram to vram, the migrate->cpages is not equal to migrate->npages, should use migrate->npages to check all needed migrate pages which could be copied or not. And only need to set those pages could be m

[PATCH 6/6] drm/amdgpu: Show warning message if IH ring overflow

2024-12-13 Thread Philip Yang
*_ih.c except ASICs older than Vega which has only one ih ring. Signed-off-by: Philip Yang Reviewed-by: Christian König Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c | 6 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h | 1 + drivers/gpu/drm/amd/amdgpu/navi10_ih.c

[PATCH 5/6] drm/amdkfd: Improve signal event slow path

2024-12-13 Thread Philip Yang
then driver process the first event interrupt, set_event and event slot is auto-reset, then for the second event interrupt, KFD goes to slow path as event is not signaled, just drop the second event interrupt because the application only need wakeup once. Signed-off-by: Philip Yang Reviewed-by:

[PATCH 3/6] drm/amdgpu: Optimize gfx v9 GPU page fault handling

2024-12-13 Thread Philip Yang
handle the gfx v9 path, cover retry on/off and CAM filter on/off cases. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 10 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 4 ++ drivers/gpu/drm/amd/amdkfd/kfd_device.c| 67

[PATCH 4/6] drm/amdkfd: Queue interrupt work to different CPU

2024-12-13 Thread Philip Yang
queue with number of workers equals to number of partitions, let queue_work select the next CPU round robin among the local CPUs of same NUMA. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_device.c| 25 -- drivers/gpu/drm/amd

[PATCH 1/6] drm/amdgpu: Don't enable sdma 4.4.5 CTXEMPTY interrupt

2024-12-13 Thread Philip Yang
-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c index 4c8308b2878b..56507ae919b0 100644 --- a

[PATCH 2/6] drm/amdkfd: KFD interrupt access ih_fifo data in-place

2024-12-13 Thread Philip Yang
To handle 4 to 8 interrupts per second running CPX mode with 4 streams/queues per KFD node, KFD interrupt handler becomes the performance bottleneck. Remove the kfifo_out memcpy overhead by accessing ih_fifo data in-place and updating rptr with kfifo_skip_count. Signed-off-by: Philip

Re: [PATCH] drm/amdkfd: fix the hang caused by the write reorder to fence_addr

2024-10-22 Thread Philip Yang
On 2024-10-21 04:12, Christian König wrote: Am 18.10.24 um 23:59 schrieb Philip Yang: On 2024-10-18 14:28, Felix Kuehling wrote: On 2024-10-17 04:34, Victor Zhao wrote: make sure

Re: [PATCH] Revert "drm/amdkfd: SMI report dropped event count"

2024-10-21 Thread Philip Yang
On 2024-10-21 13:46, Alex Deucher wrote: This reverts commit a3ab2d45b9887ee609cd3bea39f668236935774c. The userspace side for this code is not ready yet so revert for now. Signed-off-by: Alex Deucher Cc: Philip Yang Reviewed-by: Philip Yang

Re: [PATCH] drm/amdkfd: fix the hang caused by the write reorder to fence_addr

2024-10-18 Thread Philip Yang
On 2024-10-18 14:28, Felix Kuehling wrote: On 2024-10-17 04:34, Victor Zhao wrote: make sure KFD_FENCE_INIT write to fence_addr before pm_send_query_status called, to avoid qcm fence timeout caused by incorrect ord

Re: [PATCH] drm/amd/amdkfd: add/remove kfd queues on start/stop KFD scheduling

2024-10-18 Thread Philip Yang
It is safe to access dqm->sched status inside dqm_lock, no race with gpu reset. Reviewed-by: Philip Yang On 2024-10-18 11:10, Shaoyun Liu wrote: From: shaoyunl Add back kfd queues in start scheduling that originally been removed on stop schedul

Re: [PATCH] drm/amdkfd: fix the hang caused by the write reorder to fence_addr

2024-10-18 Thread Philip Yang
ordering. Signed-off-by: Victor Zhao Reviewed-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 1 + drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd

Re: [PATCH] drm/amd/amdkfd: add/remove kfd queues on start/stop KFD scheduling

2024-10-17 Thread Philip Yang
On 2024-10-17 12:12, Shaoyun Liu wrote: From: shaoyunl Add back kfd queues in start scheduling that originally been removed on stop scheduling. Signed-off-by: Shaoyun Liu --- .../drm/amd/amdkfd/kfd_device_queue_manager.c | 40 +-- 1 file change

[PATCH v3] drm/amdkfd: Accounting pdd vram_usage for svm

2024-10-11 Thread Philip Yang
pe because it is updated outside process mutex now. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +++--- drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 2 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 4 ++-- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 26 +++

Re: [PATCH 1/2] drm/amdkfd: Save pdd to svm_bo to replace node

2024-10-11 Thread Philip Yang
Drop this patch series, as Felix pointed out, the forked process takes svm_bo device pages ref, svm_bo->pdd could refer to the process that doesn't exist any more. Regards, Philip On 2024-10-11 11:00, Philip Yang wrote: KFD process device

[PATCH 2/2] drm/amdkfd: Accounting pdd vram_usage for svm

2024-10-11 Thread Philip Yang
c64_t because it is updated outside process mutex now. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +++--- drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 2 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 4 ++-- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 ++ 4

[PATCH 1/2] drm/amdkfd: Save pdd to svm_bo to replace node

2024-10-11 Thread Philip Yang
KFD process device data pdd will be used for VRAM usage accounting, save pdd to svm_bo to avoid searching pdd for every accounting, and get KFD node from pdd->dev. svm_bo->pdd will always be valid because KFD process release free all svm_bo first, then destroy process pdds. Signed-off-by:

Re: [PATCH] drm/amdkfd: Accounting pdd vram_usage for svm

2024-10-11 Thread Philip Yang
On 2024-10-09 17:20, Felix Kuehling wrote: On 2024-10-04 16:28, Philip Yang wrote: Per process device data pdd->vram_usage is used by rocm-smi to report VRAM usage, this is currently missing the svm_bo us

[PATCH] drm/amdkfd: Accounting pdd vram_usage for svm

2024-10-04 Thread Philip Yang
updated outside process mutex now. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +++--- drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 2 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 4 ++-- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 22 ++ 4 f

[PATCH] drm/amdkfd: Copy wave state only for compute queue

2024-10-03 Thread Philip Yang
get_wave_state is not defined for sdma queue, copy_context_work_handler calls it for sdma queue will crash. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd

Re: [PATCH] drm/amdkfd: fix vm-pasid lookup for multiple partitions

2024-09-11 Thread Philip Yang
On 2024-09-11 02:54, Christian König wrote: Yeah, I completely agree with Xiaogang. The PASID is an identifier of an address space. And the idea of the KFD was that we can just use the same address space and with it the page ta

Re: [PATCH] drm/amdkfd: fix vm-pasid lookup for multiple partitions

2024-09-10 Thread Philip Yang
On 2024-09-09 14:46, Christian König wrote: Am 09.09.24 um 18:02 schrieb Kim, Jonathan: [Public] -Original Message- From: Christian König Sent: Thursday, September 5, 202

  1   2   3   4   5   6   7   8   >