[PATCH] drm/amdgpu: Unlocked unmap only clear page table leaves

2025-01-14 Thread Philip Yang
the PTB bo. With this change, the vm->pt_freed list and work is not needed. Add WARN_ON(unlocked) in amdgpu_vm_pt_free_dfs to catch if unmap to free the PTB. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c| 4 --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h| 4 --- d

Re: [PATCH v4] drm/amdkfd: Fix partial migrate issue

2025-01-10 Thread Philip Yang
On 2025-01-10 11:23, Chen, Xiaogang wrote: On 1/10/2025 8:37 AM, Philip Yang wrote: On 2025-01-10 02:49, Emily Deng wrote: For partial migrate from ram to vram, the migrate->cpages

Re: [PATCH v2] drm/amdgpu: Fix the looply call svm_range_restore_pages issue

2025-01-10 Thread Philip Yang
On 2025-01-09 12:14, Felix Kuehling wrote: On 2025-01-08 20:11, Philip Yang wrote: On 2025-01-07 22:08, Deng, Emily wrote: [AMD Official Use Only - AMD Internal

Re: [PATCH v5] drm/amdkfd: Fix partial migrate issue

2025-01-10 Thread Philip Yang
On 2025-01-10 09:25, Emily Deng wrote: For partial migrate from ram to vram, the migrate->cpages is not equal to migrate->npages, should use migrate->npages to check all needed migrate pages which could be copied or not. And only need to set those pages could be m

Re: [PATCH v4] drm/amdkfd: Fix partial migrate issue

2025-01-10 Thread Philip Yang
hat fixed, this patch is Reviewed-by: Philip Yang Signed-off-by: Emily Deng --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu

Re: [PATCH v2] drm/amdgpu: Fix the looply call svm_range_restore_pages issue

2025-01-08 Thread Philip Yang
On 2025-01-07 22:08, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only] Hi Philip, It still has the deadlock, maybe the best way is

Re: [PATCH v3] drm/amdkfd: Fix partial migrate issue

2025-01-08 Thread Philip Yang
On 2025-01-08 08:19, Emily Deng wrote: For partial migrate from ram to vram, the migrate->cpages is not equal to migrate->npages, should use migrate->npages to check all needed migrate pages which could be copied or not. And only need to set those pages could be m

Re: [PATCH] drm/amdkfd: Fix partial migrate issue

2025-01-08 Thread Philip Yang
On 2025-01-07 19:31, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only]     From: Yan

Re: [PATCH] drm/amdkfd: Fix partial migrate issue

2025-01-07 Thread Philip Yang
On 2025-01-07 10:50, Chen, Xiaogang wrote: On 1/6/2025 8:02 PM, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only]

Re: [PATCH v2] drm/amdgpu: Fix the looply call svm_range_restore_pages issue

2025-01-07 Thread Philip Yang
On 2025-01-07 07:30, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only] Hi Felix, You are right, it is easily to hit deadlock, don't know why LOCKDEP doesn't catch this. Need to find another solution. Hi Philip, Do you have a sol

Re: [PATCH] drm/amdkfd: Fix partial migrate issue

2025-01-07 Thread Philip Yang
On 2025-01-06 21:31, Deng, Emily wrote: [AMD Official Use Only - AMD Internal Distribution Only]   From: Yang, Philip

Re: [PATCH] drm/amdkfd: Fix partial migrate issue

2025-01-06 Thread Philip Yang
On 2025-01-02 19:06, Emily Deng wrote: For partial migrate from ram to vram, the migrate->cpages is not equal to migrate->npages, should use migrate->npages to check all needed migrate pages which could be copied or not. And only need to set those pages could be m

[PATCH 6/6] drm/amdgpu: Show warning message if IH ring overflow

2024-12-13 Thread Philip Yang
*_ih.c except ASICs older than Vega which has only one ih ring. Signed-off-by: Philip Yang Reviewed-by: Christian König Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c | 6 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h | 1 + drivers/gpu/drm/amd/amdgpu/navi10_ih.c

[PATCH 5/6] drm/amdkfd: Improve signal event slow path

2024-12-13 Thread Philip Yang
then driver process the first event interrupt, set_event and event slot is auto-reset, then for the second event interrupt, KFD goes to slow path as event is not signaled, just drop the second event interrupt because the application only need wakeup once. Signed-off-by: Philip Yang Reviewed-by:

[PATCH 3/6] drm/amdgpu: Optimize gfx v9 GPU page fault handling

2024-12-13 Thread Philip Yang
handle the gfx v9 path, cover retry on/off and CAM filter on/off cases. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 10 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 4 ++ drivers/gpu/drm/amd/amdkfd/kfd_device.c| 67

[PATCH 4/6] drm/amdkfd: Queue interrupt work to different CPU

2024-12-13 Thread Philip Yang
queue with number of workers equals to number of partitions, let queue_work select the next CPU round robin among the local CPUs of same NUMA. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_device.c| 25 -- drivers/gpu/drm/amd

[PATCH 1/6] drm/amdgpu: Don't enable sdma 4.4.5 CTXEMPTY interrupt

2024-12-13 Thread Philip Yang
-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c index 4c8308b2878b..56507ae919b0 100644 --- a

[PATCH 2/6] drm/amdkfd: KFD interrupt access ih_fifo data in-place

2024-12-13 Thread Philip Yang
To handle 4 to 8 interrupts per second running CPX mode with 4 streams/queues per KFD node, KFD interrupt handler becomes the performance bottleneck. Remove the kfifo_out memcpy overhead by accessing ih_fifo data in-place and updating rptr with kfifo_skip_count. Signed-off-by: Philip

Re: [PATCH] drm/amdkfd: fix the hang caused by the write reorder to fence_addr

2024-10-22 Thread Philip Yang
On 2024-10-21 04:12, Christian König wrote: Am 18.10.24 um 23:59 schrieb Philip Yang: On 2024-10-18 14:28, Felix Kuehling wrote: On 2024-10-17 04:34, Victor Zhao wrote: make sure

Re: [PATCH] Revert "drm/amdkfd: SMI report dropped event count"

2024-10-21 Thread Philip Yang
On 2024-10-21 13:46, Alex Deucher wrote: This reverts commit a3ab2d45b9887ee609cd3bea39f668236935774c. The userspace side for this code is not ready yet so revert for now. Signed-off-by: Alex Deucher Cc: Philip Yang Reviewed-by: Philip Yang

Re: [PATCH] drm/amdkfd: fix the hang caused by the write reorder to fence_addr

2024-10-18 Thread Philip Yang
On 2024-10-18 14:28, Felix Kuehling wrote: On 2024-10-17 04:34, Victor Zhao wrote: make sure KFD_FENCE_INIT write to fence_addr before pm_send_query_status called, to avoid qcm fence timeout caused by incorrect ord

Re: [PATCH] drm/amd/amdkfd: add/remove kfd queues on start/stop KFD scheduling

2024-10-18 Thread Philip Yang
It is safe to access dqm->sched status inside dqm_lock, no race with gpu reset. Reviewed-by: Philip Yang On 2024-10-18 11:10, Shaoyun Liu wrote: From: shaoyunl Add back kfd queues in start scheduling that originally been removed on stop schedul

Re: [PATCH] drm/amdkfd: fix the hang caused by the write reorder to fence_addr

2024-10-18 Thread Philip Yang
ordering. Signed-off-by: Victor Zhao Reviewed-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 1 + drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd

Re: [PATCH] drm/amd/amdkfd: add/remove kfd queues on start/stop KFD scheduling

2024-10-17 Thread Philip Yang
On 2024-10-17 12:12, Shaoyun Liu wrote: From: shaoyunl Add back kfd queues in start scheduling that originally been removed on stop scheduling. Signed-off-by: Shaoyun Liu --- .../drm/amd/amdkfd/kfd_device_queue_manager.c | 40 +-- 1 file change

[PATCH v3] drm/amdkfd: Accounting pdd vram_usage for svm

2024-10-11 Thread Philip Yang
pe because it is updated outside process mutex now. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +++--- drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 2 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 4 ++-- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 26 +++

Re: [PATCH 1/2] drm/amdkfd: Save pdd to svm_bo to replace node

2024-10-11 Thread Philip Yang
Drop this patch series, as Felix pointed out, the forked process takes svm_bo device pages ref, svm_bo->pdd could refer to the process that doesn't exist any more. Regards, Philip On 2024-10-11 11:00, Philip Yang wrote: KFD process device

[PATCH 2/2] drm/amdkfd: Accounting pdd vram_usage for svm

2024-10-11 Thread Philip Yang
c64_t because it is updated outside process mutex now. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +++--- drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 2 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 4 ++-- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 ++ 4

[PATCH 1/2] drm/amdkfd: Save pdd to svm_bo to replace node

2024-10-11 Thread Philip Yang
KFD process device data pdd will be used for VRAM usage accounting, save pdd to svm_bo to avoid searching pdd for every accounting, and get KFD node from pdd->dev. svm_bo->pdd will always be valid because KFD process release free all svm_bo first, then destroy process pdds. Signed-off-by:

Re: [PATCH] drm/amdkfd: Accounting pdd vram_usage for svm

2024-10-11 Thread Philip Yang
On 2024-10-09 17:20, Felix Kuehling wrote: On 2024-10-04 16:28, Philip Yang wrote: Per process device data pdd->vram_usage is used by rocm-smi to report VRAM usage, this is currently missing the svm_bo us

[PATCH] drm/amdkfd: Accounting pdd vram_usage for svm

2024-10-04 Thread Philip Yang
updated outside process mutex now. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +++--- drivers/gpu/drm/amd/amdkfd/kfd_priv.h| 2 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 4 ++-- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 22 ++ 4 f

[PATCH] drm/amdkfd: Copy wave state only for compute queue

2024-10-03 Thread Philip Yang
get_wave_state is not defined for sdma queue, copy_context_work_handler calls it for sdma queue will crash. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd

Re: [PATCH] drm/amdkfd: fix vm-pasid lookup for multiple partitions

2024-09-11 Thread Philip Yang
On 2024-09-11 02:54, Christian König wrote: Yeah, I completely agree with Xiaogang. The PASID is an identifier of an address space. And the idea of the KFD was that we can just use the same address space and with it the page ta

Re: [PATCH] drm/amdkfd: fix vm-pasid lookup for multiple partitions

2024-09-10 Thread Philip Yang
On 2024-09-09 14:46, Christian König wrote: Am 09.09.24 um 18:02 schrieb Kim, Jonathan: [Public] -Original Message- From: Christian König Sent: Thursday, September 5, 202

Re: [PATCH V5] drm/amdgpu: Surface svm_default_granularity, a RW module parameter

2024-09-04 Thread Philip Yang
ff-by: Ramesh Errabolu With 2 below nitpicks fixed, this patch is Reviewed-by: Philip Yang change subject to "drm/amdkfd: Add svm_default_granularity module parameter" --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/am

Re: [PATCH v2] drm/amdgpu: fix a call trace when unload amdgpu driver

2024-09-04 Thread Philip Yang
leased before ttm_resource_manager is finilized, drain the workqueue in ttm_device. v2: move drain_workqueue to amdgpu_ttm.c Fixes:d99fbd9aab62 ("drm/ttm: Always take the bo delayed cleanup path for imported bos") Suggested-by: Christian König Signed-off-by: Asher Song Acked-by: Ph

Re: [PATCH] drm/amdgpu: fix invalid fence handling in amdgpu_vm_tlb_flush

2024-09-04 Thread Philip Yang
On 2024-09-02 05:06, Christian König wrote: Am 02.09.24 um 05:03 schrieb Lang Yu: Fixes: 5a1c27951966 ("drm/amdgpu: implement TLB flush fence") Signed-off-by: Lang Yu Ah yes, that exp

Re: [PATCH V3] drm/amdgpu: Surface svm_default_granularity, a RW module parameter

2024-09-03 Thread Philip Yang
On 2024-08-29 18:31, Chen, Xiaogang wrote: On 8/29/2024 5:13 PM, Ramesh Errabolu wrote: Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or respon

Re: [PATCH] drm/amdkfd: restore_process_worker race with GPU reset

2024-08-29 Thread Philip Yang
On 2024-08-29 17:15, Felix Kuehling wrote: On 2024-08-23 15:49, Philip Yang wrote: If GPU reset kick in while KFD restore_process_worker running, this may causes different issues, for example below rcu stall warning

Re: [PATCH] drm/amdkfd: restore_process_worker race with GPU reset

2024-08-29 Thread Philip Yang
On 2024-08-28 18:01, Felix Kuehling wrote: On 2024-08-23 15:49, Philip Yang wrote: If GPU reset kick in while KFD restore_process_worker running, this may causes different issues, for example below rcu stall

Re: [PATCH v2] drm/amdgpu: Surface svm_attr_gobm, a RW module parameter

2024-08-28 Thread Philip Yang
On 2024-08-26 15:34, Ramesh Errabolu wrote: Enables users to update the default size of buffer used in migration either from Sysmem to VRAM or vice versa. The param GOBM refers to granularity of buffer migration, and is specified in terms of log(numPages(buff

[PATCH v3 0/4] Improve SVM migrate event report

2024-08-27 Thread Philip Yang
. v3: Simplify event drop count handling (James Zhu) Philip Yang (4): drm/amdkfd: Document and define SVM events message macro drm/amdkfd: Output migrate end event if migrate failed drm/amdkfd: Increase SMI event fifo size drm/amdkfd: SMI report dropped event count drivers/gpu/drm/amd

[PATCH v3 4/4] drm/amdkfd: SMI report dropped event count

2024-08-27 Thread Philip Yang
and reset drop count to zero. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 25 + include/uapi/linux/kfd_ioctl.h | 6 + 2 files changed, 27 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c b

[PATCH] drm/amdkfd: SMI report dropped event count

2024-08-27 Thread Philip Yang
and reset drop count to zero. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 25 + include/uapi/linux/kfd_ioctl.h | 6 + 2 files changed, 27 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c b

[PATCH v3 1/4] drm/amdkfd: Document and define SVM events message macro

2024-08-27 Thread Philip Yang
future. No functional changes. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 45 + include/uapi/linux/kfd_ioctl.h | 100 +--- 2 files changed, 109 insertions(+), 36 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd

[PATCH v3 3/4] drm/amdkfd: Increase SMI event fifo size

2024-08-27 Thread Philip Yang
prefix to the macro name. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c index 1d94b445a060..9b8169761ec5

[PATCH v3 2/4] drm/amdkfd: Output migrate end event if migrate failed

2024-08-27 Thread Philip Yang
If page migration failed, also output migrate end event to match with migrate start event, with failure error_code added to the end of the migrate message macro. This will not break uAPI because application uses old message macro sscanf drop and ignore the error_code. Signed-off-by: Philip Yang

[PATCH] drm/amdkfd: restore_process_worker race with GPU reset

2024-08-23 Thread Philip Yang
] Call Trace: update_process_times+0x94/0xd0 RIP: 0010:amdgpu_vm_handle_moved+0x9a/0x210 [amdgpu] amdgpu_amdkfd_gpuvm_restore_process_bos+0x3d6/0x7d0 [amdgpu] restore_process_helper+0x27/0x80 [amdgpu] Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 56

Re: [PATCH v2 3/4] drm/amdkfd: Increase SMI event fifo size

2024-08-22 Thread Philip Yang
On 2024-08-22 10:34, James Zhu wrote: On 2024-07-30 16:15, Philip Yang wrote: SMI event fifo size 1KB was enough to report GPU vm fault or reset [JZ] There is a typo here. it should be NOT enough

Re: [PATCH v2 1/4] drm/amdkfd: Document and define SVM events message macro

2024-08-22 Thread Philip Yang
On 2024-08-22 10:32, James Zhu wrote: On 2024-07-30 16:15, Philip Yang wrote: Document how to use SMI system management interface to enable and receive SVM events. Document SVM event triggers. Define SVM events message

Re: [PATCH v6] drm/amdkfd: Change kfd/svm page fault drain handling

2024-08-22 Thread Philip Yang
page faults at deferred work. So, the time period that kfd does not handle page faults is reduced and can be controlled. Signed-off-by: Xiaogang.Chen Some nitpicks below. This patch is Reviewed-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 +- drivers/gpu

Re: [PATCH] drm/amdgpu: Surface svm_attr_gobm, a RW module parameter

2024-08-22 Thread Philip Yang
On 2024-08-21 19:22, Ramesh Errabolu wrote: KFD's design of unified memory (UM) does not allow users to configure the size of buffer used in migrating buffer either from Sysmem to VRAM or vice versa. This is not true, app can change range granularit

[PATCH] drm/amdkfd: Handle queue destroy buffer access race

2024-08-02 Thread Philip Yang
, to handle error in case failed to hold vm lock. Then hold dqm_lock to remove queue from queue list and then release queue buffers. Restore process worker restore queue hold dqm_lock, will always find the queue with valid queue buffers. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd

Re: [PATCH] drm/amdgpu: change kgd2kfd_init_zone sequence during device_init

2024-07-31 Thread Philip Yang
On 2024-07-31 04:10, Shikang Fan wrote: Move kgd2kfd_init _zone_device() after release_full_gpu() as it takes long time for asics with huge bar size and it could potentially hit full access timeout for SRIOV during init. Signed-off-by: Shikang Fan --- drivers/gp

[PATCH v2 1/4] drm/amdkfd: Document and define SVM events message macro

2024-07-30 Thread Philip Yang
future. No functional changes. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 45 + include/uapi/linux/kfd_ioctl.h | 100 +--- 2 files changed, 109 insertions(+), 36 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd

[PATCH v2 4/4] drm/amdkfd: SMI report dropped event count

2024-07-30 Thread Philip Yang
events dropped, generate a dropped event record and add to kfifo. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 41 ++--- include/uapi/linux/kfd_ioctl.h | 6 +++ 2 files changed, 41 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm

[PATCH v2 3/4] drm/amdkfd: Increase SMI event fifo size

2024-07-30 Thread Philip Yang
SMI event fifo size 1KB was enough to report GPU vm fault or reset event, increase it to 8KB to store about 100 migrate events, less chance to drop the migrate events if lots of migration happened in the short period of time. Add KFD prefix to the macro name. Signed-off-by: Philip Yang

[PATCH v2 2/4] drm/amdkfd: Output migrate end event if migrate failed

2024-07-30 Thread Philip Yang
If page migration failed, also output migrate end event to match with migrate start event, with failure error_code added to the end of the migrate message macro. This will not break uAPI because application uses old message macro sscanf drop and ignore the error_code. Signed-off-by: Philip Yang

[PATCH v2 0/4] Improve SVM migrate event report

2024-07-30 Thread Philip Yang
. Philip Yang (4): drm/amdkfd: Document and define SVM events message macro drm/amdkfd: Output migrate end event if migrate failed drm/amdkfd: Increase SMI event fifo size drm/amdkfd: SMI report dropped event count drivers/gpu/drm/amd/amdkfd/kfd_migrate.c| 14 ++- drivers/gpu/drm/amd

Re: [PATCH] drm/amdkfd: Fix missing error code in kfd_queue_acquire_buffers

2024-07-26 Thread Philip Yang
The kfdtest user queue validation cases don't cover those error condition path, thanks for catching it. This patch is Reviewed-by: Philip Yang On 2024-07-26 02:47, Srinivasan Shanmugam wrote: The fix involves setting 'err&#x

Re: [PATCH v3] drm/amdkfd: Change kfd/svm page fault drain handling

2024-07-25 Thread Philip Yang
On 2024-07-25 14:19, Xiaogang.Chen wrote: From: Xiaogang Chen When app unmap vm ranges(munmap) kfd/svm starts drain pending page fault and not handle any incoming pages fault of this process until a deferred work item got executed by default system wq. The

[PATCH] drm/amdkfd: Fix compile error if HMM support not enabled

2024-07-25 Thread Philip Yang
mory residency") Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202407252127.zvnxakra-...@intel.com/ Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/drivers/gpu/drm/am

Re: [PATCH v2] drm/amdkfd: Change kfd/svm page fault drain handling

2024-07-24 Thread Philip Yang
On 2024-07-24 09:58, Chen, Xiaogang wrote: On 7/23/2024 4:02 PM, Philip Yang wrote: On 2024-07-19 18:17, Xiaogang.Chen wrote: From: Xiaogang Chen

Re: [PATCH v2] drm/amdkfd: Change kfd/svm page fault drain handling

2024-07-23 Thread Philip Yang
On 2024-07-19 18:17, Xiaogang.Chen wrote: From: Xiaogang Chen When app unmap vm ranges(munmap) kfd/svm starts drain pending page fault and not handle any incoming pages fault of this process until a deferred work item got executed by default system wq. The

[PATCH v2 3/9] drm/amdkfd: Refactor queue wptr_bo GART mapping

2024-07-18 Thread Philip Yang
parameter from pqm_create_queue. Rename structure queue wptr_bo_gart to hold wptr_bo reference for GART mapping and umapping. Move MES wptr_bo_gart mapping to init_user_queue, the same location with queue ctx_bo GART mapping. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h

[PATCH v2 8/9] drm/amdkfd: Store queue cwsr area size to node properties

2024-07-18 Thread Philip Yang
KFD node properties, to remove the duplicate calculation code from Thunk. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1 + drivers/gpu/drm/amd/amdkfd/kfd_queue.c| 75 +++ drivers/gpu/drm/amd/amdkfd/kfd_topology.c

[PATCH v2 2/9] drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer

2024-07-18 Thread Philip Yang
Pass pointer reference to amdgpu_bo_unref to clear the correct pointer, otherwise amdgpu_bo_unref clear the local variable, the original pointer not set to NULL, this could cause use-after-free bug. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu

[PATCH v2 4/9] drm/amdkfd: Validate user queue buffers

2024-07-18 Thread Philip Yang
Find user queue rptr, ring buf, eop buffer and cwsr area BOs, and check BOs are mapped on the GPU with correct size and take the BO reference. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 4 +++ drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 38

[PATCH v2 5/9] drm/amdkfd: Ensure user queue buffers residency

2024-07-18 Thread Philip Yang
. Signed-off-by: Philip Yang --- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 14 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_object.h| 6 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 3 +- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1 + drivers/gpu/drm/amd/amdkfd

[PATCH v2 9/9] drm/amdkfd: Validate queue cwsr area and eop buffer size

2024-07-18 Thread Philip Yang
When creating KFD user compute queue, check if queue eop buffer size, cwsr area size, ctl stack size equal to the size of KFD node properities. Check the entire cwsr area which may split into multiple svm ranges aligned to gramularity boundary. Signed-off-by: Philip Yang Reviewed-by: Felix

[PATCH v2 6/9] drm/amdkfd: Validate user queue svm memory residency

2024-07-18 Thread Philip Yang
unmap the CWSR area while queue is active, pr_warn message in dmesg log. To be safe, evict user queue. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 110 - drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 12 +++ drivers/gpu/drm

[PATCH v2 7/9] drm/amdkfd: Validate user queue update

2024-07-18 Thread Philip Yang
Ensure update queue new ring buffer is mapped on GPU with correct size. Decrease queue old ring_bo queue_refcount and increase new ring_bo queue_refcount. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- .../amd/amdkfd/kfd_process_queue_manager.c| 32 ++- 1 file

[PATCH v2 1/9] drm/amdkfd: kfd_bo_mapped_dev support partition

2024-07-18 Thread Philip Yang
Change amdgpu_amdkfd_bo_mapped_to_dev to use drm_priv as parameter instead of adev, to support spatial partition. This is only used by CRIU checkpoint restore now. No functional change. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h

[PATCH v2 0/9] KFD user queue validation

2024-07-18 Thread Philip Yang
free and unmap them when the qeueu is active, or evict the queue if SVM memory is unmapped and freed from CPU. Patch 1-2 is prepration work and general fix. v2: - patch 3/9, keep wptr_bo_gart in struct queue Philip Yang (9): drm/amdkfd: kfd_bo_mapped_dev support partition drm/amdkfd

Re: [PATCH 3/9] drm/amdkfd: Refactor queue wptr_bo GART mapping

2024-07-18 Thread Philip Yang
See inline. kfd_queue_release_buffers return value is handled in queue destroy path, to return -ERESTARTSYS if fail to hold vm lock to release buffers because signal is received. See inline. On 2024-07-15 08:34, Philip Yang wrote: Add helpe

Re: [PATCH 3/9] drm/amdkfd: Refactor queue wptr_bo GART mapping

2024-07-18 Thread Philip Yang
On 2024-07-17 16:10, Felix Kuehling wrote: @@ -603,8 +606,6 @@ struct queue {   void *gang_ctx_bo;   uint64_t gang_ctx_gpu_addr;   void *gang_ctx_cpu_ptr; -

[PATCH 9/9] drm/amdkfd: Validate queue cwsr area and eop buffer size

2024-07-15 Thread Philip Yang
When creating KFD user compute queue, check if queue eop buffer size, cwsr area size, ctl stack size equal to the size of KFD node properities. Check the entire cwsr area which may split into multiple svm ranges aligned to gramularity boundary. Signed-off-by: Philip Yang Reviewed-by: Felix

[PATCH 4/9] drm/amdkfd: Validate user queue buffers

2024-07-15 Thread Philip Yang
Find user queue rptr, ring buf, eop buffer and cwsr area BOs, and check BOs are mapped on the GPU with correct size and take the BO reference. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 4 +++ drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 38

[PATCH 5/9] drm/amdkfd: Ensure user queue buffers residency

2024-07-15 Thread Philip Yang
. Signed-off-by: Philip Yang --- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 14 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_object.h| 6 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 3 +- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1 + drivers/gpu/drm/amd/amdkfd

[PATCH 8/9] drm/amdkfd: Store queue cwsr area size to node properties

2024-07-15 Thread Philip Yang
KFD node properties, to remove the duplicate calculation code from Thunk. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1 + drivers/gpu/drm/amd/amdkfd/kfd_queue.c| 75 +++ drivers/gpu/drm/amd/amdkfd/kfd_topology.c

[PATCH 7/9] drm/amdkfd: Validate user queue update

2024-07-15 Thread Philip Yang
Ensure update queue new ring buffer is mapped on GPU with correct size. Decrease queue old ring_bo queue_refcount and increase new ring_bo queue_refcount. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- .../amd/amdkfd/kfd_process_queue_manager.c| 32 ++- 1 file

[PATCH 3/9] drm/amdkfd: Refactor queue wptr_bo GART mapping

2024-07-15 Thread Philip Yang
, the same location with queue ctx_bo GART mapping. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h| 2 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 5 +- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 56 +++--- .../drm/amd/amdkfd

[PATCH 1/9] drm/amdkfd: kfd_bo_mapped_dev support partition

2024-07-15 Thread Philip Yang
Change amdgpu_amdkfd_bo_mapped_to_dev to use drm_priv as parameter instead of adev, to support spatial partition. This is only used by CRIU checkpoint restore now. No functional change. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h

[PATCH 6/9] drm/amdkfd: Validate user queue svm memory residency

2024-07-15 Thread Philip Yang
unmap the CWSR area while queue is active, pr_warn message in dmesg log. To be safe, evict user queue. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 110 - drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 12 +++ drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 1

[PATCH 0/9] KFD user queue validation

2024-07-15 Thread Philip Yang
free and unmap them when the qeueu is active, or evict the queue if SVM memory is unmapped and freed from CPU. Patch 1-2 is prepration work and general fix. Philip Yang (9): drm/amdkfd: kfd_bo_mapped_dev support partition drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer drm/amdkfd

[PATCH 2/9] drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer

2024-07-15 Thread Philip Yang
Pass pointer reference to amdgpu_bo_unref to clear the correct pointer, otherwise amdgpu_bo_unref clear the local variable, the original pointer not set to NULL, this could cause use-after-free bug. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 14

Re: [PATCH] drm/amdkfd: Correct svm prange overlapping handling at svm_range_set_attr ioctl

2024-06-24 Thread Philip Yang
On 2024-06-21 13:28, Xiaogang.Chen wrote: From: Xiaogang Chen When user adds new vm range that has overlapping with existing svm pranges current kfd clones new prange and remove existing pranges including all data associate with it. It is not necessary. We

[PATCH] drm/amdgpu: Show retry fault message if process xnack on

2024-05-07 Thread Philip Yang
amdgpu_vm_handle_fault and then to gmc interrupt handler to show vm fault message. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 7

Re: [PATCH] drm/amdkfd: Remove arbitrary timeout for hmm_range_fault

2024-05-02 Thread Philip Yang
On 2024-05-02 08:42, James Zhu wrote: On 2024-05-01 18:56, Philip Yang wrote: On system with khugepaged enabled and user cases with THP buffer, the hmm_range_fault may takes > 15 seconds to return -EBUSY,

Re: [PATCH] drm/amdkfd: Remove arbitrary timeout for hmm_range_fault

2024-05-02 Thread Philip Yang
On 2024-05-02 00:09, Chen, Xiaogang wrote: On 5/1/2024 5:56 PM, Philip Yang wrote: Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding

[PATCH] drm/amdkfd: Remove arbitrary timeout for hmm_range_fault

2024-05-01 Thread Philip Yang
urn EBUSY, then userspace libdrm and Thunk will call ioctl again. Change EAGAIN to debug message as this is not error. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 5 - drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 12 +++- drivers/gpu/drm/

Re: [PATCH] drm/amd/amdkfd: Fix a resource leak in svm_range_validate_and_map()

2024-05-01 Thread Philip Yang
: Ramesh Errabolu Reviewed-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 386875e6eb96..dcb1d5d3f860 100644

[PATCH v6 1/5] drm/amdgpu: Support contiguous VRAM allocation

2024-04-24 Thread Philip Yang
TTM_PL_FLAG_CONTIFUOUS flag, and ask VRAM buddy allocator to get contiguous VRAM. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 include/uapi/linux/kfd_ioctl.h | 1 + 2 files changed, 5 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu

[PATCH v6 5/5] drm/amdkfd: Bump kfd version for contiguous VRAM allocation

2024-04-24 Thread Philip Yang
Bump the kfd ioctl minor version to delcare the contiguous VRAM allocation flag support. Signed-off-by: Philip Yang --- include/uapi/linux/kfd_ioctl.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h index

[PATCH v6 4/5] drm/amdkfd: Evict BO itself for contiguous allocation

2024-04-24 Thread Philip Yang
If the BO pages pinned for RDMA is not contiguous on VRAM, evict it to system memory first to free the VRAM space, then allocate contiguous VRAM space, and then move it from system memory back to VRAM. v6: user context should use interruptible call (Felix) Signed-off-by: Philip Yang

[PATCH v6 0/5] Best effort contiguous VRAM allocation

2024-04-24 Thread Philip Yang
prefix to the macro name. v6: use shorter flag name, use interruptible wait ctx, drop patch 5/6 (Felix) Philip Yang (5): drm/amdgpu: Support contiguous VRAM allocation drm/amdgpu: Handle sg size limit for contiguous allocation drm/amdgpu: Evict BOs from same process for contiguous alloca

[PATCH v6 3/5] drm/amdgpu: Evict BOs from same process for contiguous allocation

2024-04-24 Thread Philip Yang
from the same process, this will evict the user queues first, and restore the queues later after contiguous VRAM allocation. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a

[PATCH v6 2/5] drm/amdgpu: Handle sg size limit for contiguous allocation

2024-04-24 Thread Philip Yang
guous VRAM memory. To workaround the sg table segment size limit, allocate multiple segments if contiguous size is bigger than AMDGPU_MAX_SG_SEGMENT_SIZE. Signed-off-by: Philip Yang Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 12 ++-- 1 file chang

Re: [PATCH v5 1/6] drm/amdgpu: Support contiguous VRAM allocation

2024-04-24 Thread Philip Yang
On 2024-04-23 18:17, Felix Kuehling wrote: On 2024-04-23 11:28, Philip Yang wrote: RDMA device with limited scatter-gather ability requires contiguous VRAM buffer allocation for RDMA peer direct support

Re: [PATCH v5 4/6] drm/amdkfd: Evict BO itself for contiguous allocation

2024-04-24 Thread Philip Yang
On 2024-04-23 18:15, Felix Kuehling wrote: On 2024-04-23 11:28, Philip Yang wrote: If the BO pages pinned for RDMA is not contiguous on VRAM, evict it to system memory first to free the VRAM space, then allocate

[PATCH v5 6/6] drm/amdkfd: Bump kfd version for contiguous VRAM allocation

2024-04-23 Thread Philip Yang
Bump the kfd ioctl minor version to delcare the contiguous VRAM allocation flag support. Signed-off-by: Philip Yang --- include/uapi/linux/kfd_ioctl.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h index

  1   2   3   4   5   6   7   8   >