Re: [PATCHv3 2/3] drm/amdkfd: Update queue unmap after VM fault with MES

2024-08-16 Thread Felix Kuehling
-by: Felix Kuehling --- v1->v2: - Add MES FW version check. - Separate out the kfd_dqm_evict_pasid into another function. - Use amdgpu_mes_suspend/amdgpu_mes_resume to suspend/resume queues. v2->v3: - Use down_read_trylock/up_read instead of dqm->is_hws_hang. - Increase eviction cou

Re: [PATCHv3 3/3] drm/amdkfd: Update BadOpcode Interrupt handling with MES

2024-08-16 Thread Felix Kuehling
where unmapping of the bad queue can fail thereby causing a GPU reset. Signed-off-by: Mukul Joshi Acked-by: Harish Kasiviswanathan Acked-by: Alex Deucher Reviewed-by: Felix Kuehling --- v1->v2: - No change. v2->v3: - No change. .../drm/amd/amdkfd/kfd_device_queue_manager.

Re: [PATCH 2/3] drm/amdgpu: sync to KFD fences before clearing PTEs

2024-08-21 Thread Felix Kuehling
On 2024-08-21 08:03, Christian König wrote: This patch tries to solve the basic problem we also need to sync to the KFD fences of the BO because otherwise it can be that we clear PTEs while the KFD queues are still running. This is going to trigger a lot of phantom KFD evictions and will tank

Re: [PATCH 1/3] drm/amdgpu: re-work VM syncing

2024-08-21 Thread Felix Kuehling
On 2024-08-21 08:03, Christian König wrote: Rework how VM operations synchronize to submissions. Provide an amdgpu_sync container to the backends instead of an reservation object and fill in the amdgpu_sync object in the higher layers of the code. No intended functional change, just prepares f

Re: [PATCH] drm/amdkfd: fix missed queue reset on queue destroy

2024-08-21 Thread Felix Kuehling
On 2024-08-21 17:17, Jonathan Kim wrote: If a queue is being destroyed but causes a HWS hang on removal, the KFD may issue an unnecessary gpu reset if the destroyed queue can be fixed by a queue reset. This is because the queue has been removed from the KFD's queue list prior to the preemption

Re: [PATCH 2/2] drm/amdgpu/gfx9: put queue resets behind a debug option

2024-08-21 Thread Felix Kuehling
-specific code path intentionally? If you want this check to apply to all ASICs, you should put it into detect_queue_hang in kfd_device_queue_manager.c. But maybe the extended validation is HW-specific. Either way, the patch is Acked-by: Felix Kuehling kgd_gfx_v9_acquire_queue

Re: [PATCH v2] drm/amdgpu: Surface svm_attr_gobm, a RW module parameter

2024-08-28 Thread Felix Kuehling
On 2024-08-26 15:34, Ramesh Errabolu wrote: Enables users to update the default size of buffer used in migration either from Sysmem to VRAM or vice versa. The param GOBM refers to granularity of buffer migration, and is specified in terms of log(numPages(buffer)). It facilitates users of unregi

Re: [PATCH v2] drm/amdgpu: Surface svm_attr_gobm, a RW module parameter

2024-08-28 Thread Felix Kuehling
On 2024-08-28 16:34, Chen, Xiaogang wrote: On 8/28/2024 3:26 PM, Errabolu, Ramesh wrote: Responses inline Regards, Ramesh *From:*Chen, Xiaogang *Sent:* Wednesday, August 28, 2024 3:01 PM *To:* Errabolu, Ramesh ; amd-gfx@lists.freedesktop.org *Subject:* Re: [PATCH v2] drm/amdgpu: Surfac

Re: [PATCH] drm/amdkfd: fix missed queue reset on queue destroy

2024-08-28 Thread Felix Kuehling
is passes KFD queue tests on GPUs with HWS and MES. Other than that, this patch is Reviewed-by: Felix Kuehling if (q->properties.is_active) { decrement_queue_count(dqm, qpd, q); + q->properties.is_active = false; if (!dqm

Re: [PATCH] drm/amdgpu: revert "use CPU for page table update if SDMA is unavailable"

2024-08-28 Thread Felix Kuehling
reverts commit 23335f9577e0b509c20ad8d65d9fdedd14545b55. Signed-off-by: Christian König Acked-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

Re: [PATCH] drm/amdkfd: restore_process_worker race with GPU reset

2024-08-28 Thread Felix Kuehling
On 2024-08-23 15:49, Philip Yang wrote: If GPU reset kick in while KFD restore_process_worker running, this may causes different issues, for example below rcu stall warning, because restore work may move BOs and evict queues under VRAM pressure. Fix this race by taking adev reset_domain read s

Re: [PATCH 1/3] drm/amdgpu: re-work VM syncing

2024-08-28 Thread Felix Kuehling
On 2024-08-22 03:28, Friedrich Vock wrote: On 21.08.24 22:46, Felix Kuehling wrote: On 2024-08-21 08:03, Christian König wrote: Rework how VM operations synchronize to submissions. Provide an amdgpu_sync container to the backends instead of an reservation object and fill in the amdgpu_sync

Re: [PATCH 2/3] drm/amdgpu: sync to KFD fences before clearing PTEs

2024-08-28 Thread Felix Kuehling
On 2024-08-22 05:07, Christian König wrote: Am 21.08.24 um 22:01 schrieb Felix Kuehling: On 2024-08-21 08:03, Christian König wrote: This patch tries to solve the basic problem we also need to sync to the KFD fences of the BO because otherwise it can be that we clear PTEs while the KFD

Re: [PATCH v2] drm/amdgpu: Surface svm_attr_gobm, a RW module parameter

2024-08-28 Thread Felix Kuehling
On 2024-08-28 17:38, Chen, Xiaogang wrote: On 8/28/2024 4:05 PM, Felix Kuehling wrote: On 2024-08-28 16:34, Chen, Xiaogang wrote: On 8/28/2024 3:26 PM, Errabolu, Ramesh wrote: Responses inline Regards, Ramesh *From:*Chen, Xiaogang *Sent:* Wednesday, August 28, 2024 3:01 PM *To

Re: [PATCH] drm/amdkfd: restore_process_worker race with GPU reset

2024-08-29 Thread Felix Kuehling
On 2024-08-23 15:49, Philip Yang wrote: If GPU reset kick in while KFD restore_process_worker running, this may causes different issues, for example below rcu stall warning, because restore work may move BOs and evict queues under VRAM pressure. Fix this race by taking adev reset_domain read sem

Re: [PATCH] drm/amdkfd: restore_process_worker race with GPU reset

2024-08-30 Thread Felix Kuehling
On 2024-08-29 18:16, Philip Yang wrote: > > On 2024-08-29 17:15, Felix Kuehling wrote: >> On 2024-08-23 15:49, Philip Yang wrote: >>> If GPU reset kick in while KFD restore_process_worker running, this may >>> causes different issues, for example below rcu stal

[PATCH 1/2] drm/amdgpu: Reduce VA_RESERVED_BOTTOM to 64KB

2024-01-30 Thread Felix Kuehling
/vm/mmap_min_addr. Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h index 98a57192..2c4053b29bb3 100644 --- a/drivers

[PATCH 2/2] drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole (v2)

2024-01-30 Thread Felix Kuehling
NULL access with a small offset. v2: - Move it to the reserved space to avoid concflicts with Mesa - Add macros to make reserved space management easier Cc: Arunpravin Paneer Selvam Cc: Christian Koenig Signed-off-by: Jay Cornwall Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu

Re: [PATCH v3] drm/amdkfd: reserve the BO before validating it

2024-01-30 Thread Felix Kuehling
_64+0x3f/0x90 [ 41.709973] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Signed-off-by: Lang Yu Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h| 2 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 20 --- drivers/gpu/drm/amd/amdkfd/kfd_charde

Re: [PATCH] drm/amdkfd: update SIMD distribution algo for GFXIP 9.4.2 onwards

2024-02-02 Thread Felix Kuehling
On 2024-02-01 13:54, Rajneesh Bhardwaj wrote: In certain cooperative group dispatch scenarios the default SPI resource allocation may cause reduced per-CU workgroup occupancy. Set COMPUTE_RESOURCE_LIMITS.FORCE_SIMD_DIST=1 to mitigate soft hang scenarions. Suggested-by: Joseph Greathouse Signe

Re: [PATCH] drm/amdkfd: update SIMD distribution algo for GFXIP 9.4.2 onwards

2024-02-02 Thread Felix Kuehling
ks for checking. The patch ls Reviewed-by: Felix Kuehling Thanks, -Joe Regards, Felix + m->compute_resource_limits = q->is_gws ? + COMPUTE_RESOURCE_LIMITS__FORCE_SIMD_DIST_MASK : 0; + q->is_active = QUEUE_IS_ACTIVE(*q); }

Re: [PATCH 1/2] drm/amdgpu: Unmap only clear the page table leaves

2024-02-02 Thread Felix Kuehling
On 2024-02-01 11:50, Philip Yang wrote: SVM migration unmap pages from GPU and then update mapping to GPU to recover page fault. Currently unmap clears the PDE entry for range length >= huge page and free PTB bo, update mapping to alloc new PT bo. There is race bug that the freed entry bo maybe

Re: [PATCH] drm/amdkfd: Initialize kfd_gpu_cache_info for KFD topology

2024-02-06 Thread Felix Kuehling
On 2024-02-06 15:55, Joseph Greathouse wrote: The current kfd_gpu_cache_info structure is only partially filled in for some architectures. This means that for devices where we do not fill in some fields, we can returned uninitialized values through the KFD topology. Zero out the kfd_gpu_cache_

Re: [PATCH] drm/amdkfd: Don't divide L2 cache by partition mode

2024-02-06 Thread Felix Kuehling
On 2024-02-06 16:24, Kent Russell wrote: Partition mode only affects L3 cache size. After removing the L2 check in the previous patch, make sure we aren't dividing all cache sizes by partition mode, just L3. Fixes: a75bfb3c4045 ("drm/amdkfd: Fix L2 cache size reporting in GFX9.4.3") The fixes

Re: [PATCH v2] drm/amdkfd: Initialize kfd_gpu_cache_info for KFD topology

2024-02-06 Thread Felix Kuehling
kfd_gpu_cache_info before asking the remaining fields to be filled in by lower-level functions. Fixes: 04756ac9a24c ("drm/amdkfd: Add cache line sizes to KFD topology") Signed-off-by: Joseph Greathouse Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 1 + 1 file

Re: [Patch v2] drm/amdkfd: update SIMD distribution algo for GFXIP 9.4.2 onwards

2024-02-08 Thread Felix Kuehling
On 2024-02-07 23:14, Rajneesh Bhardwaj wrote: In certain cooperative group dispatch scenarios the default SPI resource allocation may cause reduced per-CU workgroup occupancy. Set COMPUTE_RESOURCE_LIMITS.FORCE_SIMD_DIST=1 to mitigate soft hang scenarions. Suggested-by: Joseph Greathouse Signe

Re: [Patch v2] drm/amdkfd: update SIMD distribution algo for GFXIP 9.4.2 onwards

2024-02-08 Thread Felix Kuehling
On 2024-02-08 15:01, Bhardwaj, Rajneesh wrote: On 2/8/2024 2:41 PM, Felix Kuehling wrote: On 2024-02-07 23:14, Rajneesh Bhardwaj wrote: In certain cooperative group dispatch scenarios the default SPI resource allocation may cause reduced per-CU workgroup occupancy. Set

Re: [PATCH 1/2] drm/amdkfd: update SIMD distribution algo for GFXIP 9.4.2 onwards

2024-02-13 Thread Felix Kuehling
On 2024-02-09 20:49, Rajneesh Bhardwaj wrote: In certain cooperative group dispatch scenarios the default SPI resource allocation may cause reduced per-CU workgroup occupancy. Set COMPUTE_RESOURCE_LIMITS.FORCE_SIMD_DIST=1 to mitigate soft hang scenarions. Suggested-by: Joseph Greathouse Signe

Re: [PATCH 2/2] drm/amdgpu: Fix implicit assumtion in gfx11 debug flags

2024-02-13 Thread Felix Kuehling
: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v11.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v11.c b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v11.c index d722cbd31783..826bc4f6c8a7 100644 --- a/drivers

Re: [Patch v2 1/2] drm/amdkfd: update SIMD distribution algo for GFXIP 9.4.2 onwards

2024-02-13 Thread Felix Kuehling
Signed-off-by: Rajneesh Bhardwaj Reviewed-by: Felix Kuehling --- * Change the enum bitfield to 4 to avoid ORing condition of previous member flags. * Incorporate review feedback from Felix from https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg102840.html and split one of the

[PATCH v3] drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole

2024-02-13 Thread Felix Kuehling
Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 3 +- drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c| 6 +--- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 11 +++- drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c | 29 ++-- 4 files changed, 27

Re: [PATCH 1/2] drm/amdkfd: Document and define SVM event tracing macro

2024-02-16 Thread Felix Kuehling
On 2024-02-15 10:18, Philip Yang wrote: Document how to use SMI system management interface to receive SVM events. Define SVM events message string format macro that could use by user mode for sscanf to parse the event. Add it to uAPI header file to make it obvious that is changing uAPI in fut

Re: [PATCH] drm/amdkfd: fix process reference drop on debug ioctl

2024-02-21 Thread Felix Kuehling
On 2024-02-21 05:54, Jonathan Kim wrote: Prevent dropping the KFD process reference at the end of a debug IOCTL call where the acquired process value is an error. Signed-off-by: Jonathan Kim Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 1 + 1 file

Re: [PATCH] drm/amdkfd: Increase the size of the memory reserved for the TBA

2024-02-23 Thread Felix Kuehling
+TMA reserved memory size to two pages. Signed-off-by: Laurent Morichetti Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 23 --- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 6 +++--- 2 files changed, 19 insertions(+), 10 deletions(-) diff

Re: [PATCH] Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute" for Raven

2024-02-28 Thread Felix Kuehling
On 2024-02-28 01:41, Christian König wrote: Am 28.02.24 um 06:04 schrieb Jesse.Zhang: fix the issue when run clinfo: "amdgpu: Failed to create process VM object". when amdgpu initialized, seq64 do mampping and update bo mapping in vm page table. But when clifo run. It also initializes a vm f

Re: [PATCH v3] drm/amdgpu: change vm->task_info handling

2024-03-01 Thread Felix Kuehling
put last in vm_fini() Cc: Christian Koenig Cc: Alex Deucher Cc: Felix Kuehling Signed-off-by: Shashank Sharma One nit-pick and one bug inline. With those fixed, the patch Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 9 +- drivers/gpu/drm/a

Re: [PATCH V3] Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute" for Raven

2024-03-01 Thread Felix Kuehling
On 2024-02-29 01:04, Jesse.Zhang wrote: fix the issue: "amdgpu: Failed to create process VM object". [Why]when amdgpu initialized, seq64 do mampping and update bo mapping in vm page table. But when clifo run. It also initializes a vm for a process device through the function kfd_process_device

Re: [PATCH] drm/amdgpu: Init zone device and drm client after mode-1 reset on reload

2024-03-04 Thread Felix Kuehling
On 2024-03-04 17:05, Ahmad Rehman wrote: In passthrough environment, when amdgpu is reloaded after unload, mode-1 is triggered after initializing the necessary IPs, That init does not include KFD, and KFD init waits until the reset is completed. KFD init is called in the reset handler, but in t

Re: [PATCH 2/3] drm/amdgpu: sdma support for sriov cpx mode

2024-03-04 Thread Felix Kuehling
On 2024-03-04 10:19, Samir Dhume wrote: Signed-off-by: Samir Dhume Please add a meaningful commit description to all the patches in the series. See one more comment below. --- drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 34 +++- 1 file changed, 27 insertions(+), 7

Re: [PATCH] drm/amdgpu: Init zone device and drm client after mode-1 reset on reload

2024-03-05 Thread Felix Kuehling
On 2024-03-04 19:20, Rehman, Ahmad wrote: [AMD Official Use Only - General] Hey, Due to mode-1 reset (pending_reset), the amdgpu_amdkfd_device_init will not be called and hence adev->kfd.init_complete will not be set. The function amdgpu_amdkfd_drm_client_create has condition: if (!adev-

Re: [PATCH] drm/amdkfd: make kfd_class constant

2024-03-05 Thread Felix Kuehling
nly memory, instead of having to be dynamically allocated at boot time. Cc: Greg Kroah-Hartman Suggested-by: Greg Kroah-Hartman Signed-off-by: Ricardo B. Marliere The patch looks good to me. Do you want me to apply this to Alex's amd-staging-drm-next? Reviewed-by: Felix Kuehling --

Re: [PATCH 2/3] drm/amdgpu: sdma support for sriov cpx mode

2024-03-05 Thread Felix Kuehling
On 2024-03-05 14:49, Dhume, Samir wrote: [AMD Official Use Only - General] -Original Message- From: Kuehling, Felix Sent: Monday, March 4, 2024 6:47 PM To: Dhume, Samir ; amd-gfx@lists.freedesktop.org Cc: Lazar, Lijo ; Wan, Gavin ; Liu, Leo ; Deucher, Alexander Subject: Re: [PATCH 2/3

Re: [PATCH v5 1/2] drm/amdgpu: implement TLB flush fence

2024-03-06 Thread Felix Kuehling
(f->dependency) in tlb_fence_work (Christian) - move the misplaced fence_create call to the end (Philip) V5: - free the f->dependency properly (Christian) Cc: Christian Koenig Cc: Felix Kuehling Cc: Rajneesh Bhardwaj Cc: Alex Deucher Reviewed-by: Shashank Sharma Signed-off-by:

Re: [PATCH v5 1/2] drm/amdgpu: implement TLB flush fence

2024-03-07 Thread Felix Kuehling
On 2024-03-07 1:39, Sharma, Shashank wrote: On 07/03/2024 00:54, Felix Kuehling wrote: On 2024-03-06 09:41, Shashank Sharma wrote: From: Christian König The problem is that when (for example) 4k pages are replaced with a single 2M page we need to wait for change to be flushed out by

Re: [PATCH] drm/amdgpu: Handle duplicate BOs during process restore

2024-03-08 Thread Felix Kuehling
pr_err("Validating VMs failed, ret: %d\n", ret); I'd make this a pr_debug to avoid spamming the log. validation can fail intermittently and rescheduling the worker is there to handle it. With that fixed, the patch is Reviewed-by: Felix Kuehling

Re: [PATCH] drm/amdgpu: Handle duplicate BOs during process restore

2024-03-11 Thread Felix Kuehling
On 2024-03-11 11:25, Joshi, Mukul wrote: [AMD Official Use Only - General] -Original Message- From: Christian König Sent: Monday, March 11, 2024 2:50 AM To: Joshi, Mukul ; amd-gfx@lists.freedesktop.org Cc: Kuehling, Felix Subject: Re: [PATCH] drm/amdgpu: Handle duplicate BOs during pr

Re: [PATCH] drm/amdgpu: Handle duplicate BOs during process restore

2024-03-11 Thread Felix Kuehling
On 2024-03-11 12:33, Christian König wrote: Am 11.03.24 um 16:33 schrieb Felix Kuehling: On 2024-03-11 11:25, Joshi, Mukul wrote: [AMD Official Use Only - General] -Original Message- From: Christian König Sent: Monday, March 11, 2024 2:50 AM To: Joshi, Mukul ; amd-gfx

Re: [PATCH v3] drm/amdgpu: Init zone device and drm client after mode-1 reset on reload

2024-03-12 Thread Felix Kuehling
causes VM clear to SDMA before SDAM init. Adding the condition to in drm client creation, on top of v1, to guard against drm client creation call multiple times. Signed-off-by: Ahmad Rehman Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 4 ++-- drivers/gpu/drm

Re: [PATCH] drm/amd/amdgpu: Enable IH Retry CAM by register read

2024-03-13 Thread Felix Kuehling
On 2024-03-13 13:43, Dewan Alam wrote: IH Retry CAM should be enabled by register reads instead of always being set to true. This explanation sounds odd. Your code is still writing the register first. What's the reason for reading back the register? I assume it's not needed for enabling the CA

Re: [PATCH] drm/amdgpu: Do a basic health check before reset

2024-03-13 Thread Felix Kuehling
On 2024-03-13 5:41, Lijo Lazar wrote: Check if the device is present in the bus before trying to recover. It could be that device itself is lost from the bus in some hang situations. Signed-off-by: Lijo Lazar --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 ++ 1 fil

Re: [PATCH AUTOSEL 5.15 3/5] drm/amdgpu: Enable gpu reset for S3 abort cases on Raven series

2024-03-13 Thread Felix Kuehling
On 2024-03-11 11:14, Sasha Levin wrote: From: Prike Liang [ Upstream commit c671ec01311b4744b377f98b0b4c6d033fe569b3 ] Currently, GPU resets can now be performed successfully on the Raven series. While GPU reset is required for the S3 suspend abort case. So now can enable gpu reset for S3 abor

Re: [PATCH 2/2] drm/amdkfd: Check preemption status on all XCDs

2024-03-14 Thread Felix Kuehling
uint32_t inst) +{ + if (doorbell_id) { + struct device *dev = node->adev->dev; + + if (KFD_GC_VERSION(node) == IP_VERSION(9, 4, 3)) Could this be made more generic? E.g.: if (node->adev->xcp_mgr && node->adev->xcp_mgr->

Re: Proposal to add CRIU support to DRM render nodes

2024-03-14 Thread Felix Kuehling
On 2024-03-12 5:45, Tvrtko Ursulin wrote: On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix Kuehling wrote: Executive Summary: We need to add CRIU support to DRM render nodes in order to maintain CRIU support for ROCm application once they start relying on

Re: [PATCH 05/10] drivers: use new capable_any functionality

2024-03-15 Thread Felix Kuehling
On 2024-03-15 7:37, Christian Göttsche wrote: Use the new added capable_any function in appropriate cases, where a task is required to have any of two capabilities. Reorder CAP_SYS_ADMIN last. Signed-off-by: Christian Göttsche Acked-by: Alexander Gordeev (s390 portion) Acked-by: Felix

Re: [PATCH] drm/amdkfd: Check cgroup when returning DMABuf info

2024-03-18 Thread Felix Kuehling
On 2024-03-15 14:17, Mukul Joshi wrote: Check cgroup permissions when returning DMA-buf info and based on cgroup check return the id of the GPU that has access to the BO. Signed-off-by: Mukul Joshi --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 4 ++-- 1 file changed, 2 insertions(+), 2 de

Re: [PATCH] drm/amdkfd: Check cgroup when returning DMABuf info

2024-03-20 Thread Felix Kuehling
On 2024-03-20 15:09, Joshi, Mukul wrote: [AMD Official Use Only - General] -Original Message- From: Kuehling, Felix Sent: Monday, March 18, 2024 4:13 PM To: Joshi, Mukul ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH] drm/amdkfd: Check cgroup when returning DMABuf info On 2024

Re: [PATCH] drm/amdkfd: Check cgroup when returning DMABuf info

2024-03-20 Thread Felix Kuehling
On 2024-03-18 16:12, Felix Kuehling wrote: On 2024-03-15 14:17, Mukul Joshi wrote: Check cgroup permissions when returning DMA-buf info and based on cgroup check return the id of the GPU that has access to the BO. Signed-off-by: Mukul Joshi ---   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 4

Re: [PATCH] drm/amdkfd: range check cp bad op exception interrupts

2024-03-21 Thread Felix Kuehling
Tested-by: Jesse Zhang Reviewed-by: Felix Kuehling --- .../gpu/drm/amd/amdkfd/kfd_int_process_v10.c| 3 ++- .../gpu/drm/amd/amdkfd/kfd_int_process_v11.c| 3 ++- drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c | 3 ++- include/uapi/linux/kfd_ioctl.h | 17

Re: [PATCH] drm/amdkfd: Cleanup workqueue during module unload

2024-03-21 Thread Felix Kuehling
On 2024-03-20 18:52, Mukul Joshi wrote: Destroy the high priority workqueue that handles interrupts during KFD node cleanup. Signed-off-by: Mukul Joshi Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_interrupt.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a

Re: [PATCH 2/3] amd/amdgpu: wait no process running in kfd before resuming device

2024-03-25 Thread Felix Kuehling
On 2024-03-22 15:57, Zhigang Luo wrote: it will cause page fault after device recovered if there is a process running. Signed-off-by: Zhigang Luo Change-Id: Ib1eddb56b69ecd41fe703abd169944154f48b0cd --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++ 1 file changed, 2 insertions(+) diff

Re: [PATCH] drm/amdgpu : Increase the mes log buffer size as per new MES FW version

2024-03-25 Thread Felix Kuehling
On 2024-03-22 12:49, shaoyunl wrote: From MES version 0x54, the log entry increased and require the log buffer size to be increased. The 16k is maximum size agreed What happens when you run the new firmware on an old kernel that only allocates 4KB? Regards,   Felix Signed-off-by: shao

Re: [PATCH 2/3] amd/amdgpu: wait no process running in kfd before resuming device

2024-03-26 Thread Felix Kuehling
On 2024-03-26 10:53, Philip Yang wrote: On 2024-03-25 14:45, Felix Kuehling wrote: On 2024-03-22 15:57, Zhigang Luo wrote: it will cause page fault after device recovered if there is a process running. Signed-off-by: Zhigang Luo Change-Id: Ib1eddb56b69ecd41fe703abd169944154f48b0cd

Re: [PATCH] drm/amd/amdgpu: Enable IH Retry CAM by register read

2024-03-26 Thread Felix Kuehling
On 2024-03-26 12:04, Alam, Dewan wrote: [AMD Official Use Only - General] Looping in +@Zhang, Zhaochen CAM control register can only be written by PF. VF can only read the register. In SRIOV VF, the write won't work. In SRIOV case, CAM's enablement is controlled by the host. Hence, we think th

Re: [PATCH] drm/amdgpu : Increase the mes log buffer size as per new MES FW version

2024-03-26 Thread Felix Kuehling
On 2024-03-25 19:33, Liu, Shaoyun wrote: [AMD Official Use Only - General] It can cause page fault when the log size exceed the page size . I'd consider that a breaking change in the firmware that should be avoided. Is there a way the updated driver can tell the FW the log size that it

Re: [PATCH 1/2] drm/amdgpu: always allocate cleared VRAM for KFD allocations

2024-03-26 Thread Felix Kuehling
On 2024-03-26 11:52, Alex Deucher wrote: This adds allocation latency, but aligns better with user expectations. The latency should improve with the drm buddy clearing patches that Arun has been working on. If we submit this before the clear-page-tracking patches are in, this will cause una

Re: [PATCH] drm/amdgpu: use vm_update_mode=0 as default in sriov for gfx10.3 onwards

2024-03-28 Thread Felix Kuehling
fixed, the patch is Reviewed-by: Felix Kuehling + /* VF MMIO access (except mailbox range) from CPU +* will be blocked during sriov runtime +*/ + adev->virt.caps |= AMDGPU_VF_MMIO_ACCESS_PROTECT; + amdgpu_gmc_noretry_set(ade

Re: Proposal to add CRIU support to DRM render nodes

2024-03-28 Thread Felix Kuehling
ably going to be at least a few weeks. Regards,   Felix Regards, Tvrtko On 15/03/2024 18:36, Tvrtko Ursulin wrote: On 15/03/2024 02:33, Felix Kuehling wrote: On 2024-03-12 5:45, Tvrtko Ursulin wrote: On 11/03/2024 14:48, Tvrtko Ursulin wrote: Hi Felix, On 06/12/2023 21:23, Felix

Re: Proposal to add CRIU support to DRM render nodes

2024-04-01 Thread Felix Kuehling
On 2024-04-01 11:09, Tvrtko Ursulin wrote: On 28/03/2024 20:42, Felix Kuehling wrote: On 2024-03-28 12:03, Tvrtko Ursulin wrote: Hi Felix, I had one more thought while browsing around the amdgpu CRIU plugin. It appears it relies on the KFD support being compiled in and /dev/kfd present

Re: Proposal to add CRIU support to DRM render nodes

2024-04-01 Thread Felix Kuehling
On 2024-04-01 12:56, Tvrtko Ursulin wrote: On 01/04/2024 17:37, Felix Kuehling wrote: On 2024-04-01 11:09, Tvrtko Ursulin wrote: On 28/03/2024 20:42, Felix Kuehling wrote: On 2024-03-28 12:03, Tvrtko Ursulin wrote: Hi Felix, I had one more thought while browsing around the amdgpu CRIU

Re: [PATCH 1/2] amd/amdkfd: sync all devices to wait all processes being evicted

2024-04-02 Thread Felix Kuehling
On 2024-04-01 17:53, Zhigang Luo wrote: If there are more than one device doing reset in parallel, the first device will call kfd_suspend_all_processes() to evict all processes on all devices, this call takes time to finish. other device will start reset and recover without waiting. if the proces

Re: [PATCH 1/2] amd/amdkfd: sync all devices to wait all processes being evicted

2024-04-03 Thread Felix Kuehling
process has not been evicted before doing recover, it will be restored, then caused page fault. Signed-off-by: Zhigang Luo This patch is Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 17 ++--- 1 file changed, 6 insertions(+), 11 deletions(-) diff

Re: [PATCH] drm/amdkfd: make sure VM is ready for updating operations

2024-04-09 Thread Felix Kuehling
On 2024-04-08 3:55, Christian König wrote: Am 07.04.24 um 06:52 schrieb Lang Yu: When VM is in evicting state, amdgpu_vm_update_range would return -EBUSY. Then restore_process_worker runs into a dead loop. Fixes: 2fdba514ad5a ("drm/amdgpu: Auto-validate DMABuf imports in compute VMs") Mhm,

[PATCH] drm/amdkfd: Fix memory leak in create_process failure

2024-04-10 Thread Felix Kuehling
Fix memory leak due to a leaked mmget reference on an error handling code path that is triggered when attempting to create KFD processes while a GPU reset is in progress. Fixes: 0ab2d7532b05 ("drm/amdkfd: prepare per-process debug enable and disable") CC: Xiaogang Chen Signed-off

Re: [PATCH v4 9/9] drm/amdgpu: add lock in kfd_process_dequeue_from_device

2024-06-06 Thread Felix Kuehling
igned-off-by: Yunxiang Li Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/a

Re: [PATCH 3/3] drm/amdgpu: nuke the VM PD/PT shadow handling

2024-06-06 Thread Felix Kuehling
case left is SVM and that is most likely not recoverable in any way when VRAM is lost. I agree. The series is Acked-by: Felix Kuehling Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 4 - drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 87

Re: [PATCH 1/2][RFC] amdgpu: fix a race in kfd_mem_export_dmabuf()

2024-06-06 Thread Felix Kuehling
On 2024-06-05 05:14, Christian König wrote: Am 04.06.24 um 20:08 schrieb Felix Kuehling: On 2024-06-03 22:13, Al Viro wrote: Using drm_gem_prime_handle_to_fd() to set dmabuf up and insert it into descriptor table, only to have it looked up by file descriptor and remove it from descriptor

Re: [PATCH] drm/amdkfd: Update mm interval notifier tree without acquiring mm's mmap lock

2024-06-18 Thread Felix Kuehling
On 2024-06-12 16:11, Xiaogang.Chen wrote: From: Xiaogang Chen Current kfd/svm driver acquires mm's mmap write lock before update mm->notifier_subscriptions->itree. This tree is already protected by mm->notifier_subscriptions->lock at mmu notifier. Each process mm interval tree update from dif

Re: Re:Proposal to add CRIU support to DRM render nodes

2024-07-08 Thread Felix Kuehling
oduction for that? Hi David, This refers to the SVM API that has been in the upstream driver for a while now: https://elixir.bootlin.com/linux/v6.9.8/source/include/uapi/linux/kfd_ioctl.h#L732 Regards, Felix > > Thanks, > -David > > ---

Re: 回复:Re:Proposal to add CRIU support to DRM render nodes

2024-07-09 Thread Felix Kuehling
On 2024-07-09 5:30, 周春明(日月) wrote: > > > > > > > -- > 发件人:Felix Kuehling > 发送时间:2024年7月9日(星期二) 06:40 > 收件人:周春明(日月) ; Tvrtko Ursulin > ; dri-de...@lists.freedesktop.org > ; amd-gfx@li

Re: [PATCH] drm/amdgpu: Restore uncache behaviour on GFX12

2024-07-09 Thread Felix Kuehling
d by > shader code. > > Signed-off-by: David Belanger Reviewed-by: Felix Kuehling > --- > drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c | 21 ++--- > drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 8 +--- > 2 files changed, 3 insertions(+), 26 deletions(-) >

Re: va range based memory management discussion (was: 回复:回复:Re:Proposal to add CRIU support to DRM render nodes)

2024-07-10 Thread Felix Kuehling
On 2024-07-09 22:38, 周春明(日月) wrote: -- 发件人:Felix Kuehling 发送时间:2024年7月10日(星期三) 01:07 收件人:周春明(日月) ; Tvrtko Ursulin ; dri-de...@lists.freedesktop.org ; amd-gfx@lists.freedesktop.org ; Dave Airlie ; Daniel Vetter ; criu 抄 送

Re: [PATCH] drm/amdgpu: Mark amdgpu_bo as invalid after moved

2024-07-12 Thread Felix Kuehling
KFD eviction fences are triggered by the enable_signaling callback on the eviction fence. Any move operations scheduled by amdgpu_bo_move are held up by the GPU scheduler until the eviction fence is signaled by the KFD eviction handler, which only happens after the user mode queues are stopped.

Re: [PATCH] drm/amdgpu: Mark amdgpu_bo as invalid after moved

2024-07-17 Thread Felix Kuehling
also invalidate the PTEs? Regards,   Felix IIRC we postponed looking into the issue until it really becomes a problem which is probably now :) Regards, Christian. Am 12.07.24 um 16:56 schrieb Felix Kuehling: KFD eviction fences are triggered by the enable_signaling callback on the evi

Re: [PATCH 2/9] drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer

2024-07-17 Thread Felix Kuehling
On 2024-07-15 08:34, Philip Yang wrote: Pass pointer reference to amdgpu_bo_unref to clear the correct pointer, otherwise amdgpu_bo_unref clear the local variable, the original pointer not set to NULL, this could cause use-after-free bug. Signed-off-by: Philip Yang Reviewed-by: Felix

Re: [PATCH 3/9] drm/amdkfd: Refactor queue wptr_bo GART mapping

2024-07-17 Thread Felix Kuehling
On 2024-07-15 08:34, Philip Yang wrote: Add helper function kfd_queue_acquire_buffers to get queue wptr_bo reference from queue write_ptr if it is mapped to the KFD node with expected size. Move wptr_bo to structure queue_properties from struct queue as queue is allocated after queue buffers a

Re: [PATCH 3/9] drm/amdkfd: Refactor queue wptr_bo GART mapping

2024-07-17 Thread Felix Kuehling
Sorry, I see that this patch still doesn't propagate errors returned from kfd_queue_releasre_buffers correctly. And the later patches in the series don't seem to fix it either. See inline. On 2024-07-15 08:34, Philip Yang wrote: Add helper function kfd_queue_acquire_buffers to get queue wptr_b

Re: [PATCH 6/9] drm/amdkfd: Validate user queue svm memory residency

2024-07-17 Thread Felix Kuehling
return value is ignored, if application unmap the CWSR area while queue is active, pr_warn message in dmesg log. To be safe, evict user queue. Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 110 - drivers/gpu/drm

Re: [PATCH 5/9] drm/amdkfd: Ensure user queue buffers residency

2024-07-17 Thread Felix Kuehling
On 2024-07-15 08:34, Philip Yang wrote: Add atomic queue_refcount to struct bo_va, return -EBUSY to fail unmap BO from the GPU if the bo_va queue_refcount is not zero. Create queue to increase the bo_va queue_refcount, destroy queue to decrease the bo_va queue_refcount, to ensure the queue buffe

Re: [PATCH 1/6] drm/amdgpu/gfx: add bad opcode interrupt

2024-07-17 Thread Felix Kuehling
On 2024-07-17 16:40, Alex Deucher wrote: Add the irq source for bad opcodes. Signed-off-by: Alex Deucher Looks like all the error IRQ handlers return 0, which means the interrupts will still get forwarded to KFD (which is good). The series is Acked-by: Felix Kuehling --- drivers

Re: [PATCH v2] drm/amdkfd: Correct svm prange overlapping handling at svm_range_set_attr ioctl

2024-07-17 Thread Felix Kuehling
On 2024-06-26 11:06, Xiaogang.Chen wrote: From: Xiaogang Chen When user adds new vm range that has overlapping with existing svm pranges current kfd creats a cloned pragne and split it, then replaces original prange by it. That destroy original prange locks and the cloned prange locks do not

Re: [PATCH 3/9] drm/amdkfd: Refactor queue wptr_bo GART mapping

2024-07-18 Thread Felix Kuehling
On 2024-07-18 15:57, Philip Yang wrote: > > On 2024-07-17 16:16, Felix Kuehling wrote: >> Sorry, I see that this patch still doesn't propagate errors returned from >> kfd_queue_releasre_buffers correctly. And the later patches in the series >> don't

Re: [PATCH v2] drm/amdkfd: Correct svm prange overlapping handling at svm_range_set_attr ioctl

2024-07-18 Thread Felix Kuehling
On 2024-07-18 1:25, Chen, Xiaogang wrote: > > On 7/17/2024 6:02 PM, Felix Kuehling wrote: >> >> On 2024-06-26 11:06, Xiaogang.Chen wrote: >>> From: Xiaogang Chen >>> >>> When user adds new vm range that has overlapping with existing svm pranges >

Re: [PATCH v2 0/9] KFD user queue validation

2024-07-18 Thread Felix Kuehling
in struct queue The series is Reviewed-by: Felix Kuehling > > Philip Yang (9): > drm/amdkfd: kfd_bo_mapped_dev support partition > drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer > drm/amdkfd: Refactor queue wptr_bo GART mapping > drm/amdkfd: Validate use

Re: [PATCH] drm/amdkfd: allow users to target recommended SDMA engines

2024-07-19 Thread Felix Kuehling
On 2024-07-18 19:05, Jonathan Kim wrote: Certain GPUs have better copy performance over xGMI on specific SDMA engines depending on the source and destination GPU. Allow users to create SDMA queues on these recommended engines. Close to 2x overall performance has been observed with this optimizati

Re: [PATCH] drm/amdkfd: allow users to target recommended SDMA engines

2024-07-24 Thread Felix Kuehling
optimization. v2: remove unnecessary crat updates and refactor sdma resource bit setting logic. Signed-off-by: Jonathan Kim Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 16 ++ .../drm/amd/amdkfd/kfd_device_queue_manager.c | 38 +- drivers/gpu

Re: [PATCH v2] drm/amdkfd: Change kfd/svm page fault drain handling

2024-07-24 Thread Felix Kuehling
On 2024-07-19 18:17, Xiaogang.Chen wrote: From: Xiaogang Chen When app unmap vm ranges(munmap) kfd/svm starts drain pending page fault and not handle any incoming pages fault of this process until a deferred work item got executed by default system wq. The time period of "not handle page faul

Re: [PATCH 1/2] drm/amdkfd: support per-queue reset on gfx9

2024-07-24 Thread Felix Kuehling
On 2024-07-18 13:56, Jonathan Kim wrote: Support per-queue reset for GFX9. The recommendation is for the driver to target reset the HW queue via a SPI MMIO register write. Since this requires pipe and HW queue info and MEC FW is limited to doorbell reports of hung queues after an unmap failur

Re: [PATCH] drm/amdkfd: Fix compile error if HMM support not enabled

2024-07-26 Thread Felix Kuehling
te user queue svm memory residency") Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202407252127.zvnxakra-...@intel.com/ Signed-off-by: Philip Yang Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 14 ++ 1 file chang

Re: [PATCH 1/2] drm/amdkfd: support per-queue reset on gfx9

2024-07-30 Thread Felix Kuehling
On 2024-07-26 11:30, Jonathan Kim wrote: > Support per-queue reset for GFX9. The recommendation is for the driver > to target reset the HW queue via a SPI MMIO register write. > > Since this requires pipe and HW queue info and MEC FW is limited to > doorbell reports of hung queues after an unma

Re: [PATCH 2/2] drm/amdkfd: support the debugger during per-queue reset

2024-07-30 Thread Felix Kuehling
On 2024-07-26 11:30, Jonathan Kim wrote: > In order to allow ROCm GDB to handle reset queues, raise an > EC_QUEUE_RESET exception so that the debugger can subscribe and > query this exception. > > Reset queues should still be considered suspendable with a status > flag of KFD_DBG_QUEUE_RESET_MA

  1   2   3   4   5   6   7   8   9   10   >