-by: Felix Kuehling
---
v1->v2:
- Add MES FW version check.
- Separate out the kfd_dqm_evict_pasid into another function.
- Use amdgpu_mes_suspend/amdgpu_mes_resume to suspend/resume queues.
v2->v3:
- Use down_read_trylock/up_read instead of dqm->is_hws_hang.
- Increase eviction cou
where unmapping
of the bad queue can fail thereby causing a GPU reset.
Signed-off-by: Mukul Joshi
Acked-by: Harish Kasiviswanathan
Acked-by: Alex Deucher
Reviewed-by: Felix Kuehling
---
v1->v2:
- No change.
v2->v3:
- No change.
.../drm/amd/amdkfd/kfd_device_queue_manager.
On 2024-08-21 08:03, Christian König wrote:
This patch tries to solve the basic problem we also need to sync to
the KFD fences of the BO because otherwise it can be that we clear
PTEs while the KFD queues are still running.
This is going to trigger a lot of phantom KFD evictions and will tank
On 2024-08-21 08:03, Christian König wrote:
Rework how VM operations synchronize to submissions. Provide an
amdgpu_sync container to the backends instead of an reservation
object and fill in the amdgpu_sync object in the higher layers
of the code.
No intended functional change, just prepares f
On 2024-08-21 17:17, Jonathan Kim wrote:
If a queue is being destroyed but causes a HWS hang on removal, the KFD
may issue an unnecessary gpu reset if the destroyed queue can be fixed
by a queue reset.
This is because the queue has been removed from the KFD's queue list
prior to the preemption
-specific code path intentionally? If you want
this check to apply to all ASICs, you should put it into
detect_queue_hang in kfd_device_queue_manager.c. But maybe the extended
validation is HW-specific.
Either way, the patch is
Acked-by: Felix Kuehling
kgd_gfx_v9_acquire_queue
On 2024-08-26 15:34, Ramesh Errabolu wrote:
Enables users to update the default size of buffer used
in migration either from Sysmem to VRAM or vice versa.
The param GOBM refers to granularity of buffer migration,
and is specified in terms of log(numPages(buffer)). It
facilitates users of unregi
On 2024-08-28 16:34, Chen, Xiaogang wrote:
On 8/28/2024 3:26 PM, Errabolu, Ramesh wrote:
Responses inline
Regards,
Ramesh
*From:*Chen, Xiaogang
*Sent:* Wednesday, August 28, 2024 3:01 PM
*To:* Errabolu, Ramesh ;
amd-gfx@lists.freedesktop.org
*Subject:* Re: [PATCH v2] drm/amdgpu: Surfac
is passes KFD queue tests on GPUs with
HWS and MES.
Other than that, this patch is
Reviewed-by: Felix Kuehling
if (q->properties.is_active) {
decrement_queue_count(dqm, qpd, q);
+ q->properties.is_active = false;
if (!dqm
reverts commit 23335f9577e0b509c20ad8d65d9fdedd14545b55.
Signed-off-by: Christian König
Acked-by: Felix Kuehling
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 6 --
1 file changed, 6 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
On 2024-08-23 15:49, Philip Yang wrote:
If GPU reset kick in while KFD restore_process_worker running, this may
causes different issues, for example below rcu stall warning, because
restore work may move BOs and evict queues under VRAM pressure.
Fix this race by taking adev reset_domain read s
On 2024-08-22 03:28, Friedrich Vock wrote:
On 21.08.24 22:46, Felix Kuehling wrote:
On 2024-08-21 08:03, Christian König wrote:
Rework how VM operations synchronize to submissions. Provide an
amdgpu_sync container to the backends instead of an reservation
object and fill in the amdgpu_sync
On 2024-08-22 05:07, Christian König wrote:
Am 21.08.24 um 22:01 schrieb Felix Kuehling:
On 2024-08-21 08:03, Christian König wrote:
This patch tries to solve the basic problem we also need to sync to
the KFD fences of the BO because otherwise it can be that we clear
PTEs while the KFD
On 2024-08-28 17:38, Chen, Xiaogang wrote:
On 8/28/2024 4:05 PM, Felix Kuehling wrote:
On 2024-08-28 16:34, Chen, Xiaogang wrote:
On 8/28/2024 3:26 PM, Errabolu, Ramesh wrote:
Responses inline
Regards,
Ramesh
*From:*Chen, Xiaogang
*Sent:* Wednesday, August 28, 2024 3:01 PM
*To
On 2024-08-23 15:49, Philip Yang wrote:
If GPU reset kick in while KFD restore_process_worker running, this may
causes different issues, for example below rcu stall warning, because
restore work may move BOs and evict queues under VRAM pressure.
Fix this race by taking adev reset_domain read sem
On 2024-08-29 18:16, Philip Yang wrote:
>
> On 2024-08-29 17:15, Felix Kuehling wrote:
>> On 2024-08-23 15:49, Philip Yang wrote:
>>> If GPU reset kick in while KFD restore_process_worker running, this may
>>> causes different issues, for example below rcu stal
/vm/mmap_min_addr.
Signed-off-by: Felix Kuehling
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index 98a57192..2c4053b29bb3 100644
--- a/drivers
NULL
access with a small offset.
v2:
- Move it to the reserved space to avoid concflicts with Mesa
- Add macros to make reserved space management easier
Cc: Arunpravin Paneer Selvam
Cc: Christian Koenig
Signed-off-by: Jay Cornwall
Signed-off-by: Felix Kuehling
---
drivers/gpu/drm/amd/amdgpu
_64+0x3f/0x90
[ 41.709973] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Signed-off-by: Lang Yu
Reviewed-by: Felix Kuehling
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h| 2 +-
.../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 20 ---
drivers/gpu/drm/amd/amdkfd/kfd_charde
On 2024-02-01 13:54, Rajneesh Bhardwaj wrote:
In certain cooperative group dispatch scenarios the default SPI resource
allocation may cause reduced per-CU workgroup occupancy. Set
COMPUTE_RESOURCE_LIMITS.FORCE_SIMD_DIST=1 to mitigate soft hang
scenarions.
Suggested-by: Joseph Greathouse
Signe
ks for checking. The patch ls
Reviewed-by: Felix Kuehling
Thanks,
-Joe
Regards,
Felix
+ m->compute_resource_limits = q->is_gws ?
+ COMPUTE_RESOURCE_LIMITS__FORCE_SIMD_DIST_MASK : 0;
+
q->is_active = QUEUE_IS_ACTIVE(*q);
}
On 2024-02-01 11:50, Philip Yang wrote:
SVM migration unmap pages from GPU and then update mapping to GPU to
recover page fault. Currently unmap clears the PDE entry for range
length >= huge page and free PTB bo, update mapping to alloc new PT bo.
There is race bug that the freed entry bo maybe
On 2024-02-06 15:55, Joseph Greathouse wrote:
The current kfd_gpu_cache_info structure is only partially
filled in for some architectures. This means that for devices
where we do not fill in some fields, we can returned
uninitialized values through the KFD topology.
Zero out the kfd_gpu_cache_
On 2024-02-06 16:24, Kent Russell wrote:
Partition mode only affects L3 cache size. After removing the L2 check in
the previous patch, make sure we aren't dividing all cache sizes by
partition mode, just L3.
Fixes: a75bfb3c4045 ("drm/amdkfd: Fix L2 cache size reporting in GFX9.4.3")
The fixes
kfd_gpu_cache_info before asking the remaining
fields to be filled in by lower-level functions.
Fixes: 04756ac9a24c ("drm/amdkfd: Add cache line sizes to KFD topology")
Signed-off-by: Joseph Greathouse
Reviewed-by: Felix Kuehling
---
drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 1 +
1 file
On 2024-02-07 23:14, Rajneesh Bhardwaj wrote:
In certain cooperative group dispatch scenarios the default SPI resource
allocation may cause reduced per-CU workgroup occupancy. Set
COMPUTE_RESOURCE_LIMITS.FORCE_SIMD_DIST=1 to mitigate soft hang
scenarions.
Suggested-by: Joseph Greathouse
Signe
On 2024-02-08 15:01, Bhardwaj, Rajneesh wrote:
On 2/8/2024 2:41 PM, Felix Kuehling wrote:
On 2024-02-07 23:14, Rajneesh Bhardwaj wrote:
In certain cooperative group dispatch scenarios the default SPI
resource
allocation may cause reduced per-CU workgroup occupancy. Set
On 2024-02-09 20:49, Rajneesh Bhardwaj wrote:
In certain cooperative group dispatch scenarios the default SPI resource
allocation may cause reduced per-CU workgroup occupancy. Set
COMPUTE_RESOURCE_LIMITS.FORCE_SIMD_DIST=1 to mitigate soft hang
scenarions.
Suggested-by: Joseph Greathouse
Signe
: Felix Kuehling
---
drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v11.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v11.c
b/drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_v11.c
index d722cbd31783..826bc4f6c8a7 100644
--- a/drivers
Signed-off-by: Rajneesh Bhardwaj
Reviewed-by: Felix Kuehling
---
* Change the enum bitfield to 4 to avoid ORing condition of previous
member flags.
* Incorporate review feedback from Felix from
https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg102840.html
and split one of the
Signed-off-by: Felix Kuehling
---
drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 3 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c| 6 +---
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 11 +++-
drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c | 29 ++--
4 files changed, 27
On 2024-02-15 10:18, Philip Yang wrote:
Document how to use SMI system management interface to receive SVM
events.
Define SVM events message string format macro that could use by user
mode for sscanf to parse the event. Add it to uAPI header file to make
it obvious that is changing uAPI in fut
On 2024-02-21 05:54, Jonathan Kim wrote:
Prevent dropping the KFD process reference at the end of a debug
IOCTL call where the acquired process value is an error.
Signed-off-by: Jonathan Kim
Reviewed-by: Felix Kuehling
---
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 1 +
1 file
+TMA reserved memory size
to two pages.
Signed-off-by: Laurent Morichetti
Reviewed-by: Felix Kuehling
---
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 23 ---
drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 6 +++---
2 files changed, 19 insertions(+), 10 deletions(-)
diff
On 2024-02-28 01:41, Christian König wrote:
Am 28.02.24 um 06:04 schrieb Jesse.Zhang:
fix the issue when run clinfo:
"amdgpu: Failed to create process VM object".
when amdgpu initialized, seq64 do mampping and update bo mapping in
vm page table.
But when clifo run. It also initializes a vm f
put last
in vm_fini()
Cc: Christian Koenig
Cc: Alex Deucher
Cc: Felix Kuehling
Signed-off-by: Shashank Sharma
One nit-pick and one bug inline. With those fixed, the patch
Reviewed-by: Felix Kuehling
---
drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 9 +-
drivers/gpu/drm/a
On 2024-02-29 01:04, Jesse.Zhang wrote:
fix the issue:
"amdgpu: Failed to create process VM object".
[Why]when amdgpu initialized, seq64 do mampping and update bo mapping in vm
page table.
But when clifo run. It also initializes a vm for a process device through the
function kfd_process_device
On 2024-03-04 17:05, Ahmad Rehman wrote:
In passthrough environment, when amdgpu is reloaded after unload, mode-1
is triggered after initializing the necessary IPs, That init does not
include KFD, and KFD init waits until the reset is completed. KFD init
is called in the reset handler, but in t
On 2024-03-04 10:19, Samir Dhume wrote:
Signed-off-by: Samir Dhume
Please add a meaningful commit description to all the patches in the
series. See one more comment below.
---
drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 34 +++-
1 file changed, 27 insertions(+), 7
On 2024-03-04 19:20, Rehman, Ahmad wrote:
[AMD Official Use Only - General]
Hey,
Due to mode-1 reset (pending_reset), the amdgpu_amdkfd_device_init
will not be called and hence adev->kfd.init_complete will not be set.
The function amdgpu_amdkfd_drm_client_create has condition:
if (!adev-
nly memory, instead of having to be dynamically
allocated at boot time.
Cc: Greg Kroah-Hartman
Suggested-by: Greg Kroah-Hartman
Signed-off-by: Ricardo B. Marliere
The patch looks good to me. Do you want me to apply this to Alex's
amd-staging-drm-next?
Reviewed-by: Felix Kuehling
--
On 2024-03-05 14:49, Dhume, Samir wrote:
[AMD Official Use Only - General]
-Original Message-
From: Kuehling, Felix
Sent: Monday, March 4, 2024 6:47 PM
To: Dhume, Samir ; amd-gfx@lists.freedesktop.org
Cc: Lazar, Lijo ; Wan, Gavin ;
Liu, Leo ; Deucher, Alexander
Subject: Re: [PATCH 2/3
(f->dependency) in tlb_fence_work (Christian)
- move the misplaced fence_create call to the end (Philip)
V5: - free the f->dependency properly (Christian)
Cc: Christian Koenig
Cc: Felix Kuehling
Cc: Rajneesh Bhardwaj
Cc: Alex Deucher
Reviewed-by: Shashank Sharma
Signed-off-by:
On 2024-03-07 1:39, Sharma, Shashank wrote:
On 07/03/2024 00:54, Felix Kuehling wrote:
On 2024-03-06 09:41, Shashank Sharma wrote:
From: Christian König
The problem is that when (for example) 4k pages are replaced
with a single 2M page we need to wait for change to be flushed
out by
pr_err("Validating VMs failed, ret: %d\n", ret);
I'd make this a pr_debug to avoid spamming the log. validation can fail
intermittently and rescheduling the worker is there to handle it.
With that fixed, the patch is
Reviewed-by: Felix Kuehling
On 2024-03-11 11:25, Joshi, Mukul wrote:
[AMD Official Use Only - General]
-Original Message-
From: Christian König
Sent: Monday, March 11, 2024 2:50 AM
To: Joshi, Mukul ; amd-gfx@lists.freedesktop.org
Cc: Kuehling, Felix
Subject: Re: [PATCH] drm/amdgpu: Handle duplicate BOs during pr
On 2024-03-11 12:33, Christian König wrote:
Am 11.03.24 um 16:33 schrieb Felix Kuehling:
On 2024-03-11 11:25, Joshi, Mukul wrote:
[AMD Official Use Only - General]
-Original Message-
From: Christian König
Sent: Monday, March 11, 2024 2:50 AM
To: Joshi, Mukul ; amd-gfx
causes VM clear to SDMA
before SDAM init. Adding the condition to in drm client creation, on top of v1,
to guard against drm client creation call multiple times.
Signed-off-by: Ahmad Rehman
Reviewed-by: Felix Kuehling
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 4 ++--
drivers/gpu/drm
On 2024-03-13 13:43, Dewan Alam wrote:
IH Retry CAM should be enabled by register reads instead of always being set to
true.
This explanation sounds odd. Your code is still writing the register
first. What's the reason for reading back the register? I assume it's
not needed for enabling the CA
On 2024-03-13 5:41, Lijo Lazar wrote:
Check if the device is present in the bus before trying to recover. It
could be that device itself is lost from the bus in some hang
situations.
Signed-off-by: Lijo Lazar
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 ++
1 fil
On 2024-03-11 11:14, Sasha Levin wrote:
From: Prike Liang
[ Upstream commit c671ec01311b4744b377f98b0b4c6d033fe569b3 ]
Currently, GPU resets can now be performed successfully on the Raven
series. While GPU reset is required for the S3 suspend abort case.
So now can enable gpu reset for S3 abor
uint32_t inst)
+{
+ if (doorbell_id) {
+ struct device *dev = node->adev->dev;
+
+ if (KFD_GC_VERSION(node) == IP_VERSION(9, 4, 3))
Could this be made more generic? E.g.:
if (node->adev->xcp_mgr && node->adev->xcp_mgr->
On 2024-03-12 5:45, Tvrtko Ursulin wrote:
On 11/03/2024 14:48, Tvrtko Ursulin wrote:
Hi Felix,
On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render nodes
in order to maintain CRIU support for ROCm application once they
start relying on
On 2024-03-15 7:37, Christian Göttsche wrote:
Use the new added capable_any function in appropriate cases, where a
task is required to have any of two capabilities.
Reorder CAP_SYS_ADMIN last.
Signed-off-by: Christian Göttsche
Acked-by: Alexander Gordeev (s390 portion)
Acked-by: Felix
On 2024-03-15 14:17, Mukul Joshi wrote:
Check cgroup permissions when returning DMA-buf info and
based on cgroup check return the id of the GPU that has
access to the BO.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 4 ++--
1 file changed, 2 insertions(+), 2 de
On 2024-03-20 15:09, Joshi, Mukul wrote:
[AMD Official Use Only - General]
-Original Message-
From: Kuehling, Felix
Sent: Monday, March 18, 2024 4:13 PM
To: Joshi, Mukul ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdkfd: Check cgroup when returning DMABuf info
On 2024
On 2024-03-18 16:12, Felix Kuehling wrote:
On 2024-03-15 14:17, Mukul Joshi wrote:
Check cgroup permissions when returning DMA-buf info and
based on cgroup check return the id of the GPU that has
access to the BO.
Signed-off-by: Mukul Joshi
---
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 4
Tested-by: Jesse Zhang
Reviewed-by: Felix Kuehling
---
.../gpu/drm/amd/amdkfd/kfd_int_process_v10.c| 3 ++-
.../gpu/drm/amd/amdkfd/kfd_int_process_v11.c| 3 ++-
drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c | 3 ++-
include/uapi/linux/kfd_ioctl.h | 17
On 2024-03-20 18:52, Mukul Joshi wrote:
Destroy the high priority workqueue that handles interrupts
during KFD node cleanup.
Signed-off-by: Mukul Joshi
Reviewed-by: Felix Kuehling
---
drivers/gpu/drm/amd/amdkfd/kfd_interrupt.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a
On 2024-03-22 15:57, Zhigang Luo wrote:
it will cause page fault after device recovered if there is a process running.
Signed-off-by: Zhigang Luo
Change-Id: Ib1eddb56b69ecd41fe703abd169944154f48b0cd
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++
1 file changed, 2 insertions(+)
diff
On 2024-03-22 12:49, shaoyunl wrote:
From MES version 0x54, the log entry increased and require the log buffer
size to be increased. The 16k is maximum size agreed
What happens when you run the new firmware on an old kernel that only
allocates 4KB?
Regards,
Felix
Signed-off-by: shao
On 2024-03-26 10:53, Philip Yang wrote:
On 2024-03-25 14:45, Felix Kuehling wrote:
On 2024-03-22 15:57, Zhigang Luo wrote:
it will cause page fault after device recovered if there is a
process running.
Signed-off-by: Zhigang Luo
Change-Id: Ib1eddb56b69ecd41fe703abd169944154f48b0cd
On 2024-03-26 12:04, Alam, Dewan wrote:
[AMD Official Use Only - General]
Looping in +@Zhang, Zhaochen
CAM control register can only be written by PF. VF can only read the register.
In SRIOV VF, the write won't work.
In SRIOV case, CAM's enablement is controlled by the host. Hence, we think th
On 2024-03-25 19:33, Liu, Shaoyun wrote:
[AMD Official Use Only - General]
It can cause page fault when the log size exceed the page size .
I'd consider that a breaking change in the firmware that should be
avoided. Is there a way the updated driver can tell the FW the log size
that it
On 2024-03-26 11:52, Alex Deucher wrote:
This adds allocation latency, but aligns better with user
expectations. The latency should improve with the drm buddy
clearing patches that Arun has been working on.
If we submit this before the clear-page-tracking patches are in, this
will cause una
fixed, the patch is
Reviewed-by: Felix Kuehling
+ /* VF MMIO access (except mailbox range) from CPU
+* will be blocked during sriov runtime
+*/
+ adev->virt.caps |= AMDGPU_VF_MMIO_ACCESS_PROTECT;
+
amdgpu_gmc_noretry_set(ade
ably going to be at least a few weeks.
Regards,
Felix
Regards,
Tvrtko
On 15/03/2024 18:36, Tvrtko Ursulin wrote:
On 15/03/2024 02:33, Felix Kuehling wrote:
On 2024-03-12 5:45, Tvrtko Ursulin wrote:
On 11/03/2024 14:48, Tvrtko Ursulin wrote:
Hi Felix,
On 06/12/2023 21:23, Felix
On 2024-04-01 11:09, Tvrtko Ursulin wrote:
On 28/03/2024 20:42, Felix Kuehling wrote:
On 2024-03-28 12:03, Tvrtko Ursulin wrote:
Hi Felix,
I had one more thought while browsing around the amdgpu CRIU plugin.
It appears it relies on the KFD support being compiled in and
/dev/kfd present
On 2024-04-01 12:56, Tvrtko Ursulin wrote:
On 01/04/2024 17:37, Felix Kuehling wrote:
On 2024-04-01 11:09, Tvrtko Ursulin wrote:
On 28/03/2024 20:42, Felix Kuehling wrote:
On 2024-03-28 12:03, Tvrtko Ursulin wrote:
Hi Felix,
I had one more thought while browsing around the amdgpu CRIU
On 2024-04-01 17:53, Zhigang Luo wrote:
If there are more than one device doing reset in parallel, the first
device will call kfd_suspend_all_processes() to evict all processes
on all devices, this call takes time to finish. other device will
start reset and recover without waiting. if the proces
process has not been
evicted before doing recover, it will be restored, then caused page
fault.
Signed-off-by: Zhigang Luo
This patch is
Reviewed-by: Felix Kuehling
---
drivers/gpu/drm/amd/amdkfd/kfd_device.c | 17 ++---
1 file changed, 6 insertions(+), 11 deletions(-)
diff
On 2024-04-08 3:55, Christian König wrote:
Am 07.04.24 um 06:52 schrieb Lang Yu:
When VM is in evicting state, amdgpu_vm_update_range would return
-EBUSY.
Then restore_process_worker runs into a dead loop.
Fixes: 2fdba514ad5a ("drm/amdgpu: Auto-validate DMABuf imports in
compute VMs")
Mhm,
Fix memory leak due to a leaked mmget reference on an error handling
code path that is triggered when attempting to create KFD processes
while a GPU reset is in progress.
Fixes: 0ab2d7532b05 ("drm/amdkfd: prepare per-process debug enable and disable")
CC: Xiaogang Chen
Signed-off
igned-off-by: Yunxiang Li
Reviewed-by: Felix Kuehling
---
drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 9 +++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
b/drivers/gpu/drm/amd/a
case left is
SVM and that is most likely not recoverable in any way when VRAM is
lost.
I agree. The series is
Acked-by: Felix Kuehling
Signed-off-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 4 -
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 87
On 2024-06-05 05:14, Christian König wrote:
Am 04.06.24 um 20:08 schrieb Felix Kuehling:
On 2024-06-03 22:13, Al Viro wrote:
Using drm_gem_prime_handle_to_fd() to set dmabuf up and insert it into
descriptor table, only to have it looked up by file descriptor and
remove it from descriptor
On 2024-06-12 16:11, Xiaogang.Chen wrote:
From: Xiaogang Chen
Current kfd/svm driver acquires mm's mmap write lock before update
mm->notifier_subscriptions->itree. This tree is already protected
by mm->notifier_subscriptions->lock at mmu notifier. Each process mm interval
tree update from dif
oduction for that?
Hi David,
This refers to the SVM API that has been in the upstream driver for a while
now:
https://elixir.bootlin.com/linux/v6.9.8/source/include/uapi/linux/kfd_ioctl.h#L732
Regards,
Felix
>
> Thanks,
> -David
>
> ---
On 2024-07-09 5:30, 周春明(日月) wrote:
>
>
>
>
>
>
> --
> 发件人:Felix Kuehling
> 发送时间:2024年7月9日(星期二) 06:40
> 收件人:周春明(日月) ; Tvrtko Ursulin
> ; dri-de...@lists.freedesktop.org
> ; amd-gfx@li
d by
> shader code.
>
> Signed-off-by: David Belanger
Reviewed-by: Felix Kuehling
> ---
> drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c | 21 ++---
> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 8 +---
> 2 files changed, 3 insertions(+), 26 deletions(-)
>
On 2024-07-09 22:38, 周春明(日月) wrote:
--
发件人:Felix Kuehling
发送时间:2024年7月10日(星期三) 01:07
收件人:周春明(日月) ; Tvrtko Ursulin
; dri-de...@lists.freedesktop.org
; amd-gfx@lists.freedesktop.org
; Dave Airlie ;
Daniel Vetter ; criu
抄 送
KFD eviction fences are triggered by the enable_signaling callback on the
eviction fence. Any move operations scheduled by amdgpu_bo_move are held up by
the GPU scheduler until the eviction fence is signaled by the KFD eviction
handler, which only happens after the user mode queues are stopped.
also invalidate the PTEs?
Regards,
Felix
IIRC we postponed looking into the issue until it really becomes a
problem which is probably now :)
Regards,
Christian.
Am 12.07.24 um 16:56 schrieb Felix Kuehling:
KFD eviction fences are triggered by the enable_signaling callback on
the evi
On 2024-07-15 08:34, Philip Yang wrote:
Pass pointer reference to amdgpu_bo_unref to clear the correct pointer,
otherwise amdgpu_bo_unref clear the local variable, the original pointer
not set to NULL, this could cause use-after-free bug.
Signed-off-by: Philip Yang
Reviewed-by: Felix
On 2024-07-15 08:34, Philip Yang wrote:
Add helper function kfd_queue_acquire_buffers to get queue wptr_bo
reference from queue write_ptr if it is mapped to the KFD node with
expected size.
Move wptr_bo to structure queue_properties from struct queue as queue is
allocated after queue buffers a
Sorry, I see that this patch still doesn't propagate errors returned
from kfd_queue_releasre_buffers correctly. And the later patches in the
series don't seem to fix it either. See inline.
On 2024-07-15 08:34, Philip Yang wrote:
Add helper function kfd_queue_acquire_buffers to get queue wptr_b
return value is ignored, if
application unmap the CWSR area while queue is active, pr_warn message
in dmesg log. To be safe, evict user queue.
Signed-off-by: Philip Yang
Reviewed-by: Felix Kuehling
---
drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 110 -
drivers/gpu/drm
On 2024-07-15 08:34, Philip Yang wrote:
Add atomic queue_refcount to struct bo_va, return -EBUSY to fail unmap
BO from the GPU if the bo_va queue_refcount is not zero.
Create queue to increase the bo_va queue_refcount, destroy queue to
decrease the bo_va queue_refcount, to ensure the queue buffe
On 2024-07-17 16:40, Alex Deucher wrote:
Add the irq source for bad opcodes.
Signed-off-by: Alex Deucher
Looks like all the error IRQ handlers return 0, which means the
interrupts will still get forwarded to KFD (which is good). The series is
Acked-by: Felix Kuehling
---
drivers
On 2024-06-26 11:06, Xiaogang.Chen wrote:
From: Xiaogang Chen
When user adds new vm range that has overlapping with existing svm pranges
current kfd creats a cloned pragne and split it, then replaces original prange
by it. That destroy original prange locks and the cloned prange locks do not
On 2024-07-18 15:57, Philip Yang wrote:
>
> On 2024-07-17 16:16, Felix Kuehling wrote:
>> Sorry, I see that this patch still doesn't propagate errors returned from
>> kfd_queue_releasre_buffers correctly. And the later patches in the series
>> don't
On 2024-07-18 1:25, Chen, Xiaogang wrote:
>
> On 7/17/2024 6:02 PM, Felix Kuehling wrote:
>>
>> On 2024-06-26 11:06, Xiaogang.Chen wrote:
>>> From: Xiaogang Chen
>>>
>>> When user adds new vm range that has overlapping with existing svm pranges
>
in struct queue
The series is
Reviewed-by: Felix Kuehling
>
> Philip Yang (9):
> drm/amdkfd: kfd_bo_mapped_dev support partition
> drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer
> drm/amdkfd: Refactor queue wptr_bo GART mapping
> drm/amdkfd: Validate use
On 2024-07-18 19:05, Jonathan Kim wrote:
Certain GPUs have better copy performance over xGMI on specific
SDMA engines depending on the source and destination GPU.
Allow users to create SDMA queues on these recommended engines.
Close to 2x overall performance has been observed with this
optimizati
optimization.
v2: remove unnecessary crat updates and refactor sdma resource
bit setting logic.
Signed-off-by: Jonathan Kim
Reviewed-by: Felix Kuehling
---
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 16 ++
.../drm/amd/amdkfd/kfd_device_queue_manager.c | 38 +-
drivers/gpu
On 2024-07-19 18:17, Xiaogang.Chen wrote:
From: Xiaogang Chen
When app unmap vm ranges(munmap) kfd/svm starts drain pending page fault and
not handle any incoming pages fault of this process until a deferred work item
got executed by default system wq. The time period of "not handle page faul
On 2024-07-18 13:56, Jonathan Kim wrote:
Support per-queue reset for GFX9. The recommendation is for the driver
to target reset the HW queue via a SPI MMIO register write.
Since this requires pipe and HW queue info and MEC FW is limited to
doorbell reports of hung queues after an unmap failur
te user queue svm memory residency")
Reported-by: kernel test robot
Closes:
https://lore.kernel.org/oe-kbuild-all/202407252127.zvnxakra-...@intel.com/
Signed-off-by: Philip Yang
Reviewed-by: Felix Kuehling
---
drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 14 ++
1 file chang
On 2024-07-26 11:30, Jonathan Kim wrote:
> Support per-queue reset for GFX9. The recommendation is for the driver
> to target reset the HW queue via a SPI MMIO register write.
>
> Since this requires pipe and HW queue info and MEC FW is limited to
> doorbell reports of hung queues after an unma
On 2024-07-26 11:30, Jonathan Kim wrote:
> In order to allow ROCm GDB to handle reset queues, raise an
> EC_QUEUE_RESET exception so that the debugger can subscribe and
> query this exception.
>
> Reset queues should still be considered suspendable with a status
> flag of KFD_DBG_QUEUE_RESET_MA
1 - 100 of 3573 matches
Mail list logo