On 07/11/2025 15:12, Kim, Jonathan wrote:
[Public]
-----Original Message-----
From: Alex Deucher <[email protected]>
Sent: Friday, November 7, 2025 9:36 AM
To: Kim, Jonathan <[email protected]>
Cc: [email protected]; Kuehling, Felix <[email protected]>;
Six, Lancelot <[email protected]>; Yang, Philip <[email protected]>
Subject: Re: [PATCH] drm/amdkfd: relax checks for over allocation of save area
Caution: This message originated from an External Source. Use proper caution
when opening attachments, clicking links, or responding.
On Thu, Nov 6, 2025 at 3:46 PM Jonathan Kim <[email protected]> wrote:
Over allocation of save area is not fatal, only under allocation is.
ROCm has various components that independently claim authority over save
area size.
Unless KFD decides to claim single authority, relax size checks with a
warning on over allocation.
Do we want any sort of upper limit?
Mmm. I'd expect early failure on unreasonable user request for over allocation
at the allocation stage prior to acquire on queue create stage and acquire
should be bound to what was registered/allocated.
So I'm not sure what an upper bound acquire limit check could serve.
One thing to consider is that at the end of the CWSR area, there is
space dedicated for the debugger (used to implement displaced-stepping
buffers). Initially, when KFD did not check the CWSR area size, it was
just a userspace matter to increase that size if the debugger needed
extensions. With this new change, we can't extend this buffer without a
KFD update.
Having a "check that the buffer the thunk provided is big enough" policy
would still allow us to extend this allocation if necessary.
Best,
Lancelot.
Signed-off-by: Jonathan Kim <[email protected]>
---
drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
index a65c67cf56ff..448043bc2937 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
@@ -297,16 +297,21 @@ int kfd_queue_acquire_buffers(struct
kfd_process_device *pdd, struct queue_prope
goto out_err_unreserve;
}
- if (properties->ctx_save_restore_area_size != topo_dev-
node_props.cwsr_size) {
- pr_debug("queue cwsr size 0x%x not equal to node cwsr size
0x%x\n",
+ if (properties->ctx_save_restore_area_size < topo_dev-
node_props.cwsr_size) {
+ pr_debug("queue cwsr size 0x%x not sufficient for node cwsr size
0x%x\n",
properties->ctx_save_restore_area_size,
topo_dev->node_props.cwsr_size);
err = -EINVAL;
goto out_err_unreserve;
}
- total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev-
node_props.debug_memory_size)
- * NUM_XCC(pdd->dev->xcc_mask);
+ if (properties->ctx_save_restore_area_size > topo_dev-
node_props.cwsr_size)
+ pr_warn_ratelimited("queue cwsr size 0x%x exceeds recommended
node cwsr size 0x%x\n",
+ properties->ctx_save_restore_area_size,
+ topo_dev->node_props.cwsr_size);
We can probably drop the warning here.
Acked.
Thanks.
Jon
Alex
+
+ total_cwsr_size = (properties->ctx_save_restore_area_size +
+ topo_dev->node_props.debug_memory_size) *
NUM_XCC(pdd-
dev->xcc_mask);
total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);
err = kfd_queue_buffer_get(vm, (void *)properties-
ctx_save_restore_area_address,
@@ -352,8 +357,8 @@ int kfd_queue_release_buffers(struct
kfd_process_device *pdd, struct queue_prope
topo_dev = kfd_topology_device_by_id(pdd->dev->id);
if (!topo_dev)
return -EINVAL;
- total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev-
node_props.debug_memory_size)
- * NUM_XCC(pdd->dev->xcc_mask);
+ total_cwsr_size = (properties->ctx_save_restore_area_size +
+ topo_dev->node_props.debug_memory_size) *
NUM_XCC(pdd-
dev->xcc_mask);
total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);
kfd_queue_buffer_svm_put(pdd, properties-
ctx_save_restore_area_address, total_cwsr_size);
--
2.34.1