Re: [PATCH] drm/amdkfd: relax checks for over allocation of save area

Lancelot SIX Fri, 07 Nov 2025 07:37:26 -0800



On 07/11/2025 15:12, Kim, Jonathan wrote:

[Public]

-----Original Message-----
From: Alex Deucher <[email protected]>
Sent: Friday, November 7, 2025 9:36 AM
To: Kim, Jonathan <[email protected]>
Cc: [email protected]; Kuehling, Felix <[email protected]>;
Six, Lancelot <[email protected]>; Yang, Philip <[email protected]>
Subject: Re: [PATCH] drm/amdkfd: relax checks for over allocation of save area

Caution: This message originated from an External Source. Use proper caution
when opening attachments, clicking links, or responding.

On Thu, Nov 6, 2025 at 3:46 PM Jonathan Kim <[email protected]> wrote:


Over allocation of save area is not fatal, only under allocation is.
ROCm has various components that independently claim authority over save
area size.

Unless KFD decides to claim single authority, relax size checks with a
warning on over allocation.


Do we want any sort of upper limit?


Mmm.  I'd expect early failure on unreasonable user request for over allocation 
at the allocation stage prior to acquire on queue create stage and acquire 
should be bound to what was registered/allocated.
So I'm not sure what an upper bound acquire limit check could serve.

One thing to consider is that at the end of the CWSR area, there isspace dedicated for the debugger (used to implement displaced-steppingbuffers). Initially, when KFD did not check the CWSR area size, it wasjust a userspace matter to increase that size if the debugger neededextensions. With this new change, we can't extend this buffer without aKFD update.

Having a "check that the buffer the thunk provided is big enough" policywould still allow us to extend this allocation if necessary.


Best,
Lancelot.


Signed-off-by: Jonathan Kim <[email protected]>
---
  drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 17 +++++++++++------
  1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c

b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c

index a65c67cf56ff..448043bc2937 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
@@ -297,16 +297,21 @@ int kfd_queue_acquire_buffers(struct

kfd_process_device *pdd, struct queue_prope

                 goto out_err_unreserve;
         }

-       if (properties->ctx_save_restore_area_size != topo_dev-
node_props.cwsr_size) {
-               pr_debug("queue cwsr size 0x%x not equal to node cwsr size 
0x%x\n",
+       if (properties->ctx_save_restore_area_size < topo_dev-
node_props.cwsr_size) {
+               pr_debug("queue cwsr size 0x%x not sufficient for node cwsr size

0x%x\n",

                         properties->ctx_save_restore_area_size,
                         topo_dev->node_props.cwsr_size);
                 err = -EINVAL;
                 goto out_err_unreserve;
         }

-       total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev-
node_props.debug_memory_size)
-                         * NUM_XCC(pdd->dev->xcc_mask);
+       if (properties->ctx_save_restore_area_size > topo_dev-
node_props.cwsr_size)
+               pr_warn_ratelimited("queue cwsr size 0x%x exceeds recommended

node cwsr size 0x%x\n",

+                                   properties->ctx_save_restore_area_size,
+                                   topo_dev->node_props.cwsr_size);


We can probably drop the warning here.


Acked.

Thanks.

Jon


Alex

+
+       total_cwsr_size = (properties->ctx_save_restore_area_size +
+                          topo_dev->node_props.debug_memory_size) * 
NUM_XCC(pdd-
dev->xcc_mask);
         total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);

         err = kfd_queue_buffer_get(vm, (void *)properties-
ctx_save_restore_area_address,
@@ -352,8 +357,8 @@ int kfd_queue_release_buffers(struct

kfd_process_device *pdd, struct queue_prope

         topo_dev = kfd_topology_device_by_id(pdd->dev->id);
         if (!topo_dev)
                 return -EINVAL;
-       total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev-
node_props.debug_memory_size)
-                         * NUM_XCC(pdd->dev->xcc_mask);
+       total_cwsr_size = (properties->ctx_save_restore_area_size +
+                          topo_dev->node_props.debug_memory_size) * 
NUM_XCC(pdd-
dev->xcc_mask);
         total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);

         kfd_queue_buffer_svm_put(pdd, properties-
ctx_save_restore_area_address, total_cwsr_size);
--
2.34.1

Re: [PATCH] drm/amdkfd: relax checks for over allocation of save area

Reply via email to