On Fri, Nov 7, 2025 at 10:37 AM Lancelot SIX <[email protected]> wrote:
>
>
>
> On 07/11/2025 15:12, Kim, Jonathan wrote:
> > [Public]
> >
> >> -----Original Message-----
> >> From: Alex Deucher <[email protected]>
> >> Sent: Friday, November 7, 2025 9:36 AM
> >> To: Kim, Jonathan <[email protected]>
> >> Cc: [email protected]; Kuehling, Felix 
> >> <[email protected]>;
> >> Six, Lancelot <[email protected]>; Yang, Philip <[email protected]>
> >> Subject: Re: [PATCH] drm/amdkfd: relax checks for over allocation of save 
> >> area
> >>
> >> Caution: This message originated from an External Source. Use proper 
> >> caution
> >> when opening attachments, clicking links, or responding.
> >>
> >>
> >> On Thu, Nov 6, 2025 at 3:46 PM Jonathan Kim <[email protected]> wrote:
> >>>
> >>> Over allocation of save area is not fatal, only under allocation is.
> >>> ROCm has various components that independently claim authority over save
> >>> area size.
> >>>
> >>> Unless KFD decides to claim single authority, relax size checks with a
> >>> warning on over allocation.
> >>
> >> Do we want any sort of upper limit?
> >
> > Mmm.  I'd expect early failure on unreasonable user request for over 
> > allocation at the allocation stage prior to acquire on queue create stage 
> > and acquire should be bound to what was registered/allocated.
> > So I'm not sure what an upper bound acquire limit check could serve.
> >
>
> One thing to consider is that at the end of the CWSR area, there is
> space dedicated for the debugger (used to implement displaced-stepping
> buffers).  Initially, when KFD did not check the CWSR area size, it was
> just a userspace matter to increase that size if the debugger needed
> extensions.  With this new change, we can't extend this buffer without a
> KFD update.
>
> Having a "check that the buffer the thunk provided is big enough" policy
> would still allow us to extend this allocation if necessary.
>

Fair enough.  We can probably drop the warning as well in that case.

Alex

> Best,
> Lancelot.
>
> >>
> >>>
> >>> Signed-off-by: Jonathan Kim <[email protected]>
> >>> ---
> >>>   drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 17 +++++++++++------
> >>>   1 file changed, 11 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> >> b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> >>> index a65c67cf56ff..448043bc2937 100644
> >>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> >>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
> >>> @@ -297,16 +297,21 @@ int kfd_queue_acquire_buffers(struct
> >> kfd_process_device *pdd, struct queue_prope
> >>>                  goto out_err_unreserve;
> >>>          }
> >>>
> >>> -       if (properties->ctx_save_restore_area_size != topo_dev-
> >>> node_props.cwsr_size) {
> >>> -               pr_debug("queue cwsr size 0x%x not equal to node cwsr 
> >>> size 0x%x\n",
> >>> +       if (properties->ctx_save_restore_area_size < topo_dev-
> >>> node_props.cwsr_size) {
> >>> +               pr_debug("queue cwsr size 0x%x not sufficient for node 
> >>> cwsr size
> >> 0x%x\n",
> >>>                          properties->ctx_save_restore_area_size,
> >>>                          topo_dev->node_props.cwsr_size);
> >>>                  err = -EINVAL;
> >>>                  goto out_err_unreserve;
> >>>          }
> >>>
> >>> -       total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev-
> >>> node_props.debug_memory_size)
> >>> -                         * NUM_XCC(pdd->dev->xcc_mask);
> >>> +       if (properties->ctx_save_restore_area_size > topo_dev-
> >>> node_props.cwsr_size)
> >>> +               pr_warn_ratelimited("queue cwsr size 0x%x exceeds 
> >>> recommended
> >> node cwsr size 0x%x\n",
> >>> +                                   
> >>> properties->ctx_save_restore_area_size,
> >>> +                                   topo_dev->node_props.cwsr_size);
> >>
> >> We can probably drop the warning here.
> >
> > Acked.
> >
> > Thanks.
> >
> > Jon
> >
> >>
> >> Alex
> >>
> >>> +
> >>> +       total_cwsr_size = (properties->ctx_save_restore_area_size +
> >>> +                          topo_dev->node_props.debug_memory_size) * 
> >>> NUM_XCC(pdd-
> >>> dev->xcc_mask);
> >>>          total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);
> >>>
> >>>          err = kfd_queue_buffer_get(vm, (void *)properties-
> >>> ctx_save_restore_area_address,
> >>> @@ -352,8 +357,8 @@ int kfd_queue_release_buffers(struct
> >> kfd_process_device *pdd, struct queue_prope
> >>>          topo_dev = kfd_topology_device_by_id(pdd->dev->id);
> >>>          if (!topo_dev)
> >>>                  return -EINVAL;
> >>> -       total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev-
> >>> node_props.debug_memory_size)
> >>> -                         * NUM_XCC(pdd->dev->xcc_mask);
> >>> +       total_cwsr_size = (properties->ctx_save_restore_area_size +
> >>> +                          topo_dev->node_props.debug_memory_size) * 
> >>> NUM_XCC(pdd-
> >>> dev->xcc_mask);
> >>>          total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE);
> >>>
> >>>          kfd_queue_buffer_svm_put(pdd, properties-
> >>> ctx_save_restore_area_address, total_cwsr_size);
> >>> --
> >>> 2.34.1
> >>>
>

Reply via email to