On Fri, Nov 7, 2025 at 10:37 AM Lancelot SIX <[email protected]> wrote: > > > > On 07/11/2025 15:12, Kim, Jonathan wrote: > > [Public] > > > >> -----Original Message----- > >> From: Alex Deucher <[email protected]> > >> Sent: Friday, November 7, 2025 9:36 AM > >> To: Kim, Jonathan <[email protected]> > >> Cc: [email protected]; Kuehling, Felix > >> <[email protected]>; > >> Six, Lancelot <[email protected]>; Yang, Philip <[email protected]> > >> Subject: Re: [PATCH] drm/amdkfd: relax checks for over allocation of save > >> area > >> > >> Caution: This message originated from an External Source. Use proper > >> caution > >> when opening attachments, clicking links, or responding. > >> > >> > >> On Thu, Nov 6, 2025 at 3:46 PM Jonathan Kim <[email protected]> wrote: > >>> > >>> Over allocation of save area is not fatal, only under allocation is. > >>> ROCm has various components that independently claim authority over save > >>> area size. > >>> > >>> Unless KFD decides to claim single authority, relax size checks with a > >>> warning on over allocation. > >> > >> Do we want any sort of upper limit? > > > > Mmm. I'd expect early failure on unreasonable user request for over > > allocation at the allocation stage prior to acquire on queue create stage > > and acquire should be bound to what was registered/allocated. > > So I'm not sure what an upper bound acquire limit check could serve. > > > > One thing to consider is that at the end of the CWSR area, there is > space dedicated for the debugger (used to implement displaced-stepping > buffers). Initially, when KFD did not check the CWSR area size, it was > just a userspace matter to increase that size if the debugger needed > extensions. With this new change, we can't extend this buffer without a > KFD update. > > Having a "check that the buffer the thunk provided is big enough" policy > would still allow us to extend this allocation if necessary. >
Fair enough. We can probably drop the warning as well in that case. Alex > Best, > Lancelot. > > >> > >>> > >>> Signed-off-by: Jonathan Kim <[email protected]> > >>> --- > >>> drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 17 +++++++++++------ > >>> 1 file changed, 11 insertions(+), 6 deletions(-) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c > >> b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c > >>> index a65c67cf56ff..448043bc2937 100644 > >>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c > >>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c > >>> @@ -297,16 +297,21 @@ int kfd_queue_acquire_buffers(struct > >> kfd_process_device *pdd, struct queue_prope > >>> goto out_err_unreserve; > >>> } > >>> > >>> - if (properties->ctx_save_restore_area_size != topo_dev- > >>> node_props.cwsr_size) { > >>> - pr_debug("queue cwsr size 0x%x not equal to node cwsr > >>> size 0x%x\n", > >>> + if (properties->ctx_save_restore_area_size < topo_dev- > >>> node_props.cwsr_size) { > >>> + pr_debug("queue cwsr size 0x%x not sufficient for node > >>> cwsr size > >> 0x%x\n", > >>> properties->ctx_save_restore_area_size, > >>> topo_dev->node_props.cwsr_size); > >>> err = -EINVAL; > >>> goto out_err_unreserve; > >>> } > >>> > >>> - total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev- > >>> node_props.debug_memory_size) > >>> - * NUM_XCC(pdd->dev->xcc_mask); > >>> + if (properties->ctx_save_restore_area_size > topo_dev- > >>> node_props.cwsr_size) > >>> + pr_warn_ratelimited("queue cwsr size 0x%x exceeds > >>> recommended > >> node cwsr size 0x%x\n", > >>> + > >>> properties->ctx_save_restore_area_size, > >>> + topo_dev->node_props.cwsr_size); > >> > >> We can probably drop the warning here. > > > > Acked. > > > > Thanks. > > > > Jon > > > >> > >> Alex > >> > >>> + > >>> + total_cwsr_size = (properties->ctx_save_restore_area_size + > >>> + topo_dev->node_props.debug_memory_size) * > >>> NUM_XCC(pdd- > >>> dev->xcc_mask); > >>> total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE); > >>> > >>> err = kfd_queue_buffer_get(vm, (void *)properties- > >>> ctx_save_restore_area_address, > >>> @@ -352,8 +357,8 @@ int kfd_queue_release_buffers(struct > >> kfd_process_device *pdd, struct queue_prope > >>> topo_dev = kfd_topology_device_by_id(pdd->dev->id); > >>> if (!topo_dev) > >>> return -EINVAL; > >>> - total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev- > >>> node_props.debug_memory_size) > >>> - * NUM_XCC(pdd->dev->xcc_mask); > >>> + total_cwsr_size = (properties->ctx_save_restore_area_size + > >>> + topo_dev->node_props.debug_memory_size) * > >>> NUM_XCC(pdd- > >>> dev->xcc_mask); > >>> total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE); > >>> > >>> kfd_queue_buffer_svm_put(pdd, properties- > >>> ctx_save_restore_area_address, total_cwsr_size); > >>> -- > >>> 2.34.1 > >>> >
