[Public] > -----Original Message----- > From: Alex Deucher <[email protected]> > Sent: Friday, November 7, 2025 9:36 AM > To: Kim, Jonathan <[email protected]> > Cc: [email protected]; Kuehling, Felix <[email protected]>; > Six, Lancelot <[email protected]>; Yang, Philip <[email protected]> > Subject: Re: [PATCH] drm/amdkfd: relax checks for over allocation of save area > > Caution: This message originated from an External Source. Use proper caution > when opening attachments, clicking links, or responding. > > > On Thu, Nov 6, 2025 at 3:46 PM Jonathan Kim <[email protected]> wrote: > > > > Over allocation of save area is not fatal, only under allocation is. > > ROCm has various components that independently claim authority over save > > area size. > > > > Unless KFD decides to claim single authority, relax size checks with a > > warning on over allocation. > > Do we want any sort of upper limit?
Mmm. I'd expect early failure on unreasonable user request for over allocation at the allocation stage prior to acquire on queue create stage and acquire should be bound to what was registered/allocated. So I'm not sure what an upper bound acquire limit check could serve. > > > > > Signed-off-by: Jonathan Kim <[email protected]> > > --- > > drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 17 +++++++++++------ > > 1 file changed, 11 insertions(+), 6 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c > b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c > > index a65c67cf56ff..448043bc2937 100644 > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c > > @@ -297,16 +297,21 @@ int kfd_queue_acquire_buffers(struct > kfd_process_device *pdd, struct queue_prope > > goto out_err_unreserve; > > } > > > > - if (properties->ctx_save_restore_area_size != topo_dev- > >node_props.cwsr_size) { > > - pr_debug("queue cwsr size 0x%x not equal to node cwsr size > > 0x%x\n", > > + if (properties->ctx_save_restore_area_size < topo_dev- > >node_props.cwsr_size) { > > + pr_debug("queue cwsr size 0x%x not sufficient for node cwsr > > size > 0x%x\n", > > properties->ctx_save_restore_area_size, > > topo_dev->node_props.cwsr_size); > > err = -EINVAL; > > goto out_err_unreserve; > > } > > > > - total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev- > >node_props.debug_memory_size) > > - * NUM_XCC(pdd->dev->xcc_mask); > > + if (properties->ctx_save_restore_area_size > topo_dev- > >node_props.cwsr_size) > > + pr_warn_ratelimited("queue cwsr size 0x%x exceeds > > recommended > node cwsr size 0x%x\n", > > + properties->ctx_save_restore_area_size, > > + topo_dev->node_props.cwsr_size); > > We can probably drop the warning here. Acked. Thanks. Jon > > Alex > > > + > > + total_cwsr_size = (properties->ctx_save_restore_area_size + > > + topo_dev->node_props.debug_memory_size) * > > NUM_XCC(pdd- > >dev->xcc_mask); > > total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE); > > > > err = kfd_queue_buffer_get(vm, (void *)properties- > >ctx_save_restore_area_address, > > @@ -352,8 +357,8 @@ int kfd_queue_release_buffers(struct > kfd_process_device *pdd, struct queue_prope > > topo_dev = kfd_topology_device_by_id(pdd->dev->id); > > if (!topo_dev) > > return -EINVAL; > > - total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev- > >node_props.debug_memory_size) > > - * NUM_XCC(pdd->dev->xcc_mask); > > + total_cwsr_size = (properties->ctx_save_restore_area_size + > > + topo_dev->node_props.debug_memory_size) * > > NUM_XCC(pdd- > >dev->xcc_mask); > > total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE); > > > > kfd_queue_buffer_svm_put(pdd, properties- > >ctx_save_restore_area_address, total_cwsr_size); > > -- > > 2.34.1 > >
