On 12/12/25 2:34 PM, Christian König wrote:
On 12/12/25 07:40, Donet Tom wrote:
The ctl_stack_size and wg_data_size values are used to compute the total
context save/restore buffer size and the control stack size. These buffers
are programmed into the GPU and are used to store the queue state during
context save and restore.

Currently, both ctl_stack_size and wg_data_size are aligned to the CPU
PAGE_SIZE. On systems with a non-4K CPU page size, this causes unnecessary
memory waste because the GPU internally calculates and uses buffer sizes
aligned to a fixed 4K GPU page size.

Since the control stack and context save/restore buffers are consumed by
the GPU, their sizes should be aligned to the GPU page size (4K), not the
CPU page size. This patch updates the alignment of ctl_stack_size and
wg_data_size to prevent over-allocation on systems with larger CPU page
sizes.
As far as I know the problem is that the debugger needs to consume that stuff 
on the CPU side as well.


Thank you for your help.

As mentioned earlier, we were observing some queue preemption and GPU hang issues. To address this, we introduced a patch, and after applying the 7/8 and 8/8 patches, those issues have not been seen anymore

While debugging the GPU hang issue, I made some additional observations.

On my system, I booted a kernel with a 4 KB system page size and modified both the ROCR runtime and the GPU driver to set the control stack size to 64 KB. Even on a 4 KB page-size system, using a 64 KB control stack size reliably reproduces the queue preemption failure when running RCCL unit tests on 8 GPUs. This suggests that the issue is not related to the system page size, but rather to the control stack size being exactly 64 KB.

When the control stack size is set to 64 KB ± 4 KB, the tests pass on both 4 KB and 64 KB system page-size configurations.

For gfxv9, is there any documented hardware limitation on the control stack size? Specifically, is it valid to use a control stack size of exactly 64 KB?



I need to double check that, but I think the alignment is correct as it is.


The control stack is part of the context save-restore buffer, and we configure it on the GPU as shown below:

m->cp_hqd_ctx_save_base_addr_lo = lower_32_bits(q->ctx_save_restore_area_address); m->cp_hqd_ctx_save_base_addr_hi = upper_32_bits(q->ctx_save_restore_area_address);
m->cp_hqd_ctx_save_size = q->ctx_save_restore_area_size;
m->cp_hqd_cntl_stack_size = q->ctl_stack_size;
m->cp_hqd_cntl_stack_offset = q->ctl_stack_size;
m->cp_hqd_wg_state_offset = q->ctl_stack_size;

The control stack occupies the region from cp_hqd_cntl_stack_offset down to 0 within the ctx save restore area, and the remaining space is used for WG state. This buffer is fully managed by the GPU during preemption and restore operations. The control stack size is calculated based on hardware configuration (CU count and wave count). For example, on gfxv9, the size is typically around 32 KB. If we align this size to the system page size (e.g., 64 KB), two issues arise:

1. Unnecessary memory overhead.
2. Potential queue preemption issues.

On the CPU side, we copy the control stack contents to other buffers for processing. Since the control stack size is derived from hardware configuration, aligning it to the GPU page size seems more appropriate. Aligning to the system page size would waste memory without adding value. Using GPU page size alignment ensures consistency with hardware and avoids unnecessary overhead.

Would you agree that aligning the control stack size to the GPU page size is the right approach? Or do you see any concerns with this method?



Regards,
Christian.

Signed-off-by: Donet Tom <[email protected]>
---
  drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 7 ++++---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
index dc857450fa16..00ab941c3e86 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
@@ -445,10 +445,11 @@ void kfd_queue_ctx_save_restore_size(struct 
kfd_topology_device *dev)
                    min(cu_num * 40, props->array_count / 
props->simd_arrays_per_engine * 512)
                    : cu_num * 32;
- wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props), PAGE_SIZE);
+       wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props),
+                               AMDGPU_GPU_PAGE_SIZE);
        ctl_stack_size = wave_num * CNTL_STACK_BYTES_PER_WAVE(gfxv) + 8;
        ctl_stack_size = ALIGN(SIZEOF_HSA_USER_CONTEXT_SAVE_AREA_HEADER + 
ctl_stack_size,
-                              PAGE_SIZE);
+                              AMDGPU_GPU_PAGE_SIZE);
if ((gfxv / 10000 * 10000) == 100000) {
                /* HW design limits control stack size to 0x7000.
@@ -460,7 +461,7 @@ void kfd_queue_ctx_save_restore_size(struct 
kfd_topology_device *dev)
props->ctl_stack_size = ctl_stack_size;
        props->debug_memory_size = ALIGN(wave_num * DEBUGGER_BYTES_PER_WAVE, 
DEBUGGER_BYTES_ALIGN);
-       props->cwsr_size = ctl_stack_size + wg_data_size;
+       props->cwsr_size = ALIGN(ctl_stack_size + wg_data_size, PAGE_SIZE);
if (gfxv == 80002) /* GFX_VERSION_TONGA */
                props->eop_buffer_size = 0x8000;

Reply via email to