Re: [RFC PATCH v1 7/8] amdgpu: Align ctl_stack_size and wg_data_size to GPU page size instead of CPU page size

Donet Tom Fri, 19 Dec 2025 02:28:06 -0800


On 12/12/25 2:34 PM, Christian König wrote:

On 12/12/25 07:40, Donet Tom wrote:

The ctl_stack_size and wg_data_size values are used to compute the total
context save/restore buffer size and the control stack size. These buffers
are programmed into the GPU and are used to store the queue state during
context save and restore.

Currently, both ctl_stack_size and wg_data_size are aligned to the CPU
PAGE_SIZE. On systems with a non-4K CPU page size, this causes unnecessary
memory waste because the GPU internally calculates and uses buffer sizes
aligned to a fixed 4K GPU page size.

Since the control stack and context save/restore buffers are consumed by
the GPU, their sizes should be aligned to the GPU page size (4K), not the
CPU page size. This patch updates the alignment of ctl_stack_size and
wg_data_size to prevent over-allocation on systems with larger CPU page
sizes.

As far as I know the problem is that the debugger needs to consume that stuff 
on the CPU side as well.



Thank you for your help.

As mentioned earlier, we were observing some queue preemption and GPUhang issues. To address this, we introduced a patch, and after applyingthe 7/8 and 8/8 patches, those issues have not been seen anymore


While debugging the GPU hang issue, I made some additional observations.

On my system, I booted a kernel with a 4 KB system page size andmodified both the ROCR runtime and the GPU driver to set the controlstack size to 64 KB. Even on a 4 KB page-size system, using a 64 KBcontrol stack size reliably reproduces the queue preemption failure whenrunning RCCL unit tests on 8 GPUs. This suggests that the issue is notrelated to the system page size, but rather to the control stack sizebeing exactly 64 KB.

When the control stack size is set to 64 KB ± 4 KB, the tests pass onboth 4 KB and 64 KB system page-size configurations.

For gfxv9, is there any documented hardware limitation on the controlstack size? Specifically, is it valid to use a control stack size ofexactly 64 KB?


I need to double check that, but I think the alignment is correct as it is.

The control stack is part of the context save-restore buffer, and weconfigure it on the GPU as shown below:

m->cp_hqd_ctx_save_base_addr_lo =lower_32_bits(q->ctx_save_restore_area_address);m->cp_hqd_ctx_save_base_addr_hi =upper_32_bits(q->ctx_save_restore_area_address);

m->cp_hqd_ctx_save_size = q->ctx_save_restore_area_size;
m->cp_hqd_cntl_stack_size = q->ctl_stack_size;
m->cp_hqd_cntl_stack_offset = q->ctl_stack_size;
m->cp_hqd_wg_state_offset = q->ctl_stack_size;

The control stack occupies the region from cp_hqd_cntl_stack_offset downto 0 within the ctx save restore area, and the remaining space is usedfor WG state. This buffer is fully managed by the GPU during preemptionand restore operations.The control stack size is calculated based on hardware configuration (CUcount and wave count). For example, on gfxv9, the size is typicallyaround 32 KB. If we align this size to the system page size (e.g.,64 KB), two issues arise:


1. Unnecessary memory overhead.
2. Potential queue preemption issues.

On the CPU side, we copy the control stack contents to other buffers forprocessing. Since the control stack size is derived from hardwareconfiguration, aligning it to the GPU page size seems more appropriate.Aligning to the system page size would waste memory without addingvalue. Using GPU page size alignment ensures consistency with hardwareand avoids unnecessary overhead.

Would you agree that aligning the control stack size to the GPU pagesize is the right approach? Or do you see any concerns with this method?


Regards,
Christian.

Signed-off-by: Donet Tom <[email protected]>
---
  drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 7 ++++---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
index dc857450fa16..00ab941c3e86 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
@@ -445,10 +445,11 @@ void kfd_queue_ctx_save_restore_size(struct 
kfd_topology_device *dev)
                    min(cu_num * 40, props->array_count / 
props->simd_arrays_per_engine * 512)
                    : cu_num * 32;

- wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props), PAGE_SIZE);

+       wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props),
+                               AMDGPU_GPU_PAGE_SIZE);
        ctl_stack_size = wave_num * CNTL_STACK_BYTES_PER_WAVE(gfxv) + 8;
        ctl_stack_size = ALIGN(SIZEOF_HSA_USER_CONTEXT_SAVE_AREA_HEADER + 
ctl_stack_size,
-                              PAGE_SIZE);
+                              AMDGPU_GPU_PAGE_SIZE);

if ((gfxv / 10000 * 10000) == 100000) {

                /* HW design limits control stack size to 0x7000.
@@ -460,7 +461,7 @@ void kfd_queue_ctx_save_restore_size(struct 
kfd_topology_device *dev)

props->ctl_stack_size = ctl_stack_size;

        props->debug_memory_size = ALIGN(wave_num * DEBUGGER_BYTES_PER_WAVE, 
DEBUGGER_BYTES_ALIGN);
-       props->cwsr_size = ctl_stack_size + wg_data_size;
+       props->cwsr_size = ALIGN(ctl_stack_size + wg_data_size, PAGE_SIZE);

if (gfxv == 80002) /* GFX_VERSION_TONGA */

                props->eop_buffer_size = 0x8000;

Re: [RFC PATCH v1 7/8] amdgpu: Align ctl_stack_size and wg_data_size to GPU page size instead of CPU page size

Reply via email to