On 3/23/26 05:28, Donet Tom wrote: > Currently, AMDGPU_VA_RESERVED_TRAP_SIZE is hardcoded to 8KB, while > KFD_CWSR_TBA_TMA_SIZE is defined as 2 * PAGE_SIZE. On systems with > 4K pages, both values match (8KB), so allocation and reserved space > are consistent. > > However, on 64K page-size systems, KFD_CWSR_TBA_TMA_SIZE becomes 128KB, > while the reserved trap area remains 8KB. This mismatch causes the > kernel to crash when running rocminfo or rccl unit tests. > > Kernel attempted to read user page (2) - exploit attempt? (uid: 1001) > BUG: Kernel NULL pointer dereference on read at 0x00000002 > Faulting instruction address: 0xc0000000002c8a64 > Oops: Kernel access of bad area, sig: 11 [#1] > LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries > CPU: 34 UID: 1001 PID: 9379 Comm: rocminfo Tainted: G E > 6.19.0-rc4-amdgpu-00320-gf23176405700 #56 VOLUNTARY > Tainted: [E]=UNSIGNED_MODULE > Hardware name: IBM,9105-42A POWER10 (architected) 0x800200 0xf000006 > of:IBM,FW1060.30 (ML1060_896) hv:phyp pSeries > NIP: c0000000002c8a64 LR: c00000000125dbc8 CTR: c00000000125e730 > REGS: c0000001e0957580 TRAP: 0300 Tainted: G E > MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24008268 > XER: 00000036 > CFAR: c00000000125dbc4 DAR: 0000000000000002 DSISR: 40000000 > IRQMASK: 1 > GPR00: c00000000125d908 c0000001e0957820 c0000000016e8100 > c00000013d814540 > GPR04: 0000000000000002 c00000013d814550 0000000000000045 > 0000000000000000 > GPR08: c00000013444d000 c00000013d814538 c00000013d814538 > 0000000084002268 > GPR12: c00000000125e730 c000007e2ffd5f00 ffffffffffffffff > 0000000000020000 > GPR16: 0000000000000000 0000000000000002 c00000015f653000 > 0000000000000000 > GPR20: c000000138662400 c00000013d814540 0000000000000000 > c00000013d814500 > GPR24: 0000000000000000 0000000000000002 c0000001e0957888 > c0000001e0957878 > GPR28: c00000013d814548 0000000000000000 c00000013d814540 > c0000001e0957888 > NIP [c0000000002c8a64] __mutex_add_waiter+0x24/0xc0 > LR [c00000000125dbc8] __mutex_lock.constprop.0+0x318/0xd00 > Call Trace: > 0xc0000001e0957890 (unreliable) > __mutex_lock.constprop.0+0x58/0xd00 > amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x6fc/0xb60 [amdgpu] > kfd_process_alloc_gpuvm+0x54/0x1f0 [amdgpu] > kfd_process_device_init_cwsr_dgpu+0xa4/0x1a0 [amdgpu] > kfd_process_device_init_vm+0xd8/0x2e0 [amdgpu] > kfd_ioctl_acquire_vm+0xd0/0x130 [amdgpu] > kfd_ioctl+0x514/0x670 [amdgpu] > sys_ioctl+0x134/0x180 > system_call_exception+0x114/0x300 > system_call_vectored_common+0x15c/0x2ec > > This patch changes AMDGPU_VA_RESERVED_TRAP_SIZE to 2 * PAGE_SIZE, > ensuring that the reserved trap area matches the allocation size > across all page sizes. > > cc: [email protected] > Fixes: 34a1de0f7935 ("drm/amdkfd: Relocate TBA/TMA to opposite side of VM > hole") > Reviewed-by: Ritesh Harjani (IBM) <[email protected]> > Signed-off-by: Donet Tom <[email protected]> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h > index 139642eacdd0..a5eae49f9471 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h > @@ -173,7 +173,7 @@ struct amdgpu_bo_vm; > #define AMDGPU_VA_RESERVED_SEQ64_SIZE (2ULL << 20) > #define AMDGPU_VA_RESERVED_SEQ64_START(adev) > (AMDGPU_VA_RESERVED_CSA_START(adev) \ > - > AMDGPU_VA_RESERVED_SEQ64_SIZE) > -#define AMDGPU_VA_RESERVED_TRAP_SIZE (2ULL << 12) > +#define AMDGPU_VA_RESERVED_TRAP_SIZE (2ULL << PAGE_SHIFT)
Well using PAGE_SHIFT in amdgpu_vm.h looks quite broken to me. That makes the GPU VA reservation depend on the CPU page size and that is clearly not something we want to have. Where is KFD_CWSR_TBA_TMA_SIZE defined? Regards, Christian. > #define AMDGPU_VA_RESERVED_TRAP_START(adev) > (AMDGPU_VA_RESERVED_SEQ64_START(adev) \ > - AMDGPU_VA_RESERVED_TRAP_SIZE) > #define AMDGPU_VA_RESERVED_BOTTOM (1ULL << 16)
