On 4/9/26 16:19, Amir Shetaia wrote: > KFD VRAM allocations only set AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE > (clear on free) but not AMDGPU_GEM_CREATE_VRAM_CLEARED (clear on > create). This means freshly allocated VRAM BOs contain stale data > from prior use, which is observable by GPU compute kernels. > > The GEM ioctl path unconditionally sets VRAM_CLEARED, but the KFD > path was missing this flag. > > This causes data corruption in applications that depend on > VMM-allocated memory being zero-initialized, such as RCCL P2P > transport where stale data in ptrExchange/head/tail fields leads > to HSA_STATUS_ERROR_MEMORY_FAULT crashes. > > Signed-off-by: Amir Shetaia <[email protected]>
Reviewed-by: Christian König <[email protected]> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > index 8a869fe41acd..7c01492e69dd 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > @@ -1735,7 +1735,8 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu( > alloc_domain = AMDGPU_GEM_DOMAIN_GTT; > alloc_flags = 0; > } else { > - alloc_flags = AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE; > + alloc_flags = AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE | > + AMDGPU_GEM_CREATE_VRAM_CLEARED; > alloc_flags |= (flags & KFD_IOC_ALLOC_MEM_FLAGS_PUBLIC) > ? > AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED : 0; >
