KFD VRAM allocations only set AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE (clear on free) but not AMDGPU_GEM_CREATE_VRAM_CLEARED (clear on create). This means freshly allocated VRAM BOs contain stale data from prior use, which is observable by GPU compute kernels.
The GEM ioctl path unconditionally sets VRAM_CLEARED, but the KFD path was missing this flag. This causes data corruption in applications that depend on VMM-allocated memory being zero-initialized, such as RCCL P2P transport where stale data in ptrExchange/head/tail fields leads to HSA_STATUS_ERROR_MEMORY_FAULT crashes. Signed-off-by: Amir Shetaia <[email protected]> --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 8a869fe41acd..7c01492e69dd 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -1735,7 +1735,8 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu( alloc_domain = AMDGPU_GEM_DOMAIN_GTT; alloc_flags = 0; } else { - alloc_flags = AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE; + alloc_flags = AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE | + AMDGPU_GEM_CREATE_VRAM_CLEARED; alloc_flags |= (flags & KFD_IOC_ALLOC_MEM_FLAGS_PUBLIC) ? AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED : 0; -- 2.43.0
