On 4/9/26 16:19, Amir Shetaia wrote:
> KFD VRAM allocations only set AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE
> (clear on free) but not AMDGPU_GEM_CREATE_VRAM_CLEARED (clear on
> create). This means freshly allocated VRAM BOs contain stale data
> from prior use, which is observable by GPU compute kernels.
> 
> The GEM ioctl path unconditionally sets VRAM_CLEARED, but the KFD
> path was missing this flag.
> 
> This causes data corruption in applications that depend on
> VMM-allocated memory being zero-initialized, such as RCCL P2P
> transport where stale data in ptrExchange/head/tail fields leads
> to HSA_STATUS_ERROR_MEMORY_FAULT crashes.
> 
> Signed-off-by: Amir Shetaia <[email protected]>

Reviewed-by: Christian König <[email protected]>

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 8a869fe41acd..7c01492e69dd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -1735,7 +1735,8 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
>                       alloc_domain = AMDGPU_GEM_DOMAIN_GTT;
>                       alloc_flags = 0;
>               } else {
> -                     alloc_flags = AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE;
> +                     alloc_flags = AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE |
> +                             AMDGPU_GEM_CREATE_VRAM_CLEARED;
>                       alloc_flags |= (flags & KFD_IOC_ALLOC_MEM_FLAGS_PUBLIC) 
> ?
>                       AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED : 0;
>  

Reply via email to