On 2023-01-09 15:23, Felix Kuehling wrote:
Am 2023-01-09 um 15:18 schrieb Philip Yang:

On 2023-01-09 14:27, Eric Huang wrote:
There will be data corruption on vram allocated by svm
if initialization is not being done. Adding sync is to
resolve this issue.

Signed-off-by: Eric Huang <jinhuieric.hu...@amd.com>
---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 7 +++++++
  1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index b8c9753a4818..344e20306635 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -574,6 +574,13 @@ svm_range_vram_node_new(struct amdgpu_device *adev, struct svm_range *prange,
          goto reserve_bo_failed;
      }

Thanks for catching this bug, we could optimize as clear VRAM is only needed for partial migration, ex pass the clear flag clear = (cpages != npages) from svm_migrate_vma_to_vram -> svm_migrate_copy_to_vram -> svm_range_vram_node_new.

I only see one call to svm_range_vram_node_new, and it passes "true" unconditionally. What am I missing?
I just realize that if all range pages are migrated to VRAM, we don't need clear VRAM before migration.

That said, if VRAM is not cleared, there will be no fence to wait for, so the amdgpu_bo_sync_wait call is basically a no-op with a little bit of overhead for creating and destroying an empty sync object.

We pass true unconditionally right now, change it to conditional for partial migration only, then sync will be a no-op if all pages of range are migrated.

Regards,

Philip


Regards,
  Felix



Regards,

Philip

+    r = amdgpu_bo_sync_wait(bo, AMDGPU_FENCE_OWNER_KFD, false);
+    if (r) {
+        pr_debug("failed %d to sync bo\n", r);
+        amdgpu_bo_unreserve(bo);
+        goto reserve_bo_failed;
+    }
+
      r = dma_resv_reserve_fences(amdkcl_ttm_resvp(&bo->tbo), 1);
      if (r) {
          pr_debug("failed %d to reserve bo\n", r);

Reply via email to