On 2022-11-02 10:58, Christian König wrote:
It can happen that we query the sequence value before the callback
had a chance to run.

Work around that by grabbing the fence lock and releasing it again.
Should be replaced by hw handling soon.

kfd_flush_tlb is always called after waiting for map/unmap to GPU fence signalled, that means the callback is already executed and the sequence is increased if tlb flush is needed, so no such race from KFD.

I am not sure but seems the race does exist for amdgpu to grab vm and schedule job.

Acked-by: Philip Yang <[email protected]>

Signed-off-by: Christian König <[email protected]>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 15 +++++++++++++++
  1 file changed, 15 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index 9ecb7f663e19..e51a46c9582b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -485,6 +485,21 @@ void amdgpu_debugfs_vm_bo_info(struct amdgpu_vm *vm, 
struct seq_file *m);
   */
  static inline uint64_t amdgpu_vm_tlb_seq(struct amdgpu_vm *vm)
  {
+       unsigned long flags;
+       spinlock_t *lock;
+
+       /*
+        * Work around to stop racing between the fence signaling and handling
+        * the cb. The lock is static after initially setting it up, just make
+        * sure that the dma_fence structure isn't freed up.
+        */
+       rcu_read_lock();
+       lock = vm->last_tlb_flush->lock;
+       rcu_read_unlock();
+
+       spin_lock_irqsave(lock, flags);
+       spin_unlock_irqrestore(lock, flags);
+
        return atomic64_read(&vm->tlb_seq);
  }

Reply via email to