On 2020-01-20 1:28 p.m., shaoyunl wrote:

On 2020-01-20 12:58 p.m., Felix Kuehling wrote:
On 2020-01-20 12:47 p.m., shaoyunl wrote:
comments in line .

On 2020-01-17 8:37 p.m., Felix Kuehling wrote:
Using a heavy-weight TLB flush once is not sufficient. Concurrent
memory accesses in the same TLB cache line can re-populate TLB entries
from stale texture cache (TC) entries while the heavy-weight TLB
flush is in progress. To fix this race condition, perform another TLB
flush after the heavy-weight one, when TC is known to be clean.

Move the workaround into the low-level TLB flushing functions. This way
they apply to amdgpu as well, and KIQ-based TLB flush only needs to
synchronize once.

CC: shaoyun....@amd.com
Signed-off-by: Felix Kuehling <felix.kuehl...@amd.com>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  6 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      | 68 +++++++++++++++++-----
  2 files changed, 53 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 8609287620ea..5325f6b455f6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -647,13 +647,9 @@ int amdgpu_amdkfd_flush_gpu_tlb_vmid(struct kgd_dev *kgd, uint16_t vmid)   int amdgpu_amdkfd_flush_gpu_tlb_pasid(struct kgd_dev *kgd, uint16_t pasid)
  {
      struct amdgpu_device *adev = (struct amdgpu_device *)kgd;
-    uint32_t flush_type = 0;
+    const uint32_t flush_type = 0;
      bool all_hub = false;
  -    if (adev->gmc.xgmi.num_physical_nodes &&
-        adev->asic_type == CHIP_VEGA20)
-        flush_type = 2;
-
      if (adev->family == AMDGPU_FAMILY_AI)
          all_hub = true;
  diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 90216abf14a4..e2a5e852bdb0 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -476,13 +476,26 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device *adev, uint32_t vmid,
  {
      bool use_semaphore = gmc_v9_0_use_invalidate_semaphore(adev, vmhub);
      const unsigned eng = 17;
-    u32 j, inv_req, tmp;
+    u32 j, inv_req, inv_req2, tmp;
      struct amdgpu_vmhub *hub;
        BUG_ON(vmhub >= adev->num_vmhubs);
        hub = &adev->vmhub[vmhub];
-    inv_req = gmc_v9_0_get_invalidate_req(vmid, flush_type);
+    if (adev->gmc.xgmi.num_physical_nodes &&
+        adev->asic_type == CHIP_VEGA20) {
+        /* Vega20+XGMI caches PTEs in TC and TLB. Add a
+         * heavy-weight TLB flush (type 2), which flushes
+         * both. Due to a race condition with concurrent
+         * memory accesses using the same TLB cache line, we
+         * still need a second TLB flush after this.
+         */
+        inv_req = gmc_v9_0_get_invalidate_req(vmid, 2);
+        inv_req2 = gmc_v9_0_get_invalidate_req(vmid, flush_type);

[shaoyunl]  For the send invalidation in this situation ,can we use 0  for the flush type directly ? I think no matter what's the input flush_type for this function , heavy-weight + legacy invalidation should be enough for all of them .

I'm not sure that's true. In the case of the race condition, there was some concurrent memory access during the first heavy-weight invalidation. If that is now flushed in the second invalidation, and a heavy-weight invalidation was requested, we should also flush any TC cache lines associated with that access. So hard-coding flush_type 0 here is probably not safe for all cases.

Regards,
  Felix

[shaoyunl]   Originally we use the  heavy-weight invalidation for XGMI here is due to the HW issue which always use NC even for remote GPU memory access (this lead walker to load the TLB directly from TC with stale value) . The heavy-weight  will set invalidate bit for both TLB and  TC so this will make the walker to load from main memory . Your change is based on the assumption that after first heavy-weight invalidation , the TC already load with  correct contents which seems  should be true , so in this situation I think the light-weight or even legacy invalidation will be  enough since they will load from TC to TLB directly .

With this change, if you request a legacy invalidation (currently we always do), you'll get a heavy-weight followed by a legacy invalidation.

I'm working on other changes that will require a heavy-weight TLB flush even without this workaround. In this case I believe the second flush will need to be heavy-weight as well.

Regards,
  Felix



Regards

shaoyun.liu




+    } else {
+        inv_req = gmc_v9_0_get_invalidate_req(vmid, flush_type);
+        inv_req2 = 0;
+    }
        /* This is necessary for a HW workaround under SRIOV as well
       * as GFXOFF under bare metal
@@ -521,21 +534,27 @@ static void gmc_v9_0_flush_gpu_tlb(struct amdgpu_device *adev, uint32_t vmid,               DRM_ERROR("Timeout waiting for sem acquire in VM flush!\n");
      }
  -    WREG32_NO_KIQ(hub->vm_inv_eng0_req + eng, inv_req);
+    do {
+        WREG32_NO_KIQ(hub->vm_inv_eng0_req + eng, inv_req);
  -    /*
-     * Issue a dummy read to wait for the ACK register to be cleared
-     * to avoid a false ACK due to the new fast GRBM interface.
-     */
-    if (vmhub == AMDGPU_GFXHUB_0)
-        RREG32_NO_KIQ(hub->vm_inv_eng0_req + eng);
+        /*
+         * Issue a dummy read to wait for the ACK register to
+         * be cleared to avoid a false ACK due to the new fast
+         * GRBM interface.
+         */
+        if (vmhub == AMDGPU_GFXHUB_0)
+            RREG32_NO_KIQ(hub->vm_inv_eng0_req + eng);
  -    for (j = 0; j < adev->usec_timeout; j++) {
-        tmp = RREG32_NO_KIQ(hub->vm_inv_eng0_ack + eng);
-        if (tmp & (1 << vmid))
-            break;
-        udelay(1);
-    }
+        for (j = 0; j < adev->usec_timeout; j++) {
+            tmp = RREG32_NO_KIQ(hub->vm_inv_eng0_ack + eng);
+            if (tmp & (1 << vmid))
+                break;
+            udelay(1);
+        }
+
+        inv_req = inv_req2;
+        inv_req2 = 0;
+    } while (inv_req);
        /* TODO: It needs to continue working on debugging with semaphore for GFXHUB as well. */
      if (use_semaphore)
@@ -577,9 +596,26 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev,
          return -EIO;
        if (ring->sched.ready) {
+        /* Vega20+XGMI caches PTEs in TC and TLB. Add a
+         * heavy-weight TLB flush (type 2), which flushes
+         * both. Due to a race condition with concurrent
+         * memory accesses using the same TLB cache line, we
+         * still need a second TLB flush after this.
+         */
+        bool vega20_xgmi_wa = (adev->gmc.xgmi.num_physical_nodes &&
+                       adev->asic_type == CHIP_VEGA20);
+        /* 2 dwords flush + 8 dwords fence */
+        unsigned int ndw = kiq->pmf->invalidate_tlbs_size + 8;
+
+        if (vega20_xgmi_wa)
+            ndw += kiq->pmf->invalidate_tlbs_size;
+
          spin_lock(&adev->gfx.kiq.ring_lock);
          /* 2 dwords flush + 8 dwords fence */
-        amdgpu_ring_alloc(ring, kiq->pmf->invalidate_tlbs_size + 8);
+        amdgpu_ring_alloc(ring, ndw);
+        if (vega20_xgmi_wa)
+            kiq->pmf->kiq_invalidate_tlbs(ring,
+                              pasid, 2, all_hub);
          kiq->pmf->kiq_invalidate_tlbs(ring,
                      pasid, flush_type, all_hub);
          amdgpu_fence_emit_polling(ring, &seq);
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to