On 2021-11-25 2:11 a.m., Christian König wrote:
Am 25.11.21 um 02:56 schrieb Felix Kuehling:
Am 2021-11-24 um 5:58 p.m. schrieb Philip Yang:
[SNIP]
  #define amdgpu_ih_get_wptr(adev, ih) (adev)->irq.ih_funcs->get_wptr((adev), (ih))
  #define amdgpu_ih_decode_iv(adev, iv) \
      (adev)->irq.ih_funcs->decode_iv((adev), (ih), (iv))
+#define amdgpu_ih_decode_iv_ts(adev, ih, rptr, offset) \
+    (WARN_ON_ONCE(!(adev)->irq.ih_funcs->decode_iv_ts) ? 0 : \
+    (adev)->irq.ih_funcs->decode_iv_ts((ih), (rptr), (offset)))
  #define amdgpu_ih_set_rptr(adev, ih) (adev)->irq.ih_funcs->set_rptr((adev), (ih))
    int amdgpu_ih_ring_init(struct amdgpu_device *adev, struct amdgpu_ih_ring *ih,
@@ -89,10 +99,12 @@ int amdgpu_ih_ring_init(struct amdgpu_device *adev, struct amdgpu_ih_ring *ih,
  void amdgpu_ih_ring_fini(struct amdgpu_device *adev, struct amdgpu_ih_ring *ih);
  void amdgpu_ih_ring_write(struct amdgpu_ih_ring *ih, const uint32_t *iv,
                unsigned int num_dw);
-int amdgpu_ih_wait_on_checkpoint_process(struct amdgpu_device *adev,
-                    struct amdgpu_ih_ring *ih);
+int amdgpu_ih_wait_on_checkpoint_process_ts(struct amdgpu_device *adev,
+                        struct amdgpu_ih_ring *ih);
  int amdgpu_ih_process(struct amdgpu_device *adev, struct amdgpu_ih_ring *ih);
  void amdgpu_ih_decode_iv_helper(struct amdgpu_device *adev,
                  struct amdgpu_ih_ring *ih,
                  struct amdgpu_iv_entry *entry);
+uint64_t amdgpu_ih_decode_iv_ts_helper(struct amdgpu_ih_ring *ih, u32 rptr,
+                       signed int offset);
  #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index 3ec5ff5a6dbe..b129898db433 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -119,6 +119,11 @@ static int gmc_v10_0_process_interrupt(struct amdgpu_device *adev,
              return 1;
          }
  +        /* Stale retry fault if timestamp goes backward */
+        if (entry->ih == &adev->irq.ih1 &&
+            amdgpu_ih_ts_after(entry->timestamp, entry->ih->processed_timestamp))
+            return 1;
+
This check should go before amdgpu_gmc_filter_faults. Otherwise
amdgpu_gmc_filter_faults may later drop a real fault because it added a
stale fault in its hash table.

I was already wondering if we shouldn't move all of this completely into amdgpu_gmc_filter_faults().

I mean essentially we are filtering faults here once more, just based on a different criteria.

It is good idea, it also removes duplicate code in different interrupt handler. And retry fault timestamp check for both ring0 and ring1.

Thanks,

Philip


Regards,
Christian.

Reply via email to