Am 23.08.2018 um 22:21 schrieb Marek Olšák:
On Thu, Aug 23, 2018 at 2:51 AM Christian König
<ckoenig.leichtzumer...@gmail.com> wrote:
Am 22.08.2018 um 21:32 schrieb Marek Olšák:
On Wed, Aug 22, 2018 at 12:56 PM Alex Deucher <alexdeuc...@gmail.com> wrote:
On Wed, Aug 22, 2018 at 6:05 AM Christian König
<ckoenig.leichtzumer...@gmail.com> wrote:
Instead of hammering hard on the GPU try a soft recovery first.
v2: reorder code a bit
Signed-off-by: Christian König <christian.koe...@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 6 ++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 24 ++++++++++++++++++++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 4 ++++
3 files changed, 34 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 265ff90f4e01..d93e31a5c4e7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -33,6 +33,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
struct amdgpu_job *job = to_amdgpu_job(s_job);
+ if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent))
{
+ DRM_ERROR("ring %s timeout, but soft recovered\n",
+ s_job->sched->name);
+ return;
+ }
I think we should still bubble up the error to userspace even if we
can recover. Data is lost when the wave is killed. We should treat
it like a GPU reset.
Yes, please increment gpu_reset_counter, so that we are compliant with
OpenGL. Being able to recover from infinite loops is great, but test
suites also expect this to be properly reported to userspace via the
per-context query.
Sure that shouldn't be a problem.
Also please bump the deadline to 1 second. Even you if you kill all
shaders, the IB can also contain CP DMA, which may take longer than 1
ms.
Is there any way we can get a feedback from the SQ if the kill was
successfully?
I don't think so. The kill should be finished pretty quickly, but more
waves with infinite loops may be waiting to be launched, so you still
need to repeat the kill command. And we should ideally repeat it for 1
second.
The reason is that vertex shader waves take a lot of time to launch. A
very very very large draw call can keep launching new waves for 1
second with the same infinite loop. You would have to soft-reset all
VGTs to stop that.
1 second is way to long, since in the case of a blocked MC we need to
start up hard reset relative fast.
10 seconds have already passed.
That is a good argument, but I still find 1 second way to long.
Keep in mind that we hammer on the SQ here to kill the waves as soon as
they are launched.
I think that some hangs from corrupted descriptors may still be
recoverable just by killing waves.
Agreed, especially since Vega10 the memory paths became much more robust
to error handling.
Christian.
Marek
Regards,
Christian.
Marek
Marek
Alex
+
DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
job->base.sched->name,
atomic_read(&ring->fence_drv.last_seq),
ring->fence_drv.sync_seq);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 5dfd26be1eec..c045a4e38ad1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -383,6 +383,30 @@ void amdgpu_ring_emit_reg_write_reg_wait_helper(struct
amdgpu_ring *ring,
amdgpu_ring_emit_reg_wait(ring, reg1, mask, mask);
}
+/**
+ * amdgpu_ring_soft_recovery - try to soft recover a ring lockup
+ *
+ * @ring: ring to try the recovery on
+ * @vmid: VMID we try to get going again
+ * @fence: timedout fence
+ *
+ * Tries to get a ring proceeding again when it is stuck.
+ */
+bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
+ struct dma_fence *fence)
+{
+ ktime_t deadline = ktime_add_us(ktime_get(), 1000);
+
+ if (!ring->funcs->soft_recovery)
+ return false;
+
+ while (!dma_fence_is_signaled(fence) &&
+ ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
+ ring->funcs->soft_recovery(ring, vmid);
+
+ return dma_fence_is_signaled(fence);
+}
+
/*
* Debugfs info
*/
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 409fdd9b9710..9cc239968e40 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -168,6 +168,8 @@ struct amdgpu_ring_funcs {
/* priority functions */
void (*set_priority) (struct amdgpu_ring *ring,
enum drm_sched_priority priority);
+ /* Try to soft recover the ring to make the fence signal */
+ void (*soft_recovery)(struct amdgpu_ring *ring, unsigned vmid);
};
struct amdgpu_ring {
@@ -260,6 +262,8 @@ void amdgpu_ring_fini(struct amdgpu_ring *ring);
void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring,
uint32_t reg0, uint32_t val0,
uint32_t reg1, uint32_t val1);
+bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
+ struct dma_fence *fence);
static inline void amdgpu_ring_clear_ring(struct amdgpu_ring *ring)
{
--
2.14.1
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx