amdgpu: stop all rings before doing gpu recover

Andrey Grodzovsky Wed, 28 Feb 2018 09:42:33 -0800

After looking more

"Failed to initialize parser !" is expected since the reason is jobsfrom guilty context (context causing the GPU hang) are canceled.


Locking imbalance is actually gone with Monk's patches.

Thanks,

Andrey


On 02/28/2018 11:40 AM, Andrey Grodzovsky wrote:

No new issues found with those patches, testing GPU reset using libdrmdeadlock detection test on Ellsmire.


The patches are Tested-By: Andrey Grodzovsky <andrey.grodzov...@amd.com>

P.S

Noticed existing issues (before Monk's patches)

Multiple [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initializeparser !


And occasional unlock imbalance forn amdgpu_cs_ioctl

DEBUG_LOCKS_WARN_ON(depth <= 0)

[ 93.069011 < 0.000017>] WARNING: CPU: 3 PID: 2215 atkernel/locking/lockdep.c:3682 lock_release+0x2e8/0x360


On CZ full reset hangs the system.

Gonna take a look at those issues.

Thanks,

Andrey


From
On 02/28/2018 08:31 AM, Liu, Monk wrote:

Already sent

-----Original Message-----
From: Grodzovsky, Andrey
Sent: 2018年2月28日 21:31

To: Koenig, Christian <christian.koe...@amd.com>; Liu, Monk<monk....@amd.com>; amd-gfx@lists.freedesktop.orgSubject: Re: [PATCH 1/4] drm/amdgpu: stop all rings before doing gpurecover

Will do once Monk sends V2 for [PATCH 4/4] drm/amdgpu: try again kiqaccess if not in IRQ


Andrey


On 02/28/2018 07:20 AM, Christian König wrote:

Andrey please give this set a good testing as well.

Am 28.02.2018 um 08:21 schrieb Monk Liu:

found recover_vram_from_shadow sometimes get executed in paralle with
SDMA scheduler, should stop all schedulers before doing gpu
reset/recover

Change-Id: Ibaef3e3c015f3cf88f84b2eaf95cda95ae1a64e3
Signed-off-by: Monk Liu <monk....@amd.com>

For now this patch is Reviewed-by: Christian König
<christian.koe...@amd.com>.

Regards,
Christian.

---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40
+++++++++++-------------------
   1 file changed, 15 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 75d1733..e9d81a8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2649,22 +2649,23 @@ int amdgpu_device_gpu_recover(struct
amdgpu_device *adev,
         /* block TTM */
       resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
+
       /* store modesetting */
       if (amdgpu_device_has_dc_support(adev))
           state = drm_atomic_helper_suspend(adev->ddev);
   -    /* block scheduler */
+    /* block all schedulers and reset given job's ring */
       for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
           struct amdgpu_ring *ring = adev->rings[i];
             if (!ring || !ring->sched.thread)
               continue;
   -        /* only focus on the ring hit timeout if &job not NULL */
+        kthread_park(ring->sched.thread);
+
           if (job && job->ring->idx != i)
               continue;
   -        kthread_park(ring->sched.thread);
           drm_sched_hw_job_reset(&ring->sched, &job->base);
             /* after all hw jobs are reset, hw fence is meaningless,
so force_completion */ @@ -2707,33 +2708,22 @@ int
amdgpu_device_gpu_recover(struct amdgpu_device *adev,
               }
               dma_fence_put(fence);
           }
+    }
   -        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
-            struct amdgpu_ring *ring = adev->rings[i];
-
-            if (!ring || !ring->sched.thread)
-                continue;
+    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+        struct amdgpu_ring *ring = adev->rings[i];
   -            /* only focus on the ring hit timeout if &job not NULL
*/
-            if (job && job->ring->idx != i)
-                continue;
+        if (!ring || !ring->sched.thread)
+            continue;
   +        /* only need recovery sched of the given job's ring
+         * or all rings (in the case @job is NULL)
+         * after above amdgpu_reset accomplished
+         */
+        if ((!job || job->ring->idx == i) && !r)
               drm_sched_job_recovery(&ring->sched);
-            kthread_unpark(ring->sched.thread);
-        }
-    } else {
-        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
-            struct amdgpu_ring *ring = adev->rings[i];
   -            if (!ring || !ring->sched.thread)
-                continue;
-
-            /* only focus on the ring hit timeout if &job not NULL
*/
-            if (job && job->ring->idx != i)
-                continue;
-
- kthread_unpark(adev->rings[i]->sched.thread);
-        }
+        kthread_unpark(ring->sched.thread);
       }
         if (amdgpu_device_has_dc_support(adev)) {


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 1/4] drm/amdgpu: stop all rings before doing gpu recover

Reply via email to