After looking more

"Failed to initialize parser !" is expected since the reason is jobs from guilty context (context causing the GPU hang) are canceled.

Locking imbalance is actually gone with Monk's patches.

Thanks,

Andrey


On 02/28/2018 11:40 AM, Andrey Grodzovsky wrote:
No new issues found with those patches, testing GPU reset using libdrm deadlock detection test on Ellsmire.

The patches are Tested-By: Andrey Grodzovsky <andrey.grodzov...@amd.com>

P.S

Noticed existing issues (before Monk's patches)

Multiple [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser !

And occasional unlock imbalance forn amdgpu_cs_ioctl

DEBUG_LOCKS_WARN_ON(depth <= 0)
[   93.069011 <    0.000017>] WARNING: CPU: 3 PID: 2215 at kernel/locking/lockdep.c:3682 lock_release+0x2e8/0x360

On CZ full reset hangs the system.

Gonna take a look at those issues.

Thanks,

Andrey


From
On 02/28/2018 08:31 AM, Liu, Monk wrote:
Already sent

-----Original Message-----
From: Grodzovsky, Andrey
Sent: 2018年2月28日 21:31
To: Koenig, Christian <christian.koe...@amd.com>; Liu, Monk <monk....@amd.com>; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH 1/4] drm/amdgpu: stop all rings before doing gpu recover

Will do once Monk sends V2 for  [PATCH 4/4] drm/amdgpu: try again kiq access if not in IRQ

Andrey


On 02/28/2018 07:20 AM, Christian König wrote:
Andrey please give this set a good testing as well.

Am 28.02.2018 um 08:21 schrieb Monk Liu:
found recover_vram_from_shadow sometimes get executed in paralle with
SDMA scheduler, should stop all schedulers before doing gpu
reset/recover

Change-Id: Ibaef3e3c015f3cf88f84b2eaf95cda95ae1a64e3
Signed-off-by: Monk Liu <monk....@amd.com>
For now this patch is Reviewed-by: Christian König
<christian.koe...@amd.com>.

Regards,
Christian.

---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40
+++++++++++-------------------
   1 file changed, 15 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 75d1733..e9d81a8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2649,22 +2649,23 @@ int amdgpu_device_gpu_recover(struct
amdgpu_device *adev,
         /* block TTM */
       resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
+
       /* store modesetting */
       if (amdgpu_device_has_dc_support(adev))
           state = drm_atomic_helper_suspend(adev->ddev);
   -    /* block scheduler */
+    /* block all schedulers and reset given job's ring */
       for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
           struct amdgpu_ring *ring = adev->rings[i];
             if (!ring || !ring->sched.thread)
               continue;
   -        /* only focus on the ring hit timeout if &job not NULL */
+        kthread_park(ring->sched.thread);
+
           if (job && job->ring->idx != i)
               continue;
   -        kthread_park(ring->sched.thread);
           drm_sched_hw_job_reset(&ring->sched, &job->base);
             /* after all hw jobs are reset, hw fence is meaningless,
so force_completion */ @@ -2707,33 +2708,22 @@ int
amdgpu_device_gpu_recover(struct amdgpu_device *adev,
               }
               dma_fence_put(fence);
           }
+    }
   -        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
-            struct amdgpu_ring *ring = adev->rings[i];
-
-            if (!ring || !ring->sched.thread)
-                continue;
+    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+        struct amdgpu_ring *ring = adev->rings[i];
   -            /* only focus on the ring hit timeout if &job not NULL
*/
-            if (job && job->ring->idx != i)
-                continue;
+        if (!ring || !ring->sched.thread)
+            continue;
   +        /* only need recovery sched of the given job's ring
+         * or all rings (in the case @job is NULL)
+         * after above amdgpu_reset accomplished
+         */
+        if ((!job || job->ring->idx == i) && !r)
               drm_sched_job_recovery(&ring->sched);
-            kthread_unpark(ring->sched.thread);
-        }
-    } else {
-        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
-            struct amdgpu_ring *ring = adev->rings[i];
   -            if (!ring || !ring->sched.thread)
-                continue;
-
-            /* only focus on the ring hit timeout if &job not NULL
*/
-            if (job && job->ring->idx != i)
-                continue;
-
- kthread_unpark(adev->rings[i]->sched.thread);
-        }
+        kthread_unpark(ring->sched.thread);
       }
         if (amdgpu_device_has_dc_support(adev)) {


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to