Re: [PATCH] drm/gpu-sched: fix force APP kill hang

Christian König Thu, 29 Mar 2018 02:06:08 -0700

I think the core of the problem is that we don't abort all entities ofthe process at the same time.


How about splitting drm_sched_entity_fini() into two functions?

The first one is does the waiting, removes the entity from the runqueueand returns an error when the process was killed.

During destruction this one is called first for all contexts as well asthe entity for VM updates.

The second one then goes over the entity and signals all jobs with anerror code.

This way no VM updates should be executed any longer after the processis killed (something which doesn't makes sense anyway and just costs ustime).


Regards,
Christian.

Am 29.03.2018 um 06:14 schrieb Liu, Monk:

2)if a fence signaled and try to clear some entity's dependency,
should set this entity guilty to prevent its job really run since the
dependency is fake signaled.

Well, that is a clear NAK. It would mean that you mark things like the X server 
or Wayland queue is marked guilty because some client is killed.

And since unmapping/clear jobs don't have a guilty pointer it should actually 
not have any effect on the issue.


[ML], yeah that's a good point, how about this way: if a fence is fake signaled 
and try to clear other entity's dependency we only allow entity marked as guilty
If this entity share the same ctx (or even process?) of the entity of the job 
of that fake signaled fence ?
This way for a certain process a faked signaled GFX fence won't be able to wake 
another VCE/SDMA job

/Monk

-----Original Message-----
From: Christian König [mailto:[email protected]]
Sent: 2018年3月28日 19:57
To: Deng, Emily <[email protected]>; [email protected]
Cc: Liu, Monk <[email protected]>
Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

Am 28.03.2018 um 10:07 schrieb Emily Deng:

issue:
there are VMC page fault occured if force APP kill during 3dmark test,
the cause is in entity_fini we manually signal all those jobs in
entity's queue which confuse the sync/dep
mechanism:

1)page fault occured in sdma's clear job which operate on shadow
buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release
since the fence in its reservation was fake signaled by entity_fini()
under the case of SIGKILL received.

2)page fault occured in gfx' job because during the lifetime of gfx
job we manually fake signal all jobs from its entity in entity_fini(),
thus the unmapping/clear PTE job depend on those result fence is
satisfied and sdma start clearing the PTE and lead to GFX page fault.

Nice catch, but the fixes won't work like this.

fix:
1)should at least wait all jobs already scheduled complete in
entity_fini() if SIGKILL is the case.

Well that is not a good idea because when we kill a process we actually want to 
tear down the task as fast as possible and not wait for anything. That is 
actually the reason why we have this handling.

2)if a fence signaled and try to clear some entity's dependency,
should set this entity guilty to prevent its job really run since the
dependency is fake signaled.

Well, that is a clear NAK. It would mean that you mark things like the X server 
or Wayland queue is marked guilty because some client is killed.

And since unmapping/clear jobs don't have a guilty pointer it should actually 
not have any effect on the issue.

Regards,
Christian.

related issue ticket:
http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1

Signed-off-by: Monk Liu <[email protected]>
---
   drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 
+++++++++++++++++++++++++++++++
   1 file changed, 36 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
b/drivers/gpu/drm/scheduler/gpu_scheduler.c
index 2bd69c4..9b306d3 100644
--- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
@@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct 
drm_sched_entity *entity)
        return true;
   }

+static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler *sched,

+                               struct drm_sched_entity *entity)
+{
+       struct drm_sched_job *last;
+       signed long r;
+
+       spin_lock(&sched->job_list_lock);
+       list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)
+               if (last->s_fence->scheduled.context == entity->fence_context) {
+                       dma_fence_get(&last->s_fence->finished);
+                       break;
+               }
+       spin_unlock(&sched->job_list_lock);
+
+       if (&last->node != &sched->ring_mirror_list) {
+               r = dma_fence_wait_timeout(&last->s_fence->finished, false, 
msecs_to_jiffies(500));
+               if (r == 0)
+                       DRM_WARN("wait on the fly job timeout\n");
+               dma_fence_put(&last->s_fence->finished);
+       }
+}
+
   /**
    * Destroy a context entity
    *
@@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler *sched,
                        entity->dependency = NULL;
                }

+ /* Wait till all jobs from this entity really finished otherwise below

+                * fake signaling would kickstart sdma's clear PTE jobs and 
lead to
+                * vm fault
+                */
+               drm_sched_entity_wait_otf_signal(sched, entity);
+
                while ((job = 
to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
                        struct drm_sched_fence *s_fence = job->s_fence;
                        drm_sched_fence_scheduled(s_fence);
@@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, 
struct dma_fence_cb *cb
   {
        struct drm_sched_entity *entity =
                container_of(cb, struct drm_sched_entity, cb);
+
+       /* set the entity guity since its dependency is
+        * not really cleared but fake signaled (by SIGKILL
+        * or GPU recover)
+        */
+       if (f->error && entity->guilty)
+               atomic_set(entity->guilty, 1);
+
        entity->dependency = NULL;
        dma_fence_put(f);
        drm_sched_wakeup(entity->sched);


_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/gpu-sched: fix force APP kill hang

Reply via email to