RE: [PATCH] drm/gpu-sched: fix force APP kill hang

Liu, Monk Thu, 29 Mar 2018 04:14:31 -0700

Hi Christian 

> This way we won't even start running the unmapping/clear-pte job in the first 
> place.


What if there is already an unmapping/clear-pte job running before you kill app 
? like app is naturally release some resource
And by coincidence you meanwhile kill the app ?

/Monk





-----Original Message-----
From: Christian König [mailto:[email protected]] 
Sent: 2018年3月29日 18:16
To: Liu, Monk <[email protected]>; Koenig, Christian <[email protected]>; 
Deng, Emily <[email protected]>; [email protected]
Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

Hi Monk,

well that isn't a problem.

The idea is that we first stop the ALL entities and then mark all fences as 
signaled with error. This way we won't even start running the 
unmapping/clear-pte job in the first place.

I mean as I wrote when the process is killed we should cancel ALL still pending 
jobs of that process including pending submissions and page table updates.

Regards,
Christian.

Am 29.03.2018 um 12:11 schrieb Liu, Monk:
> First let's consider the shadow buffer case:
>
> After you signal all jobs with an error code, e.g. you signals an 
> unmapping/clear-pte job on sdma ring (it is running on sdma), the 
> reservation are then all cleared, this way during amdgpu_bo_undef() on 
> the SHADOW BUFFER, above sdma job would hit VMC PAGE FAULT
>
> /Monk
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2018年3月29日 17:05
> To: Liu, Monk <[email protected]>; Deng, Emily <[email protected]>; 
> [email protected]
> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>
> I think the core of the problem is that we don't abort all entities of the 
> process at the same time.
>
> How about splitting drm_sched_entity_fini() into two functions?
>
> The first one is does the waiting, removes the entity from the runqueue and 
> returns an error when the process was killed.
>
> During destruction this one is called first for all contexts as well as the 
> entity for VM updates.
>
> The second one then goes over the entity and signals all jobs with an error 
> code.
>
> This way no VM updates should be executed any longer after the process is 
> killed (something which doesn't makes sense anyway and just costs us time).
>
> Regards,
> Christian.
>
> Am 29.03.2018 um 06:14 schrieb Liu, Monk:
>>> 2)if a fence signaled and try to clear some entity's dependency, 
>>> should set this entity guilty to prevent its job really run since 
>>> the dependency is fake signaled.
>> Well, that is a clear NAK. It would mean that you mark things like the X 
>> server or Wayland queue is marked guilty because some client is killed.
>>
>> And since unmapping/clear jobs don't have a guilty pointer it should 
>> actually not have any effect on the issue.
>>
>>
>> [ML], yeah that's a good point, how about this way: if a fence is 
>> fake signaled and try to clear other entity's dependency we only allow 
>> entity marked as guilty If this entity share the same ctx (or even process?) 
>> of the entity of the job of that fake signaled fence ?
>> This way for a certain process a faked signaled GFX fence won't be 
>> able to wake another VCE/SDMA job
>>
>> /Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:[email protected]]
>> Sent: 2018年3月28日 19:57
>> To: Deng, Emily <[email protected]>; [email protected]
>> Cc: Liu, Monk <[email protected]>
>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>>
>> Am 28.03.2018 um 10:07 schrieb Emily Deng:
>>> issue:
>>> there are VMC page fault occured if force APP kill during 3dmark 
>>> test, the cause is in entity_fini we manually signal all those jobs 
>>> in entity's queue which confuse the sync/dep
>>> mechanism:
>>>
>>> 1)page fault occured in sdma's clear job which operate on shadow 
>>> buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release 
>>> since the fence in its reservation was fake signaled by 
>>> entity_fini() under the case of SIGKILL received.
>>>
>>> 2)page fault occured in gfx' job because during the lifetime of gfx 
>>> job we manually fake signal all jobs from its entity in 
>>> entity_fini(), thus the unmapping/clear PTE job depend on those 
>>> result fence is satisfied and sdma start clearing the PTE and lead to GFX 
>>> page fault.
>> Nice catch, but the fixes won't work like this.
>>
>>> fix:
>>> 1)should at least wait all jobs already scheduled complete in
>>> entity_fini() if SIGKILL is the case.
>> Well that is not a good idea because when we kill a process we actually want 
>> to tear down the task as fast as possible and not wait for anything. That is 
>> actually the reason why we have this handling.
>>
>>> 2)if a fence signaled and try to clear some entity's dependency, 
>>> should set this entity guilty to prevent its job really run since 
>>> the dependency is fake signaled.
>> Well, that is a clear NAK. It would mean that you mark things like the X 
>> server or Wayland queue is marked guilty because some client is killed.
>>
>> And since unmapping/clear jobs don't have a guilty pointer it should 
>> actually not have any effect on the issue.
>>
>> Regards,
>> Christian.
>>
>>
>>> related issue ticket:
>>> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1
>>>
>>> Signed-off-by: Monk Liu <[email protected]>
>>> ---
>>>     drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 
>>> +++++++++++++++++++++++++++++++
>>>     1 file changed, 36 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> index 2bd69c4..9b306d3 100644
>>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct 
>>> drm_sched_entity *entity)
>>>             return true;
>>>     }
>>>     
>>> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler 
>>> *sched,
>>> +                           struct drm_sched_entity *entity) {
>>> +   struct drm_sched_job *last;
>>> +   signed long r;
>>> +
>>> +   spin_lock(&sched->job_list_lock);
>>> +   list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)
>>> +           if (last->s_fence->scheduled.context == entity->fence_context) {
>>> +                   dma_fence_get(&last->s_fence->finished);
>>> +                   break;
>>> +           }
>>> +   spin_unlock(&sched->job_list_lock);
>>> +
>>> +   if (&last->node != &sched->ring_mirror_list) {
>>> +           r = dma_fence_wait_timeout(&last->s_fence->finished, false, 
>>> msecs_to_jiffies(500));
>>> +           if (r == 0)
>>> +                   DRM_WARN("wait on the fly job timeout\n");
>>> +           dma_fence_put(&last->s_fence->finished);
>>> +   }
>>> +}
>>> +
>>>     /**
>>>      * Destroy a context entity
>>>      *
>>> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler 
>>> *sched,
>>>                             entity->dependency = NULL;
>>>                     }
>>>     
>>> +           /* Wait till all jobs from this entity really finished 
>>> otherwise below
>>> +            * fake signaling would kickstart sdma's clear PTE jobs and 
>>> lead to
>>> +            * vm fault
>>> +            */
>>> +           drm_sched_entity_wait_otf_signal(sched, entity);
>>> +
>>>                     while ((job = 
>>> to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
>>>                             struct drm_sched_fence *s_fence = job->s_fence;
>>>                             drm_sched_fence_scheduled(s_fence);
>>> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence 
>>> *f, struct dma_fence_cb *cb
>>>     {
>>>             struct drm_sched_entity *entity =
>>>                     container_of(cb, struct drm_sched_entity, cb);
>>> +
>>> +   /* set the entity guity since its dependency is
>>> +    * not really cleared but fake signaled (by SIGKILL
>>> +    * or GPU recover)
>>> +    */
>>> +   if (f->error && entity->guilty)
>>> +           atomic_set(entity->guilty, 1);
>>> +
>>>             entity->dependency = NULL;
>>>             dma_fence_put(f);
>>>             drm_sched_wakeup(entity->sched);
> _______________________________________________
> amd-gfx mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/gpu-sched: fix force APP kill hang

Reply via email to