RE: [PATCH] drm/gpu-sched: fix force APP kill hang

Liu, Monk Thu, 29 Mar 2018 04:32:24 -0700

This sdma job is running, and you will fake signal all fences eventually ,
Which lead to this process's PD reservation object free and amdgpu_bo_undef() 
can call amdgpu_gart_unbind() on the shadow buffer's ttm 
And lead to sdma job hit VMC page fault on vmid0


/Monk





-----Original Message-----
From: Koenig, Christian 
Sent: 2018年3月29日 19:24
To: Liu, Monk <[email protected]>; Deng, Emily <[email protected]>; 
[email protected]
Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang

Am 29.03.2018 um 13:14 schrieb Liu, Monk:
> Hi Christian
>
>> This way we won't even start running the unmapping/clear-pte job in the 
>> first place.
> What if there is already an unmapping/clear-pte job running before you 
> kill app ? like app is naturally release some resource And by coincidence you 
> meanwhile kill the app ?

At least nothing problematic. Since the job is already running we won't do 
anything with its fence.

Christian.

>
> /Monk
>
>
>
>
>
> -----Original Message-----
> From: Christian König [mailto:[email protected]]
> Sent: 2018年3月29日 18:16
> To: Liu, Monk <[email protected]>; Koenig, Christian 
> <[email protected]>; Deng, Emily <[email protected]>; 
> [email protected]
> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>
> Hi Monk,
>
> well that isn't a problem.
>
> The idea is that we first stop the ALL entities and then mark all fences as 
> signaled with error. This way we won't even start running the 
> unmapping/clear-pte job in the first place.
>
> I mean as I wrote when the process is killed we should cancel ALL still 
> pending jobs of that process including pending submissions and page table 
> updates.
>
> Regards,
> Christian.
>
> Am 29.03.2018 um 12:11 schrieb Liu, Monk:
>> First let's consider the shadow buffer case:
>>
>> After you signal all jobs with an error code, e.g. you signals an 
>> unmapping/clear-pte job on sdma ring (it is running on sdma), the 
>> reservation are then all cleared, this way during amdgpu_bo_undef() 
>> on the SHADOW BUFFER, above sdma job would hit VMC PAGE FAULT
>>
>> /Monk
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: 2018年3月29日 17:05
>> To: Liu, Monk <[email protected]>; Deng, Emily <[email protected]>; 
>> [email protected]
>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>>
>> I think the core of the problem is that we don't abort all entities of the 
>> process at the same time.
>>
>> How about splitting drm_sched_entity_fini() into two functions?
>>
>> The first one is does the waiting, removes the entity from the runqueue and 
>> returns an error when the process was killed.
>>
>> During destruction this one is called first for all contexts as well as the 
>> entity for VM updates.
>>
>> The second one then goes over the entity and signals all jobs with an error 
>> code.
>>
>> This way no VM updates should be executed any longer after the process is 
>> killed (something which doesn't makes sense anyway and just costs us time).
>>
>> Regards,
>> Christian.
>>
>> Am 29.03.2018 um 06:14 schrieb Liu, Monk:
>>>> 2)if a fence signaled and try to clear some entity's dependency, 
>>>> should set this entity guilty to prevent its job really run since 
>>>> the dependency is fake signaled.
>>> Well, that is a clear NAK. It would mean that you mark things like the X 
>>> server or Wayland queue is marked guilty because some client is killed.
>>>
>>> And since unmapping/clear jobs don't have a guilty pointer it should 
>>> actually not have any effect on the issue.
>>>
>>>
>>> [ML], yeah that's a good point, how about this way: if a fence is 
>>> fake signaled and try to clear other entity's dependency we only allow 
>>> entity marked as guilty If this entity share the same ctx (or even 
>>> process?) of the entity of the job of that fake signaled fence ?
>>> This way for a certain process a faked signaled GFX fence won't be 
>>> able to wake another VCE/SDMA job
>>>
>>> /Monk
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:[email protected]]
>>> Sent: 2018年3月28日 19:57
>>> To: Deng, Emily <[email protected]>; [email protected]
>>> Cc: Liu, Monk <[email protected]>
>>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
>>>
>>> Am 28.03.2018 um 10:07 schrieb Emily Deng:
>>>> issue:
>>>> there are VMC page fault occured if force APP kill during 3dmark 
>>>> test, the cause is in entity_fini we manually signal all those jobs 
>>>> in entity's queue which confuse the sync/dep
>>>> mechanism:
>>>>
>>>> 1)page fault occured in sdma's clear job which operate on shadow 
>>>> buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release 
>>>> since the fence in its reservation was fake signaled by
>>>> entity_fini() under the case of SIGKILL received.
>>>>
>>>> 2)page fault occured in gfx' job because during the lifetime of gfx 
>>>> job we manually fake signal all jobs from its entity in 
>>>> entity_fini(), thus the unmapping/clear PTE job depend on those 
>>>> result fence is satisfied and sdma start clearing the PTE and lead to GFX 
>>>> page fault.
>>> Nice catch, but the fixes won't work like this.
>>>
>>>> fix:
>>>> 1)should at least wait all jobs already scheduled complete in
>>>> entity_fini() if SIGKILL is the case.
>>> Well that is not a good idea because when we kill a process we actually 
>>> want to tear down the task as fast as possible and not wait for anything. 
>>> That is actually the reason why we have this handling.
>>>
>>>> 2)if a fence signaled and try to clear some entity's dependency, 
>>>> should set this entity guilty to prevent its job really run since 
>>>> the dependency is fake signaled.
>>> Well, that is a clear NAK. It would mean that you mark things like the X 
>>> server or Wayland queue is marked guilty because some client is killed.
>>>
>>> And since unmapping/clear jobs don't have a guilty pointer it should 
>>> actually not have any effect on the issue.
>>>
>>> Regards,
>>> Christian.
>>>
>>>
>>>> related issue ticket:
>>>> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1
>>>>
>>>> Signed-off-by: Monk Liu <[email protected]>
>>>> ---
>>>>      drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 
>>>> +++++++++++++++++++++++++++++++
>>>>      1 file changed, 36 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>> index 2bd69c4..9b306d3 100644
>>>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct 
>>>> drm_sched_entity *entity)
>>>>            return true;
>>>>      }
>>>>      
>>>> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler 
>>>> *sched,
>>>> +                          struct drm_sched_entity *entity) {
>>>> +  struct drm_sched_job *last;
>>>> +  signed long r;
>>>> +
>>>> +  spin_lock(&sched->job_list_lock);
>>>> +  list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)
>>>> +          if (last->s_fence->scheduled.context == entity->fence_context) {
>>>> +                  dma_fence_get(&last->s_fence->finished);
>>>> +                  break;
>>>> +          }
>>>> +  spin_unlock(&sched->job_list_lock);
>>>> +
>>>> +  if (&last->node != &sched->ring_mirror_list) {
>>>> +          r = dma_fence_wait_timeout(&last->s_fence->finished, false, 
>>>> msecs_to_jiffies(500));
>>>> +          if (r == 0)
>>>> +                  DRM_WARN("wait on the fly job timeout\n");
>>>> +          dma_fence_put(&last->s_fence->finished);
>>>> +  }
>>>> +}
>>>> +
>>>>      /**
>>>>       * Destroy a context entity
>>>>       *
>>>> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler 
>>>> *sched,
>>>>                            entity->dependency = NULL;
>>>>                    }
>>>>      
>>>> +          /* Wait till all jobs from this entity really finished 
>>>> otherwise below
>>>> +           * fake signaling would kickstart sdma's clear PTE jobs and 
>>>> lead to
>>>> +           * vm fault
>>>> +           */
>>>> +          drm_sched_entity_wait_otf_signal(sched, entity);
>>>> +
>>>>                    while ((job = 
>>>> to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
>>>>                            struct drm_sched_fence *s_fence = job->s_fence;
>>>>                            drm_sched_fence_scheduled(s_fence);
>>>> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence 
>>>> *f, struct dma_fence_cb *cb
>>>>      {
>>>>            struct drm_sched_entity *entity =
>>>>                    container_of(cb, struct drm_sched_entity, cb);
>>>> +
>>>> +  /* set the entity guity since its dependency is
>>>> +   * not really cleared but fake signaled (by SIGKILL
>>>> +   * or GPU recover)
>>>> +   */
>>>> +  if (f->error && entity->guilty)
>>>> +          atomic_set(entity->guilty, 1);
>>>> +
>>>>            entity->dependency = NULL;
>>>>            dma_fence_put(f);
>>>>            drm_sched_wakeup(entity->sched);
>> _______________________________________________
>> amd-gfx mailing list
>> [email protected]
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/gpu-sched: fix force APP kill hang

Reply via email to