This sdma job is running, and you will fake signal all fences eventually , Which lead to this process's PD reservation object free and amdgpu_bo_undef() can call amdgpu_gart_unbind() on the shadow buffer's ttm And lead to sdma job hit VMC page fault on vmid0
/Monk -----Original Message----- From: Koenig, Christian Sent: 2018年3月29日 19:24 To: Liu, Monk <[email protected]>; Deng, Emily <[email protected]>; [email protected] Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang Am 29.03.2018 um 13:14 schrieb Liu, Monk: > Hi Christian > >> This way we won't even start running the unmapping/clear-pte job in the >> first place. > What if there is already an unmapping/clear-pte job running before you > kill app ? like app is naturally release some resource And by coincidence you > meanwhile kill the app ? At least nothing problematic. Since the job is already running we won't do anything with its fence. Christian. > > /Monk > > > > > > -----Original Message----- > From: Christian König [mailto:[email protected]] > Sent: 2018年3月29日 18:16 > To: Liu, Monk <[email protected]>; Koenig, Christian > <[email protected]>; Deng, Emily <[email protected]>; > [email protected] > Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang > > Hi Monk, > > well that isn't a problem. > > The idea is that we first stop the ALL entities and then mark all fences as > signaled with error. This way we won't even start running the > unmapping/clear-pte job in the first place. > > I mean as I wrote when the process is killed we should cancel ALL still > pending jobs of that process including pending submissions and page table > updates. > > Regards, > Christian. > > Am 29.03.2018 um 12:11 schrieb Liu, Monk: >> First let's consider the shadow buffer case: >> >> After you signal all jobs with an error code, e.g. you signals an >> unmapping/clear-pte job on sdma ring (it is running on sdma), the >> reservation are then all cleared, this way during amdgpu_bo_undef() >> on the SHADOW BUFFER, above sdma job would hit VMC PAGE FAULT >> >> /Monk >> -----Original Message----- >> From: Koenig, Christian >> Sent: 2018年3月29日 17:05 >> To: Liu, Monk <[email protected]>; Deng, Emily <[email protected]>; >> [email protected] >> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang >> >> I think the core of the problem is that we don't abort all entities of the >> process at the same time. >> >> How about splitting drm_sched_entity_fini() into two functions? >> >> The first one is does the waiting, removes the entity from the runqueue and >> returns an error when the process was killed. >> >> During destruction this one is called first for all contexts as well as the >> entity for VM updates. >> >> The second one then goes over the entity and signals all jobs with an error >> code. >> >> This way no VM updates should be executed any longer after the process is >> killed (something which doesn't makes sense anyway and just costs us time). >> >> Regards, >> Christian. >> >> Am 29.03.2018 um 06:14 schrieb Liu, Monk: >>>> 2)if a fence signaled and try to clear some entity's dependency, >>>> should set this entity guilty to prevent its job really run since >>>> the dependency is fake signaled. >>> Well, that is a clear NAK. It would mean that you mark things like the X >>> server or Wayland queue is marked guilty because some client is killed. >>> >>> And since unmapping/clear jobs don't have a guilty pointer it should >>> actually not have any effect on the issue. >>> >>> >>> [ML], yeah that's a good point, how about this way: if a fence is >>> fake signaled and try to clear other entity's dependency we only allow >>> entity marked as guilty If this entity share the same ctx (or even >>> process?) of the entity of the job of that fake signaled fence ? >>> This way for a certain process a faked signaled GFX fence won't be >>> able to wake another VCE/SDMA job >>> >>> /Monk >>> >>> -----Original Message----- >>> From: Christian König [mailto:[email protected]] >>> Sent: 2018年3月28日 19:57 >>> To: Deng, Emily <[email protected]>; [email protected] >>> Cc: Liu, Monk <[email protected]> >>> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang >>> >>> Am 28.03.2018 um 10:07 schrieb Emily Deng: >>>> issue: >>>> there are VMC page fault occured if force APP kill during 3dmark >>>> test, the cause is in entity_fini we manually signal all those jobs >>>> in entity's queue which confuse the sync/dep >>>> mechanism: >>>> >>>> 1)page fault occured in sdma's clear job which operate on shadow >>>> buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release >>>> since the fence in its reservation was fake signaled by >>>> entity_fini() under the case of SIGKILL received. >>>> >>>> 2)page fault occured in gfx' job because during the lifetime of gfx >>>> job we manually fake signal all jobs from its entity in >>>> entity_fini(), thus the unmapping/clear PTE job depend on those >>>> result fence is satisfied and sdma start clearing the PTE and lead to GFX >>>> page fault. >>> Nice catch, but the fixes won't work like this. >>> >>>> fix: >>>> 1)should at least wait all jobs already scheduled complete in >>>> entity_fini() if SIGKILL is the case. >>> Well that is not a good idea because when we kill a process we actually >>> want to tear down the task as fast as possible and not wait for anything. >>> That is actually the reason why we have this handling. >>> >>>> 2)if a fence signaled and try to clear some entity's dependency, >>>> should set this entity guilty to prevent its job really run since >>>> the dependency is fake signaled. >>> Well, that is a clear NAK. It would mean that you mark things like the X >>> server or Wayland queue is marked guilty because some client is killed. >>> >>> And since unmapping/clear jobs don't have a guilty pointer it should >>> actually not have any effect on the issue. >>> >>> Regards, >>> Christian. >>> >>> >>>> related issue ticket: >>>> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1 >>>> >>>> Signed-off-by: Monk Liu <[email protected]> >>>> --- >>>> drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 >>>> +++++++++++++++++++++++++++++++ >>>> 1 file changed, 36 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c >>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c >>>> index 2bd69c4..9b306d3 100644 >>>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c >>>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c >>>> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct >>>> drm_sched_entity *entity) >>>> return true; >>>> } >>>> >>>> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler >>>> *sched, >>>> + struct drm_sched_entity *entity) { >>>> + struct drm_sched_job *last; >>>> + signed long r; >>>> + >>>> + spin_lock(&sched->job_list_lock); >>>> + list_for_each_entry_reverse(last, &sched->ring_mirror_list, node) >>>> + if (last->s_fence->scheduled.context == entity->fence_context) { >>>> + dma_fence_get(&last->s_fence->finished); >>>> + break; >>>> + } >>>> + spin_unlock(&sched->job_list_lock); >>>> + >>>> + if (&last->node != &sched->ring_mirror_list) { >>>> + r = dma_fence_wait_timeout(&last->s_fence->finished, false, >>>> msecs_to_jiffies(500)); >>>> + if (r == 0) >>>> + DRM_WARN("wait on the fly job timeout\n"); >>>> + dma_fence_put(&last->s_fence->finished); >>>> + } >>>> +} >>>> + >>>> /** >>>> * Destroy a context entity >>>> * >>>> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler >>>> *sched, >>>> entity->dependency = NULL; >>>> } >>>> >>>> + /* Wait till all jobs from this entity really finished >>>> otherwise below >>>> + * fake signaling would kickstart sdma's clear PTE jobs and >>>> lead to >>>> + * vm fault >>>> + */ >>>> + drm_sched_entity_wait_otf_signal(sched, entity); >>>> + >>>> while ((job = >>>> to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) { >>>> struct drm_sched_fence *s_fence = job->s_fence; >>>> drm_sched_fence_scheduled(s_fence); >>>> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence >>>> *f, struct dma_fence_cb *cb >>>> { >>>> struct drm_sched_entity *entity = >>>> container_of(cb, struct drm_sched_entity, cb); >>>> + >>>> + /* set the entity guity since its dependency is >>>> + * not really cleared but fake signaled (by SIGKILL >>>> + * or GPU recover) >>>> + */ >>>> + if (f->error && entity->guilty) >>>> + atomic_set(entity->guilty, 1); >>>> + >>>> entity->dependency = NULL; >>>> dma_fence_put(f); >>>> drm_sched_wakeup(entity->sched); >> _______________________________________________ >> amd-gfx mailing list >> [email protected] >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/amd-gfx
