Hi Christian > This way we won't even start running the unmapping/clear-pte job in the first > place.
What if there is already an unmapping/clear-pte job running before you kill app ? like app is naturally release some resource And by coincidence you meanwhile kill the app ? /Monk -----Original Message----- From: Christian König [mailto:[email protected]] Sent: 2018年3月29日 18:16 To: Liu, Monk <[email protected]>; Koenig, Christian <[email protected]>; Deng, Emily <[email protected]>; [email protected] Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang Hi Monk, well that isn't a problem. The idea is that we first stop the ALL entities and then mark all fences as signaled with error. This way we won't even start running the unmapping/clear-pte job in the first place. I mean as I wrote when the process is killed we should cancel ALL still pending jobs of that process including pending submissions and page table updates. Regards, Christian. Am 29.03.2018 um 12:11 schrieb Liu, Monk: > First let's consider the shadow buffer case: > > After you signal all jobs with an error code, e.g. you signals an > unmapping/clear-pte job on sdma ring (it is running on sdma), the > reservation are then all cleared, this way during amdgpu_bo_undef() on > the SHADOW BUFFER, above sdma job would hit VMC PAGE FAULT > > /Monk > -----Original Message----- > From: Koenig, Christian > Sent: 2018年3月29日 17:05 > To: Liu, Monk <[email protected]>; Deng, Emily <[email protected]>; > [email protected] > Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang > > I think the core of the problem is that we don't abort all entities of the > process at the same time. > > How about splitting drm_sched_entity_fini() into two functions? > > The first one is does the waiting, removes the entity from the runqueue and > returns an error when the process was killed. > > During destruction this one is called first for all contexts as well as the > entity for VM updates. > > The second one then goes over the entity and signals all jobs with an error > code. > > This way no VM updates should be executed any longer after the process is > killed (something which doesn't makes sense anyway and just costs us time). > > Regards, > Christian. > > Am 29.03.2018 um 06:14 schrieb Liu, Monk: >>> 2)if a fence signaled and try to clear some entity's dependency, >>> should set this entity guilty to prevent its job really run since >>> the dependency is fake signaled. >> Well, that is a clear NAK. It would mean that you mark things like the X >> server or Wayland queue is marked guilty because some client is killed. >> >> And since unmapping/clear jobs don't have a guilty pointer it should >> actually not have any effect on the issue. >> >> >> [ML], yeah that's a good point, how about this way: if a fence is >> fake signaled and try to clear other entity's dependency we only allow >> entity marked as guilty If this entity share the same ctx (or even process?) >> of the entity of the job of that fake signaled fence ? >> This way for a certain process a faked signaled GFX fence won't be >> able to wake another VCE/SDMA job >> >> /Monk >> >> -----Original Message----- >> From: Christian König [mailto:[email protected]] >> Sent: 2018年3月28日 19:57 >> To: Deng, Emily <[email protected]>; [email protected] >> Cc: Liu, Monk <[email protected]> >> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang >> >> Am 28.03.2018 um 10:07 schrieb Emily Deng: >>> issue: >>> there are VMC page fault occured if force APP kill during 3dmark >>> test, the cause is in entity_fini we manually signal all those jobs >>> in entity's queue which confuse the sync/dep >>> mechanism: >>> >>> 1)page fault occured in sdma's clear job which operate on shadow >>> buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release >>> since the fence in its reservation was fake signaled by >>> entity_fini() under the case of SIGKILL received. >>> >>> 2)page fault occured in gfx' job because during the lifetime of gfx >>> job we manually fake signal all jobs from its entity in >>> entity_fini(), thus the unmapping/clear PTE job depend on those >>> result fence is satisfied and sdma start clearing the PTE and lead to GFX >>> page fault. >> Nice catch, but the fixes won't work like this. >> >>> fix: >>> 1)should at least wait all jobs already scheduled complete in >>> entity_fini() if SIGKILL is the case. >> Well that is not a good idea because when we kill a process we actually want >> to tear down the task as fast as possible and not wait for anything. That is >> actually the reason why we have this handling. >> >>> 2)if a fence signaled and try to clear some entity's dependency, >>> should set this entity guilty to prevent its job really run since >>> the dependency is fake signaled. >> Well, that is a clear NAK. It would mean that you mark things like the X >> server or Wayland queue is marked guilty because some client is killed. >> >> And since unmapping/clear jobs don't have a guilty pointer it should >> actually not have any effect on the issue. >> >> Regards, >> Christian. >> >> >>> related issue ticket: >>> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1 >>> >>> Signed-off-by: Monk Liu <[email protected]> >>> --- >>> drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 >>> +++++++++++++++++++++++++++++++ >>> 1 file changed, 36 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c >>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c >>> index 2bd69c4..9b306d3 100644 >>> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c >>> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c >>> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct >>> drm_sched_entity *entity) >>> return true; >>> } >>> >>> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler >>> *sched, >>> + struct drm_sched_entity *entity) { >>> + struct drm_sched_job *last; >>> + signed long r; >>> + >>> + spin_lock(&sched->job_list_lock); >>> + list_for_each_entry_reverse(last, &sched->ring_mirror_list, node) >>> + if (last->s_fence->scheduled.context == entity->fence_context) { >>> + dma_fence_get(&last->s_fence->finished); >>> + break; >>> + } >>> + spin_unlock(&sched->job_list_lock); >>> + >>> + if (&last->node != &sched->ring_mirror_list) { >>> + r = dma_fence_wait_timeout(&last->s_fence->finished, false, >>> msecs_to_jiffies(500)); >>> + if (r == 0) >>> + DRM_WARN("wait on the fly job timeout\n"); >>> + dma_fence_put(&last->s_fence->finished); >>> + } >>> +} >>> + >>> /** >>> * Destroy a context entity >>> * >>> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler >>> *sched, >>> entity->dependency = NULL; >>> } >>> >>> + /* Wait till all jobs from this entity really finished >>> otherwise below >>> + * fake signaling would kickstart sdma's clear PTE jobs and >>> lead to >>> + * vm fault >>> + */ >>> + drm_sched_entity_wait_otf_signal(sched, entity); >>> + >>> while ((job = >>> to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) { >>> struct drm_sched_fence *s_fence = job->s_fence; >>> drm_sched_fence_scheduled(s_fence); >>> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence >>> *f, struct dma_fence_cb *cb >>> { >>> struct drm_sched_entity *entity = >>> container_of(cb, struct drm_sched_entity, cb); >>> + >>> + /* set the entity guity since its dependency is >>> + * not really cleared but fake signaled (by SIGKILL >>> + * or GPU recover) >>> + */ >>> + if (f->error && entity->guilty) >>> + atomic_set(entity->guilty, 1); >>> + >>> entity->dependency = NULL; >>> dma_fence_put(f); >>> drm_sched_wakeup(entity->sched); > _______________________________________________ > amd-gfx mailing list > [email protected] > https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/amd-gfx
