Attached.
If the general idea in the patch is OK I can think of a test (and maybe
add to libdrm amdgpu tests) to actually simulate this scenario with 2 forked
concurrent processes working on same entity's job queue when one is
dying while the other keeps pushing to the same queue. For now I only
tested it
with normal boot and ruining multiple glxgears concurrently - which
doesn't really test this code path since i think each of them works on
it's own FD.
Andrey
On 08/10/2018 09:27 AM, Christian König wrote:
Crap, yeah indeed that needs to be protected by some lock.
Going to prepare a patch for that,
Christian.
Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
Reviewed-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
But I still have questions about entity->last_user (didn't notice
this before) -
Looks to me there is a race condition with it's current usage, let's
say process A was preempted after doing
drm_sched_entity_flush->cmpxchg(...)
now process B working on same entity (forked) is inside
drm_sched_entity_push_job, he writes his PID to entity->last_user and
also
executes drm_sched_rq_add_entity. Now process A runs again and
execute drm_sched_rq_remove_entity inadvertently causing process B
removal
from it's scheduler rq.
Looks to me like instead we should lock together entity->last_user
accesses and adds/removals of entity to the rq.
Andrey
On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
I forgot about this since we started discussing possible scenarios
of processes and threads.
In any case, this check is redundant. Acked-by: Nayan Deshmukh
<nayan26deshm...@gmail.com <mailto:nayan26deshm...@gmail.com>>
Nayan
On Mon, Aug 6, 2018 at 7:43 PM Christian König
<ckoenig.leichtzumer...@gmail.com
<mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
Ping. Any objections to that?
Christian.
Am 03.08.2018 um 13:08 schrieb Christian König:
> That is superflous now.
>
> Signed-off-by: Christian König <christian.koe...@amd.com
<mailto:christian.koe...@amd.com>>
> ---
> drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
> 1 file changed, 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> index 85908c7f913e..65078dd3c82c 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
drm_sched_job *sched_job,
> if (first) {
> /* Add the entity to the run queue */
> spin_lock(&entity->rq_lock);
> - if (!entity->rq) {
> - DRM_ERROR("Trying to push to a killed
entity\n");
> - spin_unlock(&entity->rq_lock);
> - return;
> - }
> drm_sched_rq_add_entity(entity->rq, entity);
> spin_unlock(&entity->rq_lock);
> drm_sched_wakeup(entity->rq->sched);
>From c581a410e1b3f82de2bb14746d8484db2162f82c Mon Sep 17 00:00:00 2001
From: Andrey Grodzovsky <andrey.grodzov...@amd.com>
Date: Mon, 13 Aug 2018 12:33:29 -0400
Subject: drm/scheduler: Fix possible race condition.
Problem:
The possbile race scenario is as follwing -
Process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)
now process B working on same entity (forked) is inside
drm_sched_entity_push_job, he writes his PID to entity->last_user and also
executes drm_sched_rq_add_entity. Now process A runs again and
execute drm_sched_rq_remove_entity inadvertently causing process B removal
from it's scheduler rq.
Fix:
Lock together entity->last_user accesses and adds/removals
of entity to the rq.
Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
---
drivers/gpu/drm/scheduler/gpu_scheduler.c | 22 +++++++++++++++-------
1 file changed, 15 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c
index f566405..5649c3d 100644
--- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
@@ -311,10 +311,12 @@ long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout)
/* For killed process disable any more IBs enqueue right now */
+ spin_lock(&entity->rq_lock);
last_user = cmpxchg(&entity->last_user, current->group_leader, NULL);
if ((!last_user || last_user == current->group_leader) &&
(current->flags & PF_EXITING) && (current->exit_code == SIGKILL))
drm_sched_rq_remove_entity(entity->rq, entity);
+ spin_unlock(&entity->rq_lock);
return ret;
}
@@ -596,17 +598,23 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job,
trace_drm_sched_job(sched_job, entity);
atomic_inc(&entity->rq->sched->num_jobs);
- WRITE_ONCE(entity->last_user, current->group_leader);
first = spsc_queue_push(&entity->job_queue, &sched_job->queue_node);
- /* first job wakes up scheduler */
- if (first) {
- /* Add the entity to the run queue */
- spin_lock(&entity->rq_lock);
+ /*
+ * entity might not be attached to rq because it's the first time we use it
+ * or because another process removed the entity from the rq during
+ * it's termination in drm_sched_entity_flush. We then need to add
+ * back the entity to rq
+ */
+ spin_lock(&entity->rq_lock);
+ entity->last_user = current->group_leader;
+ if (list_empty(&entity->list))
drm_sched_rq_add_entity(entity->rq, entity);
- spin_unlock(&entity->rq_lock);
+ spin_unlock(&entity->rq_lock);
+
+ /* first job wakes up scheduler */
+ if (first)
drm_sched_wakeup(entity->rq->sched);
- }
}
EXPORT_SYMBOL(drm_sched_entity_push_job);
--
2.7.4
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel