Hi Steven, Peter,

On 2017/9/26 11:18, Steven Rostedt wrote:
On Tue, 26 Sep 2017 09:23:20 +0800
zhouchengming<zhouchengmi...@huawei.com>  wrote:

On 2017/9/26 3:40, Steven Rostedt wrote:
On Mon, 11 Sep 2017 14:51:49 +0800
Zhou Chengming<zhouchengmi...@huawei.com>   wrote:

push_rt_task() pick the first pushable task and find an eligible
lowest_rq, then double_lock_balance(rq, lowest_rq). So if
double_lock_balance() unlock the rq (when double_lock_balance() return 1),
we have to check if this task is still on the rq.

The problem is that the check conditions are not sufficient:

if (unlikely(task_rq(task) != rq ||
             !cpumask_test_cpu(lowest_rq->cpu,&task->cpus_allowed) ||
             task_running(rq, task) ||
             !rt_task(task) ||
             !task_on_rq_queued(task))) {

cpu2                            cpu1                    cpu0
push_rt_task(rq1)
    pick task_A on rq1
    find rq0
      double_lock_balance(rq1, rq0)
        unlock(rq1)
                                rq1 __schedule
                                  pick task_A run
                                task_A sleep (dequeued)
        lock(rq0)
        lock(rq1)
      do_above_check(task_A)
        task_rq(task_A) == rq1
        cpus_allowed unchanged
        task_running == false
        rt_task(task_A) == true
                                                        try_to_wake_up(task_A)
                                                          select_cpu = cpu3
                                                          enqueue(rq3, task_A)
How can this happen? The try_to_wake_up(task_A) needs to grab the rq
that task A is on, and we have that rq lock.

/me confused.

-- Steve
Thanks for the reply!
After the task_A sleep on cpu1, the try_to_wake_up(task_A) on cpu0 select a 
different cpu3,
so it will grab the rq3 lock, not the rq1 lock.
Ah crap. This is caused by 7608dec2ce20 ("sched: Drop the rq argument
to sched_class::select_task_rq()"). Because this code depends on
try_to_wake_up() grabbing the task's rq lock. But it no longer does
that, and it causes this race.

OK, I need to look at this deeper when I'm not so jetlagged and typing
this because I can't sleep at 5am.

Thanks for pointing this out!

It may be fixed by simply grabbing the run queue lock on migration, as
that would sync things up.

Is there any new solution? I don't think grabbing the rq lock without the 
task->pi_lock
will fix this problem. And I think my patch is correct and the changes are 
small.

Thanks!

Peter?


-- Steve



.



Reply via email to