Hi Dietmar,

On 07/08/19 18:31, Dietmar Eggemann wrote:
> On 7/26/19 4:54 PM, Peter Zijlstra wrote:
> > 
> > 
> > Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
> 
> [...]
> 
> > @@ -889,6 +891,8 @@ static void update_curr(struct cfs_rq *c
> >             trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
> >             cgroup_account_cputime(curtask, delta_exec);
> >             account_group_exec_runtime(curtask, delta_exec);
> > +           if (curtask->server)
> > +                   dl_server_update(curtask->server, delta_exec);
> >     }
> 
> I get a lockdep_assert_held(&rq->lock) related warning in start_dl_timer()
> when running the full stack.
> 
> ...
> [    0.530216] root domain span: 0-5 (max cpu_capacity = 1024)
> [    0.538655] devtmpfs: initialized
> [    0.556485] update_curr: rq mismatch rq[0] != rq[4]
> [    0.561519] update_curr: rq mismatch rq[0] != rq[4]
> [    0.566497] update_curr: rq mismatch rq[0] != rq[4]
> [    0.571443] update_curr: rq mismatch rq[0] != rq[4]
> [    0.576762] update_curr: rq mismatch rq[2] != rq[4]
> [    0.581674] update_curr: rq mismatch rq[2] != rq[4]
> [    0.586569] ------------[ cut here ]------------
> [    0.591220] WARNING: CPU: 2 PID: 2 at kernel/sched/deadline.c:916 
> start_dl_timer+0x160/0x178
> [    0.599686] Modules linked in:
> [    0.602756] CPU: 2 PID: 2 Comm: kthreadd Tainted: G        W         
> 5.3.0-rc3-00013-ga33cf033cc99-dirty #64
> [    0.612620] Hardware name: ARM Juno development board (r0) (DT)
> [    0.618560] pstate: 60000085 (nZCv daIf -PAN -UAO)
> [    0.623369] pc : start_dl_timer+0x160/0x178
> [    0.627572] lr : start_dl_timer+0x160/0x178
> [    0.631768] sp : ffff000010013cb0
> ...
> [    0.715075] Call trace:
> [    0.717531]  start_dl_timer+0x160/0x178
> [    0.721382]  update_curr_dl_se+0x108/0x208
> [    0.725494]  dl_server_update+0x2c/0x38
> [    0.729348]  update_curr+0x1b4/0x3b8
> [    0.732934]  task_tick_fair+0x74/0xa88
> [    0.736698]  scheduler_tick+0x94/0x110
> [    0.740461]  update_process_times+0x48/0x60
> ...
> 
> Seems to be related to the fact that the rq can change:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e4c14851a34c..5e3130a200ec 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -891,8 +891,17 @@ static void update_curr(struct cfs_rq *cfs_rq)
>                 trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
>                 cgroup_account_cputime(curtask, delta_exec);
>                 account_group_exec_runtime(curtask, delta_exec);
> -               if (curtask->server)
> +               if (curtask->server) {
> +                       struct rq *rq = rq_of(cfs_rq);
> +                       struct rq *rq2 = curtask->server->rq;
> +
> +                       if (rq != rq2) {
> +                               printk("update_curr: rq mismatch rq[%d] != 
> rq[%d]\n",
> +                                      cpu_of(rq), cpu_of(rq2));
> +                       }
> +
>                         dl_server_update(curtask->server, delta_exec);
> +               }
>         }
> 
> ...

Yeah, I actually noticed the same. Some debugging seems to point to
early boot spawning of kthreads. I can reliably for example attribute
this mismatch to ksoftirqd(s). It looks like they can avoid the
dl_server assignment in pick_next_task_dl() and this breaks things.
Still need to figure out why this happens and how to fix it, though.

Best,

Juri

Reply via email to