Hi Vincent, Sorry for the delay, I just returned from Chinese New Year holiday.
On 2021/1/25 22:51, Vincent Guittot wrote: > On Mon, 25 Jan 2021 at 15:00, Li, Aubrey <aubrey...@linux.intel.com> wrote: >> >> On 2021/1/25 18:56, Vincent Guittot wrote: >>> On Mon, 25 Jan 2021 at 06:50, Aubrey Li <aubrey...@intel.com> wrote: >>>> >>>> A long-tail load balance cost is observed on the newly idle path, >>>> this is caused by a race window between the first nr_running check >>>> of the busiest runqueue and its nr_running recheck in detach_tasks. >>>> >>>> Before the busiest runqueue is locked, the tasks on the busiest >>>> runqueue could be pulled by other CPUs and nr_running of the busiest >>>> runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED >>> >>> We should better detect that when trying to detach task like below >> >> This should be a compromise from my understanding. If we give up load balance >> this time due to the race condition, we do reduce the load balance cost on >> the >> newly idle path, but if there is an imbalance indeed at the same sched_domain > > Redo path is there in case, LB has found an imbalance but it can't > move some loads from this busiest rq to dest rq because of some cpu > affinity. So it tries to fix the imbalance by moving load onto another > rq of the group. In your case, the imbalance has disappeared because > it has already been pulled by another rq so you don't have to try to > find another imbalance. And I would even say you should not in order > to let other level to take a chance to spread the load > >> level, we have to wait the next softirq entry to handle that imbalance. This >> means the tasks on the second busiest runqueue have to stay longer, which >> could >> introduce tail latency as well. That's why I introduced a variable to control >> the redo loops. I'll send this to the benchmark queue to see if it makes any > > TBH, I don't like multiplying the number of knobs Sure, I can take your approach, :) >>> >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -7688,6 +7688,16 @@ static int detach_tasks(struct lb_env *env) >>> >>> lockdep_assert_held(&env->src_rq->lock); >>> >>> + /* >>> + * Another CPU has emptied this runqueue in the meantime. >>> + * Just return and leave the load_balance properly. >>> + */ >>> + if (env->src_rq->nr_running <= 1 && !env->loop) { May I know why !env->loop is needed here? IIUC, if detach_tasks is invoked from LBF_NEED_BREAK, env->loop could be non-zero, but as long as src_rq's nr_running <=1, we should return immediately with LBF_ALL_PINNED flag cleared. How about the following change? diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 04a3ce20da67..1761d33accaa 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7683,8 +7683,11 @@ static int detach_tasks(struct lb_env *env) * We don't want to steal all, otherwise we may be treated likewise, * which could at worst lead to a livelock crash. */ - if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1) + if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1) { + /* Clear the flag as we will not test any task */ + env->flag &= ~LBF_ALL_PINNED; break; + } p = list_last_entry(tasks, struct task_struct, se.group_node); Thanks, -Aubrey