Any more comments for this idea? :) On 01/22/2014 03:40 PM, Alex Shi wrote: > On 01/21/2014 10:57 PM, Peter Zijlstra wrote: >> On Tue, Jan 21, 2014 at 10:04:26PM +0800, Alex Shi wrote: >>> >>> Current scheduler load balance is bottom-up mode, each CPU need >>> initiate the balance by self. >>> >>> 1, Like in a integrate computer system, it has smt/core/cpu/numa, 4 >>> level scheduler domains. If there is just 2 tasks in whole system that >>> both running on cpu0. Current load balance need to pull task to another >>> smt in smt domain, then pull task to another core, then pull task to >>> another cpu, finally pull task to another numa. Totally it is need 4 >>> times task moving to get system balance. >> >> Except the idle load balancer, and esp. the newidle can totally by-pass >> this. >> >> If you do the packing right in the newidle pass, you'd get there in 1 >> step. > > It give me a huge pressure to argue with you a great experts. I am > waiting and very appreciate for any comments and corrections. :) > > Yes, a newidle will kindly relief this. but it can not eliminate it. If > a newidle happens on another numa group. It just needs 1 step. But if it > happens on another smt group, it still needs 4 steps. So generally, we > still need one more steps before well balance. > > In this example, if a newidle is in the same smallest group, maybe we > should wakeup a remotest cpu in system/llc to avoid extra task moving in > near future for best performance. > And for power saving, maybe we'd better kick the task to smallest group, > then let the remote cpu group idle. > But for current newidle, it's impossible to do this because newidle is > also bottom-up mode. >> >>> Generally, the task moving complexity is >>> O(nm log n), n := nr_cpus, m := nr_tasks >>> >>> There is a excellent summary and explanation for this in >>> kernel/sched/fair.c:4605 >> >> Which is a perfectly fine scheme for a busy system. >> >>> Another weakness of current LB is that every cpu need to get the other >>> cpus' load info repeatedly and try to figure out busiest sched >>> group/queue on every sched domain level. But it just waste time, since >>> it may not conduct a task moving. One of reasons is that cpu can only >>> pull task, not pushing. >> >> This doesn't make sense.. and in fact, we do a limited amount of 3rd >> party movements. > > Yes, but the 3rd party movements is too limited, just for task pinned. >> >> Whatever you do, you have to repeat the information gathering anyhow, >> because it constantly changes. >> > > Yes, it is good to collection the load info once for once balance. but > if the balance cpu is busiest cpu, current balance still keep collecting > every group load info from bottom to up, and then do nothing on this > imbalance system. This is bad. > >> Trying to serialize that doesn't make any kind of sense. The only thing >> you want is that the system converges. > > Sorry, would you like to give a bit more details of 'serialize' is no sense? >> >> Skipped the rest because it seems build on a fundament I don't agree >> with. That 4 move thing is just silly for an idle system, and we >> shouldn't do that. >> >> I also very much do not want a single CPU balancing the entire system, >> that's the anti-thesis of scalable. > > Sorry. IMHO, single cpu is possible to handle 1000 cpu balancing. And it > is far more scalable than every cpu do balance in system, since there is > only one cpu need to pick other cpu load info. > > BTW, there is no organize among all cpus' balancing currently. That's a > a bit mess. Like if 2 cpus in a small cpu group just do balance for > whole system at the same time, then both of them think self group is > light and want more load. then they have the chance to over pull load to > self group. That is bad. And single balancing has no such problem. >
-- Thanks Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/