Since there is no power saving consideration in scheduler CFS, I has a very rough idea for enabling a new power saving schema in CFS.
It bases on the following assumption: 1, If there are many task crowd in system, just let few domain cpus running and let other cpus idle can not save power. Let all cpu take the load, finish tasks early, and then get into idle. will save more power and have better user experience. 2, schedule domain, schedule group perfect match the hardware, and the power consumption unit. So, pull tasks out of a domain means potentially this power consumption unit idle. So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale power aware scheduling), this proposal will adopt the sched_balance_policy concept and use 2 kind of policy: performance, power. And in scheduling, 2 place will care the policy, load_balance() and in task fork/exec: select_task_rq_fair(). Here is some pseudo code try to explain the proposal behaviour in load_balance() and select_task_rq_fair(); load_balance() { update_sd_lb_stats(); //get busiest group, idlest group data. if (sd->nr_running > sd's capacity) { //power saving policy is not suitable for //this scenario, it runs like performance policy mv tasks from busiest cpu in busiest group to idlest cpu in idlest group; } else {// the sd has enough capacity to hold all tasks. if (sg->nr_running > sg's capacity) { //imbalanced between groups if (schedule policy == performance) { //when 2 busiest group at same busy //degree, need to prefer the one has // softest group?? move tasks from busiest group to idletest group; } else if (schedule policy == power) move tasks from busiest group to idlest group until busiest is just full of capacity. //the busiest group can balance //internally after next time LB, } else { //all groups has enough capacity for its tasks. if (schedule policy == performance) //all tasks may has enough cpu //resources to run, //mv tasks from busiest to idlest group? //no, at this time, it's better to keep //the task on current cpu. //so, it is maybe better to do balance //in each of groups for_each_imbalance_groups() move tasks from busiest cpu to idlest cpu in each of groups; else if (schedule policy == power) { if (no hard pin in idlest group) mv tasks from idlest group to busiest until busiest full. else mv unpin tasks to the biggest hard pin group. } } } } select_task_rq_fair() { for_each_domain(cpu, tmp) { if (policy == power && tmp_has_capacity && tmp->flags & sd_flag) { sd = tmp; //It is fine to got cpu in the domain break; } } while(sd) { if policy == power find_busiest_and_capable_group() else find_idlest_group(); if (!group) { sd = sd->child; continue; } ... } } sub proposal: 1, If it's possible to balance task on idlest cpu not appointed 'balance cpu'. If so, it may can reduce one more time balancing. The idlest cpu can prefer the new idle cpu; and is the least load cpu; 2, se or task load is good for running time setting. but it should the second basis in load balancing. The first basis of LB is running tasks' number in group/cpu. Since whatever of the weight of groups is, if the tasks number is less than cpu number, the group is still has capacity to take more tasks. (will consider the SMT cpu power or other big/little cpu capacity on ARM.) unsolved issues: 1, like current scheduler, it didn't handled cpu affinity well in load_balance. 2, task group that isn't consider well in this rough proposal. It isn't consider well and may has mistaken . So just share my ideas and hope it become better and workable in your comments and discussion. Thanks Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/