On 11 October 2011 12:27, Peter Zijlstra <a.p.zijls...@chello.nl> wrote: > On Tue, 2011-10-11 at 11:40 +0200, Vincent Guittot wrote: >> On 11 October 2011 11:13, Peter Zijlstra <a.p.zijls...@chello.nl> wrote: >> > On Tue, 2011-10-11 at 10:51 +0200, Vincent Guittot wrote: >> >> I have several goals. The 1st one is that I need to put more load on >> >> some cpus when I have packages with different cpu frequency. >> > >> > That should be rather easy. >> > >> >> I agree, I was mainly wondering If I should use a [1-1024] or a >> [1024-xxxx] range and it seems that both can be used according : SMT >> uses <1024 and x86 turbo mode uses >1024 > > Well, turbo mode would typically only boost a cpu 25% or so, and only > while idling other cores to keep under its thermal limit. So its not > sufficient to actually affect the capacity calculation much if at all. >
OK >> >> Then, I have some use cases which have several running tasks but a low >> >> cpu load. In this case, the small tasks are spread on several cpu by >> >> the load_balance whereas they could be easily handled by one cpu >> >> without significant performance modification. >> > >> > That shouldn't be done using cpu_power, we have sched_smt_power_savings >> > and sched_mc_power_savings for stuff like that. >> > >> >> sched_mc_power_saving works fine when we have more than 2 cpus but >> can't apply on a dual core because it needs at least 2 sched_groups >> and the nr_running of these sched_groups must be higher than 0 but >> smaller than group_capacity which is 1 on a dual core system. > > SD_POWERSAVINGS_BALANCE does /=2 to nr_running, effectively doubling the > capacity iirc. And I know some IBM dudes were toying with the idea of > playing tricks with the capacity numbers, but that never went anywhere. > yes but it's only a special case for 2 tasks on a dual core and the SD_WAKE_AFFINE flag and cpu_idle_sibling can overwrite this decision. >> > Although I would really like to kill all those different >> > sched_*_power_savings knobs and reduce it to one. >> > >> >> If the cpu_power is >> >> higher than 1024, the cpu is no more seen out of capacity by the >> >> load_balance as soon as a short process is running and teh main result >> >> is that the small tasks will stay on the same cpu. This configuration >> >> is mainly usefull for ARM dual core system when we want to power gate >> >> one cpu. I use cyclictest to simulate such use case. >> > >> > Yeah, but that's wrong. >> >> That's the only way I have found to gathers small task without any >> relationship on one cpu. Do you know any better solution ? > > How do you know the task is 'small' ? > I want to use cpufreq to be notified that we have a large/small cpu load. If we have several tasks but the cpu uses the lowest frequency, it "should" mean that we have small tasks that are running (less than 20ms*95% of added duration) and we could gather them on one cpu (by increasing the cpu_power on a dual core). > For that you would need to track a time-weighted effective load average > of the task and we don't have that. > yes, that's why I use cpufreq until better option, like a time-weighted load average, is available > [ how bad is all this u64 math on ARM btw? and when will ARM finally > agree all this 32bit nonsense is a waste of time and silicon? ] > > But yeah, the whole nr_running vs capacity thing was traditionally to > deal with spreading single tasks around. And traditional power aware > scheduling was mostly about packing those on sockets (keeps other > sockets idle) instead of spreading them around sockets (optimizes > cache). > > Now I wouldn't at all mind you ripping out all that > sched_*_power_savings crap and replacing it, I doubt it actually works > anyway. I haven't got many patches on the subject, and I know I don't > have the equipment to measure power usage. > > Also, the few patches I got mostly made the sched_*_power_savings mess > bigger, which I refuse to do (what sysad wants to have a 27-state space > to configure his power aware scheduling). This has mostly made people go > away instead of fixing things up :-( > > As to what the replacement would have to look like, dunno, its not > something I've really thought much about, but maybe the time-weighted > stuff is the only sane approach, that combined with options on how to > spread tasks (core, socket, node, etc..). > > I really think changing the load-balancer is the right way to go about > solving your power issue (hot-plugging a cpu really is an insane way to > idle a core) and I'm open to discussing what would work for you. > Great. My 1st goal was not to modify the load-balancer and sched_mc (or as less as possible) and to study how I could tune the scheduler parameters to have the best power consumption on ARM platform. Now, changing the load-balancer is probably a better solution. > All I really ask is to not cobble something together, the load-balancer > is a horridly complex thing already and the last thing it needs is more > special cases that don't interact properly. > > > _______________________________________________ linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev