On Tue, May 10, 2016 at 05:26:05PM +0200, Mike Galbraith wrote: > On Tue, 2016-05-10 at 09:49 +0200, Mike Galbraith wrote: > > > Only whacking > > cfs_rq_runnable_load_avg() with a rock makes schbench -m <sockets> -t > > <near socket size> -a work well. 'Course a rock in its gearbox also > > rendered load balancing fairly busted for the general case :) > > Smaller rock doesn't injure heavy tbench, but more importantly, still > demonstrates the issue when you want full spread. > > schbench -m4 -t38 -a > > cputime 30000 threads 38 p99 177 > cputime 30000 threads 39 p99 10160 > > LB_TIP_AVG_HIGH > cputime 30000 threads 38 p99 193 > cputime 30000 threads 39 p99 184 > cputime 30000 threads 40 p99 203 > cputime 30000 threads 41 p99 202 > cputime 30000 threads 42 p99 205 > cputime 30000 threads 43 p99 218 > cputime 30000 threads 44 p99 237 > cputime 30000 threads 45 p99 245 > cputime 30000 threads 46 p99 262 > cputime 30000 threads 47 p99 296 > cputime 30000 threads 48 p99 3308 > > 47*4+4=nr_cpus yay yay... and haha, "a perfect world"...
> --- > kernel/sched/fair.c | 3 +++ > kernel/sched/features.h | 1 + > 2 files changed, 4 insertions(+) > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3027,6 +3027,9 @@ void remove_entity_load_avg(struct sched > > static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq) > { > + if (sched_feat(LB_TIP_AVG_HIGH) && cfs_rq->load.weight > > cfs_rq->runnable_load_avg*2) > + return cfs_rq->runnable_load_avg + min_t(unsigned long, > NICE_0_LOAD, > + cfs_rq->load.weight/2); > return cfs_rq->runnable_load_avg; > } cfs_rq->runnable_load_avg is for sure no greater than (in this case much less than, maybe 1/2 of) load.weight, whereas load_avg is not necessarily a rock in gearbox that only impedes speed up, but also speed down. But I really don't know the load references in select_task_rq() should be what kind. So maybe the real issue is a mix of them, i.e., conflated balancing and just wanting an idle cpu. ?