On Fri, Jul 28, 2017 at 01:16:24PM +0100, Dietmar Eggemann wrote: > >> IIRC the topology you had in mind was MC + DIE level with n (n > 2) DIE > >> level sched groups. > > > > That'd be a NUMA box? > > I don't think it's NUMA. SD level are MC, DIE w/ # DIE sg's >> 2.
Ah, I can't read. I thought >2 DIEs. > > So this is 4 * 18 * 2 = 144 cpus: > > Impressive ;-) Takes forever to boot though :/ > > If I then start a 3rd loop, I see 100% 50%,50%. I then kill the 100%. > > Then instantly they balance and I get 2x100% back. > > Yeah, could reproduce on IVB-EP (2x10x2). OK, I have one of those. What should I do, because I didn't actually see anything odd. > > Anything else I need to reproduce? (other than maybe a slightly less > > insane machine :-) > > I guess what Jeff is trying to avoid is that 'busiest->load_per_task' > lowered to 'sds->avg_load' in case of an imbalanced busiest sg: > > if (busiest->group_type == group_imbalanced) > busiest->load_per_task = min(busiest->load_per_task, sds->avg_load); > > is so low that later fix_small_imbalance() won't be called and > 'env->imbalance' stays so low that load-balance of on 50% task to the > now idle cpu won't happen. > > if (env->imbalance < busiest->load_per_task) > fix_small_imbalance(env, sds); > > Having really a lot of otherwise idle DIE sg's helps to keep > 'sds->avg_load' low in comparison to 'busiest->load_per_task'. Right, but the whole load_per_task thing is a bit wonky, and since that's the basis of fix_small_imbalance() I'm very suspect.