On Wed, Jun 11, 2014 at 02:13:42PM +0800, Michael wang wrote: > Hi, Peter > > Thanks for the reply :) > > On 06/10/2014 08:12 PM, Peter Zijlstra wrote: > [snip] > >> Wake-affine for sure pull tasks together for workload like dbench, what > >> make > >> it difference when put dbench into a group one level deeper is the > >> load-balance, which happened less. > > > > We load-balance less (frequently) or we migrate less tasks due to > > load-balancing ? > > IMHO, when we put tasks one group deeper, in other word the totally > weight of these tasks is 1024 (prev is 3072), the load become more > balancing in root, which make bl-routine consider the system is > balanced, which make we migrate less in lb-routine.
But how? The absolute value (1024 vs 3072) is of no effect to the imbalance, the imbalance is computed from relative differences between cpus. > Our comparison is based on the same busy-system, all the two cases have > the same workload running, the only difference is that we put the same > workload (dbench + stress) one group deeper, it's like: > > Good case: > root > l1-A l1-B l1-C > dbench stress stress > > results: > dbench got around 300% > each stress got around 450% > > Bad case: > root > l1 > l2-A l2-B l2-C > dbench stress stress > > results: > dbench got around 100% (throughout dropped too) > each stress got around 550% > > Although the l1-group gain the same resources (1200%), it doesn't assign > to l2-ABC correctly like the root-group did. But in this case select_idle_sibling() should function identially, so that cannot be the problem. > > The second is adding the cgroup crap on. > > > >> However, in our cases the load balance could not help on that, since deeper > >> the group is, less the load effect it means to root group. > > > > But since all actual load is on the same depth, the relative threshold > > (imbalance pct) should work the same, the size of the values don't > > matter, the relative ratios do. > > Exactly, however, when group is deep, the chance of it to make root > imbalance reduced, in good case, gathered on cpu means 1024 load, while > in bad case it dropped to 1024/3 ideally, that make it harder to trigger > imbalance and gain help from the routine, please note that although > dbench and stress are the only workload in system, there are still other > tasks serve for the system need to be wakeup (some very actively since > the dbench...), compared to them, deep group load means nothing... What tasks are these? And is it their interference that disturbs load-balancing? > >> By which means even tasks in deep group all gathered on one CPU, the load > >> could still balanced from the view of root group, and the tasks lost the > >> only chances (balance) to spread when they already on the same CPU... > > > > Sure, but see above. > > The lb-routine could not provide enough help for deep group, since the > imbalance happened inside the group could not cause imbalance in root, > ideally each l2-task will gain 1024/18 ~= 56 root-load, which could be > easily ignored, but inside the l2-group, the gathered case could already > means imbalance like (1024 * 5) : 1024. your explanation is not making sense, we have 3 cgroups, so the total root weight is at least 3072, with 18 tasks you would get 3072/18 ~ 170. And again, the absolute value doesn't matter, with (istr) 12 cpus the avg cpu load would be 3072/12 ~ 256, and 170 is significant on that scale. Same with l2, total weight of 1024, giving a per task weight of ~56 and a per-cpu weight of ~85, which is again significant. Also, you said load-balance doesn't usually participate much because dbench is too fast, so please make up your mind, does it or doesn't it matter? > > So I think that approach is wrong, select_idle_siblings() works because > > we want to keep CPUs from being idle, but if they're not actually idle, > > pretending like they are (in a cgroup) is actively wrong and can skew > > load pretty bad. > > We only choose the timing when no idle cpu located, and flips is > somewhat high, also the group is deep. -enotmakingsense > In such cases, select_idle_siblings() doesn't works anyway, it return > the target even it is very busy, we just check twice to prevent it from > making some obviously bad decision ;-) -emakinglesssense > > Furthermore, if as I expect, dbench sucks on a busy system, then the > > proposed cgroup thing is wrong, as a cgroup isn't supposed to radically > > alter behaviour like that. > > That's true and that's why we currently still need to shut down the > GENTLE_FAIR_SLEEPERS feature, but that's another problem we need to > solve later... more confusion.. > What we currently expect is that the cgroup assign the resource > according to the share, it works well in l1-groups, so we expect it to > work the same well in l2-groups... Sure, but explain why it isn't? So far you're just saying words that don't compute.
pgp4HvRM0GPEO.pgp
Description: PGP signature