On 2/9/2018 12:08 PM, Mike Galbraith wrote: > On Fri, 2018-02-09 at 11:10 -0500, Steven Sistare wrote: >> On 2/8/2018 10:54 PM, Mike Galbraith wrote: >>> On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote: >>>> This patch introduces the sysctl for sched_domain based migration costs. >>>> These in turn can be used for performance tuning of workloads. >>> >>> With this patch, we trade 1 completely bogus constant (cost is really >>> highly variable) for 3, twiddling of which has zero effect unless you >>> trigger a domain rebuild afterward, which is neither mentioned in the >>> changelog, nor documented. >>> >>> bogo-numbers++ is kinda hard to love. >> >> Yup, the domain rebuild is missing. >> >> I am no fan of tunables, the fewer the better, but one of the several flaws >> of the single figure for migration cost is that it ignores the very large >> difference in cost when migrating between near vs far levels of the cache >> hierarchy. >> Migration between CPUs of the same core should be free, as they share L1 >> cache. >> Rohit defined a tunable for it, but IMO it could be hard coded to 0. > > That cost is never really 0 in the context of load balancing, as the > load balancing machinery is non-free. When the idle_balance() throttle > was added, that was done to mitigate the (at that time) quite high cost > to high frequency cross core scheduling ala localhost communication.
I was imprecise. The cache-loss component of cost as represented by sched_migration_cost should be 0 in this case. The cost of the machinery is non-zero and remains in the code, and can still prevent migration. >> Migration >> between CPUs in different sockets is the most expensive and is represented by >> the existing sysctl_sched_migration_cost tunable. Migration between CPUs in >> the same core cluster, or in the same socket, is somewhere in between, as >> they share L2 or L3 cache. We could avoid a separate tunable by setting it >> to >> sysctl_sched_migration_cost / 10. > > Shrug. It's bogus no mater what we do. Once Upon A Time, a cost > number was generated via measurement, but the end result was just as > bogus as a number pulled out of the ether. How much bandwidth you have > when blasting data to/from wherever says nothing about misses you avoid > vs those you generate. Yes, yes and yes. I cannot make the original tunable less bogus. Using a smaller cost for closer caches still makes logical sense and is supported by the data. - Steve