On Tue, May 03, 2016 at 04:32:25PM +0200, Peter Zijlstra wrote: > On Mon, May 02, 2016 at 11:47:25AM -0400, Chris Mason wrote: > > On Mon, May 02, 2016 at 04:58:17PM +0200, Peter Zijlstra wrote: > > > On Mon, May 02, 2016 at 04:50:04PM +0200, Mike Galbraith wrote: > > > > Oh btw, did you know single socket boxen have no sd_busy? That doesn't > > > > look right. > > > > > > I suspected; didn't bother looking at yet. The 'problem' is that the LLC > > > domain is the top-most, so it doesn't have a parent domain. I'm sure we > > > can come up with something if we can get this all working right. > > > > > > And yes, I can get gains on various workloads with various options, I > > > can even break all workloads, but I've so far completely failed on > > > getting a win for everyone :/ > > > > Adding in the task_hot() check to decide if scanning idle was a good > > idea ended up being really important > > So I'm conflicted on this patch: > > +static int bounce_to_target(struct task_struct *p, int cpu) > +{ > + s64 delta; > + > + /* > + * as the run queue gets bigger, its more and more likely that > + * balance will have distributed things for us, and less likely > + * that scanning all our CPUs for an idle one will find one. > + * So, if nr_running > 1, just call this CPU good enough > + */ > + if (cpu_rq(cpu)->cfs.nr_running > 1) > + return 1;
The nr_running check is interesting. It is supposed to give the same benefit as your "do we have anything idle?" variable, but without having to constantly update a variable somewhere. I'll have to do a few runs to verify (maybe a idle_scan_failed counter). > + > + /* taken from task_hot() */ > + delta = rq_clock_task(task_rq(p)) - p->se.exec_start; > + return delta < (s64)sysctl_sched_migration_cost; > +} > > This will work for you schbench workload because it sleep for 30ms while > the migration_cost thingy is 500us, therefore you'll trigger the full > LLC scan. The task_hot checks don't do much for the sleeping schbench runs, but they help a lot for this: # pick a single core, in my case cpus 0,20 are the same core # cpu_hog is any program that spins # taskset -c 20 cpu_hog & # schbench -p 4 means message passing mode with 4 byte messages (like # pipe test), no sleeps, just bouncing as fast as it can. # # make the scheduler choose between the sibling of the hog and cpu 1 # taskset -c 0,1 schbench -p 4 -m 1 -t 1 Current mainline will stuff both schbench threads onto CPU 1, leaving CPU 0 100% idle. My first patch with the minimal task_hot() checks would sometimes pick CPU 0. My second patch that just directly calls task_hot sticks to cpu1, which is ~3x faster than spreading it. The full task_hot() checks also really help tbench. > > _However_, the migration_cost is supposed the model the cost of leaving > the LLC, so testing against that here seems wrong. > > Let me go play with something that measures the cost of doing that LLC > scan and compares that against the sleepy time -- of course, now need to > go figure out how to do this clock thing without rq-lock pain. > > > > + if (package_sd && !bounce_to_target(p, target)) { > + for_each_cpu_and(i, sched_domain_span(package_sd), > tsk_cpus_allowed(p)) { > + if (idle_cpu(i)) { > + target = i; > + break; > + } > + > + } > + } > > Also note your s/sd/package_sd/ rename is, strictly speaking, wrong. > Sure, on your current Intel system the LLC is the entire package, but > this is not true in general. > > Take for instance the Intel Core2Quad and AMD Bulldozer thingies, they > had two dies in one package, and correspondingly two LLC domains in one > package. > > (also, the Intel cluster-on-die thing can split the thing in two) > > There were also the old P6 era SMP boards which had external LLC, where > you could have an LLC shared across multiple packages -- although I'm > thinking we'll never see that again, due to off package being far > toooooo slooooooow these days. Gotcha, makes sense. I'll switch to llc_sd ;) -chris