On Sun, Mar 21, 2021 at 03:03:58PM -0400, Rik van Riel wrote: > Mel Gorman did some nice work in 9fe1f127b913 > ("sched/fair: Merge select_idle_core/cpu()"), resulting in the kernel > being more efficient at finding an idle CPU, and in tasks spending less > time waiting to be run, both according to the schedstats run_delay > numbers, and according to measured application latencies. Yay. >
Other than unusual indentation of the changelog, yey for part 1. > The flip side of this is that we see more task migrations (about > 30% more), higher cache misses, higher memory bandwidth utilization, > and higher CPU use, for the same number of requests/second. > I am having difficulty with this part and whether this patch affects task migrations in particular. > This is most pronounced on a memcache type workload, which saw > a consistent 1-3% increase in total CPU use on the system, due > to those increased task migrations leading to higher L2 cache > miss numbers, and higher memory utilization. The exclusive L3 > cache on Skylake does us no favors there. > Out of curiousity, what is the load generator for memcache or is this based on analysis of a production workload? I ask because mutilate (https://github.com/leverich/mutilate) is allegedly a load generator that can simulate FaceBook patterns but it is old. I would be interested in hearing if mutilate is used and if so, what parameters the load generator is given. > On our web serving workload, that effect is usually negligible, > but occasionally as large as a 9% regression in the number of > requests served, due to some interaction between scheduler latency > and the web load balancer. > > It appears that the increased number of CPU migrations is generally > a good thing, since it leads to lower cpu_delay numbers, reflecting > the fact that tasks get to run faster. However, the reduced locality > and the corresponding increase in L2 cache misses hurts a little. > This is understandable because finding an idle CPU and incurring the migration cost can be better than stacking a task on a busy CPU for workloads that are sensitive to wakeup latency. > The patch below appears to fix the regression, while keeping the > benefit of the lower cpu_delay numbers, by reintroducing select_idle_smt > with a twist: when a socket has no idle cores, check to see if the > sibling of "prev" is idle, before searching all the other CPUs. > But this is the part I'm having a problem with. If the sibling of prev is idle then a task migration cost is still incurred. The main difference is that it's more likely to share a L1 or L2 cache and with no idle cores, some sharing of resources between HT siblings is inevitable. Can you confirm that task migrations are still higher with this patch? > This fixes both the occasional 9% regression on the web serving > workload, and the continuous 2% CPU use regression on the memcache > type workload. > > With Mel's patches and this patch together, the p95 and p99 response > times for the memcache type application improve by about 20% over what > they were before Mel's patches got merged. > Again, I would be very interested in hearing how this conclusion was reached because I can update mmtests accordingly and wire it into the "standard battery of scheduler workloads". > Signed-off-by: Rik van Riel <r...@surriel.com> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 794c2cb945f8..0c986972f4cd 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6098,6 +6098,28 @@ static int select_idle_core(struct task_struct *p, int > core, struct cpumask *cpu > return -1; > } > > +/* > + * Scan the local SMT mask for idle CPUs. > + */ > +static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, > int > +target) > +{ > + int cpu; > + > + if (!static_branch_likely(&sched_smt_present)) > + return -1; > + > + for_each_cpu(cpu, cpu_smt_mask(target)) { > + if (!cpumask_test_cpu(cpu, p->cpus_ptr) || > + !cpumask_test_cpu(cpu, sched_domain_span(sd))) > + continue; > + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) > + return cpu; > + } > + > + return -1; > +} > + > #else /* CONFIG_SCHED_SMT */ > Take a look at Barry's patch "!static_branch_likely(&sched_smt_present)". If that is applied first then you can remove the static_branch_likely check here as it'll be covered by an earlier test_idle_cores() call. As the ordering is not known, just note that it's a potential follow-up if Barry's patch was merged after yours. > static inline void set_idle_cores(int cpu, int val) > @@ -6114,6 +6136,11 @@ static inline int select_idle_core(struct task_struct > *p, int core, struct cpuma > return __select_idle_cpu(core); > } > > +static inline int select_idle_smt(struct task_struct *p, struct sched_domain > *sd, int target) > +{ > + return -1; > +} > + > #endif /* CONFIG_SCHED_SMT */ > > /* > @@ -6121,7 +6148,7 @@ static inline int select_idle_core(struct task_struct > *p, int core, struct cpuma > * comparing the average scan cost (tracked in sd->avg_scan_cost) against the > * average idle time for this rq (as found in rq->avg_idle). > */ > -static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, > int target) > +static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, > int prev, int target) > { > struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); > int i, cpu, idle_cpu = -1, nr = INT_MAX; > @@ -6155,6 +6182,13 @@ static int select_idle_cpu(struct task_struct *p, > struct sched_domain *sd, int t > time = cpu_clock(this); > } > > + if (!smt && cpus_share_cache(prev, target)) { > + /* No idle core. Check if prev has an idle sibling. */ > + i = select_idle_smt(p, sd, prev); > + if ((unsigned int)i < nr_cpumask_bits) > + return i; > + } > + > for_each_cpu_wrap(cpu, cpus, target) { > if (smt) { > i = select_idle_core(p, cpu, cpus, &idle_cpu); Please consider moving this block within the SIS_PROP && !smt check above. It could become something like if (!smt) { /* No idle core. Check if prev has an idle sibling. */ i = select_idle_smt(p, sd, prev); if ((unsigned int)i < nr_cpumask_bits) return i; if (sched_feat(SIS_PROP)) { SIS_PROP STUFF } } That way you avoid some calculations and a clock read if the HT sibling is idle. Second, select_idle_smt() does not use the cpus mask so consider moving the cpus initialisation after select_idle_smt() has been called. Specifically this initialisation cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); Alternatively, clear the bits in the SMT sibling scan to avoid checking the siblings twice. It's a tradeoff because initialising and clearing bits is not free and the cost is wasted if a sibling is free. A third concern, although it is mild, is that the SMT scan ignores the SIS_PROP limits on the depth search. This patch may increase the scan depth as a result. It's only a mild concern as limiting the depth of a search is a magic number anyway. Otherwise, the biggest difference from historical behaviour is that we are explicitly favouring SMT siblings when !idle_cores and !idle_cores can be inaccurate. That is going to be a "win some, lose some" scenario and might show up in overloaded workloads that rapidly idle (e.g. hackbench). I don't have a strong opinion on this one because *some* HT sibling is going to be used and it's workload and hardware dependant what the impact is. I think I'll run this through a short run of some scheduler loads but I'll wait to see if you decide to create another version based on this review. -- Mel Gorman SUSE Labs