On 31-May-2019 05:08:16 PM, Julien Desfossez wrote: > > My first reaction is: when shell wakes up from sleep, it will > > fork date. If the script is untagged and those workloads are > > tagged and all available cores are already running workload > > threads, the forked date can lose to the running workload > > threads due to __prio_less() can't properly do vruntime comparison > > for tasks on different CPUs. So those idle siblings can't run > > date and are idled instead. See my previous post on this: > > > > https://lore.kernel.org/lkml/20190429033620.GA128241@aaronlu/ > > (Now that I re-read my post, I see that I didn't make it clear > > that se_bash and se_hog are assigned different tags(e.g. hog is > > tagged and bash is untagged). > > > > Siblings being forced idle is expected due to the nature of core > > scheduling, but when two tasks belonging to two siblings are > > fighting for schedule, we should let the higher priority one win. > > > > It used to work on v2 is probably due to we mistakenly > > allow different tagged tasks to schedule on the same core at > > the same time, but that is fixed in v3. > > I confirm this is indeed what is happening, we reproduced it with a > simple script that only uses one core (cpu 2 and 38 are sibling on this > machine): > > setup: > cgcreate -g cpu,cpuset:test > cgcreate -g cpu,cpuset:test/set1 > cgcreate -g cpu,cpuset:test/set2 > echo 2,38 > /sys/fs/cgroup/cpuset/test/cpuset.cpus > echo 0 > /sys/fs/cgroup/cpuset/test/cpuset.mems > echo 2,38 > /sys/fs/cgroup/cpuset/test/set1/cpuset.cpus > echo 2,38 > /sys/fs/cgroup/cpuset/test/set2/cpuset.cpus > echo 0 > /sys/fs/cgroup/cpuset/test/set1/cpuset.mems > echo 0 > /sys/fs/cgroup/cpuset/test/set2/cpuset.mems > echo 1 > /sys/fs/cgroup/cpu,cpuacct/test/set1/cpu.tag > > In one terminal: > sudo cgexec -g cpu,cpuset:test/set1 sysbench --threads=1 --time=30 > --test=cpu run > > In another one: > sudo cgexec -g cpu,cpuset:test/set2 date > > It's very clear that 'date' hangs until sysbench is done. > > We started experimenting with marking a task on the forced idle sibling > if normalized vruntimes are equal. That way, at the next compare, if the > normalized vruntimes are still equal, it prefers the task on the forced > idle sibling. It still needs more work, but in our early tests it helps.
As mentioned above, we have come up with a fix for the long starvation of untagged interactive threads competing for the same core with tagged threads at the same priority. The idea is to detect the stall and boost the stalling threads priority so that it gets a chance next time. Boosting is done by a new counter(min_vruntime_boost) for every task which we subtract from vruntime before comparison. The new logic looks like this: If we see that normalized runtimes are equal, we check the min_vruntimes of their runqueues and give a chance for the task in the runqueue with less min_vruntime. That will help it to progress its vruntime. While doing this, we boost the priority of the task in the sibling so that, we don’t starve the task in the sibling until the min_vruntime of this runqueue catches up. If min_vruntimes are also equal, we do as before and consider the task ‘a’ of higher priority. Here we boost the task ‘b’ so that it gets to run next time. The min_vruntime_boost is reset to zero once the task in on cpu. So only waiting tasks will have a non-zero value if it is starved while matching a task on the other sibling. The attached patch has a sched_feature to enable the above feature so that you can compare the results with and without this feature. What we observe with this patch is that it helps for untagged interactive tasks and fairness in general, but this increases the overhead of core scheduling when there is contention for the CPU with tasks of varying cpu usage. The general trend we see is that if there is a cpu intensive thread and multiple relatively idle threads in different tags, the cpu intensive tasks continuously yields to be fair to the relatively idle threads when it becomes runnable. And if the relatively idle threads make up for most of the tasks in a system and are tagged, the cpu intensive tasks sees a considerable drop in performance. If you have any feedback or creative ideas to help improve, let us know ! Thanks diff --git a/include/linux/sched.h b/include/linux/sched.h index 1a309e8..56cad0e 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -642,6 +642,7 @@ struct task_struct { struct rb_node core_node; unsigned long core_cookie; unsigned int core_occupation; + unsigned int core_vruntime_boost; #endif #ifdef CONFIG_CGROUP_SCHED diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 73329da..c302853 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -92,6 +92,10 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b) int pa = __task_prio(a), pb = __task_prio(b); + trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n", + a->comm, a->pid, pa, a->se.vruntime, a->dl.deadline, + b->comm, b->pid, pb, b->se.vruntime, b->dl.deadline); + if (-pa < -pb) return true; @@ -102,21 +106,36 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b) return !dl_time_before(a->dl.deadline, b->dl.deadline); if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */ - u64 vruntime = b->se.vruntime; - - trace_printk("(%s/%d;%d,%Lu,%Lu) ?< (%s/%d;%d,%Lu,%Lu)\n", - a->comm, a->pid, pa, a->se.vruntime, task_cfs_rq(a)->min_vruntime, - b->comm, b->pid, pb, b->se.vruntime, task_cfs_rq(b)->min_vruntime); + u64 a_vruntime = a->se.vruntime - a->core_vruntime_boost; + u64 b_vruntime = b->se.vruntime - b->core_vruntime_boost; /* * Normalize the vruntime if tasks are in different cpus. */ if (task_cpu(a) != task_cpu(b)) { - vruntime -= task_cfs_rq(b)->min_vruntime; - vruntime += task_cfs_rq(a)->min_vruntime; + s64 min_vruntime_diff = task_cfs_rq(a)->min_vruntime - + task_cfs_rq(b)->min_vruntime; + b_vruntime += min_vruntime_diff; + + trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n", + a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime, + b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime); + + if (sched_feat(CORESCHED_STALL_FIX) && + a_vruntime == b_vruntime) { + bool less_prio = min_vruntime_diff > 0; + + if (less_prio) + a->core_vruntime_boost++; + else + b->core_vruntime_boost++; + + return less_prio; + + } } - return !((s64)(a->se.vruntime - vruntime) <= 0); + return !((s64)(a_vruntime - b_vruntime) <= 0); } return false; @@ -2456,6 +2475,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) #ifdef CONFIG_COMPACTION p->capture_control = NULL; #endif +#ifdef CONFIG_SCHED_CORE + p->core_vruntime_boost = 0UL; +#endif init_numa_balancing(clone_flags, p); } @@ -3723,6 +3745,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) next->comm, next->pid, next->core_cookie); + next->core_vruntime_boost = 0UL; return next; } @@ -3835,6 +3858,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) trace_printk("max: %s/%d %lx\n", max->comm, max->pid, max->core_cookie); if (old_max) { + if (old_max->core_vruntime_boost) + old_max->core_vruntime_boost--; + for_each_cpu(j, smt_mask) { if (j == i) continue; @@ -3905,6 +3931,7 @@ next_class:; done: set_next_task(rq, next); + next->core_vruntime_boost = 0UL; return next; } diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 858589b..332a092 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -90,3 +90,9 @@ SCHED_FEAT(WA_BIAS, true) * UtilEstimation. Use estimated CPU utilization. */ SCHED_FEAT(UTIL_EST, true) + +/* + * Prevent task stall due to vruntime comparison limitation across + * cpus. + */ +SCHED_FEAT(CORESCHED_STALL_FIX, false)