From: Peter Puhov <peter.pu...@linaro.org> In slow path, when selecting idlest group, if both groups have type group_has_spare, only idle_cpus count gets compared. As a result, if multiple tasks are created in a tight loop, and go back to sleep immediately (while waiting for all tasks to be created), they may be scheduled on the same core, because CPU is back to idle when the new fork happen.
For example: sudo perf record -e sched:sched_wakeup_new -- \ sysbench threads --threads=4 run ... total number of events: 61582 ... sudo perf script sysbench 129378 [006] 74586.633466: sched:sched_wakeup_new: sysbench:129380 [120] success=1 CPU:007 sysbench 129378 [006] 74586.634718: sched:sched_wakeup_new: sysbench:129381 [120] success=1 CPU:007 sysbench 129378 [006] 74586.635957: sched:sched_wakeup_new: sysbench:129382 [120] success=1 CPU:007 sysbench 129378 [006] 74586.637183: sched:sched_wakeup_new: sysbench:129383 [120] success=1 CPU:007 This may have negative impact on performance for workloads with frequent creation of multiple threads. In this patch we using group_util to select idlest group if both groups have equal number of idle_cpus. In this case newly created tasks would be better distributed. It is possible to use nr_running instead of group_util, but result is less predictable. With this patch: sudo perf record -e sched:sched_wakeup_new -- \ sysbench threads --threads=4 run ... total number of events: 74401 ... sudo perf script sysbench 129455 [006] 75232.853257: sched:sched_wakeup_new: sysbench:129457 [120] success=1 CPU:008 sysbench 129455 [006] 75232.854489: sched:sched_wakeup_new: sysbench:129458 [120] success=1 CPU:009 sysbench 129455 [006] 75232.855732: sched:sched_wakeup_new: sysbench:129459 [120] success=1 CPU:010 sysbench 129455 [006] 75232.856980: sched:sched_wakeup_new: sysbench:129460 [120] success=1 CPU:011 We tested this patch with following benchmarks: perf bench -f simple sched pipe -l 4000000 perf bench -f simple sched messaging -l 30000 perf bench -f simple mem memset -s 3GB -l 15 -f default perf bench -f simple futex wake -s -t 640 -w 1 sysbench cpu --threads=8 --cpu-max-prime=10000 run sysbench memory --memory-access-mode=rnd --threads=8 run sysbench threads --threads=8 run sysbench mutex --mutex-num=1 --threads=8 run hackbench --loops 20000 hackbench --pipe --threads --loops 20000 hackbench --pipe --threads --loops 20000 --datasize 4096 and found some performance improvements in: sysbench threads sysbench mutex perf bench futex wake and no regressions in others. master: 'commit b3a9e3b9622a ("Linux 5.8-rc1")' $> sysbench threads --threads=16 run sysbench 1.0.11 (using system LuaJIT 2.1.0-beta3) Running the test with following options: Number of threads: 16 Initializing random number generator from current time Initializing worker threads... Threads started! General statistics: total time: 10.0079s total number of events: 45526 << higher is better Latency (ms): min: 0.36 avg: 3.52 max: 54.22 95th percentile: 23.10 sum: 160044.33 Threads fairness: events (avg/stddev): 2845.3750/94.18 execution time (avg/stddev): 10.0028/0.00 With patch: $> sysbench threads --threads=16 run sysbench 1.0.11 (using system LuaJIT 2.1.0-beta3) Running the test with following options: Number of threads: 16 Initializing random number generator from current time Initializing worker threads... Threads started! General statistics: total time: 10.0053s total number of events: 56567 << higher is better Latency (ms): min: 0.36 avg: 2.83 max: 27.65 95th percentile: 18.95 sum: 160003.83 Threads fairness: events (avg/stddev): 3535.4375/147.38 execution time (avg/stddev): 10.0002/0.00 master: 'commit b3a9e3b9622a ("Linux 5.8-rc1")' $> sysbench mutex --mutex-num=1 --threads=32 run sysbench 1.0.11 (using system LuaJIT 2.1.0-beta3) Running the test with following options: Number of threads: 32 Initializing random number generator from current time Initializing worker threads... Threads started! General statistics: total time: 1.0415s << lower is better total number of events: 32 Latency (ms): min: 940.57 avg: 959.24 max: 1041.05 95th percentile: 960.30 sum: 30695.84 Threads fairness: events (avg/stddev): 1.0000/0.00 execution time (avg/stddev): 0.9592/0.02 With patch: @> sysbench mutex --mutex-num=1 --threads=32 run sysbench 1.0.11 (using system LuaJIT 2.1.0-beta3) Running the test with following options: Number of threads: 32 Initializing random number generator from current time Initializing worker threads... Threads started! General statistics: total time: 0.9209s << lower is better total number of events: 32 Latency (ms): min: 867.37 avg: 892.09 max: 920.70 95th percentile: 909.80 sum: 28546.84 Threads fairness: events (avg/stddev): 1.0000/0.00 execution time (avg/stddev): 0.8921/0.01 master: 'commit b3a9e3b9622a ("Linux 5.8-rc1")' $> perf bench futex wake -s -t 128 -w 1 # Running 'futex/wake' benchmark: Run summary [PID 2414]: blocking on 128 threads (at [private] futex 0xaaaab663a154), waking up 1 at a time. Wokeup 128 of 128 threads in 0.2852 ms (+-1.86%) << lower is better With patch: $> perf bench futex wake -s -t 128 -w 1 # Running 'futex/wake' benchmark: Run summary [PID 5057]: blocking on 128 threads (at [private] futex 0xaaaace461154), waking up 1 at a time. Wokeup 128 of 128 threads in 0.2705 ms (+-1.84%) << lower is better Signed-off-by: Peter Puhov <peter.pu...@linaro.org> --- kernel/sched/fair.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 02f323b85b6d..abcbdf80ee75 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8662,8 +8662,14 @@ static bool update_pick_idlest(struct sched_group *idlest, case group_has_spare: /* Select group with most idle CPUs */ - if (idlest_sgs->idle_cpus >= sgs->idle_cpus) + if (idlest_sgs->idle_cpus > sgs->idle_cpus) return false; + + /* Select group with lowest group_util */ + if (idlest_sgs->idle_cpus == sgs->idle_cpus && + idlest_sgs->group_util <= sgs->group_util) + return false; + break; } -- 2.20.1