The current load balancer implementation implies that scheduler groups, within the same domain, all host the same number of CPUs. This is reflected in the condition, that a scheduler group, which is load balancing and classified as having spare capacity, should pull work from the busiest group, if the local group runs less processes than the busiest one. This implies that these two groups should run the same number of processes, which is problematic if the groups are not of the same size.
The assumption that scheduler groups within the same scheduler domain host the same number of CPUs appears to be true for non-s390 architectures. Nevertheless, s390 can have scheduler groups of unequal size. This introduces a performance degredation in the following scenario: Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket, the remaining 2 are located on another socket: Socket -----1----- -2- CPU 1 2 3 4 5 6 7 8 Placing some workload ( x = one task ) yields the following scenarios: The first 5 tasks are distributed evenly across the two groups. Socket -----1----- -2- CPU 1 2 3 4 5 6 7 8 x x x x x Adding a 6th task yields the following distribution: Socket -----1----- -2- CPU 1 2 3 4 5 6 7 8 SMT1 x x x x x SMT2 x The task is added to the 2nd scheduler group, as the scheduler has the assumption that scheduler groups are of the same size, so they should also host the same number of tasks. This makes CPU 7 run into SMT thread, which comes with a performance penalty. This means, that in the window of 6-8 tasks, load balancing is done suboptimally, because SMT is used although there is no reason to do so as fully idle CPUs are still available. Taking the weight of the scheduler groups into account, ensures that a load balancing CPU within a smaller group will not try to pull tasks from a bigger group while the bigger group still has idle CPUs available. Signed-off-by: Tobias Huschle <husc...@linux.ibm.com> --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 48b6f0ca13ac..b1307d7e4065 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) * group's child domain. */ if (sds.prefer_sibling && local->group_type == group_has_spare && - busiest->sum_nr_running > local->sum_nr_running + 1) + busiest->sum_nr_running * local->group_weight > + local->sum_nr_running * busiest->group_weight + 1) goto force_balance; if (busiest->group_type != group_overloaded) { -- 2.34.1