On Mon, 15 May 2023 at 13:46, Tobias Huschle <husc...@linux.ibm.com> wrote: > > The current load balancer implementation implies that scheduler groups, > within the same domain, all host the same number of CPUs. This is > reflected in the condition, that a scheduler group, which is load > balancing and classified as having spare capacity, should pull work > from the busiest group, if the local group runs less processes than > the busiest one. This implies that these two groups should run the > same number of processes, which is problematic if the groups are not > of the same size. > > The assumption that scheduler groups within the same scheduler domain > host the same number of CPUs appears to be true for non-s390 > architectures. Nevertheless, s390 can have scheduler groups of unequal > size. > > This introduces a performance degredation in the following scenario: > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket, > the remaining 2 are located on another socket: > > Socket -----1----- -2- > CPU 1 2 3 4 5 6 7 8 > > Placing some workload ( x = one task ) yields the following > scenarios: > > The first 5 tasks are distributed evenly across the two groups. > > Socket -----1----- -2- > CPU 1 2 3 4 5 6 7 8 > x x x x x > > Adding a 6th task yields the following distribution: > > Socket -----1----- -2- > CPU 1 2 3 4 5 6 7 8 > SMT1 x x x x x > SMT2 x
Your description is a bit confusing for me. What you name CPU above should be named Core, doesn' it ? Could you share with us your scheduler topology ? > > The task is added to the 2nd scheduler group, as the scheduler has the > assumption that scheduler groups are of the same size, so they should > also host the same number of tasks. This makes CPU 7 run into SMT > thread, which comes with a performance penalty. This means, that in > the window of 6-8 tasks, load balancing is done suboptimally, because > SMT is used although there is no reason to do so as fully idle CPUs > are still available. > > Taking the weight of the scheduler groups into account, ensures that > a load balancing CPU within a smaller group will not try to pull tasks > from a bigger group while the bigger group still has idle CPUs > available. > > Signed-off-by: Tobias Huschle <husc...@linux.ibm.com> > --- > kernel/sched/fair.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 48b6f0ca13ac..b1307d7e4065 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct > lb_env *env) > * group's child domain. > */ > if (sds.prefer_sibling && local->group_type == group_has_spare && > - busiest->sum_nr_running > local->sum_nr_running + 1) > + busiest->sum_nr_running * local->group_weight > > + local->sum_nr_running * busiest->group_weight + 1) This is the prefer_sibling path. Could it be that you should disable prefer_siling between your sockets for such topology ? the default path compares the number of idle CPUs when groups has spare capacity > goto force_balance; > > if (busiest->group_type != group_overloaded) { > -- > 2.34.1 >