On Tue, 1 Oct 2019 at 18:53, Dietmar Eggemann <dietmar.eggem...@arm.com> wrote: > > On 01/10/2019 10:14, Vincent Guittot wrote: > > On Mon, 30 Sep 2019 at 18:24, Dietmar Eggemann <dietmar.eggem...@arm.com> > > wrote: > >> > >> Hi Vincent, > >> > >> On 19/09/2019 09:33, Vincent Guittot wrote: > [...]
> > >>> + if (busiest->group_weight == 1 || sds->prefer_sibling) { > >>> + /* > >>> + * When prefer sibling, evenly spread running tasks > >>> on > >>> + * groups. > >>> + */ > >>> + env->balance_type = migrate_task; > >>> + env->imbalance = (busiest->sum_h_nr_running - > >>> local->sum_h_nr_running) >> 1; > >>> + return; > >>> + } > >>> + > >>> + /* > >>> + * If there is no overload, we just want to even the number > >>> of > >>> + * idle cpus. > >>> + */ > >>> + env->balance_type = migrate_task; > >>> + env->imbalance = max_t(long, 0, (local->idle_cpus - > >>> busiest->idle_cpus) >> 1); > >> > >> Why do we need a max_t(long, 0, ...) here and not for the 'if > >> (busiest->group_weight == 1 || sds->prefer_sibling)' case? > > > > For env->imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) > > >> 1; > > > > either we have sds->prefer_sibling && busiest->sum_nr_running > > > local->sum_nr_running + 1 > > I see, this corresponds to > > /* Try to move all excess tasks to child's sibling domain */ > if (sds.prefer_sibling && local->group_type == group_has_spare && > busiest->sum_h_nr_running > local->sum_h_nr_running + 1) > goto force_balance; > > in find_busiest_group, I assume. yes. But it seems that I missed a case: prefer_sibling is set busiest->sum_h_nr_running <= local->sum_h_nr_running + 1 so we skip goto force_balance above But env->idle != CPU_NOT_IDLE and local->idle_cpus > (busiest->idle_cpus + 1) so we also skip goto out_balance and finally call calculate_imbalance() in calculate_imbalance with prefer_sibling set, imbalance = (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1; so we probably want something similar to max_t(long, 0, (busiest->sum_h_nr_running - local->sum_h_nr_running) >> 1) > > Haven't been able to recreate this yet on my arm64 platform since there > is no prefer_sibling and in case local and busiest have > group_type=group_has_spare they bailout in > > if (busiest->group_type != group_overloaded && > (env->idle == CPU_NOT_IDLE || > local->idle_cpus <= (busiest->idle_cpus + 1))) > goto out_balanced; > > > [...] > > >>> - if (busiest->group_type == group_overloaded && > >>> - local->group_type == group_overloaded) { > >>> - load_above_capacity = busiest->sum_h_nr_running * > >>> SCHED_CAPACITY_SCALE; > >>> - if (load_above_capacity > busiest->group_capacity) { > >>> - load_above_capacity -= busiest->group_capacity; > >>> - load_above_capacity *= scale_load_down(NICE_0_LOAD); > >>> - load_above_capacity /= busiest->group_capacity; > >>> - } else > >>> - load_above_capacity = ~0UL; > >>> + if (local->group_type < group_overloaded) { > >>> + /* > >>> + * Local will become overloaded so the avg_load metrics are > >>> + * finally needed. > >>> + */ > >> > >> How does this relate to the decision_matrix[local, busiest] (dm[])? E.g. > >> dm[overload, overload] == avg_load or dm[fully_busy, overload] == force. > >> It would be nice to be able to match all allowed fields of dm to code > >> sections. > > > > decision_matrix describes how it decides between balanced or unbalanced. > > In case of dm[overload, overload], we use the avg_load to decide if it > > is balanced or not > > OK, that's why you calculate sgs->avg_load in update_sg_lb_stats() only > for 'sgs->group_type == group_overloaded'. > > > In case of dm[fully_busy, overload], the groups are unbalanced because > > fully_busy < overload and we force the balance. Then > > calculate_imbalance() uses the avg_load to decide how much will be > > moved > > And in this case 'local->group_type < group_overloaded' in > calculate_imbalance(), 'local->avg_load' and 'sds->avg_load' have to be > calculated before using them in env->imbalance = min(...). > > OK, got it now. > > > dm[overload, overload]=force means that we force the balance and we > > will compute later the imbalance. avg_load may be used to calculate > > the imbalance > > dm[overload, overload]=avg_load means that we compare the avg_load to > > decide whether we need to balance load between groups > > dm[overload, overload]=nr_idle means that we compare the number of > > idle cpus to decide whether we need to balance. In fact this is no > > more true with patch 7 because we also take into account the number of > > nr_h_running when weight =1 > > This becomes clearer now ... slowly. > > [...]