Hello, Vincent.

On Thu, Apr 27, 2017 at 10:28:01AM +0200, Vincent Guittot wrote:
> On 27 April 2017 at 02:30, Tejun Heo <t...@kernel.org> wrote:
> > Hello, Vincent.
> >
> > On Wed, Apr 26, 2017 at 12:21:52PM +0200, Vincent Guittot wrote:
> >> > This is from the follow-up patch.  I was confused.  Because we don't
> >> > propagate decays, we still should decay the runnable_load_avg;
> >> > otherwise, we end up accumulating errors in the counter.  I'll drop
> >> > the last patch.
> >>
> >> Ok, the runnable_load_avg goes back to 0 when I drop patch 3. But i
> >> see  runnable_load_avg sometimes significantly higher than load_avg
> >> which is normally not possible as load_avg = runnable_load_avg +
> >> sleeping task's load_avg
> >
> > So, while load_avg would eventually converge on runnable_load_avg +
> > blocked load_avg given stable enough workload for long enough,
> > runnable_load_avg jumping above load avg temporarily is expected,
> 
> No, it's not. Look at load_avg/runnable_avg at root domain when only
> task are involved, runnable_load_avg will never be higher than
> load_avg because
>     load_avg = /sum load_avg of tasks attached to the cfs_rq
>     runnable_load_avg = /sum load_avg of tasks attached and enqueued
> to the cfs_rq
> load_avg = runnable_load_avg + blocked tasks and as a result
> runnable_load_avg is always lower than load_avg.

We don't actually calculate that tho, but yeah, given the calculations
we do, runnable shouldn't be significantly higher than load_avg.  I
see runable going above load_avg by a bit often but the margins are
small.  I wonder whether you're seeing the errors accumulating from
the incorrect min() capping.

> And with the propagate load/util_avg patchset, we can even reflect
> task migration directly at root domain whereas we had to wait for the
> util/load_avg and runnable_load_avg to converge to the new value
> before
> 
> Just to confirm one of my assumption, the latency regression was
> already there in previous kernel versions and is not a result of
> propagate load/util_avg patchset, isn't it ?

Yeah, the latency problem is independent of the propagation logic.
It's really the meaning of the top level running_avg_load being
different w/ and w/o cgroups.

> > AFAICS.  That's the whole point of it, a sum closely tracking what's
> > currently on the cpu so that we can pick the cpu which has the most on
> > it now.  It doesn't make sense to try to pick threads off of a cpu
> > which is generally loaded but doesn't have much going on right now,
> > after all.
> 
> The only interest of runnable_load_avg is to be null when a cfs_rq is
> idle whereas load_avg is not  but not to be higher than load_avg. The
> root cause is that load_balance only looks at "load" but not number of
> task currently running and that's probably the main problem:
> runnable_load_avg has been added because load_balance fails to filter
> idle group and idle rq. We should better add a new type in
> group_classify to tag group that are  idle and the same in
> find_busiest_queue more.

I'll follow up in the other subthread but there really is fundamental
difference in how we calculate runnable_avg w/ and w/o cgroups.
Indepndent of whether we can improve the load balancer further, it is
an existing bug.

> > * Can you please double check that the higher latencies w/ the patch
> >   is reliably reproducible?  The test machines that I use have
> >   variable management load.  They never dominate the machine but are
> >   enough to disturb the results so that to drawing out a reliable
> >   pattern takes a lot of repeated runs.  I'd really appreciate if you
> >   could double check that the pattern is reliable with different run
> >   patterns (ie. instead of 10 consecutive runs after another,
> >   interleaved).
> 
> I always let time between 2 consecutive run and the 10 consecutive
> runs take around 2min to execute
> 
> Then I have run several time these 10 consecutive test and results stay the 
> same

Can you please try the patch at the end of this message?  I'm
wondering whether you're seeing the errors accumulating from the wrong
min().

> >   ones.  If there's something I can easily buy, please point me to it.
> >   If there's something I can loan, that'd be great too.
> 
> It looks like most of web site are currently out of stock

:(

> > * If not, I'll try to clean up the debug patches I have and send them
> >   your way to get more visiblity but given these things tend to be
> >   very iterative, it might take quite a few back and forth.
> 
> Yes, that could be usefull. Even a trace of regression could be useful

Yeah, will clean it up a bit and post.  Will also post some traces.

And here's the updated patch.  I didn't add the suggested
scale_down()'s.  I'll follow up on that in the other subthread.
Thanks.

 kernel/sched/fair.c |   47 ++++++++++++++++++++---------------------------
 1 file changed, 20 insertions(+), 27 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3078,37 +3078,30 @@ static inline void
 update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
        struct cfs_rq *gcfs_rq = group_cfs_rq(se);
-       long delta, load = gcfs_rq->avg.load_avg;
+       long load = 0, delta;
 
        /*
-        * If the load of group cfs_rq is null, the load of the
-        * sched_entity will also be null so we can skip the formula
+        * A cfs_rq's load avg contribution to the parent should be scaled
+        * to the sched_entity's weight.  Use freshly calculated shares
+        * instead of @se->load.weight as the latter may not reflect
+        * changes from the current scheduling operation.
+        *
+        * Note that the propagation source is runnable_load_avg instead of
+        * load_avg.  This keeps every cfs_rq's runnable_load_avg true to
+        * the sum of the scaled loads of all tasks queued under it, which
+        * is important for the correct operation of the load balancer.
+        *
+        * This can make the sched_entity's load_avg jumpier but that
+        * correctly reflects what would happen without cgroups if each
+        * task's load is scaled across nesting - the load is being
+        * averaged at the task and each cfs_rq.
         */
-       if (load) {
-               long tg_load;
+       if (gcfs_rq->load.weight) {
+               long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg);
 
-               /* Get tg's load and ensure tg_load > 0 */
-               tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
-
-               /* Ensure tg_load >= load and updated with current load*/
-               tg_load -= gcfs_rq->tg_load_avg_contrib;
-               tg_load += load;
-
-               /*
-                * We need to compute a correction term in the case that the
-                * task group is consuming more CPU than a task of equal
-                * weight. A task with a weight equals to tg->shares will have
-                * a load less or equal to scale_load_down(tg->shares).
-                * Similarly, the sched_entities that represent the task group
-                * at parent level, can't have a load higher than
-                * scale_load_down(tg->shares). And the Sum of sched_entities'
-                * load must be <= scale_load_down(tg->shares).
-                */
-               if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
-                       /* scale gcfs_rq's load into tg's shares*/
-                       load *= scale_load_down(gcfs_rq->tg->shares);
-                       load /= tg_load;
-               }
+               load = min_t(long, scale_load_down(shares),
+                            gcfs_rq->runnable_load_avg *
+                            shares / gcfs_rq->load.weight);
        }
 
        delta = load - se->avg.load_avg;

Reply via email to