On Tue, Jul 18, 2017 at 11:14:57AM +0800, Li, Aubrey wrote:
> On 2017/7/18 3:23, Peter Zijlstra wrote:
> > On Fri, Jul 14, 2017 at 09:26:19AM -0700, Andi Kleen wrote:
> >>> And as said; Daniel has been working on a better predictor -- now he's
> >>> probably not used it on the network workload you're looking at, so that
> >>> might be something to consider.
> >>
> >> Deriving a better idle predictor is a bit orthogonal to fast idle.
> > 
> > No. If you want a different C state selected we need to fix the current
> > C state selector. We're not going to tinker.
> > 
> > And the predictor is probably the most fundamental part of the whole C
> > state selection logic.
> > 
> > Now I think the problem is that the current predictor goes for an
> > average idle duration. This means that we, on average, get it wrong 50%
> > of the time. For performance that's bad.
> > 
> > If you want to improve the worst case, we need to consider a cumulative
> > distribution function, and staying with the Gaussian assumption already
> > present, that would mean using:
> > 
> >      1              x - mu
> > CDF(x) = - [ 1 + erf(-------------) ]
> >      2           sigma sqrt(2)
> > 
> > Where, per the normal convention mu is the average and sigma^2 the
> > variance. See also:
> > 
> >   https://en.wikipedia.org/wiki/Normal_distribution
> > 
> > We then solve CDF(x) = n% to find the x for which we get it wrong n% of
> > the time (IIRC something like: 'mu - 2sigma' ends up being 5% or so).
> > 
> > This conceptually gets us better exit latency for the cases where we got
> > it wrong before, and practically pushes down the estimate which gets us
> > C1 longer.
> > 
> > Of course, this all assumes a Gaussian distribution to begin with, if we
> > get bimodal (or worse) distributions we can still get it wrong. To fix
> > that, we'd need to do something better than what we currently have.
> > 
> Maybe you are talking about applying some machine learning algorithm online
> to fit a multivariate normal distribution, :)

Nah, nothing that fancy..

Something that _could_ work and deals with arbitrary distributions is
buckets divided on the actual C-state selection boundaries and a
(cyclic) array of the N most recent idle times.

Something like so:

struct est_data {
        u8 array[64]
        u8 *dist;
        u8 idx;
}

DEFINE_PER_CPU(struct est_data, est_data);

void est_init(void)
{
        int size = drv->state_count;
        int cpu;

        for_each_possible_cpu(cpu) {
                per_cpu(est_data, cpu).dist = kzalloc(size);
                // handle errors
        }
}

u8 est_duration_2_state(u64 duration)
{
        for (i=0; i<drv->state_count; i++) {
                if (duration/1024 < drv->state[i].target_residency)
                        return i;
        }

        return i-1;
}

void est_contemplate(u64 duration)
{
        struct est_data *ed = this_cpu_ptr(&est_data);
        int state = est_duration_2_state(duration);
        int idx = (ed->idx++ % ARRAY_SIZE(ed->array);

        ed->dist[ed->array[idx]]--;
        ed->array[idx] = state;
        ed->dist[ed->array[idx]]++;
}

int est_state(int pct)
{
        struct est_data *ed = this_cpu_ptr(&est_data);
        int limit = pct * ARRAY_SIZE(ed->array) / 100; /* XXX move div out of 
line */
        int cnt, last = 0;

        /* CDF */
        for (i=0; i<drv->state_count; i++) {
                cnt += ed->dist[i];
                if (cnt > limit)
                        break;
                last = i;
        }

        return last;
}


> Well, back to the problem, when the scheduler picks up idle thread, it does
> not look at the history, nor make the prediction. So it's possible it has
> to switch back a task ASAP when it's going into idle(very common under some
> workloads).
> 
> That is, (idle_entry + idle_exit) > idle. If the system has multiple
> hardware idle states, then:
> 
> (idle_entry + idle_exit + HW_entry + HW_exit) > HW_sleep
> 
> So we eventually want the idle path lighter than what we currently have.
> A complex predictor may have high accuracy, but the cost could be high as 
> well.

I never suggested anything complex. The current menu thing uses an
average, all I said is if instead of the average you use something less,
say 'avg - 2*stdev' (it already computes the stdev) you get something,
which assuming Gaussian, is less than ~5 wrong on exit latency.

The above, also simple thing, uses a generic distribution function,
which works because it uses the exact boundaries we're limited to
anyway.

Of course, the above needs to be augmented with IRQ bits etc..

Reply via email to