Re: CFS review

Ingo Molnar Thu, 02 Aug 2007 09:13:53 -0700

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> It would be better, I suspect, to make the scheduler clock totally 
> distinct from the other clock sources (many architectures have per-cpu 
> cycle counters), and *not* try to even necessarily force it to be a 
> "time-based" one.


yeah. Note that i largely detached sched_clock() from the GTOD 
clocksources already in CFS, so part of this is already implemented and 
the intention is clear. For example, when the following happens:

   Marking TSC unstable due to: possible TSC halt in C2.
   Clocksource tsc unstable (delta = -71630388 ns)

sched_clock() does _not_ stop using the TSC. It is very careful with the 
TSC value, it checks against wraps, jumping, etc. (the whole rq_clock() 
wrapper around sched_clock()), but still tries to use the highest 
resolution time source possible, even if that time source is not good 
enough for GTOD's purposes anymore. So the scheduler clock is already 
largely detached from other clocksources in the system.

> It would still be true that something that is purely based on timer 
> ticks will always be liable to have rounding errors that will 
> inevitably mean that you don't get good fairness - tuning threads to 
> run less than a timer tick at a time would effectively "hide" them 
> from the scheduler accounting. However, in the end, I think that's 
> pretty much unavoidable.

Note that there is a relatively easy way of reducing the effects of such 
intentional coupling: turn on CONFIG_HIGH_RES_TIMERS. That decouples the 
scheduler tick from the jiffy tick and works against such 'exploits' - 
_even_ if the scheduler clock is otherwise low resolution. Also enable 
CONFIG_NO_HZ and the whole thing (of when the scheduler tick kicks in) 
becomes very hard to predict.

[ So while in a low-res clock situation scheduling will always be less 
  precise, with hres-timers and dynticks we have a natural 'random 
  sampler' mechanism so that no task can couple to the scheduler tick - 
  accidentally or even intentionally.

  The only 'unavoidable coupling' scenario is when the hardware has only 
  a single, low-resolution time sampling method. (that is pretty rare 
  though, even in the ultra-embedded space. If a box has two independent 
  hw clocks, even if they are low resolution, the timer tick can be
  decoupled from the scheduler tick.) ]

> I have to say, it would be interesting to try to use 32-bit 
> arithmetic.

yeah. We tried to do as much of that as possible, please read on below 
for (many) more details. There's no short summary i'm afraid :-/

Most importantly, CFS _already_ includes a number of measures that act 
against too frequent math. So even though you can see 64-bit math code 
in it, it's only rarely called if your clock has a low resolution - and 
that happens all automatically! (see below the details of this buffered 
delta math)

I have not seen Roman notice and mention any of these important details 
(perhaps because he was concentrating on finding faults in CFS - which a 
reviewer should do), but those measures are still very important for a 
complete, balanced picture, especially if one focuses on overhead on 
small boxes where the clock is low-resolution.

As Peter has said it in his detailed review of Roman's suggested 
algorithm, our main focus is on keeping total complexity down - and we 
are (of course) fundamentally open to changing the math behind CFS, we 
ourselves tweaked it numerous times, it's not cast into stone in any 
way, shape or form.

> I also think it's likely a mistake to do a nanosecond resolution. 
> That's part of what forces us to 64 bits, and it's just not even an 
> *interesting* resolution.

yes, very much not interesting, and we did not pick nanoseconds because 
we find anything "interesting" in that timescale. Firstly, before i go 
into the thinking behind nanoseconds, microseconds indeed have 
advantages too, so the choice was not easy, see the arguments in favor 
of microseconds below at: [*].

There are two fundamental reasons for nanoseconds:

 1) they _automatically_ act as a 'carry-over' for fractional math and
    thus reduce rounding errors. As Roman has noticed we dont care much
    about math rounding errors in the algorithm: _exactly_ because we
    *dont have to* care about rounding errors: we've got extra 10
    "uninteresting" bits in the time variables to offset the effects of 
    rounding errors and to carry over fractionals.

    ( Sidenote: in fact we had some simple extra anti-rounding-error 
      code in CFS and removed it because it made no measurable 
      difference. So even in the current structure there's additional 
      design reserves that we could tap before having to go to another 
      math model. All we need is a testcase that demonstrates rounding 
      errors. Roman's testcase was _not_ a demonstration of math 
      rounding errors, it was about clock granularity! )

 2) i tried microseconds resolution once (it's easy) but on fast hw it 
    already showed visible accounting/rounding artifacts in 
    high-frequency scheduling workloads, which, if hw gets just a little 
    bit faster, will become pain.

    ( Sidenote: if a workload is rescheduling once every few 
      microseconds, then it very much matters to the balance of things 
      whether there's a fundamental +- 1 microsecond skew on who gets 
      accounted what. In fact the placement of sched_clock() call within 
      schedule() is already visible in practice in some testcases, 
      whether the runqueue-spinlock acquire spinning time is accounted 
      to the 'previous' or the 'next' task - despite that time being 
      sub-microsecond on average. Going to microseconds makes this too 
      coarse. )

I.e. microseconds are on the limit today on fast hardware, and 
nanoseconds give us an automatic buffer against rounding errors.

On _slow_ hardware, with a low-resolution clock, i very much agree that 
we should not do too frequent math, and there are already four 
independent measures that we did in CFS to keep the math overhead down:

Firstly, CFS fundamentally "buffers the math" via deltas, _everywhere_ 
in the fastpath:

        if (unlikely(curr->delta_exec > sysctl_sched_stat_granularity)) {
                __update_curr(cfs_rq, curr, now);
                curr->delta_exec = 0;
        }

I.e. we only call the math routines if there was any delta. The beauty 
is that this "math buffering" works _automatically_ if there is a 
low-resolution sched_clock() present, because with a low-resolution 
clock a delta is only observed if a tick happens. I.e. in a 
high-frequency scheduling workload (the only place where scheduler 
overhead would be noticeable) all the CFS math is in a rare _slowpath_, 
that gets triggered only every 10 msecs or so! (if HZ=1000)

I.e. we didnt have to go down the (very painful) path of ktime_t 
split-model math and we didnt have to introduce a variable precision 
model for "scheduler time" either, because the delta math itself 
automatically buffers on slow boxes.

Secondly, note that all the key fastpath variables are already 32-bit 
(on 32-bit platforms):

        long                    wait_runtime;
        unsigned long           delta_fair_run;
        unsigned long           delta_fair_sleep;
        unsigned long           delta_exec;

The _timestamps_ are still 64-bit, but most of the actual math goes on 
in 32-bit delta variables. That's one of the advantages of doing deltas 
instead of absolute values.

Thirdly, if even this amount of buffering is not enough for an 
architecture, CFS also includes the sched_stat_granularity_ns tunable 
that allows the further reduction of the sampling frequency (and the 
frequency of us having to do the math) - so if the math overhead is a 
problem an architecture can set it.

Fourthly, in CFS there is a _single_ 64-bit division, and even for that 
division, the actual values passed to it are typically in a 32-bit 
range. Hence we've introduced the following optimization:

  static u64 div64_likely32(u64 divident, unsigned long divisor)
  {
  #if BITS_PER_LONG == 32
          if (likely(divident <= 0xffffffffULL))
                  return (u32)divident / divisor;
          do_div(divident, divisor);

          return divident;
  #else
          return divident / divisor;
  #endif
  }

Which, if the divident is in 32-bit range, does a 32-bit division.

About math related rounding errors mentioned by Roman (not to be 
confused with clock granularity rounding), in our analysis and in our 
experience rounding errors of the math were never an issue in CFS, due 
to the extra buffering that nanosecs gives - i tweaked it a bit around 
CFSv10 but it was unprovable to have any effect. That's one of the 
advantages of working with nanoseconds: the fundamental time unit 
includes about 10 "low bits" that can carry over much of the error and 
reduce rounding artifacts. And even that math rounding errors we believe 
centers around 0.

In Roman's variant of CFS's algorithm the variables are 32-bit, but the 
error is rolled forward in separate fract_* (fractional) 32-bit 
variables, so we still have 32+32==64 bit of stuff to handle. So we 
think that in the end such a 32+32 scheme would be more complex (and 
anyone who took a look at fs2.c would i think agree - it took Peter a 
day to decypher the math!) - but we'll be happy to be surprised with 
patches of course :-)

        Ingo

[*] the main advantage of microseconds would be that we could use "u32"
    throughout to carry around the "now" variable (timestamp). That 
    property of microseconds _is_ tempting and would reduce the 
    task_struct footprint as well. But if we did that it would have
    ripple effects: we'd have to resort to (pretty complex and 
    non-obvious) math to act against rounding errors. We'd either have 
    to carry the rounding error with us in separate 32-bit variables (in 
    essence creating 32+32 bit 64-bit variable), or we'd have to shift 
    up the microseconds by say ... 10 binary positions ... in essence 
    bringing us back into the same nanoseconds range. And then the 
    wraparound problem of microseconds - 72 hours is not _that_ 
    unrealistic to trigger intentionally, so we have to do _something_
    about it to inhibit infinite starvation. We experimented
    around with all this and the overwhelming conclusion (so far) was 
    that trying to reduce timestamps back to 32 bits is just not worth 
    it. _Deltas_ should, can and _are_ 32-bit values already, even in 
    the nanosecond model. So all the buffering and delta logic gives us
    most of the 32-bit advantages already, without the disadvantages of
    microseconds.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: CFS review

Reply via email to