Re: [RFC] scheduler: improve SMP fairness in CFS

Tong Li Sat, 28 Jul 2007 12:24:15 -0700

On Fri, 27 Jul 2007, Chris Snook wrote:

I don't think that achieving a constant error bound is always a good thing.We all know that fairness has overhead. If I have 3 threads and 2processors, and I have a choice between fairly giving each thread 1.0 billioncycles during the next second, or unfairly giving two of them 1.1 billioncycles and giving the other 0.9 billion cycles, then we can have a usefuldiscussion about where we want to draw the line on the fairness/performancetradeoff. On the other hand, if we can give two of them 1.1 billion cyclesand still give the other one 1.0 billion cycles, it's madness to waste those0.2 billion cycles just to avoid user jealousy. The more complex the memorytopology of a system, the more "free" cycles you'll get by toleratingshort-term unfairness. As a crude heuristic, scaling some fairly lowtolerance by log2(NCPUS) seems appropriate, but eventually we should take theboot-time computed migration costs into consideration.

I think we are in agreement. To avoid confusion, I think we should be moreprecise on what fairness means. Lag (i.e., ideal fair time - actualservice time) is the commonly used metric for fairness. The definition isthat a scheduler is proportionally fair if for any task in any timeinterval, the task's lag is bounded by a constant (note it's in terms ofabsolute time). The knob here is this constant and can help trade offperformance and fairness. The reason for a constant bound is that we wantconsistent fairness properties regardless of the number of tasks. Forexample, we don't want the system to be much less fair as the number oftasks increases. With DWRR, the lag bound is the max weight of currentlyrunning tasks, multiplied by sysctl_base_round_slice. So if all tasks areof nice 0, i.e., weight 1, and sysctl_base_round_slice equals 30 ms, thenwe are guaranteed each task is at most 30ms off of the ideal case. This isa useful property. Just like what you mentioned about the migration cost,this property allows the scheduler or user to accurately reason about thetradeoffs. If we want to trade fairness for performance, we can increasesysctl_base_round_slice to, say, 100ms; doing so we also know accuratelythe worst impact it has on fairness.

Adding system calls, while great for research, is not something which is donelightly in the published kernel. If we're going to implement a userinterface beyond simply interpreting existing priorities more precisely, itwould be nice if this was part of a framework with a broader vision, such asa scheduler economy.

Agreed. I've seen papers on scheduler economy but not familiar enough tocomment on it.

Scheduling Algorithm:
The scheduler keeps a set data structures, called Trio groups, to maintainthe weight or reservation of each thread group (including one or morethreads) and the local weight of each member thread. When scheduling athread, it consults these data structures and computes (in constant time) asystem-wide weight for the thread that represents an equivalent CPU share.Consequently, the scheduling algorithm, DWRR, operates solely based on thesystem-wide weight (or weight for short, hereafter) of each thread. Havinga flat space of system-wide weights for individual threads avoidsperforming seperate scheduling at each level of the group hierarchy andthus greatly simplies the implementation for group scheduling.
Implementing a flat weight space efficiently is nontrivial. I'm curious tosee how you reworked the original patch without global locking.

I simply removed the locking and changed a little bit in idle_balance().The lock was trying to avoid a thread from reading or writing the globalhighest round value while another thread is writing to it. For writes,it's simple to ensure without locking only one write takes effect whenmultiple writes are concurrent. For the case that there's one write goingon and multiple threads read, without locking, the only problem is that areader may read a stale value and thus thinks the current highest round isX while it's actually X + 1. The end effect is that a thread can be atmost two rounds behind the highest round. This changes DWRR's lag bound to2 * (max weight of current tasks) * sysctl_base_round_slice, which isstill constant.

I had a feeling this patch was originally designed for the O(1) scheduler,and this is why. The old scheduler had expired arrays, so adding around-expired array wasn't a radical departure from the design. CFS does nothave an expired rbtree, so adding one *is* a radical departure from thedesign. I think we can implement DWRR or something very similar withoutusing this implementation method. Since we've already got a tree of queuedtasks, it might be easiest to basically break off one subtree (usually justone task, but not necessarily) and migrate it to a less loaded tree wheneverwe can reduce the difference between the load on the two trees by at leasthalf. This would prevent both overcorrection and undercorrection.

Yes, the description was based on O(1) and the intent was exactly not tobe much a departure from its design. I totally agree the same philosophyshould apply to an implementation based on CFS.

The idea of rounds was another implementation detail that bothered me. Inthe old scheduler, quantizing CPU time was a necessary evil. Now that we canaccount for CPU time with nanosecond resolution, doing things on an as-neededbasis seems more appropriate, and should reduce the need for globalsynchronization.

Without the global locking, the global synchronization here is simplyping-ponging a cache line once of while. This doesn't look expensive tome, but if it does after benchmarking, adjusting sysctl_base_round_slicecan reduce the ping-pong frequency. There might also be a smartimplementation that can alleviate this problem.

I don't understand why quantizing CPU time is a bad thing. Could youeducate me on this?

I guess it's worth mentioning that although we now have nanosecond-levelaccounting, scheduling in the common case still occurs at timer tickgranularity.

In summary, I think the accounting is sound, but the enforcement issub-optimal for the new scheduler. A revision of the algorithm morecognizant of the capabilities and design of the current scheduler would seemto be in order.
I've referenced many times my desire to account for CPU/memory hierarchy inthese patches. At present, I'm not sure we have sufficient infrastructure inthe kernel to automatically optimize for system topology, but I thinkwhatever design we pursue should have some concept of this hierarchy, even ifwe end up using a depth-1 tree in the short term while we figure out how tooptimize this.

Agreed.

  tong
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] scheduler: improve SMP fairness in CFS

Reply via email to