Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.

Atish Patra Tue, 21 Nov 2017 21:24:32 -0800

Here are the results of schbench(scheduler latency benchmark) and uperf(networking benchmark).


Hardware config: 20 core (40 hyperthreaded cpus) x86 box.
schbench config: message threads = 2; time = 180s, worker thread = variable
uperf config:ping pong test on loopback interface with message size = 8k

Overall, both benchmark seems to happiest when number of threads arecloser to number of cpus.

--------------------------------------------------------------------------------------------------------------------------
schbench Maximum Latency(lower is better):
            Base(4.14)                 Base+pcpu
Num
Worker Mean            Stdev        Mean         Stdev Improvement (%)
10       3026.8     4987.12           523         474.35 82.7210255055
18       13854.6   1841.61        12945.6     125.19 6.5609977913
19       16457      2046.51        12985.4     48.46 21.0949747828
20       14995      2368.84        15838       2038.82 -5.621873958
25       29952.2   107.72           29673.6   337.57 0.9301487036
30       30084      19.768           30096.2     7.782 -0.0405531179

-------------------------------------------------------------------------------------------------------------------

The proposed fix seems to improve the maximum latency for lower number
of threads. It also seems to reduce the variation(lower stdev) as well.

If number of threads are equal or higher than number of cpus, it resultsin significantlyhigher latencies in because of the nature of the benchmark. Results forhigherthreads use case are presented to provide a complete picture but it isdifficult to conclude

anything from that.

Next individual percentile results are present for each use case. Theproposed fix alsoimproves latency across all percentiles for configuration(19 workerthreads) which

should saturate the system.
---------------------------------------------------------------------------------------------------------------------
schbench Latency in usec(lower is better)
            Baseline(4.14)           Base+pcpu
Num
Worker Mean       stdev        Mean    stdev    Improvement(%)

50th
10       64.2            2.039        63.6    1.743        0.934
18       57.6            5.388        57        4.939        1.041
19       63               4.774        58        4 7.936
20       59.6           4.127        60.2    5.153        -1.006
25       78.4           0.489        78.2    0.748        0.255
30       96.2           0.748        96.4    1.019        -0.207

75th
10        72             3.033        71.6      2.939        0.555
18        78             2.097        77.2      2.135        1.025
19        81.6          1.2            79.4      0.8 2.696
20        81             1.264        80.4      2.332        0.740
25        109.6        1.019       110        0               -0.364
30        781.4        50.902     731.8   70.6382    6.3475

90th
10        80.4          3.666       80.6        2.576        -0.248
18        87.8          1.469       88           1.673        -0.227
19        92.8          0.979       90.6        0.489        2.370
20        92.6          1.019       92            2 0.647
25        8977.6   1277.160   9014.4    467.857   -0.409
30        9558.4   334.641     9507.2   320.383    0.5356

95th
10        86.8         3.867        87.6         4.409        -0.921
18        95.4         1.496        95.2         2.039        0.209
19        102.6       1.624        99            0.894        3.508
20        103.2       1.326        102.2       2.481        0.968
25        12400      78.383      12406.4   37.318    -0.051
30        12336      40.477      12310.4   12.8        0.207

99th
10        99.2          5.418       103.4        6.887        -4.233
18        115.2        2.561       114.6        3.611        0.5208
19        126.25      4.573       120.4        3.872        4.6336
20        145.4        3.09          133           1.41 8.5281
25        12988.8   15.676      12924.8   25.6          0.4927
30        12988.8   15.676      12956.8   32.633     0.2463

99.50th
10        104.4        5.161         109.8    7.909          -5.172
18        127.6        7.391         124.2    4.214          2.6645
19        2712.2      4772.883  133.6    5.571          95.074
20        3707.8      2831.954  2844.2  4708.345    23.291
25        14032       1283.834  13008   0                  7.2976
30        16550.4    886.382    13840   1218.355    16.376

------------------------------------------------------------------------------------------------------------------------


Results from uperf
uperf config: Loopback ping pong test with message size = 8k

        Baseline (4.14)            Baseline +pcpu
         Mean        stdev        Mean        stdev Improvement(%)
1      9.056        0.02          8.966        0.083        -0.993
2      17.664      0.13         17.448       0.303       -1.222
4      32.03        0.22         31.972       0.129       -0.181
8      58.198      0.31         58.588       0.198        0.670
16    101.018   0.67        100.056      0.455        -0.952
32    148.1        15.41     164.494      2.312        11.069
64    203.66      1.16       203.042      1.348        -0.3073
128   197.12     1.04       194.722     1.174         -1.2165

The race window fix seems to help uperf for 32 threads (closest tonumber of cpus) as well.


Regards,
Atish

On 11/04/2017 07:58 PM, Joel Fernandes wrote:

Hi Peter,

On Tue, Oct 31, 2017 at 1:20 AM, Peter Zijlstra <pet...@infradead.org> wrote:

On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote:

Currently, multiple tasks can wakeup on same cpu from
select_idle_sibiling() path in case they wakeup simulatenously
and last ran on the same llc. This happens because an idle cpu
is not updated until idle task is scheduled out. Any task waking
during that period may potentially select that cpu for a wakeup
candidate.

Introduce a per cpu variable that is set as soon as a cpu is
selected for wakeup for any task. This prevents from other tasks
to select the same cpu again. Note: This does not close the race
window but minimizes it to accessing the per-cpu variable. If two
wakee tasks access the per cpu variable at the same time, they may
select the same cpu again. But it minimizes the race window
considerably.

The very most important question; does it actually help? What
benchmarks, give what numbers?

I collected some numbers with an Android benchmark called Jankbench.
Most tests didn't show an improvement or degradation with the patch.
However, one of the tests called "list view",  consistently shows an
improvement. Particularly striking is the improvement at mean and 25
percentile.

For list_view test, Jankbench pulls up a list of text and scrolls the
list, this exercises the display pipeline in Android to render and
display the animation as the scroll happens. For Android, lower frame
times is considered quite important as that means we are less likely
to drop frames and give the user a good experience vs a perceivable
poor experience.

For each frame, Jankbench measures the total time a frame takes and
stores it in a DB (the time from which the app starts drawing, to when
the rendering completes and the frame is submitted for display).
Following is the distribution of frame times in ms.

count    16304   (@60 fps, 4.5 minutes)

         Without patch   With patch
mean         5.196633   4.429641 (+14.75%)
std          2.030054   2.310025
25%          5.606810   1.991017 (+64.48%)
50%          5.824013   5.716631 (+1.84%)
75%          5.987102   5.932751 (+0.90%)
95%          6.461230   6.301318 (+2.47%)
99%          9.828959   9.697076 (+1.34%)

Note that although Android uses energy aware scheduling patches, I
turned those off to bring the test as close to mainline as possible. I
also backported Vincent's and Brendan's slow path fixes to the 4.4
kernel that the Pixel 2 uses.

Personally I am in favor of this patch considering this test data but
also that in the past, I remember that our teams had to deal with the
same race issue and used cpusets to avoid it (although they probably
tested with "energy aware" CPU selection kept on).

thanks,

- Joel

Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.

Reply via email to