Re: [PATCH v5 0/4] Scan for an idle sibling in a single pass

Li, Aubrey Sun, 31 Jan 2021 17:16:42 -0800

On 2021/1/27 21:51, Mel Gorman wrote:
> Changelog since v4
> o Avoid use of intermediate variable during select_idle_cpu
> 
> Changelog since v3
> o Drop scanning based on cores, SMT4 results showed problems
> 
> Changelog since v2
> o Remove unnecessary parameters
> o Update nr during scan only when scanning for cpus
> 
> Changlog since v1
> o Move extern declaration to header for coding style
> o Remove unnecessary parameter from __select_idle_cpu
> 
> This series of 4 patches reposts three patches from Peter entitled
> "select_idle_sibling() wreckage". It only scans the runqueues in a single
> pass when searching for an idle sibling.
> 
> Three patches from Peter were dropped. The first patch altered how scan
> depth was calculated. Scan depth deletion is a random number generator
> with two major limitations. The avg_idle time is based on the time
> between a CPU going idle and being woken up clamped approximately by
> 2*sysctl_sched_migration_cost.  This is difficult to compare in a sensible
> fashion to avg_scan_cost. The second issue is that only the avg_scan_cost
> of scan failures is recorded and it does not decay.  This requires deeper
> surgery that would justify a patch on its own although Peter notes that
> https://lkml.kernel.org/r/20180530143105.977759...@infradead.org is
> potentially useful for an alternative avg_idle metric.
> 
> The second patch dropped scanned based on cores instead of CPUs as it
> rationalised the difference between core scanning and CPU scanning.
> Unfortunately, Vincent reported problems with SMT4 so it's dropped
> for now until depth searching can be fixed.
> 
> The third patch dropped converted the idle core scan throttling mechanism
> to SIS_PROP. While this would unify the throttling of core and CPU
> scanning, it was not free of regressions and has_idle_cores is a fairly
> effective throttling mechanism with the caveat that it can have a lot of
> false positives for workloads like hackbench.
> 
> Peter's series tried to solve three problems at once, this subset addresses
> one problem.
> 
>  kernel/sched/fair.c     | 151 +++++++++++++++++++---------------------
>  kernel/sched/features.h |   1 -
>  2 files changed, 70 insertions(+), 82 deletions(-)
>


4 benchmarks measured on a x86 4s system with 24 cores per socket and
2 HTs per core, total 192 CPUs. 

The load level is [25%, 50%, 75%, 100%].

- hackbench almost has a universal win.
- netperf high load has notable changes, as well as tbench 50% load.

Details below:

hackbench: 10 iterations, 10000 loops, 40 fds per group
======================================================

- pipe process

        group   base    %std    v5      %std
        3       1       19.18   1.0266  9.06
        6       1       9.17    0.987   13.03
        9       1       7.11    1.0195  4.61
        12      1       1.07    0.9927  1.43

- pipe thread

        group   base    %std    v5      %std
        3       1       11.14   0.9742  7.27
        6       1       9.15    0.9572  7.48
        9       1       2.95    0.986   4.05
        12      1       1.75    0.9992  1.68

- socket process

        group   base    %std    v5      %std
        3       1       2.9     0.9586  2.39
        6       1       0.68    0.9641  1.3
        9       1       0.64    0.9388  0.76
        12      1       0.56    0.9375  0.55

- socket thread

        group   base    %std    v5      %std
        3       1       3.82    0.9686  2.97
        6       1       2.06    0.9667  1.91
        9       1       0.44    0.9354  1.25
        12      1       0.54    0.9362  0.6

netperf: 10 iterations x 100 seconds, transactions rate / sec
=============================================================

- tcp request/response performance

        thread  base    %std    v4      %std
        25%     1       5.34    1.0039  5.13
        50%     1       4.97    1.0115  6.3
        75%     1       5.09    0.9257  6.75
        100%    1       4.53    0.908   4.83



- udp request/response performance

        thread  base    %std    v4      %std
        25%     1       6.18    0.9896  6.09
        50%     1       5.88    1.0198  8.92
        75%     1       24.38   0.9236  29.14
        100%    1       26.16   0.9063  22.16

tbench: 10 iterations x 100 seconds, throughput / sec
=====================================================

        thread  base    %std    v4      %std
        25%     1       0.45    1.003   1.48
        50%     1       1.71    0.9286  0.82
        75%     1       0.84    0.9928  0.94
        100%    1       0.76    0.9762  0.59

schbench: 10 iterations x 100 seconds, 99th percentile latency
==============================================================

        mthread base    %std    v4      %std
        25%     1       2.89    0.9884  7.34
        50%     1       40.38   1.0055  38.37
        75%     1       4.76    1.0095  4.62
        100%    1       10.09   1.0083  8.03

Thanks,
-Aubrey

Re: [PATCH v5 0/4] Scan for an idle sibling in a single pass

Reply via email to