On 2021/1/22 21:22, Vincent Guittot wrote:
> On Fri, 22 Jan 2021 at 11:14, Mel Gorman <mgor...@techsingularity.net> wrote:
>>
>> On Fri, Jan 22, 2021 at 10:30:52AM +0100, Vincent Guittot wrote:
>>> Hi Mel,
>>>
>>> On Tue, 19 Jan 2021 at 13:02, Mel Gorman <mgor...@techsingularity.net> 
>>> wrote:
>>>>
>>>> On Tue, Jan 19, 2021 at 12:33:04PM +0100, Vincent Guittot wrote:
>>>>> On Tue, 19 Jan 2021 at 12:22, Mel Gorman <mgor...@techsingularity.net> 
>>>>> wrote:
>>>>>>
>>>>>> Changelog since v2
>>>>>> o Remove unnecessary parameters
>>>>>> o Update nr during scan only when scanning for cpus
>>>>>
>>>>> Hi Mel,
>>>>>
>>>>> I haven't looked at your previous version mainly because I'm chasing a
>>>>> performance regression on v5.11-rcx which prevents me from testing the
>>>>> impact of your patchset on my !SMT2 system.
>>>>> Will do this as soon as this problem is fixed
>>>>>
>>>>
>>>> Thanks, that would be appreciated as I do not have access to a !SMT2
>>>> system to do my own evaluation.
>>>
>>> I have been able to run tests with your patchset on both large arm64
>>> SMT4 system and small arm64 !SMT system and patch 3 is still a source
>>> of regression on both. Decreasing min number of loops to 2 instead of
>>> 4 and scaling it with smt weight doesn't seem to be a good option as
>>> regressions disappear when I remove them as I tested with the patch
>>> below
>>>
>>> hackbench -l 2560 -g 1 on 8 cores arm64
>>> v5.11-rc4 : 1.355 (+/- 7.96)
>>> + sis improvement : 1.923 (+/- 25%)
>>> + the patch below : 1.332 (+/- 4.95)
>>>
>>> hackbench -l 2560 -g 256 on 8 cores arm64
>>> v5.11-rc4 : 2.116 (+/- 4.62%)
>>> + sis improvement : 2.216 (+/- 3.84%)
>>> + the patch below : 2.113 (+/- 3.01%)
>>>

4 benchmarks reported out during weekend, with patch 3 on a x86 4s system
with 24 cores per socket and 2 HT per core, total 192 CPUs.

It looks like mid-load has notable changes on my side:
- netperf 50% num of threads in TCP mode has 27.25% improved
- tbench 50% num of threads has 9.52% regression

Details below:

hackbench: 10 iterations, 10000 loops, 40 fds per group
======================================================

- pipe process

        group   base    %std    patch   %std
        6       1       5.27    1.0469  8.53
        12      1       1.03    1.0398  1.44
        24      1       2.36    1.0275  3.34

- pipe thread

        group   base    %std    patch   %std
        6       1       7.48    1.0747  5.25
        12      1       0.97    1.0432  1.95
        24      1       7.01    1.0299  6.81

- socket process

        group   base    %std    patch   %std
        6       1       1.01    0.9656  1.09
        12      1       0.35    0.9853  0.49
        24      1       1.33    0.9877  1.20

- socket thread

        group   base    %std    patch   %std
        6       1       2.52    0.9346  2.75
        12      1       0.86    0.9830  0.66
        24      1       1.17    0.9791  1.23

netperf: 10 iterations x 100 seconds, transactions rate / sec
=============================================================

- tcp request/response performance

        thread  base    %std    patch   %std
        50%     1       3.98    1.2725   7.52
        100%    1       2.73    0.9446   2.86
        200%    1       39.36   0.9955  29.45

- udp request/response performance

        thread  base    %std    patch   %std
        50%     1       6.18    1.0704  11.99
        100%    1       47.85   0.9637  45.83
        200%    1       45.74   1.0162  36.99

tbench: 10 iterations x 100 seconds, throughput / sec
=====================================================

        thread  base    %std    patch   %std
        50%     1       1.38    0.9048  2.46 
        100%    1       1.05    0.9640  0.68 
        200%    1       6.76    0.9886  2.86 

schbench: 10 iterations x 100 seconds, 99th percentile latency
==============================================================

        mthread base    %std    patch   %std
        6       1       29.07   0.8714  25.73
        12      1       15.32   1.0000  12.39
        24      1        0.08   0.9996   0.01

>>> So starting with a min of 2 loops instead of 4 currently and scaling
>>> nr loop with smt weight doesn't seem to be a good option and we should
>>> remove it for now
>>>
>> Note that this is essentially reverting the patch. As you remove "nr *=
>> sched_smt_weight", the scan is no longer proportional to cores, it's
> 
> Yes. My goal above was to narrow the changes only to lines that
> generate the regressions but i agree that removing patch 3 is the
> right solution> 
>> proportial to logical CPUs and the rest of the patch and changelog becomes
>> meaningless. On that basis, I'll queue tests over the weekend that remove
>> this patch entirely and keep the CPU scan as a single pass.
>>
>> --
>> Mel Gorman
>> SUSE Labs

Reply via email to