RE: Performance of low-cpu utilisation benchmark regressed severely since 4.6

Doug Smythies Thu, 20 Apr 2017 07:56:47 -0700

On 2017.04.19 01:16 Mel Gorman wrote:
> On Fri, Apr 14, 2017 at 04:01:40PM -0700, Doug Smythies wrote:
>> Hi Mel,
>> 
>> Thanks for the "how to" information.
>> This is a very interesting use case.
>> From trace data, I see a lot of minimal durations with
>> virtually no load on the CPU, typically more consistent
>> with some type of light duty periodic (~~100 Hz) work flow
>> (where we would prefer to not ramp up frequencies, or more
>> accurately keep them ramped up).
>
> This broadly matches my expectations in terms of behaviour. It is a
> low duty workload but while I accept that a laptop may not want the
> frequencies to ramp up, it's not universally true.


Agreed.

> Long periods at low
> frequency to complete a workload is not necessarily better than using a
> high frequency to race to idle.

Agreed, but it is processor dependant. For example with my older
i7-2700k processor I get the following package energies for
one loop (after the throw away loop) of the test (method 1):

intel_cpu-freq, powersave (lowest energy reference) 5876 Joules
intel_cpu-freq, conservative 5927 Joules
intel_cpu-freq, ondemand 6525 Joules
intel_cpu_freq, schedutil 6049 Joules
              , performance (highest energy reference) 8105 Joules
intel_pstate, powersave 7044 Joules
intel_pstate, force the load based algorithm 6390 Joules

> Effectively, a low utilisation test suite
> could be considered as a "foreground task of high priority" and not a
> "background task of little interest".

I wouldn't know how to make the distinction.

>> My results (further below) are different than yours, sometimes
>> dramatically, but the trends are similar.
>
> It's inevitable there would be some hardware based differences. The
> machine I have appears to show an extreme case.

Agreed.

>> I have nothing to add about the control algorithm over what
>> Rafael already said.
>> 
>> On 2017.04.11 09:42 Mel Gorman wrote:
>>> On Tue, Apr 11, 2017 at 08:41:09AM -0700, Doug Smythies wrote:
>>>> On 2017.04.11 03:03 Mel Gorman wrote:
>>>>>On Mon, Apr 10, 2017 at 10:51:38PM +0200, Rafael J. Wysocki wrote:
>>>>>> On Mon, Apr 10, 2017 at 10:41 AM, Mel Gorman wrote:
>>>>>>>
>>>>>>> It's far more obvious when looking at the git test suite and the length
>>>>>>> of time it takes to run. This is a shellscript and git intensive 
>>>>>>> workload
>>>>>>> whose CPU utilisatiion is very low but is less sensitive to multiple
>>>>>>> factors than netperf and sockperf.
>>>>>> 
>>>> 
>>>> I would like to repeat your tests on my test computer (i7-2600K).
>>>> I am not familiar with, and have not been able to find,
>>>> "the git test suite" shellscript. Could you point me to it?
>>>>
>>>
>>> If you want to use git source directly do a checkout from
>>> https://github.com/git/git and build it. The core "benchmark" is make
>>> test and timing it.
>> 
>> Because I had troubles with your method further below, I also did
>> this method. I did 5 runs, after a throw away run, similar to what
>> you do (and I could see the need for a throw away pass).
>> 
>
> Yeah, at the very least IO effects should be eliminated.
>
>> Results (there is something wrong with user and system times and CPU%
>> in kernel 4.5, so I only calculated Elapsed differences):
>> 
>
> In case it matters, the User and System CPU times are reported as standard
> for these classes of workload by mmtests even though it's not necessarily
> universally interesting. Generally, I consider the elapsed time to
> be the most important but often, a major change in system CPU time is
> interesting. That's not universally true as there have been changes in how
> system CPU is calculated in the past and it's sensitive to Kconfig options
> with VIRT_CPU_ACCOUNTING_GEN being a notable source of confusion in the past.
>
>> Linux s15 4.5.0-stock #232 SMP Tue Apr 11 23:54:49 PDT 2017 x86_64 x86_64 
>> x86_64 GNU/Linux
>> ... test_run: start 5 runs ...
>> 327.04user 122.08system 33:57.81elapsed (2037.81 : reference) 22%CPU
>> ... test_run: done ...
>> 
>> Linux s15 4.11.0-rc6-stock #231 SMP Mon Apr 10 08:29:29 PDT 2017 x86_64 
>> x86_64 x86_64 GNU/Linux
>> 
>> intel_pstate - powersave
>> ... test_run: start 5 runs ...
>> 1518.71user 552.87system 39:24.45elapsed (2364.45 : -16.03%) 87%CPU
>> ... test_run: done ...
>> 
>> intel_pstate - performance (fast reference)
>> ... test_run: start 5 runs ...
>> 1160.52user 291.33system 29:36.05elapsed (1776.05 : 12.85%) 81%CPU
>> ... test_run: done ...
>> 
>> intel_cpufreq - powersave (slow reference)
>> ... test_run: start 5 runs ...
>> 2165.72user 1049.18system 57:12.77elapsed (3432.77 : -68.45%) 93%CPU
>> ... test_run: done ...
>> 
>> intel_cpufreq - ondemand
>> ... test_run: start 5 runs ...
>> 1776.79user 808.65system 47:14.74elapsed (2834.74 : -39.11%) 91%CPU
>> 
>
> Nothing overly surprising there. It's been my observation that pstate is
> generally better than acpi_cpufreq which somewhat amuses me when I still
> see suggestions of disabling intel_pstate entirely being used despite the
> advice being based on much older kernels.
>
>> intel_cpufreq - schedutil
>> ... test_run: start 5 runs ...
>> 2049.28user 1028.70system 54:57.82elapsed (3297.82 : -61.83%) 93%CPU
>> ... test_run: done ...
>>
>
> I'm mildly surprised at this. I had observed that schedutil is not great
> but I don't recall seeing a result this bad.
>
>> Linux s15 4.11.0-rc6-revert #233 SMP Wed Apr 12 15:30:19 PDT 2017 x86_64 
>> x86_64 x86_64 GNU/Linux
>> ... test_run: start 5 runs ...
>> 1295.30user 365.98system 32:50.15elapsed (1970.15 : 3.32%) 84%CPU
>> ... test_run: done ...
>> 
>
> And the revert does help albeit not being an option for reasons Rafael
> covered.

New data point: Kernel 4.11-rc7  intel_pstate, powersave forcing the
load based algorithm: Elapsed 3178 seconds.

If I understand your data correctly, my load based results are the opposite of 
yours.

Mel: 4.11-rc5 vanilla: Elapsed mean: 3750.20 Seconds
Mel: 4.11-rc5 load based: Elapsed mean: 2503.27 Seconds
Or: 33.25%

Doug: 4.11-rc6 stock: Elapsed total (5 runs): 2364.45 Seconds
Doug: 4.11-rc7 force load based: Elapsed total (5 runs): 3178 Seconds
Or: -34.4%

>>> The way I'm doing it is via mmtests so
>>>
>>> git clone https://github.com/gormanm/mmtests
>>> cd mmtests
>>> ./run-mmtests --no-monitor --config 
>>> configs/config-global-dhp__workload_shellscripts test-run-1
>>> cd work/log
>>> ../../compare-kernels.sh | less
>>>
>>> and it'll generate a similar report to what I posted in this email
>>> thread. If you do multiple tests with different kernels then change the
>>> name of "test-run-1" to preserve the old data. compare-kernel.sh will
>>> compare whatever results you have.
>> 
>>          k4.5    k4.11-rc6         k4.11-rc6         k4.11-rc6          
>> k4.11-rc6         k4.11-rc6         k4.11-rc6
>>                                    performance       pass-ps            
>> pass-od           pass-su           revert
>> E min    388.71  456.51 (-17.44%)  342.81 ( 11.81%)  668.79 (-72.05%)   
>> 552.85 (-42.23%)  646.96 (-66.44%)  375.08 (  3.51%)
>> E mean   389.74  458.52 (-17.65%)  343.81 ( 11.78%)  669.42 (-71.76%)   
>> 553.45 (-42.01%)  647.95 (-66.25%)  375.98 (  3.53%)
>> E stddev   0.85    1.64 (-92.78%)    0.67 ( 20.83%)    0.41 ( 52.25%)     
>> 0.31 ( 64.00%)    0.68 ( 20.35%)    0.46 ( 46.00%)
>> E coeffvar 0.22    0.36 (-63.86%)    0.20 ( 10.25%)    0.06 ( 72.20%)     
>> 0.06 ( 74.65%)    0.10 ( 52.09%)    0.12 ( 44.03%)
>> E max    390.90  461.47 (-18.05%)  344.83 ( 11.79%)  669.91 (-71.38%)   
>> 553.68 (-41.64%)  648.75 (-65.96%)  376.37 (  3.72%)
>> 
>> E = Elapsed (squished in an attempt to prevent line length wrapping when I 
>> send)
>> 
>>            k4.5   k4.11-rc6   k4.11-rc6   k4.11-rc6   k4.11-rc6   k4.11-rc6  
>>  k4.11-rc6
>>                             performance     pass-ps     pass-od     pass-su  
>>     revert
>> User     347.26     1801.56     1398.76     2540.67     2106.30     2434.06  
>>    1536.80
>> System   139.01      701.87      366.59     1346.75     1026.67     1322.39  
>>     449.81
>> Elapsed 2346.77     2761.20     2062.12     4017.47     3321.10     3887.19  
>>    2268.90
>> 
>> Legend:
>> blank  = active mode: intel_pstate - powersave
>> performance = active mode: intel_pstate - performance (fast reference)
>> pass-ps = passive mode: intel_cpufreq - powersave (slow reference)
>> pass-od = passive mode: intel_cpufreq - ondemand
>> pass-su = passive mode: intel_cpufreq - schedutil
>> revert = active mode: intel_pstate - powersave with commit ffb810563c0c 
>> reverted.
>> 
>> I deleted the user, system, and CPU rows, because they don't make any sense.
>>
>
> User is particularly misleading. System can be very misleading between
> kernel versions due to accounting differences so I'm ok with that.
>
>> I do not know why the tests run overall so much faster on my computer,
>
> Differences in CPU I imagine. I know the machine I'm reporting on is a
> particularly bad example. I've seen other machines where the effect is
> less severe.

No, I meant that my overall run time was on the order of 3/4 of an hour,
whereas your tests were on the order of 3 hours. As far as I could tell,
our CPUs had similar capabilities.

>
>> I can only assume I have something wrong in my installation of your mmtests.
>
> No, I've seen results broadly similar to yours on other machines so I
> don't think you have a methodology error.
>
>> I do see mmtests looking for some packages which it can not find.
>> 
>
> That's not too unusual. The package names are based on opensuse naming
> and that doesn't translate to other distributions. If you open
> bin/install-depends, you'll see a hashmap near the top that maps some of
> the names for redhat-based distributions and debian. It's not actively
> maintained. You can either install the packages manaually before the
> test or update the mappings.

>> Mel wrote:
>>> The results show that it's not the only source as a revert (last column)
>>> doesn't fix the damage although it goes from 3750 seconds (4.11-rc5 vanilla)
>>> to 2919 seconds (with a revert).
>> 
>> In my case, the reverted code ran faster than the kernel 4.5 code.
>> 
>> The other big difference is between Kernel 4.5 and 4.11-rc5 you got
>> -102.28% elapsed time, whereas I got -16.03% with method 1 and
>> -17.65% with method 2 (well, between 4.5 and 4.11-rc6 in my case).
>> I only get -93.28% and -94.82% difference between my fast and slow reference
>> tests (albeit on the same kernel).
>> 
>
> I have no reason to believe this is a methodology error and is due to a
> difference in CPU. Consider the following reports
>
>
http://beta.suse.com/private/mgorman/results/home/marvin/openSUSE-LEAP-42.2/global-dhp__workload_shellscripts-xfs/delboy/#gitsource
> http://beta.suse.com/private/mgorman/results/home/marvin/openSUSE-LEAP-42.2/global-dhp__workload_shellscripts-xfs/ivy/#gitsource
>
> The first one (delboy) shows a gain of 1.35% and it's only for 4.11
> (kernel shown is 4.11-rc1 with vmscan-related patches on top that do not
> affect this test case) of -17.51% which is very similar to yours. The
> CPU there is a Xeon E3-1230 v5.
>
> The second report (ivy) is the machine I'm based the original complain
> on and shows the large regression in elapsed time.
>
> So, different CPUs have different behaviours which is no surprise at all
> considering that at the very least, exit latencies will be different.
> While there may not be a universally correct answer to how to do this
> automatically, is it possible to tune intel_pstate such that it ramps up
> quickly regardless of recent utilisation and reduces relatively slowly?
> That would be better from a power consumption perspective than setting the
> "performance" governor.

As mentioned above, I don't know how to make the distinction in the use
cases.

... Doug

RE: Performance of low-cpu utilisation benchmark regressed severely since 4.6

Reply via email to