Yes, I think this is the extra LEAQ that appears in the loop. Ideally it 
would be lifted out of the loop. I think that is 
https://github.com/golang/go/issues/15808

On Friday, July 22, 2022 at 7:33:47 PM UTC-7 Taj Khattra wrote:

> i get similar results with 1.18 (inline slower than noinline)
> but different results with 1.16, 1.17, and 1.19rc2 (inline faster than 
> noinline)
>
> goos: linux
> goarch: amd64
> cpu: AMD Ryzen 5 5600X 6-Core Processor
>
> ======== 1.16.15
> BenchmarkNoInline-12        125717362            9.607 ns/op
> BenchmarkInline-12          150066394            8.721 ns/op
>
> BenchmarkNoInline-12        125476344            9.710 ns/op
> BenchmarkInline-12          133781608            8.851 ns/op
>
> ======== 1.17.10
> BenchmarkNoInline-12        100000000           10.14 ns/op
> BenchmarkInline-12          135818722            8.646 ns/op
>
> BenchmarkNoInline-12        123817206           10.61 ns/op
> BenchmarkInline-12          137691572            8.754 ns/op
>
> ======== 1.18.4
> BenchmarkNoInline-12        121646458           10.13 ns/op
> BenchmarkInline-12          81420973            14.65 ns/op
>
> BenchmarkNoInline-12        123927972           10.05 ns/op
> BenchmarkInline-12          81371038            14.64 ns/op
>
> ======== 1.19rc2
> BenchmarkNoInline-12        120799062            9.864 ns/op
> BenchmarkInline-12          147306990            8.579 ns/op
>
> BenchmarkNoInline-12        120426837           10.17 ns/op
> BenchmarkInline-12          129029052            8.621 ns/op
>
> On Friday, 22 July 2022 at 18:56:54 UTC-7 Kevin Chowski wrote:
>
>> Datapoint: same with windows/amd64 on Intel (running 1.19beta1):
>>
>> goos: windows
>> goarch: amd64
>> pkg: common/sandbox
>> cpu: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
>> BenchmarkNoInline-4     77425848                14.34 ns/op
>> BenchmarkInline-4       59108932                20.58 ns/op
>> PASS
>> ok      common/sandbox  2.645s
>>
>> Looking at the disassembly, I noticed that in the Inline case there was a 
>> 7-byte `lea    0xXXXXXX(%rip),%rbx` in the tight inner loop due to some 
>> really proactive constant propagation (I hypothesize). If you manually 
>> defeat the propagation by storing the string in a global and manually 
>> copying it into the stack, the inlined becomes faster than NoInline again: 
>> https://go.dev/play/p/VRgJP2y7joS
>>
>> goos: windows
>> goarch: amd64
>> pkg: common/sandbox
>> cpu: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
>> BenchmarkNoInline-4                     81436539                14.08 
>> ns/op
>> BenchmarkInline-4                       59255162                21.32 
>> ns/op
>> BenchmarkInlineDefeatConstProp-4        97524828                12.57 
>> ns/op
>> PASS
>> ok      common/sandbox  5.111s
>>
>> On Friday, July 22, 2022 at 11:01:00 AM UTC-6 mpr...@google.com wrote:
>>
>>> I can reproduce similar behavior on linux-amd64:
>>>
>>> $ perf stat ./example.com.test -test.bench=BenchmarkInline 
>>> -test.benchtime=100000000x
>>> goos: linux                               
>>> goarch: amd64                                              
>>> pkg: example.com
>>> cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz     
>>> BenchmarkInline-12      100000000               16.78 ns/op             
>>>                                                                      
>>> PASS
>>>                                                                       
>>>  Performance counter stats for './example.com.test 
>>> -test.bench=BenchmarkInline -test.benchtime=100000000x':
>>>
>>>           1,691.95 msec task-clock:u              #    1.004 CPUs 
>>> utilized          
>>>                  0      context-switches:u        #    0.000 /sec       
>>>              
>>>                  0      cpu-migrations:u          #    0.000 /sec       
>>>              
>>>                352      page-faults:u             #  208.044 /sec       
>>>              
>>>      6,732,752,072      cycles:u                  #    3.979 GHz         
>>>             
>>>     22,405,823,428      instructions:u            #    3.33  insn per 
>>> cycle         
>>>      6,501,294,164      branches:u                #    3.842 G/sec       
>>>             
>>>            149,596      branch-misses:u           #    0.00% of all 
>>> branches        
>>>
>>>        1.684677260 seconds time elapsed
>>>
>>>        1.692474000 seconds user
>>>        0.004020000 seconds sys
>>>
>>>
>>>
>>> $ perf stat ./example.com.test -test.bench=BenchmarkNoInline 
>>> -test.benchtime=100000000x
>>> goos: linux
>>> goarch: amd64
>>> pkg: example.com
>>> cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
>>> BenchmarkNoInline-12            100000000               10.79 ns/op
>>> PASS
>>>
>>>  Performance counter stats for './example.com.test 
>>> -test.bench=BenchmarkNoInline -test.benchtime=100000000x':
>>>
>>>           1,091.71 msec task-clock:u              #    1.005 CPUs 
>>> utilized          
>>>                  0      context-switches:u        #    0.000 /sec       
>>>              
>>>                  0      cpu-migrations:u          #    0.000 /sec       
>>>              
>>>                363      page-faults:u             #  332.505 /sec       
>>>              
>>>      4,490,159,750      cycles:u                  #    4.113 GHz         
>>>             
>>>     20,205,764,499      instructions:u            #    4.50  insn per 
>>> cycle         
>>>      6,701,281,015      branches:u                #    6.138 G/sec       
>>>             
>>>            586,073      branch-misses:u           #    0.01% of all 
>>> branches        
>>>
>>>        1.086302272 seconds time elapsed
>>>
>>>        1.087710000 seconds user
>>>        0.008027000 seconds sys
>>>
>>> The non-inlined version is actually fewer instructions to run the same 
>>> benchmark, which surprises me because naively looking at the disassembly it 
>>> seems that the inlined version is much more compact.
>>>
>>>
>>> On Fri, Jul 22, 2022 at 5:52 AM eric...@arm.com <eric...@arm.com> wrote:
>>>
>>>> For this piece of code, two test functions are the same, but one is 
>>>> inlined, the other is not. However the inlined version is about 25% slower 
>>>> than the no inlined version on apple m1 chip. Why is it?
>>>>
>>>> The code is here https://go.dev/play/p/0NkLMtTZtv4
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "golang-nuts" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to golang-nuts...@googlegroups.com.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/golang-nuts/527264d7-7cc1-4278-9a29-c04eb3ec4e86n%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/golang-nuts/527264d7-7cc1-4278-9a29-c04eb3ec4e86n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/5dfd8329-4bea-40ad-b595-a4433518ecf1n%40googlegroups.com.

Reply via email to