Re: [go-nuts] noinline is 25% faster than inline on apple m1 ?

Kevin Chowski Fri, 22 Jul 2022 18:57:17 -0700

Datapoint: same with windows/amd64 on Intel (running 1.19beta1):

goos: windows
goarch: amd64
pkg: common/sandbox
cpu: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
BenchmarkNoInline-4     77425848                14.34 ns/op
BenchmarkInline-4       59108932                20.58 ns/op
PASS
ok      common/sandbox  2.645s


Looking at the disassembly, I noticed that in the Inline case there was a 
7-byte `lea    0xXXXXXX(%rip),%rbx` in the tight inner loop due to some 
really proactive constant propagation (I hypothesize). If you manually 
defeat the propagation by storing the string in a global and manually 
copying it into the stack, the inlined becomes faster than NoInline 
again: https://go.dev/play/p/VRgJP2y7joS

goos: windows
goarch: amd64
pkg: common/sandbox
cpu: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz
BenchmarkNoInline-4                     81436539                14.08 ns/op
BenchmarkInline-4                       59255162                21.32 ns/op
BenchmarkInlineDefeatConstProp-4        97524828                12.57 ns/op
PASS
ok      common/sandbox  5.111s

On Friday, July 22, 2022 at 11:01:00 AM UTC-6 mpr...@google.com wrote:

> I can reproduce similar behavior on linux-amd64:
>
> $ perf stat ./example.com.test -test.bench=BenchmarkInline 
> -test.benchtime=100000000x
> goos: linux                               
> goarch: amd64                                              
> pkg: example.com
> cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz     
> BenchmarkInline-12      100000000               16.78 ns/op               
>                                                                    
> PASS
>                                                                       
>  Performance counter stats for './example.com.test 
> -test.bench=BenchmarkInline -test.benchtime=100000000x':
>
>           1,691.95 msec task-clock:u              #    1.004 CPUs utilized 
>          
>                  0      context-switches:u        #    0.000 /sec         
>            
>                  0      cpu-migrations:u          #    0.000 /sec         
>            
>                352      page-faults:u             #  208.044 /sec         
>            
>      6,732,752,072      cycles:u                  #    3.979 GHz           
>           
>     22,405,823,428      instructions:u            #    3.33  insn per 
> cycle         
>      6,501,294,164      branches:u                #    3.842 G/sec         
>           
>            149,596      branch-misses:u           #    0.00% of all 
> branches        
>
>        1.684677260 seconds time elapsed
>
>        1.692474000 seconds user
>        0.004020000 seconds sys
>
>
>
> $ perf stat ./example.com.test -test.bench=BenchmarkNoInline 
> -test.benchtime=100000000x
> goos: linux
> goarch: amd64
> pkg: example.com
> cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
> BenchmarkNoInline-12            100000000               10.79 ns/op
> PASS
>
>  Performance counter stats for './example.com.test 
> -test.bench=BenchmarkNoInline -test.benchtime=100000000x':
>
>           1,091.71 msec task-clock:u              #    1.005 CPUs utilized 
>          
>                  0      context-switches:u        #    0.000 /sec         
>            
>                  0      cpu-migrations:u          #    0.000 /sec         
>            
>                363      page-faults:u             #  332.505 /sec         
>            
>      4,490,159,750      cycles:u                  #    4.113 GHz           
>           
>     20,205,764,499      instructions:u            #    4.50  insn per 
> cycle         
>      6,701,281,015      branches:u                #    6.138 G/sec         
>           
>            586,073      branch-misses:u           #    0.01% of all 
> branches        
>
>        1.086302272 seconds time elapsed
>
>        1.087710000 seconds user
>        0.008027000 seconds sys
>
> The non-inlined version is actually fewer instructions to run the same 
> benchmark, which surprises me because naively looking at the disassembly it 
> seems that the inlined version is much more compact.
>
>
> On Fri, Jul 22, 2022 at 5:52 AM eric...@arm.com <eric...@arm.com> wrote:
>
>> For this piece of code, two test functions are the same, but one is 
>> inlined, the other is not. However the inlined version is about 25% slower 
>> than the no inlined version on apple m1 chip. Why is it?
>>
>> The code is here https://go.dev/play/p/0NkLMtTZtv4
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "golang-nuts" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to golang-nuts...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/golang-nuts/527264d7-7cc1-4278-9a29-c04eb3ec4e86n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/golang-nuts/527264d7-7cc1-4278-9a29-c04eb3ec4e86n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/059740f3-eaea-4d1d-bfa8-ee33ccaf9d97n%40googlegroups.com.

Re: [go-nuts] noinline is 25% faster than inline on apple m1 ?

Reply via email to