Yes, I think this is the extra LEAQ that appears in the loop. Ideally it would be lifted out of the loop. I think that is https://github.com/golang/go/issues/15808
On Friday, July 22, 2022 at 7:33:47 PM UTC-7 Taj Khattra wrote: > i get similar results with 1.18 (inline slower than noinline) > but different results with 1.16, 1.17, and 1.19rc2 (inline faster than > noinline) > > goos: linux > goarch: amd64 > cpu: AMD Ryzen 5 5600X 6-Core Processor > > ======== 1.16.15 > BenchmarkNoInline-12 125717362 9.607 ns/op > BenchmarkInline-12 150066394 8.721 ns/op > > BenchmarkNoInline-12 125476344 9.710 ns/op > BenchmarkInline-12 133781608 8.851 ns/op > > ======== 1.17.10 > BenchmarkNoInline-12 100000000 10.14 ns/op > BenchmarkInline-12 135818722 8.646 ns/op > > BenchmarkNoInline-12 123817206 10.61 ns/op > BenchmarkInline-12 137691572 8.754 ns/op > > ======== 1.18.4 > BenchmarkNoInline-12 121646458 10.13 ns/op > BenchmarkInline-12 81420973 14.65 ns/op > > BenchmarkNoInline-12 123927972 10.05 ns/op > BenchmarkInline-12 81371038 14.64 ns/op > > ======== 1.19rc2 > BenchmarkNoInline-12 120799062 9.864 ns/op > BenchmarkInline-12 147306990 8.579 ns/op > > BenchmarkNoInline-12 120426837 10.17 ns/op > BenchmarkInline-12 129029052 8.621 ns/op > > On Friday, 22 July 2022 at 18:56:54 UTC-7 Kevin Chowski wrote: > >> Datapoint: same with windows/amd64 on Intel (running 1.19beta1): >> >> goos: windows >> goarch: amd64 >> pkg: common/sandbox >> cpu: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz >> BenchmarkNoInline-4 77425848 14.34 ns/op >> BenchmarkInline-4 59108932 20.58 ns/op >> PASS >> ok common/sandbox 2.645s >> >> Looking at the disassembly, I noticed that in the Inline case there was a >> 7-byte `lea 0xXXXXXX(%rip),%rbx` in the tight inner loop due to some >> really proactive constant propagation (I hypothesize). If you manually >> defeat the propagation by storing the string in a global and manually >> copying it into the stack, the inlined becomes faster than NoInline again: >> https://go.dev/play/p/VRgJP2y7joS >> >> goos: windows >> goarch: amd64 >> pkg: common/sandbox >> cpu: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz >> BenchmarkNoInline-4 81436539 14.08 >> ns/op >> BenchmarkInline-4 59255162 21.32 >> ns/op >> BenchmarkInlineDefeatConstProp-4 97524828 12.57 >> ns/op >> PASS >> ok common/sandbox 5.111s >> >> On Friday, July 22, 2022 at 11:01:00 AM UTC-6 mpr...@google.com wrote: >> >>> I can reproduce similar behavior on linux-amd64: >>> >>> $ perf stat ./example.com.test -test.bench=BenchmarkInline >>> -test.benchtime=100000000x >>> goos: linux >>> goarch: amd64 >>> pkg: example.com >>> cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz >>> BenchmarkInline-12 100000000 16.78 ns/op >>> >>> PASS >>> >>> Performance counter stats for './example.com.test >>> -test.bench=BenchmarkInline -test.benchtime=100000000x': >>> >>> 1,691.95 msec task-clock:u # 1.004 CPUs >>> utilized >>> 0 context-switches:u # 0.000 /sec >>> >>> 0 cpu-migrations:u # 0.000 /sec >>> >>> 352 page-faults:u # 208.044 /sec >>> >>> 6,732,752,072 cycles:u # 3.979 GHz >>> >>> 22,405,823,428 instructions:u # 3.33 insn per >>> cycle >>> 6,501,294,164 branches:u # 3.842 G/sec >>> >>> 149,596 branch-misses:u # 0.00% of all >>> branches >>> >>> 1.684677260 seconds time elapsed >>> >>> 1.692474000 seconds user >>> 0.004020000 seconds sys >>> >>> >>> >>> $ perf stat ./example.com.test -test.bench=BenchmarkNoInline >>> -test.benchtime=100000000x >>> goos: linux >>> goarch: amd64 >>> pkg: example.com >>> cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz >>> BenchmarkNoInline-12 100000000 10.79 ns/op >>> PASS >>> >>> Performance counter stats for './example.com.test >>> -test.bench=BenchmarkNoInline -test.benchtime=100000000x': >>> >>> 1,091.71 msec task-clock:u # 1.005 CPUs >>> utilized >>> 0 context-switches:u # 0.000 /sec >>> >>> 0 cpu-migrations:u # 0.000 /sec >>> >>> 363 page-faults:u # 332.505 /sec >>> >>> 4,490,159,750 cycles:u # 4.113 GHz >>> >>> 20,205,764,499 instructions:u # 4.50 insn per >>> cycle >>> 6,701,281,015 branches:u # 6.138 G/sec >>> >>> 586,073 branch-misses:u # 0.01% of all >>> branches >>> >>> 1.086302272 seconds time elapsed >>> >>> 1.087710000 seconds user >>> 0.008027000 seconds sys >>> >>> The non-inlined version is actually fewer instructions to run the same >>> benchmark, which surprises me because naively looking at the disassembly it >>> seems that the inlined version is much more compact. >>> >>> >>> On Fri, Jul 22, 2022 at 5:52 AM eric...@arm.com <eric...@arm.com> wrote: >>> >>>> For this piece of code, two test functions are the same, but one is >>>> inlined, the other is not. However the inlined version is about 25% slower >>>> than the no inlined version on apple m1 chip. Why is it? >>>> >>>> The code is here https://go.dev/play/p/0NkLMtTZtv4 >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "golang-nuts" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to golang-nuts...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/golang-nuts/527264d7-7cc1-4278-9a29-c04eb3ec4e86n%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/golang-nuts/527264d7-7cc1-4278-9a29-c04eb3ec4e86n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/5dfd8329-4bea-40ad-b595-a4433518ecf1n%40googlegroups.com.