T L, ffff.go : https://play.golang.org/p/GPGO7YnLxh
Peter On Monday, May 1, 2017 at 1:15:38 PM UTC-4, peterGo wrote: > > T L, > > First, keep things simple. Delete the main function; it's an irrelevant > complication. Use shorter function names. Start simply and build one > feature at a time. Function gggg is exactly like ffff except that variables > are substituted for the function parameter. Function hhhh is exactly like > gggg except that the parameter variables are local variables not file > variables. Function iiii is exactly like hhhh except that variables are > used directly instead of indirectly through pointers. > > For example: https://play.golang.org/p/TsPPKv14Va > > It shouldn't be too hard to see what is happening and why function ffff is > fast (registers versus memory in loop): > > package tl > > const N = 4096 > > type T int64 > > func ffff(p *[N]T) T { > var sum T > for i := 0; i < len(p); i++ { > sum += T(p[i]) > } > return sum > } > > $ go tool compile -S ffff.go > > "".ffff t=1 size=49 args=0x10 locals=0x0 > 0x0000 00000 (ffff.go:7) TEXT "".ffff(SB), $0-16 > 0x0000 00000 (ffff.go:7) FUNCDATA $0, > gclocals·aef1f7ba6e2630c93a51843d99f5a28a(SB) > 0x0000 00000 (ffff.go:7) FUNCDATA $1, > gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB) > 0x0000 00000 (ffff.go:9) MOVQ "".p+8(FP), AX > 0x0005 00005 (ffff.go:9) MOVQ $0, CX > 0x0007 00007 (ffff.go:7) MOVQ $0, DX > 0x0009 00009 (ffff.go:9) CMPQ CX, $4096 > 0x0010 00016 (ffff.go:9) JGE $0, 43 > 0x0012 00018 (ffff.go:10) TESTB AL, (AX) > 0x0014 00020 (ffff.go:9) LEAQ 1(CX), BX > 0x0018 00024 (ffff.go:10) MOVQ (AX)(CX*8), SI > 0x001c 00028 (ffff.go:10) ADDQ SI, DX > 0x001f 00031 (ffff.go:9) MOVQ BX, CX > 0x0022 00034 (ffff.go:9) CMPQ CX, $4096 > 0x0029 00041 (ffff.go:9) JLT $0, 18 > 0x002b 00043 (ffff.go:12) MOVQ DX, "".~r1+16(FP) > 0x0030 00048 (ffff.go:12) RET > 0x0000 48 8b 44 24 08 31 c9 31 d2 48 81 f9 00 10 00 00 > H.D$.1.1.H...... > 0x0010 7d 19 84 00 48 8d 59 01 48 8b 34 c8 48 01 f2 48 > }...H.Y.H.4.H..H > 0x0020 89 d9 48 81 f9 00 10 00 00 7c e7 48 89 54 24 10 > ..H......|.H.T$. > 0x0030 c3 . > gclocals·33cdeccccebe80329f1fdbee7f5874cb t=8 dupok size=8 > 0x0000 01 00 00 00 00 00 00 00 ........ > gclocals·aef1f7ba6e2630c93a51843d99f5a28a t=8 dupok size=9 > 0x0000 01 00 00 00 02 00 00 00 01 ......... > > $ go tool objdump ffff.o > > TEXT %22%22.ffff(SB) ffff.go > ffff.go:9 0x325 488b442408 MOVQ 0x8(SP), AX > ffff.go:9 0x32a 31c9 XORL CX, CX > ffff.go:7 0x32c 31d2 XORL DX, DX > ffff.go:9 0x32e 4881f900100000 CMPQ $0x1000, CX > ffff.go:9 0x335 7d19 JGE 0x350 > ffff.go:10 0x337 8400 TESTB AL, 0(AX) > ffff.go:9 0x339 488d5901 LEAQ 0x1(CX), BX > ffff.go:10 0x33d 488b34c8 MOVQ 0(AX)(CX*8), SI > ffff.go:10 0x341 4801f2 ADDQ SI, DX > ffff.go:9 0x344 4889d9 MOVQ BX, CX > ffff.go:9 0x347 4881f900100000 CMPQ $0x1000, CX > ffff.go:9 0x34e 7ce7 JL 0x337 > ffff.go:12 0x350 4889542410 MOVQ DX, 0x10(SP) > ffff.go:12 0x355 c3 RET > > References: > > Names > https://research.swtch.com/names > > A Quick Guide to Go's Assembler > https://golang.org/doc/asm > > GopherCon 2016: Rob Pike - The Design of the Go Assembler > https://www.youtube.com/watch?v=KINIAgRpkDA > > Command compile > https://golang.org/cmd/compile/ > > Command nm > https://golang.org/cmd/nm/ > > Command objdump > https://golang.org/cmd/objdump/ > > Command asm > https://golang.org/cmd/asm/ > > A Foray Into Go Assembly Programming > https://blog.sgmansfield.com/2017/04/a-foray-into-go-assembly-programming/ > > Intel® 64 and IA-32 Architectures Software Developer Manuals > https://software.intel.com/en-us/articles/intel-sdm > > Peter > > On Sunday, April 30, 2017 at 2:00:03 AM UTC-4, T L wrote: >> >> >> >> On Sunday, April 30, 2017 at 7:33:38 AM UTC+8, Ian Lance Taylor wrote: >>> >>> On Sat, Apr 29, 2017 at 1:43 AM, T L <tapi...@gmail.com> wrote: >>> > >>> > package main >>> > >>> > import ( >>> > "testing" >>> > ) >>> > >>> > const N = 4096 >>> > type T int64 >>> > var a [N]T >>> > >>> > var globalSum T >>> > >>> > func sumByLoopArray_a(p *[N]T) T { >>> > var sum T >>> > for i := 0; i < len(p); i++ { >>> > sum += T(p[i]) >>> > } >>> > return sum >>> > } >>> > >>> > func sumByLoopArray_b(p *[N]T) T { >>> > var sum T >>> > for i := 0; i < len(*p); i++ { >>> > sum += T((*p)[i]) >>> > } >>> > return sum >>> > } >>> > >>> > //============================================ >>> > >>> > func Benchmark_LoopArray_0a(b *testing.B) { >>> > for i := 0; i < b.N; i++ { >>> > var sum T >>> > for i := 0; i < len(a); i++ { >>> > sum += T(a[i]) >>> > } >>> > globalSum = sum >>> > } >>> > } >>> > >>> > func Benchmark_LoopArray_0b(b *testing.B) { >>> > for i := 0; i < b.N; i++ { >>> > var sum T >>> > p := &a >>> > for i := 0; i < len(p); i++ { >>> > sum += T(p[i]) >>> > } >>> > globalSum = sum >>> > } >>> > } >>> > >>> > func Benchmark_LoopArray_1a(b *testing.B) { >>> > for i := 0; i < b.N; i++ { >>> > globalSum = sumByLoopArray_a(&a) >>> > } >>> > } >>> > >>> > func Benchmark_LoopArray_1b(b *testing.B) { >>> > for i := 0; i < b.N; i++ { >>> > globalSum = sumByLoopArray_b(&a) >>> > } >>> > } >>> > >>> > /* output: >>> > >>> > $ go test . -bench=. >>> > Benchmark_LoopArray_0a-4 300000 5248 ns/op >>> > Benchmark_LoopArray_0b-4 300000 5240 ns/op >>> > Benchmark_LoopArray_1a-4 500000 3942 ns/op >>> > Benchmark_LoopArray_1b-4 300000 3936 ns/op >>> > */ >>> > >>> > why? >>> >>> Benchmarking is hard. >>> >>> If you really want to know why, you will have to look at the generated >>> assembly code. There are likely to be differences there. For >>> example, your 1a and 1b loops use a pointer to a global variable, but >>> your 0a and 0b loops use a global variable directly. In general >>> references through a pointer are more efficient than references to a >>> named global variable. I don't know if that is the difference here, >>> but it could be. >>> >>> It could also be something difficult to control for, like loop >>> alignment. >>> >>> Ian >>> >> >> Thanks, Ian. >> >> Yes, for 0a, it is reasonable that the reference to a global array really >> matters. >> But for 0b, the p pointer is a local variable, so it shouldn't be much >> different with 1a and 1b. >> >> If I create a local array for 0a and 0b, then their results will become >> much better, but still a little slower than 1a and 1b. >> >> I write a small program and use "-gcflags -S" to check the assembly >> output: >> >> package main >> >> const N = 4096 >> type T int64 >> >> func fffffff (p *[N]T) T { >> var sum T >> for i := 0; i < len(p); i++ { >> sum += T(p[i]) >> } >> return sum >> } >> >> func ggggggg () T { >> var a [N]T >> var sum T >> for i := 0; i < len(a); i++ { >> sum += T(a[i]) >> } >> return sum >> } >> >> func main() { >> } >> >> The assembly code for fffffff and ggggggg are much different. >> I am not familiar with go assembly now, so I can't make a conclusion why >> ggggggg is slower. >> >> -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.