T L,

First, keep things simple. Delete the main function; it's an irrelevant 
complication. Use shorter function names. Start simply and build one 
feature at a time. Function gggg is exactly like ffff except that variables 
are substituted for the function parameter. Function hhhh is exactly like 
gggg except that the parameter variables are local variables not file 
variables. Function iiii is exactly like hhhh except that variables are 
used directly instead of indirectly through pointers.

For example: https://play.golang.org/p/TsPPKv14Va

It shouldn't be too hard to see what is happening and why function ffff is 
fast (registers versus memory in loop):

package tl

const N = 4096

type T int64

func ffff(p *[N]T) T {
    var sum T
    for i := 0; i < len(p); i++ {
        sum += T(p[i])
    }
    return sum
}

$ go tool compile -S ffff.go

"".ffff t=1 size=49 args=0x10 locals=0x0
    0x0000 00000 (ffff.go:7)    TEXT    "".ffff(SB), $0-16
    0x0000 00000 (ffff.go:7)    FUNCDATA    $0, 
gclocals·aef1f7ba6e2630c93a51843d99f5a28a(SB)
    0x0000 00000 (ffff.go:7)    FUNCDATA    $1, 
gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
    0x0000 00000 (ffff.go:9)    MOVQ    "".p+8(FP), AX
    0x0005 00005 (ffff.go:9)    MOVQ    $0, CX
    0x0007 00007 (ffff.go:7)    MOVQ    $0, DX
    0x0009 00009 (ffff.go:9)    CMPQ    CX, $4096
    0x0010 00016 (ffff.go:9)    JGE    $0, 43
    0x0012 00018 (ffff.go:10)    TESTB    AL, (AX)
    0x0014 00020 (ffff.go:9)    LEAQ    1(CX), BX
    0x0018 00024 (ffff.go:10)    MOVQ    (AX)(CX*8), SI
    0x001c 00028 (ffff.go:10)    ADDQ    SI, DX
    0x001f 00031 (ffff.go:9)    MOVQ    BX, CX
    0x0022 00034 (ffff.go:9)    CMPQ    CX, $4096
    0x0029 00041 (ffff.go:9)    JLT    $0, 18
    0x002b 00043 (ffff.go:12)    MOVQ    DX, "".~r1+16(FP)
    0x0030 00048 (ffff.go:12)    RET
    0x0000 48 8b 44 24 08 31 c9 31 d2 48 81 f9 00 10 00 00  H.D$.1.1.H......
    0x0010 7d 19 84 00 48 8d 59 01 48 8b 34 c8 48 01 f2 48  }...H.Y.H.4.H..H
    0x0020 89 d9 48 81 f9 00 10 00 00 7c e7 48 89 54 24 10  ..H......|.H.T$.
    0x0030 c3                                               .
gclocals·33cdeccccebe80329f1fdbee7f5874cb t=8 dupok size=8
    0x0000 01 00 00 00 00 00 00 00                          ........
gclocals·aef1f7ba6e2630c93a51843d99f5a28a t=8 dupok size=9
    0x0000 01 00 00 00 02 00 00 00 01                       .........

$ go tool objdump ffff.o

TEXT %22%22.ffff(SB) ffff.go
    ffff.go:9    0x325    488b442408    MOVQ 0x8(SP), AX    
    ffff.go:9    0x32a    31c9        XORL CX, CX        
    ffff.go:7    0x32c    31d2        XORL DX, DX        
    ffff.go:9    0x32e    4881f900100000    CMPQ $0x1000, CX    
    ffff.go:9    0x335    7d19        JGE 0x350        
    ffff.go:10    0x337    8400        TESTB AL, 0(AX)        
    ffff.go:9    0x339    488d5901    LEAQ 0x1(CX), BX    
    ffff.go:10    0x33d    488b34c8    MOVQ 0(AX)(CX*8), SI    
    ffff.go:10    0x341    4801f2        ADDQ SI, DX        
    ffff.go:9    0x344    4889d9        MOVQ BX, CX        
    ffff.go:9    0x347    4881f900100000    CMPQ $0x1000, CX    
    ffff.go:9    0x34e    7ce7        JL 0x337        
    ffff.go:12    0x350    4889542410    MOVQ DX, 0x10(SP)    
    ffff.go:12    0x355    c3        RET            

References:

Names
https://research.swtch.com/names

A Quick Guide to Go's Assembler
https://golang.org/doc/asm

GopherCon 2016: Rob Pike - The Design of the Go Assembler
https://www.youtube.com/watch?v=KINIAgRpkDA

Command compile
https://golang.org/cmd/compile/

Command nm
https://golang.org/cmd/nm/

Command objdump
https://golang.org/cmd/objdump/

Command asm
https://golang.org/cmd/asm/

A Foray Into Go Assembly Programming
https://blog.sgmansfield.com/2017/04/a-foray-into-go-assembly-programming/

Intel® 64 and IA-32 Architectures Software Developer Manuals
https://software.intel.com/en-us/articles/intel-sdm

Peter

On Sunday, April 30, 2017 at 2:00:03 AM UTC-4, T L wrote:
>
>
>
> On Sunday, April 30, 2017 at 7:33:38 AM UTC+8, Ian Lance Taylor wrote:
>>
>> On Sat, Apr 29, 2017 at 1:43 AM, T L <tapi...@gmail.com> wrote: 
>> > 
>> > package main 
>> > 
>> > import ( 
>> >     "testing" 
>> > ) 
>> > 
>> > const N = 4096 
>> > type T int64 
>> > var a [N]T 
>> > 
>> > var globalSum T 
>> > 
>> > func sumByLoopArray_a(p *[N]T) T { 
>> >     var sum T 
>> >     for i := 0; i < len(p); i++ { 
>> >         sum += T(p[i]) 
>> >     } 
>> >     return sum 
>> > } 
>> > 
>> > func sumByLoopArray_b(p *[N]T) T { 
>> >     var sum T 
>> >     for i := 0; i < len(*p); i++ { 
>> >         sum += T((*p)[i]) 
>> >     } 
>> >     return sum 
>> > } 
>> > 
>> > //============================================ 
>> > 
>> > func Benchmark_LoopArray_0a(b *testing.B) { 
>> >     for i := 0; i < b.N; i++ { 
>> >         var sum T 
>> >         for i := 0; i < len(a); i++ { 
>> >             sum += T(a[i]) 
>> >         } 
>> >         globalSum = sum 
>> >     } 
>> > } 
>> > 
>> > func Benchmark_LoopArray_0b(b *testing.B) { 
>> >     for i := 0; i < b.N; i++ { 
>> >         var sum T 
>> >         p := &a 
>> >         for i := 0; i < len(p); i++ { 
>> >             sum += T(p[i]) 
>> >         } 
>> >         globalSum = sum 
>> >     } 
>> > } 
>> > 
>> > func Benchmark_LoopArray_1a(b *testing.B) { 
>> >     for i := 0; i < b.N; i++ { 
>> >         globalSum = sumByLoopArray_a(&a) 
>> >     } 
>> > } 
>> > 
>> > func Benchmark_LoopArray_1b(b *testing.B) { 
>> >     for i := 0; i < b.N; i++ { 
>> >         globalSum = sumByLoopArray_b(&a) 
>> >     } 
>> > } 
>> > 
>> > /* output: 
>> > 
>> > $ go test . -bench=. 
>> > Benchmark_LoopArray_0a-4         300000          5248 ns/op 
>> > Benchmark_LoopArray_0b-4         300000          5240 ns/op 
>> > Benchmark_LoopArray_1a-4         500000          3942 ns/op 
>> > Benchmark_LoopArray_1b-4         300000          3936 ns/op 
>> > */ 
>> > 
>> > why? 
>>
>> Benchmarking is hard. 
>>
>> If you really want to know why, you will have to look at the generated 
>> assembly code.  There are likely to be differences there.  For 
>> example, your 1a and 1b loops use a pointer to a global variable, but 
>> your 0a and 0b loops use a global variable directly.  In general 
>> references through a pointer are more efficient than references to a 
>> named global variable.  I don't know if that is the difference here, 
>> but it could be. 
>>
>> It could also be something difficult to control for, like loop alignment. 
>>
>> Ian 
>>
>
> Thanks, Ian.
>
> Yes, for 0a, it is reasonable that the reference to a global array really 
> matters.
> But for 0b, the p pointer is a local variable, so it shouldn't be much 
> different with 1a and 1b.
>
> If I create a local array for 0a and 0b, then their results will become 
> much better, but still a little slower than 1a and 1b.
>
> I write a small program and use "-gcflags -S" to check the assembly output:
>
> package main
>
> const N = 4096
> type T int64
>     
> func fffffff (p *[N]T) T {
>     var sum T
>     for i := 0; i < len(p); i++ {
>         sum += T(p[i])
>     }
>     return sum
> }
>
> func ggggggg () T {
>     var a [N]T
>     var sum T
>     for i := 0; i < len(a); i++ {
>         sum += T(a[i])
>     }
>     return sum    
> }
>
> func main() {
> }
>
> The assembly code for fffffff and ggggggg are much different.
> I am not familiar with go assembly now, so I can't make a conclusion why 
> ggggggg is slower.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to