[go-nuts] Re: [ANN] primegen.go: Sieve of Atkin prime number generator

alb . donizetti Wed, 15 Jun 2016 06:10:32 -0700

I suggest you to repeat your analysis on go1.7beta, the compiler has a new 
backend
for amd64 and the generated code should be better. On my machine:


go1.6.2:

Found 23000000 primes up to 262146 in 445.329735ms .
Found 23000000 primes up to 262146 in 447.491918ms .
Found 23000000 primes up to 262146 in 451.530021ms .
Found 23000000 primes up to 262146 in 445.181835ms .
Found 23000000 primes up to 262146 in 447.489657ms .
Found 23000000 primes up to 262146 in 451.568191ms .
Found 23000000 primes up to 262146 in 446.207366ms .
Found 23000000 primes up to 262146 in 449.992793ms .
Found 23000000 primes up to 262146 in 445.746567ms .
Found 23000000 primes up to 262146 in 449.102428ms .

on go1.7beta

Found 23000000 primes up to 262146 in 358.99806ms .
Found 23000000 primes up to 262146 in 358.755634ms .
Found 23000000 primes up to 262146 in 360.231645ms .
Found 23000000 primes up to 262146 in 358.93451ms .
Found 23000000 primes up to 262146 in 359.412598ms .
Found 23000000 primes up to 262146 in 363.33646ms .
Found 23000000 primes up to 262146 in 363.981767ms .
Found 23000000 primes up to 262146 in 366.880261ms .
Found 23000000 primes up to 262146 in 358.761698ms .
Found 23000000 primes up to 262146 in 358.025324ms .


Il giorno mercoledì 15 giugno 2016 14:37:59 UTC+2, gordo...@gmail.com ha 
scritto:
>
> While I am sure you are correct that some numeric code with golang can run 
> as fast as C/C++, it seems that bit-packed tight loops such as used for 
> culling composites in the Sieve of Eratosthenes (or the Sieve of Atkin for 
> that matter) are not as optimized.  Of course one of the problems that 
> golang has is compulsory array bounds checks, but it doesn't run as fast as 
> even DotNet or JVM code, which also have those compulsory checks. 
>
> To be specific, following is a simple Sieve of Eratosthenes program in 
> golang that runs about twice as slow as the same program in C/C++ and about 
> half again as slow as in C# or Java.  The range of 262146 is chosen so that 
> the created bit-packed array is smaller than the size of most CPU L1 caches 
> (16 Kilobytes) so that memory access time isn't a bottleneck for most 
> modern desktop CPU's, and the program is run 1000 times to get a reasonable 
> length of time for more accurate timing  The commented out lines were an 
> experiment to use unsafe pointers (as in a C style array program) to try to 
> save time by avoiding the automatic array bounds checks, but the resulting 
> time was about the same: 
>
> package main 
>
> import ( 
>         "fmt" 
>         "math" 
>         "time" 
>         //    "unsafe" 
> ) 
>
> func mkCLUT() [65536]byte { 
>         var arr [65536]byte 
>         for i := 0; i < 65536; i++ { 
>                 var cnt byte = 0 
>                 for v := (uint16)(i ^ 0xFFFF); v > 0; v &= v - 1 { 
>                         cnt++ 
>                 } 
>                 arr[i] = cnt 
>         } 
>         return arr 
> } 
>
> var cnstCLUT [65536]byte = mkCLUT() 
>
> func primesTest(top uint) int { 
>         lmtndx := (top - 3) >> 1 
>         lstw := lmtndx >> 5 
>         topsqrtndx := (int(math.Sqrt(float64(top))) - 3) >> 1 
>         cmpsts := make([]uint32, lstw+1) 
> //        start := uintptr(unsafe.Pointer(&cmpsts[0])) 
> //        step := unsafe.Sizeof(buf[0]) 
> //        limit := start + uintptr(len(cmpsts))*step 
>         for i := 0; i <= topsqrtndx; i++ { 
>                 if cmpsts[i>>5]&(uint32(1)<<uint(i)) == 0 { 
>                         p := (uint(i) << 1) + 3 
>                         for j := (p*p - 3) >> 1; j <= lmtndx; j += p { 
>                                 cmpsts[j>>5] |= 1 << (j & 31) 
>                         } 
> //                        p := uintptr((uint(i) << 1) + 3) 
> //                       lmt := uintptr(lmtndx) 
> //                       for j := (p*p - 3) >> 1; j <= lmt; j += p { 
> //                         *(*uint)(unsafe.Pointer(start + step*(j>>5))) 
> |= 1 << (j & 31) 
> //                       } 
>                 } 
>         } 
>         msk := uint32(0xFFFFFFFE) << (lmtndx & 31) 
>         cmpsts[lstw] |= msk 
>         cnt := 1 
>         for i := uint(0); i <= lstw; i++ { 
>                 v := cmpsts[i] 
>                 cnt += int(cnstCLUT[v&0xFFFF] + cnstCLUT[0xFFFF&(v>>16)]) 
>         } 
>         return cnt 
> } 
>
> func main() { 
>         n := uint(262146) 
>
>         strt := time.Now() 
>
>         sum := 0 
>         for i := 0; i < 1000; i++ { 
>                 sum += primesTest(n) 
>         } 
>
>         end := time.Now() 
>
>         fmt.Println("Found", sum, "primes up to", n, "in", end.Sub(strt), 
> ".") 
> } 
>
> The assembly listing revealed by "go tool compile -S primestest.go > 
> primestest.s" reveals why (I have added comments at the ends of the lines 
> to indicate what is going on): 
>
> line 74            for j := (p*p - 3) >> 1; j <= lmtndx; j += p { 
> line 75                cmpsts[j>>5] |= 1 << (j & 31) 
> line 76            } 
>
> has the following assembly code: 
>
>         0x011e 00286 (primestest.go:75)        MOVQ        AX, R9 ;; make 
> a copy of 'j' 
>         0x0121 00289 (primestest.go:75)        SHRQ        $5, R9  ;; turn 
> it into the uint32 word address 
>         0x0125 00293 (primestest.go:75)        CMPQ        R9, SI  ;; do 
> the array bounds check 
>         0x0128 00296 (primestest.go:75)        JCC        $1, 546       
>  ;; out to panic if it fails 
>         0x012e 00302 (primestest.go:75)        LEAQ        (DI)(R9*4), BX 
> ;; get address of right element of array; the di register contains the 
> pointer to the array base 
>         0x0132 00306 (primestest.go:75)        MOVL        (BX), R11       
>   ;; get the contents of that element 
>         0x0135 00309 (primestest.go:75)        CMPQ        R9, SI  ;; DO 
> ANOTHER ARRAY BOUNDS CHECK!!!!! 
>         0x0138 00312 (primestest.go:75)        JCC        $1, 539       
>  ;; OUT TO A DIFFERENT PANIC IF IT FAILS!!! 
>         0x013e 00318 (primestest.go:75)        LEAQ        (DI)(R9*4), BX 
> ;; GET THE ADDRESS OF THE SAME ELEMENT AGAIN!!!! 
>         0x0142 00322 (primestest.go:75)        MOVQ        CX, R8 ;;SAVE 
> THE CONTENTS OF THE CX REGISTER (SHOULD BE DONE OUTSIDE THE LOOP)!!! 
>         0x0145 00325 (primestest.go:75)        MOVQ        AX, CX 
>         0x0148 00328 (primestest.go:75)        ANDQ        $31, CX ;; cx 
> register now contains 'j' & 31 
>         0x014c 00332 (primestest.go:75)        MOVL        $1, BP 
>         0x0151 00337 (primestest.go:75)        SHLL        CX, BP ;; bp 
> register now contains 1 << ('j' & 31) 
>         0x0153 00339 (primestest.go:75)        MOVQ        R8, CX 
>  ;;RESTORE THE CONTENTS OF THE CX REGISTER (SHOULD BE DONE OUTSIDE THE 
> LOOP)!!! 
>         0x0156 00342 (primestest.go:75)        ORL        R11, BP  ;; r11 
> now contains cmpsts['j' >> 5] | (1 << ('j' & 31)) 
>         0x0159 00345 (primestest.go:75)        MOVL        BP, (BX)  ;; 
> move it back to the element via the address in BX 
>         0x015b 00347 (primestest.go:75)        NOP  ;; nop's to make the 
> addresses work out to even words 
>         0x015b 00347 (primestest.go:75)        NOP 
>         0x015b 00347 (primestest.go:74)        ADDQ        R10, AX ;; add 
> 'j' to the stored 'p' in r10 
>         0x015e 00350 (primestest.go:74)        NOP 
>         0x015e 00350 (primestest.go:74)        CMPQ        AX, R12  ;; 
> check if 'j' is up to the stored 'lmtndx' in r12 
>         0x0161 00353 (primestest.go:74)        JLS        $0, 286 ;; and 
> loop back to top if not done 
>
> The main problem is the lines I have commented in capitals where the 
> bounds check is done twice but also 
> where the contents of the cx register are saved and restored inside the 
> loop 
>
> A secondary problem as far as speed is concerned is that the C/C++ code 
> also takes advantage of the direct read/modify/write instruction, 
> equivalent to "cmpsts[j >> 5] |= 1 << (j & 31)" so the code in this syntax 
> for C/C++ code would become: 
>
>         0x011e 00286 (primestest.go:75)        MOVQ        AX, R9 ;; make 
> a copy of 'j' 
>         0x0121 00289 (primestest.go:75)        SHRQ        $5, R9  ;; turn 
> it into the uint32 word address 
>         0x0125 00293 (primestest.go:75)        CMPQ        R9, SI  ;; do 
> the array bounds check ONCE IF ONE MUST 
>         0x0128 00296 (primestest.go:75)        JCC        $1, 546       
>  ;; out to panic if it fails 
>         0x0145 00325 (primestest.go:75)        MOVQ        AX, CX 
>         0x0148 00328 (primestest.go:75)        ANDQ        $31, CX ;; cx 
> register now contains 'j' & 31 
>         0x014c 00332 (primestest.go:75)        MOVL        $1, BP 
>         0x0151 00337 (primestest.go:75)        SHLL        CX, BP ;; bp 
> register now contains 1 << ('j' & 31) 
>         0x015b 00347 (primestest.go:75)        NOP ;; nop's to make the 
> addresses work out to even words AS NECESSARY 
>         0x015b 00347 (primestest.go:75)        NOP 
>         0x0156 00342 (primestest.go:75)        ORL        BP, (DI)(R9*4);; 
> the element now contains cmpsts['j' >> 5] | (1 << ('j' & 31)) 
>         0x015b 00347 (primestest.go:74)        ADDQ        R10, AX ;; add 
> 'j' to the stored 'p' in r10 
>         0x015e 00350 (primestest.go:74)        NOP 
>         0x015e 00350 (primestest.go:74)        CMPQ        AX, R12  ;; 
> check if 'j' is up to the stored 'lmtndx' in r12 
>         0x0161 00353 (primestest.go:74)        JLS        $0, 286 ;; and 
> loop back to top if not done 
>
> Note that NOP's do not take execution time as they are optimized away in a 
> modern CPU and are there to 
> make destinations of jump instructions work to even word boundaries, which 
> can make a difference in speed. 
>
> The immediately above loop will take about 3.5 clock cycles without bounds 
> check and about 5.5 clock cycles with bounds check for a modern desktop CPU 
> whereas the former version will take about three clock cycles more. 
>
> This analysis is for 64-bit code, the difference is even worse for 32-bit 
> code as there are less registers available and 
> the golang compiler is even worse at making best use of them. 
>
> For most use cases, the C/C++ code will be about twice the speed of the 
> golang code. 
>
> C#/Java do the bounds checks, but not twice and are better at reducing 
> register operations within tight loops. 
>
> In summary, the golang compiler has a way to go for maximum efficiency in 
> register use, even if the array bounds checks are kept.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[go-nuts] Re: [ANN] primegen.go: Sieve of Atkin prime number generator

Reply via email to