I suggest you to repeat your analysis on go1.7beta, the compiler has a new backend for amd64 and the generated code should be better. On my machine:
go1.6.2: Found 23000000 primes up to 262146 in 445.329735ms . Found 23000000 primes up to 262146 in 447.491918ms . Found 23000000 primes up to 262146 in 451.530021ms . Found 23000000 primes up to 262146 in 445.181835ms . Found 23000000 primes up to 262146 in 447.489657ms . Found 23000000 primes up to 262146 in 451.568191ms . Found 23000000 primes up to 262146 in 446.207366ms . Found 23000000 primes up to 262146 in 449.992793ms . Found 23000000 primes up to 262146 in 445.746567ms . Found 23000000 primes up to 262146 in 449.102428ms . on go1.7beta Found 23000000 primes up to 262146 in 358.99806ms . Found 23000000 primes up to 262146 in 358.755634ms . Found 23000000 primes up to 262146 in 360.231645ms . Found 23000000 primes up to 262146 in 358.93451ms . Found 23000000 primes up to 262146 in 359.412598ms . Found 23000000 primes up to 262146 in 363.33646ms . Found 23000000 primes up to 262146 in 363.981767ms . Found 23000000 primes up to 262146 in 366.880261ms . Found 23000000 primes up to 262146 in 358.761698ms . Found 23000000 primes up to 262146 in 358.025324ms . Il giorno mercoledì 15 giugno 2016 14:37:59 UTC+2, gordo...@gmail.com ha scritto: > > While I am sure you are correct that some numeric code with golang can run > as fast as C/C++, it seems that bit-packed tight loops such as used for > culling composites in the Sieve of Eratosthenes (or the Sieve of Atkin for > that matter) are not as optimized. Of course one of the problems that > golang has is compulsory array bounds checks, but it doesn't run as fast as > even DotNet or JVM code, which also have those compulsory checks. > > To be specific, following is a simple Sieve of Eratosthenes program in > golang that runs about twice as slow as the same program in C/C++ and about > half again as slow as in C# or Java. The range of 262146 is chosen so that > the created bit-packed array is smaller than the size of most CPU L1 caches > (16 Kilobytes) so that memory access time isn't a bottleneck for most > modern desktop CPU's, and the program is run 1000 times to get a reasonable > length of time for more accurate timing The commented out lines were an > experiment to use unsafe pointers (as in a C style array program) to try to > save time by avoiding the automatic array bounds checks, but the resulting > time was about the same: > > package main > > import ( > "fmt" > "math" > "time" > // "unsafe" > ) > > func mkCLUT() [65536]byte { > var arr [65536]byte > for i := 0; i < 65536; i++ { > var cnt byte = 0 > for v := (uint16)(i ^ 0xFFFF); v > 0; v &= v - 1 { > cnt++ > } > arr[i] = cnt > } > return arr > } > > var cnstCLUT [65536]byte = mkCLUT() > > func primesTest(top uint) int { > lmtndx := (top - 3) >> 1 > lstw := lmtndx >> 5 > topsqrtndx := (int(math.Sqrt(float64(top))) - 3) >> 1 > cmpsts := make([]uint32, lstw+1) > // start := uintptr(unsafe.Pointer(&cmpsts[0])) > // step := unsafe.Sizeof(buf[0]) > // limit := start + uintptr(len(cmpsts))*step > for i := 0; i <= topsqrtndx; i++ { > if cmpsts[i>>5]&(uint32(1)<<uint(i)) == 0 { > p := (uint(i) << 1) + 3 > for j := (p*p - 3) >> 1; j <= lmtndx; j += p { > cmpsts[j>>5] |= 1 << (j & 31) > } > // p := uintptr((uint(i) << 1) + 3) > // lmt := uintptr(lmtndx) > // for j := (p*p - 3) >> 1; j <= lmt; j += p { > // *(*uint)(unsafe.Pointer(start + step*(j>>5))) > |= 1 << (j & 31) > // } > } > } > msk := uint32(0xFFFFFFFE) << (lmtndx & 31) > cmpsts[lstw] |= msk > cnt := 1 > for i := uint(0); i <= lstw; i++ { > v := cmpsts[i] > cnt += int(cnstCLUT[v&0xFFFF] + cnstCLUT[0xFFFF&(v>>16)]) > } > return cnt > } > > func main() { > n := uint(262146) > > strt := time.Now() > > sum := 0 > for i := 0; i < 1000; i++ { > sum += primesTest(n) > } > > end := time.Now() > > fmt.Println("Found", sum, "primes up to", n, "in", end.Sub(strt), > ".") > } > > The assembly listing revealed by "go tool compile -S primestest.go > > primestest.s" reveals why (I have added comments at the ends of the lines > to indicate what is going on): > > line 74 for j := (p*p - 3) >> 1; j <= lmtndx; j += p { > line 75 cmpsts[j>>5] |= 1 << (j & 31) > line 76 } > > has the following assembly code: > > 0x011e 00286 (primestest.go:75) MOVQ AX, R9 ;; make > a copy of 'j' > 0x0121 00289 (primestest.go:75) SHRQ $5, R9 ;; turn > it into the uint32 word address > 0x0125 00293 (primestest.go:75) CMPQ R9, SI ;; do > the array bounds check > 0x0128 00296 (primestest.go:75) JCC $1, 546 > ;; out to panic if it fails > 0x012e 00302 (primestest.go:75) LEAQ (DI)(R9*4), BX > ;; get address of right element of array; the di register contains the > pointer to the array base > 0x0132 00306 (primestest.go:75) MOVL (BX), R11 > ;; get the contents of that element > 0x0135 00309 (primestest.go:75) CMPQ R9, SI ;; DO > ANOTHER ARRAY BOUNDS CHECK!!!!! > 0x0138 00312 (primestest.go:75) JCC $1, 539 > ;; OUT TO A DIFFERENT PANIC IF IT FAILS!!! > 0x013e 00318 (primestest.go:75) LEAQ (DI)(R9*4), BX > ;; GET THE ADDRESS OF THE SAME ELEMENT AGAIN!!!! > 0x0142 00322 (primestest.go:75) MOVQ CX, R8 ;;SAVE > THE CONTENTS OF THE CX REGISTER (SHOULD BE DONE OUTSIDE THE LOOP)!!! > 0x0145 00325 (primestest.go:75) MOVQ AX, CX > 0x0148 00328 (primestest.go:75) ANDQ $31, CX ;; cx > register now contains 'j' & 31 > 0x014c 00332 (primestest.go:75) MOVL $1, BP > 0x0151 00337 (primestest.go:75) SHLL CX, BP ;; bp > register now contains 1 << ('j' & 31) > 0x0153 00339 (primestest.go:75) MOVQ R8, CX > ;;RESTORE THE CONTENTS OF THE CX REGISTER (SHOULD BE DONE OUTSIDE THE > LOOP)!!! > 0x0156 00342 (primestest.go:75) ORL R11, BP ;; r11 > now contains cmpsts['j' >> 5] | (1 << ('j' & 31)) > 0x0159 00345 (primestest.go:75) MOVL BP, (BX) ;; > move it back to the element via the address in BX > 0x015b 00347 (primestest.go:75) NOP ;; nop's to make the > addresses work out to even words > 0x015b 00347 (primestest.go:75) NOP > 0x015b 00347 (primestest.go:74) ADDQ R10, AX ;; add > 'j' to the stored 'p' in r10 > 0x015e 00350 (primestest.go:74) NOP > 0x015e 00350 (primestest.go:74) CMPQ AX, R12 ;; > check if 'j' is up to the stored 'lmtndx' in r12 > 0x0161 00353 (primestest.go:74) JLS $0, 286 ;; and > loop back to top if not done > > The main problem is the lines I have commented in capitals where the > bounds check is done twice but also > where the contents of the cx register are saved and restored inside the > loop > > A secondary problem as far as speed is concerned is that the C/C++ code > also takes advantage of the direct read/modify/write instruction, > equivalent to "cmpsts[j >> 5] |= 1 << (j & 31)" so the code in this syntax > for C/C++ code would become: > > 0x011e 00286 (primestest.go:75) MOVQ AX, R9 ;; make > a copy of 'j' > 0x0121 00289 (primestest.go:75) SHRQ $5, R9 ;; turn > it into the uint32 word address > 0x0125 00293 (primestest.go:75) CMPQ R9, SI ;; do > the array bounds check ONCE IF ONE MUST > 0x0128 00296 (primestest.go:75) JCC $1, 546 > ;; out to panic if it fails > 0x0145 00325 (primestest.go:75) MOVQ AX, CX > 0x0148 00328 (primestest.go:75) ANDQ $31, CX ;; cx > register now contains 'j' & 31 > 0x014c 00332 (primestest.go:75) MOVL $1, BP > 0x0151 00337 (primestest.go:75) SHLL CX, BP ;; bp > register now contains 1 << ('j' & 31) > 0x015b 00347 (primestest.go:75) NOP ;; nop's to make the > addresses work out to even words AS NECESSARY > 0x015b 00347 (primestest.go:75) NOP > 0x0156 00342 (primestest.go:75) ORL BP, (DI)(R9*4);; > the element now contains cmpsts['j' >> 5] | (1 << ('j' & 31)) > 0x015b 00347 (primestest.go:74) ADDQ R10, AX ;; add > 'j' to the stored 'p' in r10 > 0x015e 00350 (primestest.go:74) NOP > 0x015e 00350 (primestest.go:74) CMPQ AX, R12 ;; > check if 'j' is up to the stored 'lmtndx' in r12 > 0x0161 00353 (primestest.go:74) JLS $0, 286 ;; and > loop back to top if not done > > Note that NOP's do not take execution time as they are optimized away in a > modern CPU and are there to > make destinations of jump instructions work to even word boundaries, which > can make a difference in speed. > > The immediately above loop will take about 3.5 clock cycles without bounds > check and about 5.5 clock cycles with bounds check for a modern desktop CPU > whereas the former version will take about three clock cycles more. > > This analysis is for 64-bit code, the difference is even worse for 32-bit > code as there are less registers available and > the golang compiler is even worse at making best use of them. > > For most use cases, the C/C++ code will be about twice the speed of the > golang code. > > C#/Java do the bounds checks, but not twice and are better at reducing > register operations within tight loops. > > In summary, the golang compiler has a way to go for maximum efficiency in > register use, even if the array bounds checks are kept. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.