golang version 1.7beta1 does indeed help, and the time is now not much worse than C#/Java, but still not as good as C/C++ due to the single array bounds check:
Using the same procedure to obtain an assembly listing (go tool compiler -S PrimeTest.go > PrimeTest.s): line 36 for j := (p*p - 3) >> 1; j <= lmtndx; j += p { line 37 cmpsts[j>>5] |= 1 << (j & 31) line 38 } 0x00f1 00241 (Main.go:37) MOVQ R8, CX ;; move 'j' in r8 to r 0x00f4 00244 (Main.go:37) SHRQ $5, R8 ;; shift cx right by 5 to get word address 0x00f8 00248 (Main.go:37) CMPQ R8, DX ;; array bounds check to array length stored in dx 0x00fb 00251 (Main.go:37) JCC $0, 454 ;; panic if fail bounds check 0x0101 00257 (Main.go:37) MOVL (AX)(R8*4), R10 ;; get element to r10 in one step 0x0105 00261 (Main.go:37) MOVQ CX, R11 ;; save 'j' for later in r11 0x0108 00264 (Main.go:37) ANDQ $31, CX ;; leave 'j' & 31 in cx 0x010c 00268 (Main.go:37) MOVL R9, R12 ;; save r9 to r12 to preserve the 1 it contains - WHY NOT JUST MAKE R12 CONTAIN 1 AT ALL TIMES IF USING IT IS QUICKER THAN AN IMMEDIATE LOAD 0x010f 00271 (Main.go:37) SHLL CX, R9 ;; R9 SHOULD JUST BE LOADED WITH 1 ABOVE - now cx contains 1 << ('j' & 31) 0x0112 00274 (Main.go:37) ORL R10, R9 ;; r9 contains cmpsts[j >> 5] | (1 << ('j' & 31)) - the bit or is done here 0x0115 00277 (Main.go:37) MOVL R9, (AX)(R8*4) ;; element now contains the modified value 0x0119 00281 (Main.go:36) LEAQ 3(R11)(DI*2), R8 ;; tricky way to calculate 'j' + 2 * 'j' + 3 where 2 * 'j' + 3 is p, answer to r8, saves a register 0x011e 00286 (Main.go:37) MOVL R12, R9 ;; RESTORE R9 FROM R12 - SHOULD NOT BE NECESSARY, but doesn't really cost in time as CPU is waiting for results of LEAQ operation 0x0121 00289 (Main.go:36) CMPQ R8, BX ;; check if 'j' in r8 is up to limit stored in bx 0x0124 00292 (Main.go:36) JLS $0, 241 ;; loop if not complete This is much better than the 1.6.2 code in that it no longer does the array bounds check twice, although there is still the minor use of an extra r12 register used to store 1 instead of using an immediate load of 1 into the r9 register as above, where it could have been used to store 'p' to save a slight amount of time instead of the tricky code to calculate 'p' (quickly) every loop (the tricky bit is still about a half cycle slower than just using a pre-calculated 'p' value). The C/C++ code will still be quicker, mainly because of no array bounds check for a couple of CPU clock cycles, but also because it is more efficient to use the single read/modify/write version of the ORL instruction instead of MOVL from the array element to a register, ORL with the bit modifier, then MOVL from the register back to the array element. It seems it is now almost trying too hard to save registers at the cost of time in the tricky 'p' calculation, but costing registers for no gain or an actual loss in saving the 1 to a register. So it is good to see that golang compiler optimization is taking some steps forward, but it isn't quite there yet. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.