On Monday, June 20, 2016 at 6:33:29 AM UTC-7, gordo...@gmail.com wrote: Further to the subject of compiler efficiency, the following is the assembler code output with array bounds checking turned off (-B) the the inner tight composite culling loop of FasterEratspeed above (generated with go tool compile -B -S FasterEratspeed.go > FasterEratspeed.asm):
0x0051 00081 (main.go:426) MOVL R11, CX 0x0054 00084 (main.go:426) SHRL $5, R11 0x0058 00088 (main.go:428) MOVL (R9)(R11*4), R13 0x005c 00092 (main.go:430) MOVL $1, R14 0x0062 00098 (main.go:430) SHLL CX, R14 0x0065 00101 (main.go:430) ORL R13, R14 0x0068 00104 (main.go:431) MOVL R14, (R9)(R11*4) 0x006c 00108 (main.go:429) LEAL (R12)(CX*1), R11 0x0070 00112 (main.go:425) CMPL R11, R8 0x0073 00115 (main.go:425) JCS $0, 81 At 10 instructions, this is about as tight as it gets other than for using the more complex read/modify/write version of the ORL instruction, but that doesn't seem to save much if any time given instruction latencies. Note that this code has eliminated the "k & 31" for the shift, seeming to recognize that it isn't necessary as a long shift can't be greater than 31 Getting rid of the &31 is easy and I'll do that in 1.8. anyway, that unlike the simple PrimeSpeed program, this properly uses the immediate load of '1', I don't know what the issue is yet, but it shouldn't be hard to fix in 1.8. that it cleverly uses the LEAL instruction to add the prime value 'q' in R12 to the unmodified 'k' value in CX to produce the sum to the original location of 'j' in R11 to save another instruction to move the results from CX to R11. The current SSA backend should do this also. No, Keith, you seem to have misunderstood, I wasn't complaining above the above assembler codeas produced by the 1.7beta1 compiler, and I was wondering why it always isn't this good, which is about as good as it gets for this loop and already properly gets rid of &31, does a proper immediate load of 1, and the clever use of the LEA instruction without the misuse of the LEA instruction to continuously recalculate 'p'. The assembler code above is produced by either of the below loop variations: 1) as it is in FasterEratspeed: for k < lngthb { pos := k >> 5 data := k & 31 bits := buf[pos] k += q bits |= 1 << data // two[data] buf[pos] = bits } 2) I get the same assembler code if I change this to the simpler: for ; k < lngthb; k += q { buf[k>>5] |= 1 << (k & 31) } where all variables and buffers are uint32. My question was, why did the compiler produce this very good code for both variations, yet produced something much worse for the same variation two loop in the simple PrimeSpeed code, with the main difference that PrimeSpeed uses 64-bit uint for the loop variables and loop limit. Does that give you a clue where the problem might be? Converting PrimeSpeed to use uint32's as here fixed the continuous recalculation of 'p' but not the other problems. It seems that sometimes the compiler erroneously tries to reduce register use without applying the cost in execution speed to the decision. It is inconsistent, sometimes producing great code as here, and sometimes not so great as in PrimeSpeed. I was looking for some general advice on how to format loops so they produce code as good as this? Do you plan to include SSA for the x86 version as well? -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.