> Do you plan to include SSA for the x86 version as well? For an answer to this: yes, it seems like the plan is to port every supported architecture to SSA. It was discussed here: https://groups.google.com/d/msg/golang-dev/fSIl5Sbr4ek/10sgOsnDEAAJ
Il giorno martedì 21 giugno 2016 03:39:41 UTC+2, gordo...@gmail.com ha scritto: > > On Monday, June 20, 2016 at 6:33:29 AM UTC-7, gordo...@gmail.com wrote: > Further to the subject of compiler efficiency, the following is the > assembler code output with array bounds checking turned off (-B) the the > inner tight composite culling loop of FasterEratspeed above (generated with > go tool compile -B -S FasterEratspeed.go > FasterEratspeed.asm): > > 0x0051 00081 (main.go:426) MOVL R11, CX > 0x0054 00084 (main.go:426) SHRL $5, R11 > 0x0058 00088 (main.go:428) MOVL (R9)(R11*4), R13 > 0x005c 00092 (main.go:430) MOVL $1, R14 > 0x0062 00098 (main.go:430) SHLL CX, R14 > 0x0065 00101 (main.go:430) ORL R13, R14 > 0x0068 00104 (main.go:431) MOVL R14, (R9)(R11*4) > 0x006c 00108 (main.go:429) LEAL (R12)(CX*1), R11 > 0x0070 00112 (main.go:425) CMPL R11, R8 > 0x0073 00115 (main.go:425) JCS $0, 81 > > At 10 instructions, this is about as tight as it gets other than for using > the more complex read/modify/write version of the ORL instruction, but that > doesn't seem to save much if any time given instruction latencies. Note > that this code has eliminated the "k & 31" for the shift, seeming to > recognize that it isn't necessary as a long shift can't be greater than 31 > > Getting rid of the &31 is easy and I'll do that in 1.8. > > anyway, that unlike the simple PrimeSpeed program, this properly uses the > immediate load of '1', > > I don't know what the issue is yet, but it shouldn't be hard to fix in > 1.8. > > that it cleverly uses the LEAL instruction to add the prime value 'q' in > R12 to the unmodified 'k' value in CX to produce the sum to the original > location of 'j' in R11 to save another instruction to move the results from > CX to R11. > > The current SSA backend should do this also. > > No, Keith, you seem to have misunderstood, I wasn't complaining above the > above assembler codeas produced by the 1.7beta1 compiler, and I was > wondering why it always isn't this good, which is about as good as it gets > for this loop and already properly gets rid of &31, does a proper immediate > load of 1, and the clever use of the LEA instruction without the misuse of > the LEA instruction to continuously recalculate 'p'. The assembler code > above is produced by either of the below loop variations: > > 1) as it is in FasterEratspeed: > > for k < lngthb { > pos := k >> 5 > data := k & 31 > bits := buf[pos] > k += q > bits |= 1 << data // two[data] > buf[pos] = bits > } > > 2) I get the same assembler code if I change this to the simpler: > > for ; k < lngthb; k += q { > buf[k>>5] |= 1 << (k & 31) > } > > where all variables and buffers are uint32. > > My question was, why did the compiler produce this very good code for both > variations, yet produced something much worse for the same variation two > loop in the simple PrimeSpeed code, with the main difference that > PrimeSpeed uses 64-bit uint for the loop variables and loop limit. Does > that give you a clue where the problem might be? Converting PrimeSpeed to > use uint32's as here fixed the continuous recalculation of 'p' but not the > other problems. > > It seems that sometimes the compiler erroneously tries to reduce register > use without applying the cost in execution speed to the decision. It is > inconsistent, sometimes producing great code as here, and sometimes not so > great as in PrimeSpeed. > > I was looking for some general advice on how to format loops so they > produce code as good as this? > > Do you plan to include SSA for the x86 version as well? -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.