On Monday, June 20, 2016 at 6:33:29 AM UTC-7, gordo...@gmail.com wrote: > > Further to the subject of compiler efficiency, the following is the > assembler code output with array bounds checking turned off (-B) the the > inner tight composite culling loop of FasterEratspeed above (generated with > go tool compile -B -S FasterEratspeed.go > FasterEratspeed.asm): > > 0x0051 00081 (main.go:426) MOVL R11, CX > 0x0054 00084 (main.go:426) SHRL $5, R11 > 0x0058 00088 (main.go:428) MOVL (R9)(R11*4), R13 > 0x005c 00092 (main.go:430) MOVL $1, R14 > 0x0062 00098 (main.go:430) SHLL CX, R14 > 0x0065 00101 (main.go:430) ORL R13, R14 > 0x0068 00104 (main.go:431) MOVL R14, (R9)(R11*4) > 0x006c 00108 (main.go:429) LEAL (R12)(CX*1), R11 > 0x0070 00112 (main.go:425) CMPL R11, R8 > 0x0073 00115 (main.go:425) JCS $0, 81 > > At 10 instructions, this is about as tight as it gets other than for using > the more complex read/modify/write version of the ORL instruction, but that > doesn't seem to save much if any time given instruction latencies. Note > that this code has eliminated the "k & 31" for the shift, seeming to > recognize that it isn't necessary as a long shift can't be greater than 31
Getting rid of the &31 is easy and I'll do that in 1.8. > anyway, that unlike the simple PrimeSpeed program, this properly uses the > immediate load of '1', I don't know what the issue is yet, but it shouldn't be hard to fix in 1.8. > that it cleverly uses the LEAL instruction to add the prime value 'q' in > R12 to the unmodified 'k' value in CX to produce the sum to the original > location of 'j' in R11 to save another instruction to move the results from > CX to R11. > > The current SSA backend should do this also. > This uses a total of 7 registers being CX, R8, R9, R11, R12, R13, and R14, > so it would be possible to make it work and run just as fast for x86 code. > This includes one register to hold the prime value 'q' in R12, and one > register to hold the loop limit 'lngthb] in R8. An array bounds check > requires another register to hold the array upper bound, which won't be > hard for x64 code, but will slow x86 code as there won't be enough > available registers unless the compiler can be made smart enough for > certain loop formats to recognize that the check isn't necessary, being > inherent to the loop check. Pretty impressive if all loops could be > compiled as well as this, as this will run at very close to C/C++ speed! > > When run within L1 cache on a modern Intel CPU without memory or execution > pipeline(s) bottle-necks, this code will run in about 3.5 CPU cycles per > loop or about 2.86 instructions per cycle. > > I'm looking for the reason that this code is so efficient and the former > simple PrimeSpeed code was so poor; one clue might be that I used native > uint's and int's for PrimeSpeed, which are 64-bit on an amd version of the > compiler for that code so the code was using 'Q' suffix instructions that > perhaps it doesn't use as efficiently where uint32's and int32's are used > here. > > Can the compiler guys contribute anything on how to format tight loop code > so it compiles as well as this? If the crucial parts of the compiler and > libraries were written to compile to code as efficient as this, I don't > think anyone would ahve problems with golang's speed. > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.