On Monday, June 20, 2016 at 6:33:29 AM UTC-7, gordo...@gmail.com wrote:
>
> Further to the subject of compiler efficiency, the following is the 
> assembler code output with array bounds checking turned off (-B) the the 
> inner tight composite culling loop of FasterEratspeed above (generated with 
> go tool compile -B -S FasterEratspeed.go > FasterEratspeed.asm): 
>
>         0x0051 00081 (main.go:426)        MOVL        R11, CX 
>         0x0054 00084 (main.go:426)        SHRL        $5, R11 
>         0x0058 00088 (main.go:428)        MOVL        (R9)(R11*4), R13 
>         0x005c 00092 (main.go:430)        MOVL        $1, R14 
>         0x0062 00098 (main.go:430)        SHLL        CX, R14 
>         0x0065 00101 (main.go:430)        ORL        R13, R14 
>         0x0068 00104 (main.go:431)        MOVL        R14, (R9)(R11*4) 
>         0x006c 00108 (main.go:429)        LEAL        (R12)(CX*1), R11 
>         0x0070 00112 (main.go:425)        CMPL        R11, R8 
>         0x0073 00115 (main.go:425)        JCS        $0, 81 
>
> At 10 instructions, this is about as tight as it gets other than for using 
> the more complex read/modify/write version of the ORL instruction, but that 
> doesn't seem to save much if any time given instruction latencies.  Note 
> that this code has eliminated the "k & 31" for the shift, seeming to 
> recognize that it isn't necessary as a long shift can't be greater than 31


Getting rid of the &31 is easy and I'll do that in 1.8.
 

> anyway, that unlike the simple PrimeSpeed program, this properly uses the 
> immediate load of '1',


I don't know what the issue is yet, but it shouldn't be hard to fix in 1.8.
 

> that it cleverly uses the LEAL instruction to add the prime value 'q' in 
> R12 to the unmodified 'k' value in CX to produce the sum to the original 
> location of 'j' in R11 to save another instruction to move the results from 
> CX to R11. 
>
>
The current SSA backend should do this also.
 

> This uses a total of 7 registers being CX, R8, R9, R11, R12, R13, and R14, 
> so it would be possible to make it work and run just as fast for x86 code. 
>  This includes one register to hold the prime value 'q' in R12, and one 
> register to hold the loop limit 'lngthb] in R8.  An array bounds check 
> requires another register to hold the array upper bound, which won't be 
> hard for x64 code, but will slow x86 code as there won't be enough 
> available registers unless the compiler can be made smart enough for 
> certain loop formats to recognize that the check isn't necessary, being 
> inherent to the loop check.  Pretty impressive if all loops could be 
> compiled as well as this, as this will run at very close to C/C++ speed! 
>
> When run within L1 cache on a modern Intel CPU without memory or execution 
> pipeline(s) bottle-necks, this code will run in about 3.5 CPU cycles per 
> loop or about 2.86 instructions per cycle. 
>
> I'm looking for the reason that this code is so efficient and the former 
> simple PrimeSpeed code was so poor; one clue might be that I used native 
> uint's and int's for PrimeSpeed, which are 64-bit on an amd version of the 
> compiler for that code so the code was using 'Q' suffix instructions that 
> perhaps it doesn't use as efficiently where uint32's and int32's are used 
> here. 
>
> Can the compiler guys contribute anything on how to format tight loop code 
> so it compiles as well as this?  If the crucial parts of the compiler and 
> libraries were written to compile to code as efficient as this, I don't 
> think anyone would ahve problems with golang's speed. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to