>  Do you plan to include SSA for the x86 version as well?

For an answer to this: yes, it seems like the plan is to port every 
supported architecture
to SSA. It was discussed here:
https://groups.google.com/d/msg/golang-dev/fSIl5Sbr4ek/10sgOsnDEAAJ

Il giorno martedì 21 giugno 2016 03:39:41 UTC+2, gordo...@gmail.com ha 
scritto:
>
> On Monday, June 20, 2016 at 6:33:29 AM UTC-7, gordo...@gmail.com wrote: 
> Further to the subject of compiler efficiency, the following is the 
> assembler code output with array bounds checking turned off (-B) the the 
> inner tight composite culling loop of FasterEratspeed above (generated with 
> go tool compile -B -S FasterEratspeed.go > FasterEratspeed.asm): 
>
>         0x0051 00081 (main.go:426)        MOVL        R11, CX 
>         0x0054 00084 (main.go:426)        SHRL        $5, R11 
>         0x0058 00088 (main.go:428)        MOVL        (R9)(R11*4), R13 
>         0x005c 00092 (main.go:430)        MOVL        $1, R14 
>         0x0062 00098 (main.go:430)        SHLL        CX, R14 
>         0x0065 00101 (main.go:430)        ORL        R13, R14 
>         0x0068 00104 (main.go:431)        MOVL        R14, (R9)(R11*4) 
>         0x006c 00108 (main.go:429)        LEAL        (R12)(CX*1), R11 
>         0x0070 00112 (main.go:425)        CMPL        R11, R8 
>         0x0073 00115 (main.go:425)        JCS        $0, 81 
>
> At 10 instructions, this is about as tight as it gets other than for using 
> the more complex read/modify/write version of the ORL instruction, but that 
> doesn't seem to save much if any time given instruction latencies.  Note 
> that this code has eliminated the "k & 31" for the shift, seeming to 
> recognize that it isn't necessary as a long shift can't be greater than 31 
>
> Getting rid of the &31 is easy and I'll do that in 1.8. 
>   
> anyway, that unlike the simple PrimeSpeed program, this properly uses the 
> immediate load of '1', 
>
> I don't know what the issue is yet, but it shouldn't be hard to fix in 
> 1.8. 
>   
> that it cleverly uses the LEAL instruction to add the prime value 'q' in 
> R12 to the unmodified 'k' value in CX to produce the sum to the original 
> location of 'j' in R11 to save another instruction to move the results from 
> CX to R11. 
>
> The current SSA backend should do this also. 
>   
> No, Keith, you seem to have misunderstood, I wasn't complaining above the 
> above assembler codeas produced by the 1.7beta1 compiler, and I was 
> wondering why it always isn't this good, which is about as good as it gets 
> for this loop and already properly gets rid of &31, does a proper immediate 
> load of 1, and the clever use of the LEA instruction without the misuse of 
> the LEA instruction to continuously recalculate 'p'.  The assembler code 
> above is produced by either of the below loop variations: 
>
> 1) as it is in FasterEratspeed: 
>
>                                 for k < lngthb { 
>                                         pos := k >> 5 
>                                         data := k & 31 
>                                         bits := buf[pos] 
>                                         k += q 
>                                         bits |= 1 << data // two[data] 
>                                         buf[pos] = bits 
>                                 } 
>
> 2) I get the same assembler code if I change this to the simpler: 
>
>                                 for ; k < lngthb; k += q { 
>                                         buf[k>>5] |= 1 << (k & 31) 
>                                 } 
>
> where all variables and buffers are uint32. 
>
> My question was, why did the compiler produce this very good code for both 
> variations, yet produced something much worse for the same variation two 
> loop in the simple PrimeSpeed code, with the main difference that 
> PrimeSpeed uses 64-bit uint for the loop variables and loop limit.  Does 
> that give you a clue where the problem might be?  Converting PrimeSpeed to 
> use uint32's as here fixed the continuous recalculation of 'p' but not the 
> other problems. 
>
> It seems that sometimes the compiler erroneously tries to reduce register 
> use without applying the cost in execution speed to the decision.  It is 
> inconsistent, sometimes producing great code as here, and sometimes not so 
> great as in PrimeSpeed. 
>
> I was looking for some general advice on how to format loops so they 
> produce code as good as this? 
>
> Do you plan to include SSA for the x86 version as well?

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to