Ok, for the i386 port, I use uint32_t instead of uint64_t because otherwise the assembly code generated is a bit complicated (I'm on a 32 bit machine).
The tree dump from final_cleanup are the same for the goo function: goo (i) { <bb 2>: return data[i + 13] + data[i]; } However, the first RTL dump from expand gives this for the i386 port: (insn 6 5 7 3 ld.c:17 (parallel [ (set (reg:SI 61) (plus:SI (reg/v:SI 59 [ i ]) (const_int 13 [0xd]))) (clobber (reg:CC 17 flags)) ]) -1 (nil)) (insn 7 6 8 3 ld.c:17 (set (reg/f:SI 62) (symbol_ref:SI ("data") <var_decl 0xb7e7ce60 data>)) -1 (nil)) (insn 8 7 9 3 ld.c:17 (set (reg/f:SI 63) (symbol_ref:SI ("data") <var_decl 0xb7e7ce60 data>)) -1 (nil)) (insn 9 8 10 3 ld.c:17 (set (reg:SI 64) (mem/s:SI (plus:SI (mult:SI (reg/v:SI 59 [ i ]) (const_int 4 [0x4])) (reg/f:SI 63)) [3 data S4 A32])) -1 (nil)) (insn 10 9 11 3 ld.c:17 (set (reg:SI 65) (mem/s:SI (plus:SI (mult:SI (reg:SI 61) (const_int 4 [0x4])) (reg/f:SI 62)) [3 data S4 A32])) -1 (nil)) (insn 11 10 12 3 ld.c:17 (parallel [ (set (reg:SI 60) (plus:SI (reg:SI 65) (reg:SI 64))) (clobber (reg:CC 17 flags)) ]) -1 (expr_list:REG_EQUAL (plus:SI (mem/s:SI (plus:SI (mult:SI (reg:SI 61) (const_int 4 [0x4])) (reg/f:SI 62)) [3 data S4 A32]) (mem/s:SI (plus:SI (mult:SI (reg/v:SI 59 [ i ]) (const_int 4 [0x4])) (reg/f:SI 63)) [3 data S4 A32])) (nil))) As we can see, the compiler moves 13, and the @ of data, then muliplies the 13 with 4 to get the right size and then performs the 2 loads and finally has a plus. In my port, I get: (insn 6 5 7 3 ld.c:17 (set (reg:DI 75) (plus:DI (reg/v:DI 73 [ i ]) (const_int 13 [0xd]))) -1 (nil)) (insn 7 6 8 3 ld.c:17 (set (reg/f:DI 76) (symbol_ref:DI ("data") <var_decl 0xb7c85bb0 data>)) -1 (nil)) (insn 8 7 9 3 ld.c:17 (set (reg:DI 78) (const_int 3 [0x3])) -1 (nil)) (insn 9 8 10 3 ld.c:17 (set (reg:DI 77) (ashift:DI (reg:DI 75) (reg:DI 78))) -1 (nil)) (insn 10 9 11 3 ld.c:17 (set (reg/f:DI 79) (plus:DI (reg/f:DI 76) (reg:DI 77))) -1 (nil)) (insn 11 10 12 3 ld.c:17 (set (reg/f:DI 80) (symbol_ref:DI ("data") <var_decl 0xb7c85bb0 data>)) -1 (nil)) (insn 12 11 13 3 ld.c:17 (set (reg:DI 82) (const_int 3 [0x3])) -1 (nil)) (insn 13 12 14 3 ld.c:17 (set (reg:DI 81) (ashift:DI (reg/v:DI 73 [ i ]) (reg:DI 82))) -1 (nil)) (insn 14 13 15 3 ld.c:17 (set (reg/f:DI 83) (plus:DI (reg/f:DI 80) (reg:DI 81))) -1 (nil)) (insn 15 14 16 3 ld.c:17 (set (reg:DI 84) (mem/s:DI (reg/f:DI 79) [2 data S8 A64])) -1 (nil)) (insn 16 15 17 3 ld.c:17 (set (reg:DI 85) (mem/s:DI (reg/f:DI 83) [2 data S8 A64])) -1 (nil)) (insn 17 16 18 3 ld.c:17 (set (reg:DI 74) (plus:DI (reg:DI 84) (reg:DI 85))) -1 (nil)) Which seems to be the same idea, except that constant 3 gets load up and a shift is performed. Is it possible that it's that that is causing my problem in code generation? I'm trying to figure out why my port is generating a shift instead of simply a mult. I actually changed the cost of shift to a large value and then it uses adds instead of simply a mult. I seem to think that this is then an rtx_cost problem where I'm not telling the compiler that a multiplication in this case is correct. I've been playing with rtx_cost but have been unable to really get it to generate the right code. Thanks again for your help and insight, Jc On Fri, May 8, 2009 at 5:18 AM, Paolo Bonzini <bonz...@gnu.org> wrote: > >> It seems that when set in a loop, the program is able to perform some >> type of optimization to actually get the use of the offsets where as >> in the case of no loop, we have twice the calculations of instructions >> for each address calculations. > > I suggest you look at the dumps for i386 to see which pass does the > changes, and then see what happens in your port. > > Paolo >