Guys, I'm trying to make compiler to generate better code on superscalar in-order machine but can't find the right way to do it.
Imagine the following code: long f(long* p, long a, long b) { long a1 = a << 2; long a2 = a1 + b; return p[a1] + p[a2]; } by default compiler generates something like this in some pseudo-asm: shl r3, r3, 2 add r4, r3, r4 ld8 r15, [r2 + r3 * 8] ld8 r2, [ r2 + r4 * 8] { add r2, r2, r15 ; ret } but it would be way better this way: { sh_add r4, r4, (r3 << 2) ; shl r3, r3, 2 } { ld8 r15, [r2 + r3 * 8] ; ld8 r2, [ r2 + r4 * 8] } { add r2, r2, r15 ; ret } 2nd sequence is 2 cycles shorter. Combiner pass even shows patterns like this but fail to transform this as it wrapped in parallel: Failed to match this instruction: (parallel [ (set (reg:DI 56) (plus:DI (mult:DI (reg:DI 3 r3 [ a ]) (const_int 4 [0x4])) (reg:DI 4 r4 [ b ]))) (set (reg/v:DI 40 [ a1 ]) (ashift:DI (reg:DI 3 r3 [ a ]) (const_int 2 [0x2]))) ]) What would be a proper way to perform reorganizations like this in general way? The same goes with the pointer increment: add r2, r2, 1 ld r3, [r2+0] would be much better off like this: { ld r3, [r2 + 1] ; add r2, r2, 1 } Are those kind of things overlooked or I failed to set something in machine-dependent portion? Thanks a lot for your thoughts