Guys,
I'm trying to make compiler to generate better code on superscalar
in-order machine but can't find the right way to do it.
Imagine the following code:
long f(long* p, long a, long b)
{
long a1 = a << 2;
long a2 = a1 + b;
return p[a1] + p[a2];
}
by default compiler generates something like this in some pseudo-asm:
shl r3, r3, 2
add r4, r3, r4
ld8 r15, [r2 + r3 * 8]
ld8 r2, [ r2 + r4 * 8]
{ add r2, r2, r15 ; ret }
but it would be way better this way:
{ sh_add r4, r4, (r3 << 2) ; shl r3, r3, 2 }
{ ld8 r15, [r2 + r3 * 8] ; ld8 r2, [ r2 + r4 * 8] }
{ add r2, r2, r15 ; ret }
2nd sequence is 2 cycles shorter. Combiner pass even shows patterns
like this but fail to transform this as it wrapped in parallel:
Failed to match this instruction:
(parallel [
(set (reg:DI 56)
(plus:DI (mult:DI (reg:DI 3 r3 [ a ])
(const_int 4 [0x4]))
(reg:DI 4 r4 [ b ])))
(set (reg/v:DI 40 [ a1 ])
(ashift:DI (reg:DI 3 r3 [ a ])
(const_int 2 [0x2])))
])
What would be a proper way to perform reorganizations like this in general way?
The same goes with the pointer increment:
add r2, r2, 1
ld r3, [r2+0]
would be much better off like this:
{ ld r3, [r2 + 1] ; add r2, r2, 1 }
Are those kind of things overlooked or I failed to set something in
machine-dependent portion?
Thanks a lot for your thoughts