On 01/14/2016 04:47 PM, Igor Shevlyakov wrote:
Guys,
I'm trying to make compiler to generate better code on superscalar
in-order machine but can't find the right way to do it.
Imagine the following code:
long f(long* p, long a, long b)
{
long a1 = a << 2;
long a2 = a1 + b;
return p[a1] + p[a2];
}
by default compiler generates something like this in some pseudo-asm:
shl r3, r3, 2
add r4, r3, r4
ld8 r15, [r2 + r3 * 8]
ld8 r2, [ r2 + r4 * 8]
{ add r2, r2, r15 ; ret }
but it would be way better this way:
{ sh_add r4, r4, (r3 << 2) ; shl r3, r3, 2 }
{ ld8 r15, [r2 + r3 * 8] ; ld8 r2, [ r2 + r4 * 8] }
{ add r2, r2, r15 ; ret }
2nd sequence is 2 cycles shorter. Combiner pass even shows patterns
like this but fail to transform this as it wrapped in parallel:
Failed to match this instruction:
(parallel [
(set (reg:DI 56)
(plus:DI (mult:DI (reg:DI 3 r3 [ a ])
(const_int 4 [0x4]))
(reg:DI 4 r4 [ b ])))
(set (reg/v:DI 40 [ a1 ])
(ashift:DI (reg:DI 3 r3 [ a ])
(const_int 2 [0x2])))
])
You can always write a pattern which matches the PARALLEL. You can then
either arrange to emit the assembly code from that pattern or split the
pattern (after reload/lra)
What would be a proper way to perform reorganizations like this in general way?
The same goes with the pointer increment:
add r2, r2, 1
ld r3, [r2+0]
would be much better off like this:
{ ld r3, [r2 + 1] ; add r2, r2, 1 }
Are those kind of things overlooked or I failed to set something in
machine-dependent portion?
Similarly. You may also get some mileage from LEGITIMIZE_ADDRESS,
though it may not see the add/load together which would hinder its
ability to generate the code you want.
Note that using a define_split to match these things prior to reload
likely won't work because combine will likely see the split pattern as
being the same cost as the original insns.
In general the combiner really isn't concerned with superscalar issues,
though you can tackle some superscalar things with creative patterns
that match parallels or which match something more complex, but then
split it up later.
Note that GCC supports a number of superscalar architectures -- they
were very common for workstations and high end embedded processors for
many years. MIPS, PPC, HPPA, even x86, etc all have variants which were
tuned for superscalar code generation. I'm sure there's tricks you can
exploit in every one of those architectures to help generate code with
fewer data dependencies and thus more opportunities to exploit the
superscalar nature of your processor.
jeff