On 01/14/2016 04:47 PM, Igor Shevlyakov wrote:
Guys,

I'm trying to make compiler to generate better code on superscalar
in-order machine but can't find the right way to do it.

Imagine the following code:

long f(long* p, long a, long b)
{
   long a1 = a << 2;
   long a2 = a1 + b;
   return p[a1] + p[a2];
}



by default compiler generates something like this in some pseudo-asm:

         shl     r3, r3, 2
         add     r4, r3, r4
         ld8    r15, [r2 + r3 * 8]
         ld8    r2, [ r2 + r4 * 8]
        {    add     r2, r2, r15  ;    ret    }

but it would be way better this way:

   {   sh_add  r4, r4, (r3 << 2)   ; shl   r3, r3, 2  }
   {   ld8    r15, [r2 + r3 * 8]   ;   ld8    r2, [ r2 + r4 * 8] }
   {    add     r2, r2, r15  ;    ret    }



2nd sequence is 2 cycles shorter. Combiner pass even shows patterns
like this but fail to transform this as it wrapped in parallel:

Failed to match this instruction:
(parallel [
         (set (reg:DI 56)
             (plus:DI (mult:DI (reg:DI 3 r3 [ a ])
                     (const_int 4 [0x4]))
                 (reg:DI 4 r4 [ b ])))
         (set (reg/v:DI 40 [ a1 ])
             (ashift:DI (reg:DI 3 r3 [ a ])
                 (const_int 2 [0x2])))
     ])
You can always write a pattern which matches the PARALLEL. You can then either arrange to emit the assembly code from that pattern or split the pattern (after reload/lra)


What would be a proper way to perform reorganizations like this in general way?

The same goes with the pointer increment:

add r2, r2, 1
ld r3, [r2+0]

would be much better off like this:

{ ld r3, [r2 + 1] ; add r2, r2, 1 }

Are those kind of things overlooked or I failed to set something in
machine-dependent portion?
Similarly. You may also get some mileage from LEGITIMIZE_ADDRESS, though it may not see the add/load together which would hinder its ability to generate the code you want.

Note that using a define_split to match these things prior to reload likely won't work because combine will likely see the split pattern as being the same cost as the original insns.

In general the combiner really isn't concerned with superscalar issues, though you can tackle some superscalar things with creative patterns that match parallels or which match something more complex, but then split it up later.

Note that GCC supports a number of superscalar architectures -- they were very common for workstations and high end embedded processors for many years. MIPS, PPC, HPPA, even x86, etc all have variants which were tuned for superscalar code generation. I'm sure there's tricks you can exploit in every one of those architectures to help generate code with fewer data dependencies and thus more opportunities to exploit the superscalar nature of your processor.



jeff

Reply via email to