Thanks Jeff,

I really hoped that I missed something and there was better answer.
But does it do any harm if combiner will try to check every piece of a
parallel like that and if every component is matchable and total cost
is not worse to emit them separately?
It will change nothing for single issue machines just some reordering
but it will help many multi-issue...
What the pitfalls or this approach are?

Thanks

On Thu, Jan 14, 2016 at 8:55 PM, Jeff Law <l...@redhat.com> wrote:
> On 01/14/2016 04:47 PM, Igor Shevlyakov wrote:
>>
>> Guys,
>>
>> I'm trying to make compiler to generate better code on superscalar
>> in-order machine but can't find the right way to do it.
>>
>> Imagine the following code:
>>
>> long f(long* p, long a, long b)
>> {
>>    long a1 = a << 2;
>>    long a2 = a1 + b;
>>    return p[a1] + p[a2];
>> }
>
>
>
>>
>> by default compiler generates something like this in some pseudo-asm:
>>
>>          shl     r3, r3, 2
>>          add     r4, r3, r4
>>          ld8    r15, [r2 + r3 * 8]
>>          ld8    r2, [ r2 + r4 * 8]
>>         {    add     r2, r2, r15  ;    ret    }
>>
>> but it would be way better this way:
>>
>>    {   sh_add  r4, r4, (r3 << 2)   ; shl   r3, r3, 2  }
>>    {   ld8    r15, [r2 + r3 * 8]   ;   ld8    r2, [ r2 + r4 * 8] }
>>    {    add     r2, r2, r15  ;    ret    }
>
>
>
>>
>> 2nd sequence is 2 cycles shorter. Combiner pass even shows patterns
>> like this but fail to transform this as it wrapped in parallel:
>>
>> Failed to match this instruction:
>> (parallel [
>>          (set (reg:DI 56)
>>              (plus:DI (mult:DI (reg:DI 3 r3 [ a ])
>>                      (const_int 4 [0x4]))
>>                  (reg:DI 4 r4 [ b ])))
>>          (set (reg/v:DI 40 [ a1 ])
>>              (ashift:DI (reg:DI 3 r3 [ a ])
>>                  (const_int 2 [0x2])))
>>      ])
>
> You can always write a pattern which matches the PARALLEL.  You can then
> either arrange to emit the assembly code from that pattern or split the
> pattern (after reload/lra)
>
>>
>> What would be a proper way to perform reorganizations like this in general
>> way?
>>
>> The same goes with the pointer increment:
>>
>> add r2, r2, 1
>> ld r3, [r2+0]
>>
>> would be much better off like this:
>>
>> { ld r3, [r2 + 1] ; add r2, r2, 1 }
>>
>> Are those kind of things overlooked or I failed to set something in
>> machine-dependent portion?
>
> Similarly.  You may also get some mileage from LEGITIMIZE_ADDRESS, though it
> may not see the add/load together which would hinder its ability to generate
> the code you want.
>
> Note that using a define_split to match these things prior to reload likely
> won't work because combine will likely see the split pattern as being the
> same cost as the original insns.
>
> In general the combiner really isn't concerned with superscalar issues,
> though you can tackle some superscalar things with creative patterns that
> match parallels or which match something more complex, but then split it up
> later.
>
> Note that GCC supports a number of superscalar architectures -- they were
> very common for workstations and high end embedded processors for many
> years.  MIPS, PPC, HPPA, even x86, etc all have variants which were tuned
> for superscalar code generation.  I'm sure there's tricks you can exploit in
> every one of those architectures to help generate code with fewer data
> dependencies and thus more opportunities to exploit the superscalar nature
> of your processor.
>
>
>
> jeff

Reply via email to