Reorder/combine insns on superscalar arch

2016-01-14 Thread Igor Shevlyakov
Guys,

I'm trying to make compiler to generate better code on superscalar
in-order machine but can't find the right way to do it.

Imagine the following code:

long f(long* p, long a, long b)
{
  long a1 = a << 2;
  long a2 = a1 + b;
  return p[a1] + p[a2];
}

by default compiler generates something like this in some pseudo-asm:

shl r3, r3, 2
add r4, r3, r4
ld8r15, [r2 + r3 * 8]
ld8r2, [ r2 + r4 * 8]
   {add r2, r2, r15  ;ret}

but it would be way better this way:

  {   sh_add  r4, r4, (r3 << 2)   ; shl   r3, r3, 2  }
  {   ld8r15, [r2 + r3 * 8]   ;   ld8r2, [ r2 + r4 * 8] }
  {add r2, r2, r15  ;ret}

2nd sequence is 2 cycles shorter. Combiner pass even shows patterns
like this but fail to transform this as it wrapped in parallel:

Failed to match this instruction:
(parallel [
(set (reg:DI 56)
(plus:DI (mult:DI (reg:DI 3 r3 [ a ])
(const_int 4 [0x4]))
(reg:DI 4 r4 [ b ])))
(set (reg/v:DI 40 [ a1 ])
(ashift:DI (reg:DI 3 r3 [ a ])
(const_int 2 [0x2])))
])

What would be a proper way to perform reorganizations like this in general way?

The same goes with the pointer increment:

add r2, r2, 1
ld r3, [r2+0]

would be much better off like this:

{ ld r3, [r2 + 1] ; add r2, r2, 1 }

Are those kind of things overlooked or I failed to set something in
machine-dependent portion?

Thanks a lot for your thoughts


Re: Reorder/combine insns on superscalar arch

2016-01-14 Thread Igor Shevlyakov
Thanks Jeff,

I really hoped that I missed something and there was better answer.
But does it do any harm if combiner will try to check every piece of a
parallel like that and if every component is matchable and total cost
is not worse to emit them separately?
It will change nothing for single issue machines just some reordering
but it will help many multi-issue...
What the pitfalls or this approach are?

Thanks

On Thu, Jan 14, 2016 at 8:55 PM, Jeff Law  wrote:
> On 01/14/2016 04:47 PM, Igor Shevlyakov wrote:
>>
>> Guys,
>>
>> I'm trying to make compiler to generate better code on superscalar
>> in-order machine but can't find the right way to do it.
>>
>> Imagine the following code:
>>
>> long f(long* p, long a, long b)
>> {
>>long a1 = a << 2;
>>long a2 = a1 + b;
>>return p[a1] + p[a2];
>> }
>
>
>
>>
>> by default compiler generates something like this in some pseudo-asm:
>>
>>  shl r3, r3, 2
>>  add r4, r3, r4
>>  ld8r15, [r2 + r3 * 8]
>>  ld8r2, [ r2 + r4 * 8]
>> {add r2, r2, r15  ;ret}
>>
>> but it would be way better this way:
>>
>>{   sh_add  r4, r4, (r3 << 2)   ; shl   r3, r3, 2  }
>>{   ld8r15, [r2 + r3 * 8]   ;   ld8r2, [ r2 + r4 * 8] }
>>{add r2, r2, r15  ;ret}
>
>
>
>>
>> 2nd sequence is 2 cycles shorter. Combiner pass even shows patterns
>> like this but fail to transform this as it wrapped in parallel:
>>
>> Failed to match this instruction:
>> (parallel [
>>  (set (reg:DI 56)
>>  (plus:DI (mult:DI (reg:DI 3 r3 [ a ])
>>  (const_int 4 [0x4]))
>>  (reg:DI 4 r4 [ b ])))
>>  (set (reg/v:DI 40 [ a1 ])
>>  (ashift:DI (reg:DI 3 r3 [ a ])
>>  (const_int 2 [0x2])))
>>  ])
>
> You can always write a pattern which matches the PARALLEL.  You can then
> either arrange to emit the assembly code from that pattern or split the
> pattern (after reload/lra)
>
>>
>> What would be a proper way to perform reorganizations like this in general
>> way?
>>
>> The same goes with the pointer increment:
>>
>> add r2, r2, 1
>> ld r3, [r2+0]
>>
>> would be much better off like this:
>>
>> { ld r3, [r2 + 1] ; add r2, r2, 1 }
>>
>> Are those kind of things overlooked or I failed to set something in
>> machine-dependent portion?
>
> Similarly.  You may also get some mileage from LEGITIMIZE_ADDRESS, though it
> may not see the add/load together which would hinder its ability to generate
> the code you want.
>
> Note that using a define_split to match these things prior to reload likely
> won't work because combine will likely see the split pattern as being the
> same cost as the original insns.
>
> In general the combiner really isn't concerned with superscalar issues,
> though you can tackle some superscalar things with creative patterns that
> match parallels or which match something more complex, but then split it up
> later.
>
> Note that GCC supports a number of superscalar architectures -- they were
> very common for workstations and high end embedded processors for many
> years.  MIPS, PPC, HPPA, even x86, etc all have variants which were tuned
> for superscalar code generation.  I'm sure there's tricks you can exploit in
> every one of those architectures to help generate code with fewer data
> dependencies and thus more opportunities to exploit the superscalar nature
> of your processor.
>
>
>
> jeff


Is this FE bug or am I missing something?

2016-09-11 Thread Igor Shevlyakov
Guys,

Small sample below fails (at least on 6.1) for multiple targets. The
difference between two functions start at the very first tree pass...

Please confirm that I'm not crazy and it's not supposed to be like this...

Thanks

--
#include "limits.h"
#include "stdio.h"

int* __attribute__((noinline)) f1(int* p, int x)
{
return &p[x + 1];
}

int* __attribute__((noinline)) f2(int*p, int x)
{
return &p[1 + x];
}

int P[10];

int main()
{
int x = INT_MAX;
if (f1(P, x) != f2(P, x)) {
printf("Error!\n");
abort();
}
}
--


Re: Is this FE bug or am I missing something?

2016-09-12 Thread Igor Shevlyakov
Well, my concern is not what happens with overflow (which in second
case -fsanitize=undefined will address), but rather consistency of
that 2 cases.

p[x+1] generates RTL which leads to better generated code at the
expense of leading to overflow, while p[1+x] never overflows but leads
to worse code.
It would be beneficial to make the behaviour consistent between those 2 cases.

Thanks for your input

On Mon, Sep 12, 2016 at 12:51 AM, Marc Glisse  wrote:
> On Sun, 11 Sep 2016, Igor Shevlyakov wrote:
>
>> Small sample below fails (at least on 6.1) for multiple targets. The
>> difference between two functions start at the very first tree pass...
>
>
> You are missing -fsanitize=undefined (and #include ).
>
> Please use the mailing list gcc-h...@gcc.gnu.org next time.
>
> --
> Marc Glisse