https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067

--- Comment #8 from ktkachov at gcc dot gnu.org ---
(In reply to rguent...@suse.de from comment #7)
> On Mon, 29 Jan 2018, ktkachov at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067
> > 
> > --- Comment #6 from ktkachov at gcc dot gnu.org ---
> > (In reply to rguent...@suse.de from comment #5)
> > > On Mon, 29 Jan 2018, ktkachov at gcc dot gnu.org wrote:
> > > 
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067
> > > > 
> > > > --- Comment #3 from ktkachov at gcc dot gnu.org ---
> > > > (In reply to Richard Biener from comment #2)
> > > > > So any hint on whether the code after r257077 is better or worse than 
> > > > > before?
> > > > 
> > > > Looks worse unfortunately:
> > > > For aarch64 at -O2 it generates:
> > > > foo:
> > > >         mov     w3, 44
> > > >         mov     w2, 40
> > > >         mov     w5, 1
> > > >         mov     w4, 2
> > > >         smull   x3, w1, w3
> > > >         smull   x2, w1, w2
> > > >         str     w5, [x0, x3]
> > > >         add     x2, x2, 400
> > > >         add     x1, x2, x1, sxtw 2
> > > >         str     w4, [x0, x1]
> > > >         ret
> > > > 
> > > > whereas with r257077 it generates the shorter:
> > > > foo:
> > > >         mov     w3, 40
> > > >         sxtw    x2, w1
> > > >         mov     w4, 1
> > > >         smaddl  x0, w1, w3, x0
> > > >         mov     w3, 2
> > > >         add     x1, x0, x2, lsl 2
> > > >         str     w4, [x0, x2, lsl 2]
> > > >         str     w3, [x1, 400]
> > > >         ret
> > > 
> > > So shorter is worse?  Might be because I don't understand the
> > > difference between the 'lsl 2' and the 'sxtw 2' or the cost
> > > of the [x1, 400] addressing.
> > 
> > Sorry, I messed up the writeup. Let me try again.
> > The shorter sequence (with the smaddl) is the good one and is produced
> > *without* r257077. After r257077 we generate the longer and worse sequence 
> > with
> > two smull.
> 
> I see the shorter sequence with TOT, r257077 included.  The testcase
> explicitely checks for no widen-mult-plus but we now have two:
> 
>   <bb 2> [local count: 1073741825]:
>   _17 = Idx_6(D) w* 44;
>   _13 = Arr_7(D) + _17;
>   MEM[(int[10] *)_13] = 1;
>   _4 = WIDEN_MULT_PLUS_EXPR <Idx_6(D), 40, 400>;
>   _18 = WIDEN_MULT_PLUS_EXPR <Idx_6(D), 4, _4>;
>   _16 = Arr_7(D) + _18;
>   MEM[(int[10] *)_16] = 2;
>   return;
> 
> note the "shorter" sequence I see is
> 
> foo:
>         mov     x4, 400
>         mov     w3, 40
>         mov     w2, 44
>         mov     w5, 1
>         smaddl  x3, w1, w3, x4
>         mov     w4, 2
>         smull   x2, w1, w2
>         add     x1, x3, x1, sxtw 2
>         str     w5, [x0, x2]
>         str     w4, [x0, x1]
>         ret
> 
> which doesn't 1:1 match either of yours.

Hmm, the exact instruction mix will depend a lot on the cpu tuning in question
because the RTX costs affect the widening multiplication expansion, but at the
tree level I see only one WIDEN_MULT_PLUS_EXPR with current ToT (with r257077):

  <bb 2> [local count: 1073741825]:
  _1 = (long unsigned int) Idx_6(D);
  _2 = Idx_6(D) w* 40;
  _3 = Arr_7(D) + _2;
  _12 = Idx_6(D) w* 4;
  _11 = Idx_6(D) w* 44;
  _13 = Arr_7(D) + _11;
  MEM[(int[10] *)_13] = 1;
  _4 = _2 + 400;
  _5 = Arr_7(D) + _4;
  _14 = WIDEN_MULT_PLUS_EXPR <Idx_6(D), 4, _4>;
  _16 = Arr_7(D) + _14;
  MEM[(int[10] *)_16] = 2;
  return;

Reply via email to