https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067
--- Comment #8 from ktkachov at gcc dot gnu.org --- (In reply to rguent...@suse.de from comment #7) > On Mon, 29 Jan 2018, ktkachov at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067 > > > > --- Comment #6 from ktkachov at gcc dot gnu.org --- > > (In reply to rguent...@suse.de from comment #5) > > > On Mon, 29 Jan 2018, ktkachov at gcc dot gnu.org wrote: > > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067 > > > > > > > > --- Comment #3 from ktkachov at gcc dot gnu.org --- > > > > (In reply to Richard Biener from comment #2) > > > > > So any hint on whether the code after r257077 is better or worse than > > > > > before? > > > > > > > > Looks worse unfortunately: > > > > For aarch64 at -O2 it generates: > > > > foo: > > > > mov w3, 44 > > > > mov w2, 40 > > > > mov w5, 1 > > > > mov w4, 2 > > > > smull x3, w1, w3 > > > > smull x2, w1, w2 > > > > str w5, [x0, x3] > > > > add x2, x2, 400 > > > > add x1, x2, x1, sxtw 2 > > > > str w4, [x0, x1] > > > > ret > > > > > > > > whereas with r257077 it generates the shorter: > > > > foo: > > > > mov w3, 40 > > > > sxtw x2, w1 > > > > mov w4, 1 > > > > smaddl x0, w1, w3, x0 > > > > mov w3, 2 > > > > add x1, x0, x2, lsl 2 > > > > str w4, [x0, x2, lsl 2] > > > > str w3, [x1, 400] > > > > ret > > > > > > So shorter is worse? Might be because I don't understand the > > > difference between the 'lsl 2' and the 'sxtw 2' or the cost > > > of the [x1, 400] addressing. > > > > Sorry, I messed up the writeup. Let me try again. > > The shorter sequence (with the smaddl) is the good one and is produced > > *without* r257077. After r257077 we generate the longer and worse sequence > > with > > two smull. > > I see the shorter sequence with TOT, r257077 included. The testcase > explicitely checks for no widen-mult-plus but we now have two: > > <bb 2> [local count: 1073741825]: > _17 = Idx_6(D) w* 44; > _13 = Arr_7(D) + _17; > MEM[(int[10] *)_13] = 1; > _4 = WIDEN_MULT_PLUS_EXPR <Idx_6(D), 40, 400>; > _18 = WIDEN_MULT_PLUS_EXPR <Idx_6(D), 4, _4>; > _16 = Arr_7(D) + _18; > MEM[(int[10] *)_16] = 2; > return; > > note the "shorter" sequence I see is > > foo: > mov x4, 400 > mov w3, 40 > mov w2, 44 > mov w5, 1 > smaddl x3, w1, w3, x4 > mov w4, 2 > smull x2, w1, w2 > add x1, x3, x1, sxtw 2 > str w5, [x0, x2] > str w4, [x0, x1] > ret > > which doesn't 1:1 match either of yours. Hmm, the exact instruction mix will depend a lot on the cpu tuning in question because the RTX costs affect the widening multiplication expansion, but at the tree level I see only one WIDEN_MULT_PLUS_EXPR with current ToT (with r257077): <bb 2> [local count: 1073741825]: _1 = (long unsigned int) Idx_6(D); _2 = Idx_6(D) w* 40; _3 = Arr_7(D) + _2; _12 = Idx_6(D) w* 4; _11 = Idx_6(D) w* 44; _13 = Arr_7(D) + _11; MEM[(int[10] *)_13] = 1; _4 = _2 + 400; _5 = Arr_7(D) + _4; _14 = WIDEN_MULT_PLUS_EXPR <Idx_6(D), 4, _4>; _16 = Arr_7(D) + _14; MEM[(int[10] *)_16] = 2; return;