https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067
--- Comment #6 from ktkachov at gcc dot gnu.org --- (In reply to rguent...@suse.de from comment #5) > On Mon, 29 Jan 2018, ktkachov at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067 > > > > --- Comment #3 from ktkachov at gcc dot gnu.org --- > > (In reply to Richard Biener from comment #2) > > > So any hint on whether the code after r257077 is better or worse than > > > before? > > > > Looks worse unfortunately: > > For aarch64 at -O2 it generates: > > foo: > > mov w3, 44 > > mov w2, 40 > > mov w5, 1 > > mov w4, 2 > > smull x3, w1, w3 > > smull x2, w1, w2 > > str w5, [x0, x3] > > add x2, x2, 400 > > add x1, x2, x1, sxtw 2 > > str w4, [x0, x1] > > ret > > > > whereas with r257077 it generates the shorter: > > foo: > > mov w3, 40 > > sxtw x2, w1 > > mov w4, 1 > > smaddl x0, w1, w3, x0 > > mov w3, 2 > > add x1, x0, x2, lsl 2 > > str w4, [x0, x2, lsl 2] > > str w3, [x1, 400] > > ret > > So shorter is worse? Might be because I don't understand the > difference between the 'lsl 2' and the 'sxtw 2' or the cost > of the [x1, 400] addressing. Sorry, I messed up the writeup. Let me try again. The shorter sequence (with the smaddl) is the good one and is produced *without* r257077. After r257077 we generate the longer and worse sequence with two smull.