https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709
--- Comment #2 from Marc Glisse <glisse at gcc dot gnu.org> --- We have trouble with your VSET macro (known issue): two_28 = BIT_INSERT_EXPR <two_27(D), 2.0e+0, 0 (64 bits)>; two_29 = BIT_INSERT_EXPR <two_28, 2.0e+0, 64 (64 bits)>; two_30 = BIT_INSERT_EXPR <two_29, 2.0e+0, 128 (64 bits)>; two_31 = BIT_INSERT_EXPR <two_30, 2.0e+0, 192 (64 bits)>; it is easier for gcc if you write: v4do two={2,2,2,2}; or you could even replace two with 2 in the expressions, gcc handles it just fine. In reload, subregs are extracted via the stack, whereas the low subreg should already be available (NOP) and the high one can be extracted by a single insn. That's probably the first thing to investigate. (-mtune doesn't change what happens) res could be kept in a register (or even better a pair of registers) through the whole loop.