https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97147
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Hongtao.liu from comment #2) > Disable (define_insn "*sse3_haddv2df3_low" and (define_insn > "*sse3_hsubv2df3_low" seems to be ok. > But for foo1. > > v2df foo1 (v2df x, v2df y) > { > v2df a; > a[0] = x[0] + x[1]; > a[1] = y[0] + y[1]; > return a; > } > > it's > > vhaddpd %xmm1, %xmm0, %xmm0 > ret > > vs > > movapd xmm2, xmm0 > unpckhpd xmm2, xmm2 > addsd xmm0, xmm2 > movapd xmm2, xmm1 > unpckhpd xmm1, xmm1 > addsd xmm1, xmm2 > unpcklpd xmm0, xmm1 > ret > > and note w/o vhaddpd, codegen can be optimized to > > movapd xmm2, xmm0 > unpcklpd xmm2, xmm1 > unpckhpd xmm0, xmm1 > addpd xmm0, xmm2 > ret > > Guess maybe it's better done in gimple level? On GIMPLE we see the testcase basically unchanged from what the source does: _1 = BIT_FIELD_REF <x_7(D), 64, 0>; _2 = BIT_FIELD_REF <x_7(D), 64, 64>; _3 = _1 + _2; a_9 = BIT_INSERT_EXPR <a_8(D), _3, 0>; _4 = BIT_FIELD_REF <y_10(D), 64, 0>; _5 = BIT_FIELD_REF <y_10(D), 64, 64>; _6 = _4 + _5; a_11 = BIT_INSERT_EXPR <a_9, _6, 64>; return a_11; vectorization fails in SLP discovery because we essentially see two lanes operating on different vectors and we don't implement a way to shuffle them together. I think the full hadd define_insns are OK to keep, they really have special arrangements (esp. the SFmode variants). But the reductions to scalar (*_low) seem unnecessary and penaltizing (maybe we can guard use of those with a -mtune-ctl?). I also see we're missing patterns for h{add,sub}ps (not sure if we can manage to get combine to synthesize it).