https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167
--- Comment #15 from Hongtao.liu <crazylht at gmail dot com> --- (In reply to Andrew Pinski from comment #14) > (In reply to Hongtao.liu from comment #13) > > fold shulfps to vec_perm_exp, but still 2 shulfps are generated. > > > > __m128 f (__m128 a, __m128 b) > > { > > vector(4) float _3; > > vector(4) float _5; > > vector(4) float _6; > > > > ;; basic block 2, loop depth 0 > > ;; pred: ENTRY > > _3 = VEC_PERM_EXPR <b_2(D), b_2(D), { 0, 0, 0, 0 }>; > > _5 = VEC_PERM_EXPR <a_4(D), a_4(D), { 0, 0, 0, 0 }>; > > _6 = _3 * _5; > > return _6; > > ;; succ: EXIT > > > > } > > So this is a bit more complex as not all targets have a good extract/dup > functionary for scalars. So maybe this should be done as a define_insn for > x86. No need for extract/dup, if both perm indexes is the same, it can be c = a * b, and vec_perm_expr (c, c, index}. it seems a quite general optimization which could apply to all other operations.