https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98167
--- Comment #14 from Andrew Pinski <pinskia at gcc dot gnu.org> --- (In reply to Hongtao.liu from comment #13) > fold shulfps to vec_perm_exp, but still 2 shulfps are generated. > > __m128 f (__m128 a, __m128 b) > { > vector(4) float _3; > vector(4) float _5; > vector(4) float _6; > > ;; basic block 2, loop depth 0 > ;; pred: ENTRY > _3 = VEC_PERM_EXPR <b_2(D), b_2(D), { 0, 0, 0, 0 }>; > _5 = VEC_PERM_EXPR <a_4(D), a_4(D), { 0, 0, 0, 0 }>; > _6 = _3 * _5; > return _6; > ;; succ: EXIT > > } So this is a bit more complex as not all targets have a good extract/dup functionary for scalars. So maybe this should be done as a define_insn for x86.