https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92819
--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> --- So when we hoist the narrowing across the permute to, for corge, instead of _1 = __MEM <double> (p_4(D)); _5 = {_1, _1, _1, _1}; _6 = __VEC_PERM (x_2(D), _5, { 3ul, 5ul, 6ul, 7ul }); _7 = __BIT_FIELD_REF <v2df> (_6, 128u, 0u); do _1 = __BIT_FIELD_REF <v2df> (x_3(D), 128u, 128u); _2 = __MEM <double> (p_5(D)); _7 = _Literal (v2df) {_2, _2}; _8 = __VEC_PERM (_1, _7, _Literal (v2di) { 1ul, 3ul }); then we get vextractf128 $0x1, %ymm0, %xmm0 vmovddup (%rdi), %xmm1 vunpckhpd %xmm1, %xmm0, %xmm0 which would be OK, comparable to vextractf128 $0x1, %ymm0, %xmm0 vunpckhpd %xmm0, %xmm0, %xmm0 vmovhpd (%rdi), %xmm0, %xmm0 doing the same for foo() gets us vmovddup (%rdi), %xmm1 vunpckhpd %xmm1, %xmm0, %xmm0 which looks OK to me as well.