https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102750
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- Created attachment 51609 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51609&action=edit testcase This testcase reproduces the vectorization difference with REV and REV^ when using -Ofast -march=znver2 [-fno-early-inlinig]. Trunk shows similar behavior still. The issue is likely code like vect__325.61_64 = MEM <vector(2) double> [(double *)_80]; vect__325.62_60 = MEM <vector(2) double> [(double *)_80 + 16B]; vect__325.64_53 = VEC_PERM_EXPR <vect__325.61_64, vect__325.62_60, { 0, 2 }>; _52 = BIT_FIELD_REF <vect__325.62_60, 64, 0>; where it would have been better to emit two scalar loads and combine them with a CTOR (or in asm use movsd + movhpd). But we end up with .L5: vmovupd (%rcx), %xmm3 vmovupd 16(%rcx), %xmm2 ... addq $48, %rcx ... vunpckhpd %xmm2, %xmm3, %xmm7 ... vfnmadd132sd -48(%rcx), %xmm8, %xmm15 also scattered around (but that's not GIMPLEs fault). The vectorizer generates vectp.60_65 = &_6->c[0].real; vect__325.61_64 = MEM <vector(2) double> [(double *)vectp.60_65]; vectp.60_61 = vectp.60_65 + 16; vect__325.62_60 = MEM <vector(2) double> [(double *)vectp.60_61]; vectp.60_57 = vectp.60_65 + 32; vect__325.64_53 = VEC_PERM_EXPR <vect__325.61_64, vect__325.62_60, { 0, 2 }>; _52 = BIT_FIELD_REF <vect__325.64_53, 64, 64>; which is of course entirely reasonable in some sense (that's 12 + 12 + 4 + 4 cost - two scalar loads plus CTOR would cost 12 + 12 + 8 but we'd likely still generate and cost the scalar extract). Eventually it's better to pattern-match the permute and demote the vector loads to scalar... We also end up with vect__259.20_179 = .FNMA (_182, vect__326.17_185, _196); vect__368.19_180 = .FMA (_182, vect__326.17_185, _196); _178 = VEC_PERM_EXPR <vect__368.19_180, vect__259.20_179, { 0, 5, 2, 7 }>; and unhandled add_force_to_mom.c:72:60: note: node 0x3ed37e0 (max_nunits=4, refcnt=2) add_force_to_mom.c:72:60: note: op template: _362 = _6->c[1].real; add_force_to_mom.c:72:60: note: stmt 0 _362 = _6->c[1].real; add_force_to_mom.c:72:60: note: stmt 1 _362 = _6->c[1].real; add_force_to_mom.c:72:60: note: stmt 2 _399 = _6->c[2].real; add_force_to_mom.c:72:60: note: stmt 3 _399 = _6->c[2].real; add_force_to_mom.c:72:60: note: load permutation { 2 2 4 4 } .... add_force_to_mom.c:72:60: note: ==> examining statement: _362 = _6->c[1].real; add_force_to_mom.c:72:60: missed: BB vectorization with gaps at the end of a load is not supported add_force_to_mom.c:53:16: missed: not vectorized: relevant stmt not supported: _362 = _6->c[1].real; add_force_to_mom.c:72:60: note: Building vector operands of 0x3ed37e0 from scalars instead