https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115777
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> --- So I tried the optimistic way to classify a problematic load as VMAT_ELEMENTWISE which for BB vectorization results in not vectorizing the SLP node but instead making it external, builting it from scalars. That still makes vectorization profitable: _7 1 times scalar_store costs 12 in body _4 1 times scalar_store costs 12 in body *_6 1 times scalar_load costs 12 in body *_3 1 times scalar_load costs 12 in body node 0x3f1bf0b0 1 times vec_perm costs 4 in body node 0x3f1bf020 1 times vec_construct costs 4 in prologue _7 1 times unaligned_store (misalign -1) costs 12 in body *_6 1 times vec_to_scalar costs 4 in epilogue *_3 1 times vec_to_scalar costs 4 in epilogue t.c:7:11: note: Cost model analysis for part in loop 2: Vector cost: 28 Scalar cost: 48 t.c:7:11: note: Basic block will be vectorized using SLP I think we falsely consider the permute node recoding the corresponding scalar lanes as covering the scalar loads here not realizing we have to keep them (also on the other side we think we have to extract both lanes from the permute). Fixing the first issue would reduce scalar cost by 24, fixing both would reduce vector cost by 8 in the end still trading a scalar store (12) for vector construction and permute (8). The result is insertion_sort => 1008 which is faster than with STLF fails insertion_sort => 2333 but slower than w/o vectorization insertion_sort => 181 movl (%rax), %ecx movl 4(%rax), %edx cmpl %ecx, %edx jnb .L6 movd %edx, %xmm0 movd %ecx, %xmm1 punpckldq %xmm1, %xmm0 movq %xmm0, (%rax) cmpq %rdi, %rax jne .L7 in backend costing we do anticipate the vector construction to happen by loading from memory though, so we don't account for the extra GPR->xmm move penalty.