https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115777

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
So I tried the optimistic way to classify a problematic load as
VMAT_ELEMENTWISE which for BB vectorization results in not vectorizing the SLP
node but instead making it external, builting it from scalars.  That still
makes vectorization profitable:

_7 1 times scalar_store costs 12 in body
_4 1 times scalar_store costs 12 in body
*_6 1 times scalar_load costs 12 in body
*_3 1 times scalar_load costs 12 in body
node 0x3f1bf0b0 1 times vec_perm costs 4 in body
node 0x3f1bf020 1 times vec_construct costs 4 in prologue
_7 1 times unaligned_store (misalign -1) costs 12 in body
*_6 1 times vec_to_scalar costs 4 in epilogue
*_3 1 times vec_to_scalar costs 4 in epilogue
t.c:7:11: note: Cost model analysis for part in loop 2:
  Vector cost: 28
  Scalar cost: 48
t.c:7:11: note: Basic block will be vectorized using SLP

I think we falsely consider the permute node recoding the corresponding
scalar lanes as covering the scalar loads here not realizing we have to
keep them (also on the other side we think we have to extract both
lanes from the permute).  Fixing the first issue would reduce scalar
cost by 24, fixing both would reduce vector cost by 8 in the end still
trading a scalar store (12) for vector construction and permute (8).

The result is

insertion_sort  =>     1008

which is faster than with STLF fails

insertion_sort  =>     2333

but slower than w/o vectorization

insertion_sort  =>      181

        movl    (%rax), %ecx
        movl    4(%rax), %edx
        cmpl    %ecx, %edx
        jnb     .L6
        movd    %edx, %xmm0
        movd    %ecx, %xmm1
        punpckldq       %xmm1, %xmm0
        movq    %xmm0, (%rax)
        cmpq    %rdi, %rax
        jne     .L7

in backend costing we do anticipate the vector construction to happen
by loading from memory though, so we don't account for the extra
GPR->xmm move penalty.

Reply via email to