[Bug tree-optimization/97428] -O3 is great for basic AoSoA packing of complex arrays, but horrible one step above the basic

rguenth at gcc dot gnu.org via Gcc-bugs Wed, 14 Oct 2020 23:46:28 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97428


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2020-10-15
           Keywords|                            |missed-optimization
          Component|target                      |tree-optimization
             Target|                            |x86_64-*-* i?86-*-*
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
             Blocks|                            |53947

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Kind-of confirmed.  Note it's always worth looking at what -fopt-info says
which does

> ./cc1 -quiet t.c -O3 -fopt-info-vec -march=skylake
t.c:6:3: optimized: loop vectorized using 32 byte vectors
t.c:6:3: optimized:  loop versioned for vectorization because of possible
aliasing
t.c:6:3: optimized: basic block part vectorized using 32 byte vectors
t.c:6:3: optimized: basic block part vectorized using 32 byte vectors
t.c:6:3: optimized: basic block part vectorized using 32 byte vectors
t.c:6:26: optimized: basic block part vectorized using 32 byte vectors
t.c:6:3: optimized: basic block part vectorized using 32 byte vectors

so we apply versioning here because src[] and dst[] may alias (which is
because AVX vectorizing somehow triggers unrolling).  Note with
standard SSE we see

> ./cc1 -quiet t.c -O3 -fopt-info-vec               
t.c:6:3: optimized: basic block part vectorized using 16 byte vectors

and a nicer vectorized kernel

.L3:
        movupd  (%rsi), %xmm3
        movupd  (%rax), %xmm1
        addq    $64, %rsi
        addq    $64, %rax
        movupd  -48(%rsi), %xmm7
        movupd  -32(%rsi), %xmm2
        subq    $-128, %rdi
        movupd  -16(%rsi), %xmm6
        movapd  %xmm3, %xmm8
        movupd  -48(%rax), %xmm5
        unpcklpd        %xmm7, %xmm8
        movupd  -32(%rax), %xmm0
        movupd  -16(%rax), %xmm4
        unpckhpd        %xmm7, %xmm3
        movups  %xmm8, -128(%rdi)
        movapd  %xmm2, %xmm8
        unpckhpd        %xmm6, %xmm2
        movups  %xmm2, -80(%rdi)
        movapd  %xmm1, %xmm2
        unpcklpd        %xmm6, %xmm8
        unpckhpd        %xmm5, %xmm1
        unpcklpd        %xmm5, %xmm2
        movups  %xmm8, -112(%rdi)
        movups  %xmm2, -64(%rdi)
        movapd  %xmm0, %xmm2
        unpckhpd        %xmm4, %xmm0
        unpcklpd        %xmm4, %xmm2
        movups  %xmm3, -96(%rdi)
        movups  %xmm2, -48(%rdi)
        movups  %xmm1, -32(%rdi)
        movups  %xmm0, -16(%rdi)
        cmpq    %rsi, %rdx
        jne     .L3

because the cost model tells us applying an interleaving scheme is not
profitable and the loop is later BB vectorized.

also note with -march=skylake the loops _not_ loop vectorized are later
BB vectorized and also produce nice

.L8:
        vmovupd (%rdi), %ymm1
        vmovupd 32(%rdi), %ymm4
        vmovupd (%rax), %ymm0
        vmovupd 32(%rax), %ymm3
        vunpcklpd       %ymm4, %ymm1, %ymm2
        vunpckhpd       %ymm4, %ymm1, %ymm1
        vpermpd $216, %ymm1, %ymm1
        vmovupd %ymm1, 32(%r8)
        vunpcklpd       %ymm3, %ymm0, %ymm1
        vunpckhpd       %ymm3, %ymm0, %ymm0
        vpermpd $216, %ymm2, %ymm2
        vpermpd $216, %ymm1, %ymm1
        vpermpd $216, %ymm0, %ymm0
        addq    $64, %rdi
        vmovupd %ymm2, (%r8)
        vmovupd %ymm1, 64(%r8)
        vmovupd %ymm0, 96(%r8)
        addq    $64, %rax
        subq    $-128, %r8
        cmpq    %rdi, %rdx
        jne     .L8

which is comparable.

But somehow the actual loop vectorized variant is totally off because SLP
analysis fails but with AVX the interleaving scheme suddenly appears
profitable.

One improvement in your code (not actually improving the loop kernels
but eliding the version for alias check) is to have the function
signature be

void foo_i2(dcmlx4_t * __restrict dst, const dcmlx_t * __restrict src, int n)

thus to assess that dst and src do not overlap.


Now, for the vectorizer part the issue we run into is that for loop
vectorization we do not try to split up the large store group:

t.c:6:3: note:   Detected interleaving store of size 16
t.c:6:3: note:          _36->re[0] = s00$re_42;
t.c:6:3: note:          _36->re[1] = s01$re_65;
t.c:6:3: note:          _36->re[2] = s02$re_67;
t.c:6:3: note:          _36->re[3] = s03$re_69;
t.c:6:3: note:          _36->im[0] = s00$im_64;
t.c:6:3: note:          _36->im[1] = s01$im_66;
t.c:6:3: note:          _36->im[2] = s02$im_68;
t.c:6:3: note:          _36->im[3] = s03$im_70;
t.c:6:3: note:          _39->re[0] = s10$re_71;
t.c:6:3: note:          _39->re[1] = s11$re_73;
t.c:6:3: note:          _39->re[2] = s12$re_75;
t.c:6:3: note:          _39->re[3] = s13$re_77;
t.c:6:3: note:          _39->im[0] = s10$im_72;
t.c:6:3: note:          _39->im[1] = s11$im_74;
t.c:6:3: note:          _39->im[2] = s12$im_76;
t.c:6:3: note:          _39->im[3] = s13$im_78;

which eventually runs into one of the different interleaving load
groups

t.c:6:3: note:   Detected interleaving load of size 8
t.c:6:3: note:          s11$re_73 = _22->re;
t.c:6:3: note:          s11$im_74 = _22->im;
t.c:6:3: note:          <gap of 6 elements>
t.c:6:3: note:   Detected interleaving load of size 8
t.c:6:3: note:          s12$re_75 = _27->re;
t.c:6:3: note:          s12$im_76 = _27->im;
t.c:6:3: note:          <gap of 6 elements>
t.c:6:3: note:   Detected interleaving load of size 8
t.c:6:3: note:          s13$re_77 = _32->re;
t.c:6:3: note:          s13$im_78 = _32->im;
t.c:6:3: note:          <gap of 6 elements>
t.c:6:3: note:   Detected interleaving load of size 8
t.c:6:3: note:          s10$re_71 = _17->re;
t.c:6:3: note:          s10$im_72 = _17->im;
t.c:6:3: note:          <gap of 6 elements>
t.c:6:3: note:   Detected interleaving load of size 8
t.c:6:3: note:          s00$re_42 = _4->re;
t.c:6:3: note:          s00$im_64 = _4->im;
t.c:6:3: note:          s01$re_65 = _7->re;
t.c:6:3: note:          s01$im_66 = _7->im;
t.c:6:3: note:          s02$re_67 = _10->re;
t.c:6:3: note:          s02$im_68 = _10->im;
t.c:6:3: note:          s03$re_69 = _13->re;
t.c:6:3: note:          s03$im_70 = _13->im;

which also shows we somehow get five groups instead of the desired two.
We're confused by different base addresses for src[i*4+0+n] and src[i*4+1+n],
failing to split out the constant CST * 16

Creating dr for _22->im
analyze_innermost: success.
        base_address: src_45(D) + (sizetype) (((long unsigned int) n_44(D) + 1)
* 16)
...
Creating dr for _27->re
analyze_innermost: success.
        base_address: src_45(D) + (sizetype) (((long unsigned int) n_44(D) + 2)
* 16)
...

that splits the group unnecessarily and contributes to the issue.  In the
end we would still need to consider splitting the store group.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug tree-optimization/97428] -O3 is great for basic AoSoA packing of complex arrays, but horrible one step above the basic

Reply via email to