https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97428
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Last reconfirmed| |2020-10-15 Keywords| |missed-optimization Component|target |tree-optimization Target| |x86_64-*-* i?86-*-* Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Blocks| |53947 --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- Kind-of confirmed. Note it's always worth looking at what -fopt-info says which does > ./cc1 -quiet t.c -O3 -fopt-info-vec -march=skylake t.c:6:3: optimized: loop vectorized using 32 byte vectors t.c:6:3: optimized: loop versioned for vectorization because of possible aliasing t.c:6:3: optimized: basic block part vectorized using 32 byte vectors t.c:6:3: optimized: basic block part vectorized using 32 byte vectors t.c:6:3: optimized: basic block part vectorized using 32 byte vectors t.c:6:26: optimized: basic block part vectorized using 32 byte vectors t.c:6:3: optimized: basic block part vectorized using 32 byte vectors so we apply versioning here because src[] and dst[] may alias (which is because AVX vectorizing somehow triggers unrolling). Note with standard SSE we see > ./cc1 -quiet t.c -O3 -fopt-info-vec t.c:6:3: optimized: basic block part vectorized using 16 byte vectors and a nicer vectorized kernel .L3: movupd (%rsi), %xmm3 movupd (%rax), %xmm1 addq $64, %rsi addq $64, %rax movupd -48(%rsi), %xmm7 movupd -32(%rsi), %xmm2 subq $-128, %rdi movupd -16(%rsi), %xmm6 movapd %xmm3, %xmm8 movupd -48(%rax), %xmm5 unpcklpd %xmm7, %xmm8 movupd -32(%rax), %xmm0 movupd -16(%rax), %xmm4 unpckhpd %xmm7, %xmm3 movups %xmm8, -128(%rdi) movapd %xmm2, %xmm8 unpckhpd %xmm6, %xmm2 movups %xmm2, -80(%rdi) movapd %xmm1, %xmm2 unpcklpd %xmm6, %xmm8 unpckhpd %xmm5, %xmm1 unpcklpd %xmm5, %xmm2 movups %xmm8, -112(%rdi) movups %xmm2, -64(%rdi) movapd %xmm0, %xmm2 unpckhpd %xmm4, %xmm0 unpcklpd %xmm4, %xmm2 movups %xmm3, -96(%rdi) movups %xmm2, -48(%rdi) movups %xmm1, -32(%rdi) movups %xmm0, -16(%rdi) cmpq %rsi, %rdx jne .L3 because the cost model tells us applying an interleaving scheme is not profitable and the loop is later BB vectorized. also note with -march=skylake the loops _not_ loop vectorized are later BB vectorized and also produce nice .L8: vmovupd (%rdi), %ymm1 vmovupd 32(%rdi), %ymm4 vmovupd (%rax), %ymm0 vmovupd 32(%rax), %ymm3 vunpcklpd %ymm4, %ymm1, %ymm2 vunpckhpd %ymm4, %ymm1, %ymm1 vpermpd $216, %ymm1, %ymm1 vmovupd %ymm1, 32(%r8) vunpcklpd %ymm3, %ymm0, %ymm1 vunpckhpd %ymm3, %ymm0, %ymm0 vpermpd $216, %ymm2, %ymm2 vpermpd $216, %ymm1, %ymm1 vpermpd $216, %ymm0, %ymm0 addq $64, %rdi vmovupd %ymm2, (%r8) vmovupd %ymm1, 64(%r8) vmovupd %ymm0, 96(%r8) addq $64, %rax subq $-128, %r8 cmpq %rdi, %rdx jne .L8 which is comparable. But somehow the actual loop vectorized variant is totally off because SLP analysis fails but with AVX the interleaving scheme suddenly appears profitable. One improvement in your code (not actually improving the loop kernels but eliding the version for alias check) is to have the function signature be void foo_i2(dcmlx4_t * __restrict dst, const dcmlx_t * __restrict src, int n) thus to assess that dst and src do not overlap. Now, for the vectorizer part the issue we run into is that for loop vectorization we do not try to split up the large store group: t.c:6:3: note: Detected interleaving store of size 16 t.c:6:3: note: _36->re[0] = s00$re_42; t.c:6:3: note: _36->re[1] = s01$re_65; t.c:6:3: note: _36->re[2] = s02$re_67; t.c:6:3: note: _36->re[3] = s03$re_69; t.c:6:3: note: _36->im[0] = s00$im_64; t.c:6:3: note: _36->im[1] = s01$im_66; t.c:6:3: note: _36->im[2] = s02$im_68; t.c:6:3: note: _36->im[3] = s03$im_70; t.c:6:3: note: _39->re[0] = s10$re_71; t.c:6:3: note: _39->re[1] = s11$re_73; t.c:6:3: note: _39->re[2] = s12$re_75; t.c:6:3: note: _39->re[3] = s13$re_77; t.c:6:3: note: _39->im[0] = s10$im_72; t.c:6:3: note: _39->im[1] = s11$im_74; t.c:6:3: note: _39->im[2] = s12$im_76; t.c:6:3: note: _39->im[3] = s13$im_78; which eventually runs into one of the different interleaving load groups t.c:6:3: note: Detected interleaving load of size 8 t.c:6:3: note: s11$re_73 = _22->re; t.c:6:3: note: s11$im_74 = _22->im; t.c:6:3: note: <gap of 6 elements> t.c:6:3: note: Detected interleaving load of size 8 t.c:6:3: note: s12$re_75 = _27->re; t.c:6:3: note: s12$im_76 = _27->im; t.c:6:3: note: <gap of 6 elements> t.c:6:3: note: Detected interleaving load of size 8 t.c:6:3: note: s13$re_77 = _32->re; t.c:6:3: note: s13$im_78 = _32->im; t.c:6:3: note: <gap of 6 elements> t.c:6:3: note: Detected interleaving load of size 8 t.c:6:3: note: s10$re_71 = _17->re; t.c:6:3: note: s10$im_72 = _17->im; t.c:6:3: note: <gap of 6 elements> t.c:6:3: note: Detected interleaving load of size 8 t.c:6:3: note: s00$re_42 = _4->re; t.c:6:3: note: s00$im_64 = _4->im; t.c:6:3: note: s01$re_65 = _7->re; t.c:6:3: note: s01$im_66 = _7->im; t.c:6:3: note: s02$re_67 = _10->re; t.c:6:3: note: s02$im_68 = _10->im; t.c:6:3: note: s03$re_69 = _13->re; t.c:6:3: note: s03$im_70 = _13->im; which also shows we somehow get five groups instead of the desired two. We're confused by different base addresses for src[i*4+0+n] and src[i*4+1+n], failing to split out the constant CST * 16 Creating dr for _22->im analyze_innermost: success. base_address: src_45(D) + (sizetype) (((long unsigned int) n_44(D) + 1) * 16) ... Creating dr for _27->re analyze_innermost: success. base_address: src_45(D) + (sizetype) (((long unsigned int) n_44(D) + 2) * 16) ... that splits the group unnecessarily and contributes to the issue. In the end we would still need to consider splitting the store group. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations