[llvm-bugs] [Bug 31800] New: clang/llvm vectorize the sum of a complex array poorly

via llvm-bugs Mon, 30 Jan 2017 05:10:42 -0800

https://llvm.org/bugs/show_bug.cgi?id=31800


            Bug ID: 31800
           Summary: clang/llvm vectorize the sum of a complex array poorly
           Product: libraries
           Version: 3.9
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P
         Component: Loop Optimizer
          Assignee: unassignedb...@nondot.org
          Reporter: drr...@gmail.com
                CC: llvm-bugs@lists.llvm.org
    Classification: Unclassified

Consider this code:

#include <complex.h>
complex float f(complex float x[]) {
  complex float p = 1.0;
  for (int i = 0; i < 32; i++)
    p += x[i];
  return p;
}

clang 3.9.1 with -O3 -march=core-avx2 -ffast-math  gives

f:                                      # @f
        vmovq   xmm0, qword ptr [rdi]   # xmm0 = mem[0],zero
        vmovss  xmm1, dword ptr [rip + .LCPI0_0] # xmm1 = mem[0],zero,zero,zero
        vaddps  xmm0, xmm0, xmm1
        vmovq   xmm1, qword ptr [rdi + 8] # xmm1 = mem[0],zero
        vmovq   xmm2, qword ptr [rdi + 16] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vaddps  xmm0, xmm0, xmm1
        vmovq   xmm1, qword ptr [rdi + 24] # xmm1 = mem[0],zero
        vmovq   xmm2, qword ptr [rdi + 32] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vmovq   xmm2, qword ptr [rdi + 40] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vaddps  xmm0, xmm0, xmm1
        vmovq   xmm1, qword ptr [rdi + 48] # xmm1 = mem[0],zero
        vmovq   xmm2, qword ptr [rdi + 56] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vmovq   xmm2, qword ptr [rdi + 64] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vmovq   xmm2, qword ptr [rdi + 72] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vaddps  xmm0, xmm0, xmm1
        vmovq   xmm1, qword ptr [rdi + 80] # xmm1 = mem[0],zero
        vmovq   xmm2, qword ptr [rdi + 88] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vmovq   xmm2, qword ptr [rdi + 96] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vmovq   xmm2, qword ptr [rdi + 104] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vmovq   xmm2, qword ptr [rdi + 112] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vaddps  xmm0, xmm0, xmm1
        vmovq   xmm1, qword ptr [rdi + 120] # xmm1 = mem[0],zero
        vmovq   xmm2, qword ptr [rdi + 128] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vmovq   xmm2, qword ptr [rdi + 136] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vmovq   xmm2, qword ptr [rdi + 144] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vmovq   xmm2, qword ptr [rdi + 152] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vmovq   xmm2, qword ptr [rdi + 160] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vaddps  xmm0, xmm0, xmm1
        vmovq   xmm1, qword ptr [rdi + 168] # xmm1 = mem[0],zero
        vmovq   xmm2, qword ptr [rdi + 176] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vmovq   xmm2, qword ptr [rdi + 184] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vmovq   xmm2, qword ptr [rdi + 192] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vmovq   xmm2, qword ptr [rdi + 200] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vmovq   xmm2, qword ptr [rdi + 208] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vmovq   xmm2, qword ptr [rdi + 216] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vaddps  xmm0, xmm0, xmm1
        vmovq   xmm1, qword ptr [rdi + 224] # xmm1 = mem[0],zero
        vmovq   xmm2, qword ptr [rdi + 232] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vmovq   xmm2, qword ptr [rdi + 240] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vmovq   xmm2, qword ptr [rdi + 248] # xmm2 = mem[0],zero
        vaddps  xmm1, xmm1, xmm2
        vaddps  xmm0, xmm0, xmm1
        ret

The only vectorization is that the real and the imaginary parts are added in 
parallel.  The assembly also wastes half of the xmm register.

However in icc you get:

f:
        vmovups   ymm1, YMMWORD PTR [rdi]                       #5.10
        vmovups   ymm2, YMMWORD PTR [64+rdi]                    #5.10
        vmovups   ymm5, YMMWORD PTR [128+rdi]                   #5.10
        vmovups   ymm6, YMMWORD PTR [192+rdi]                   #5.10
        vmovsd    xmm0, QWORD PTR p.152.0.0.1[rip]              #3.19
        vaddps    ymm3, ymm1, YMMWORD PTR [32+rdi]              #3.19
        vaddps    ymm4, ymm2, YMMWORD PTR [96+rdi]              #3.19
        vaddps    ymm7, ymm5, YMMWORD PTR [160+rdi]             #3.19
        vaddps    ymm8, ymm6, YMMWORD PTR [224+rdi]             #3.19
        vaddps    ymm9, ymm3, ymm4                              #3.19
        vaddps    ymm10, ymm7, ymm8                             #3.19
        vaddps    ymm11, ymm9, ymm10                            #3.19
        vextractf128 xmm12, ymm11, 1                            #3.19
        vaddps    xmm13, xmm11, xmm12                           #3.19
        vmovhlps  xmm14, xmm13, xmm13                           #3.19
        vaddps    xmm15, xmm13, xmm14                           #3.19
        vaddps    xmm0, xmm15, xmm0                             #3.19
        vzeroupper                                              #6.10
        ret     

which is fully vectorized (and uses the wider ymm registers).

Another key difference seems to be that in the clang/llvm produced assembly
subsequent additions depend on each other.  Whereas in the icc code the
additions work on subsequent items and so it benefits both from full
vectorization and superscalar parallelism.

(This report is related to https://llvm.org/bugs/show_bug.cgi?id=31677 where I
incorrectly stated at the end of the problem report that llvm could vectorise
this additive reduction loop.)

-- 
You are receiving this mail because:
You are on the CC list for the bug.

_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

[llvm-bugs] [Bug 31800] New: clang/llvm vectorize the sum of a complex array poorly

Reply via email to