https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64844
Bug ID: 64844 Summary: Vectorization inhibited in gcc5 when loop starts with elem[1], aarch64 perf regression from 4.9.1 Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: chris_s_jones at yahoo dot com Created attachment 34611 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=34611&action=edit Simple test case % ./trunk_aarch64/bin/aarch64-linux-gnu-gcc -v Using built-in specs. COLLECT_GCC=./trunk_aarch64/bin/aarch64-linux-gnu-gcc COLLECT_LTO_WRAPPER=/local/trunk_aarch64/libexec/gcc/aarch64-linux-gnu/5.0.0/lto-wrapper Target: aarch64-linux-gnu Configured with: /local/src/gcc-trunk/configure --prefix=/local/trunk_aarch64 --target=aarch64-linux-gnu --with-sysroot=/local/trunk_aarch64/sysroot --with-gmp=/local/trunk_aarch64 --with-mpc=/local/trunk_aarch64 --with-mpfr=/local/trunk_aarch64 --with-cloog=/local/trunk_aarch64 --with-isl=/local/trunk_aarch64 --enable-__cxa_atexit --with-gnu-as --with-gnu-ld --enable-shared --disable-libssp --disable-libmudflap --enable-languages=c,c++,fortran --disable-libsanitizer --disable-nls Thread model: posix gcc version 5.0.0 20150127 (experimental) (GCC) For the following code sample, only the first inlined call to compute() seems to get vectorized by GCC5 using the command line shown below. In GCC 4.9.1, both calls get vectorized. This results in a nearly 50% performance hit for the newer compiler. File smpd.c: #include <stdint.h> #include <stdio.h> inline double compute(size_t n, double const * restrict a, double const * restrict b) { double res = 0.0; for (size_t i = 0; i < n; ++i) { res += a[i] + b[i]; } return res; } int main(int argc, char **argv) { double ary1[1024]; double ary2[1024]; // Initialize arrays for (size_t i = 0; i < 1024; ++i) { ary1[i] = argc / (double)(i + 1); ary2[i] = argc + argc / (double) (i + 1); } // Compute two results using different starting elements printf("Result 0 is %f\n", compute(512, &ary1[0], &ary2[0])); printf("Result 1 is %f\n", compute(512, &ary1[1], &ary2[1])); return 0; } Command line: % aarch64-linux-gnu-gcc -O3 -mcpu=cortex-a57 -ffast-math -g -std=c99 -o smdp.gcc5.test smdp.c Code generated by GCC5: Loop from first call to compute (vectorized): 400460: 3ce06a60 ldr q0, [x19,x0] 400464: 3ce06a82 ldr q2, [x20,x0] 400468: 91004000 add x0, x0, #0x10 40046c: f140041f cmp x0, #0x1, lsl #12 400470: 4e62d400 fadd v0.2d, v0.2d, v2.2d 400474: 4e60d421 fadd v1.2d, v1.2d, v0.2d 400478: 54ffff41 b.ne 400460 <main+0x50> Loop from second call to compute (not vectorized): 400494: fc607a81 ldr d1, [x20,x0,lsl #3] 400498: fc607a62 ldr d2, [x19,x0,lsl #3] 40049c: 91000400 add x0, x0, #0x1 4004a0: f108041f cmp x0, #0x201 4004a4: 1e622821 fadd d1, d1, d2 4004a8: 1e612800 fadd d0, d0, d1 4004ac: 54ffff41 b.ne 400494 <main+0x84> In GCC 4.9.1, I see the following code generated for the second call, following a short prologue to handle the first data element: 40048c: 3cc10402 ldr q2, [x0],#16 400490: 3cc10420 ldr q0, [x1],#16 400494: eb13001f cmp x0, x19 400498: 4e62d400 fadd v0.2d, v0.2d, v2.2d 40049c: 4e60d421 fadd v1.2d, v1.2d, v0.2d 4004a0: 54ffff61 b.ne 40048c <main+0xbc>