https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64844

            Bug ID: 64844
           Summary: Vectorization inhibited in gcc5 when loop starts with
                    elem[1], aarch64 perf regression from 4.9.1
           Product: gcc
           Version: 5.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: chris_s_jones at yahoo dot com

Created attachment 34611
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=34611&action=edit
Simple test case

% ./trunk_aarch64/bin/aarch64-linux-gnu-gcc -v
Using built-in specs.
COLLECT_GCC=./trunk_aarch64/bin/aarch64-linux-gnu-gcc
COLLECT_LTO_WRAPPER=/local/trunk_aarch64/libexec/gcc/aarch64-linux-gnu/5.0.0/lto-wrapper
Target: aarch64-linux-gnu
Configured with: /local/src/gcc-trunk/configure --prefix=/local/trunk_aarch64
--target=aarch64-linux-gnu --with-sysroot=/local/trunk_aarch64/sysroot
--with-gmp=/local/trunk_aarch64 --with-mpc=/local/trunk_aarch64
--with-mpfr=/local/trunk_aarch64 --with-cloog=/local/trunk_aarch64
--with-isl=/local/trunk_aarch64 --enable-__cxa_atexit --with-gnu-as
--with-gnu-ld --enable-shared --disable-libssp --disable-libmudflap
--enable-languages=c,c++,fortran --disable-libsanitizer --disable-nls
Thread model: posix
gcc version 5.0.0 20150127 (experimental) (GCC)

For the following code sample, only the first inlined call to compute() seems
to get vectorized by GCC5 using the command line shown below.  In GCC 4.9.1,
both calls get vectorized.  This results in a nearly 50% performance hit for
the newer compiler.

File smpd.c:
#include <stdint.h>
#include <stdio.h>

inline double compute(size_t n,
                      double const * restrict a, double const * restrict b)
{
    double res = 0.0;
    for (size_t i = 0; i < n; ++i) {
        res += a[i] + b[i];
    }
    return res;
}


int
main(int argc, char **argv) {

    double ary1[1024];
    double ary2[1024];

    // Initialize arrays
    for (size_t i = 0; i < 1024; ++i) {
        ary1[i] = argc / (double)(i + 1);
        ary2[i] = argc + argc / (double) (i + 1);
    }

    // Compute two results using different starting elements
    printf("Result 0 is %f\n", compute(512, &ary1[0], &ary2[0]));
    printf("Result 1 is %f\n", compute(512, &ary1[1], &ary2[1]));

    return 0;
}

Command line:

% aarch64-linux-gnu-gcc -O3 -mcpu=cortex-a57 -ffast-math -g -std=c99 -o
smdp.gcc5.test smdp.c

Code generated by GCC5:

Loop from first call to compute (vectorized):
  400460:       3ce06a60        ldr     q0, [x19,x0]
  400464:       3ce06a82        ldr     q2, [x20,x0]
  400468:       91004000        add     x0, x0, #0x10
  40046c:       f140041f        cmp     x0, #0x1, lsl #12
  400470:       4e62d400        fadd    v0.2d, v0.2d, v2.2d
  400474:       4e60d421        fadd    v1.2d, v1.2d, v0.2d
  400478:       54ffff41        b.ne    400460 <main+0x50>

Loop from second call to compute (not vectorized):
  400494:       fc607a81        ldr     d1, [x20,x0,lsl #3]
  400498:       fc607a62        ldr     d2, [x19,x0,lsl #3]
  40049c:       91000400        add     x0, x0, #0x1
  4004a0:       f108041f        cmp     x0, #0x201
  4004a4:       1e622821        fadd    d1, d1, d2
  4004a8:       1e612800        fadd    d0, d0, d1
  4004ac:       54ffff41        b.ne    400494 <main+0x84>

In GCC 4.9.1, I see the following code generated for the second call, following
a short prologue to handle the first data element:
  40048c:       3cc10402        ldr     q2, [x0],#16
  400490:       3cc10420        ldr     q0, [x1],#16
  400494:       eb13001f        cmp     x0, x19
  400498:       4e62d400        fadd    v0.2d, v0.2d, v2.2d
  40049c:       4e60d421        fadd    v1.2d, v1.2d, v0.2d
  4004a0:       54ffff61        b.ne    40048c <main+0xbc>

Reply via email to