https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83687

            Bug ID: 83687
           Summary: ARM NEON invalid optimisation for vabd/vabdl
           Product: gcc
           Version: 7.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: nicholas at nicholaswilson dot me.uk
  Target Milestone: ---

When compiling for ARM, the optimiser appears to do an invalid optimisation, by
attempting to the reduce the sequence subtract-then-abs into the single
instruction absolute-difference. Unfortunately the two are *not* equivalent.

=== Test file (test.c) ===
// COMPILE WITH: gcc -c -O1 -o test.o test.c -mfpu=neon
#include <arm_neon.h>

int8_t testFunction1(int8_t a, int8_t b) {
  volatile int8x16_t sub = vsubq_s8(vdupq_n_s8(a), vdupq_n_s8(b));
  int8x16_t abs = vabsq_s8(sub);
  return vgetq_lane_s8(abs, 0);
}

int8_t testFunction2(int8_t a, int8_t b) {
  int8x16_t sub = vsubq_s8(vdupq_n_s8(a), vdupq_n_s8(b));
  int8x16_t abs = vabsq_s8(sub);
  return vgetq_lane_s8(abs, 0);
}

=== Result ===
$ objdump -d test.o
test.o:     file format elf32-littlearm
Disassembly of section .text:

00000000 <testFunction1>:
   0:   e24dd010        sub     sp, sp, #16
   4:   eee00b90        vdup.8  q8, r0
   8:   eee21b90        vdup.8  q9, r1
   c:   f34008e2        vsub.i8 q8, q8, q9
  10:   f44d0adf        vst1.64 {d16-d17}, [sp :64]
  14:   f46d0adf        vld1.64 {d16-d17}, [sp :64]
  18:   f3f10360        vabs.s8 q8, q8
  1c:   ee500b90        vmov.s8 r0, d16[0]
  20:   e28dd010        add     sp, sp, #16
  24:   e12fff1e        bx      lr

00000028 <testFunction2>:
  28:   eee00b90        vdup.8  q8, r0
  2c:   eee21b90        vdup.8  q9, r1
  30:   f24007e2        vabd.s8 q8, q8, q9
  34:   ee500b90        vmov.s8 r0, d16[0]
  38:   e12fff1e        bx      lr

As you can see, the vsub/vabs sequence is optimised to vabd unless "volatile"
is used to prevent it.

=== Second test, to show that the behaviour differs ===
// COMPILE WITH: gcc -o a.out test.o main.c && ./a.out
#include <stdint.h>
#include <stdio.h>

int8_t testFunction1(int8_t a, int8_t b);
int8_t testFunction2(int8_t a, int8_t b);

int main() {
  printf("vabs(vsub(-100,100)) = %u\n", (uint8_t)testFunction1(-100, 100));
  printf("vabd(-100,100) = %u\n", (uint8_t)testFunction2(-100, 100));
  return 0;
}

// Result, prints:
//   vabs(vsub(-100,100)) = 56  [ because vsub(-100,100) wraps to 56 ]
//   vabd(-100,100) = 200       [ because vabd does abs over the 9-bit diff ]


=== Final observations ===

* Behaviour does not repro at -O0, does at -O1, -O2.
* Behaviour does not repro with the following set of options, but does repro
  if any of these options are removed:
    -O1 -fno-if-conversion -fno-forward-propagate -fno-tree-copy-prop \
    -fno-tree-copyrename -fno-tree-dominator-opts -fno-tree-ter

Currently, I'm using the "volatile" hack to prevent the vabd instruction from
being emitted. This hurts perf a bit (redundant store/load to the stack) but at
least it works.

Tested with: GCC 4.6.0, GCC 6.3.0, and GCC 7.2.0.
Hardware: Raspberry Pi, ARMv7l BCM2835 and BCM2709

=== GCC 6.3 configuration ===
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabihf/6/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../src/configure -v --with-pkgversion='Raspbian 6.3.0-18+rpi1'
--with-bugurl=file:///usr/share/doc/gcc-6/README.Bugs
--enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr
--program-suffix=-6 --program-prefix=arm-linux-gnueabihf- --enable-shared
--enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext
--enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/
--enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libitm
--disable-libquadmath --enable-plugin --with-system-zlib
--disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-6-armhf/jre --enable-java-home
--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-6-armhf
--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-6-armhf
--with-arch-directory=arm --with-ecj-jar=/usr/share/java/eclipse-ecj.jar
--with-target-system-zlib --enable-objc-gc=auto --enable-multiarch
--disable-sjlj-exceptions --with-arch=armv6 --with-fpu=vfp --with-float=hard
--enable-checking=release --build=arm-linux-gnueabihf
--host=arm-linux-gnueabihf --target=arm-linux-gnueabihf
Thread model: posix
gcc version 6.3.0 20170516 (Raspbian 6.3.0-18+rpi1)

=== GCC 7.2.0 configuration ===
Using built-in specs.
COLLECT_GCC=/usr/local/gcc-7.2.0/bin/gcc-7.2.0
COLLECT_LTO_WRAPPER=/usr/local/gcc-7.2.0/libexec/gcc/arm-linux-gnueabihf/7.2.0/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../gcc-7.2.0/configure -v --enable-languages=c,c++,fortran
--prefix=/usr/local/gcc-7.2.0 --program-suffix=-7.2.0 --with-arch=armv6
--with-fpu=vfp --with-float=hard --build=arm-linux-gnueabihf
--host=arm-linux-gnueabihf --target=arm-linux-gnueabihf
Thread model: posix
gcc version 7.2.0 (GCC)

Reply via email to