https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83687
Bug ID: 83687 Summary: ARM NEON invalid optimisation for vabd/vabdl Product: gcc Version: 7.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: nicholas at nicholaswilson dot me.uk Target Milestone: --- When compiling for ARM, the optimiser appears to do an invalid optimisation, by attempting to the reduce the sequence subtract-then-abs into the single instruction absolute-difference. Unfortunately the two are *not* equivalent. === Test file (test.c) === // COMPILE WITH: gcc -c -O1 -o test.o test.c -mfpu=neon #include <arm_neon.h> int8_t testFunction1(int8_t a, int8_t b) { volatile int8x16_t sub = vsubq_s8(vdupq_n_s8(a), vdupq_n_s8(b)); int8x16_t abs = vabsq_s8(sub); return vgetq_lane_s8(abs, 0); } int8_t testFunction2(int8_t a, int8_t b) { int8x16_t sub = vsubq_s8(vdupq_n_s8(a), vdupq_n_s8(b)); int8x16_t abs = vabsq_s8(sub); return vgetq_lane_s8(abs, 0); } === Result === $ objdump -d test.o test.o: file format elf32-littlearm Disassembly of section .text: 00000000 <testFunction1>: 0: e24dd010 sub sp, sp, #16 4: eee00b90 vdup.8 q8, r0 8: eee21b90 vdup.8 q9, r1 c: f34008e2 vsub.i8 q8, q8, q9 10: f44d0adf vst1.64 {d16-d17}, [sp :64] 14: f46d0adf vld1.64 {d16-d17}, [sp :64] 18: f3f10360 vabs.s8 q8, q8 1c: ee500b90 vmov.s8 r0, d16[0] 20: e28dd010 add sp, sp, #16 24: e12fff1e bx lr 00000028 <testFunction2>: 28: eee00b90 vdup.8 q8, r0 2c: eee21b90 vdup.8 q9, r1 30: f24007e2 vabd.s8 q8, q8, q9 34: ee500b90 vmov.s8 r0, d16[0] 38: e12fff1e bx lr As you can see, the vsub/vabs sequence is optimised to vabd unless "volatile" is used to prevent it. === Second test, to show that the behaviour differs === // COMPILE WITH: gcc -o a.out test.o main.c && ./a.out #include <stdint.h> #include <stdio.h> int8_t testFunction1(int8_t a, int8_t b); int8_t testFunction2(int8_t a, int8_t b); int main() { printf("vabs(vsub(-100,100)) = %u\n", (uint8_t)testFunction1(-100, 100)); printf("vabd(-100,100) = %u\n", (uint8_t)testFunction2(-100, 100)); return 0; } // Result, prints: // vabs(vsub(-100,100)) = 56 [ because vsub(-100,100) wraps to 56 ] // vabd(-100,100) = 200 [ because vabd does abs over the 9-bit diff ] === Final observations === * Behaviour does not repro at -O0, does at -O1, -O2. * Behaviour does not repro with the following set of options, but does repro if any of these options are removed: -O1 -fno-if-conversion -fno-forward-propagate -fno-tree-copy-prop \ -fno-tree-copyrename -fno-tree-dominator-opts -fno-tree-ter Currently, I'm using the "volatile" hack to prevent the vabd instruction from being emitted. This hurts perf a bit (redundant store/load to the stack) but at least it works. Tested with: GCC 4.6.0, GCC 6.3.0, and GCC 7.2.0. Hardware: Raspberry Pi, ARMv7l BCM2835 and BCM2709 === GCC 6.3 configuration === Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabihf/6/lto-wrapper Target: arm-linux-gnueabihf Configured with: ../src/configure -v --with-pkgversion='Raspbian 6.3.0-18+rpi1' --with-bugurl=file:///usr/share/doc/gcc-6/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-6 --program-prefix=arm-linux-gnueabihf- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libitm --disable-libquadmath --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-6-armhf/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-6-armhf --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-6-armhf --with-arch-directory=arm --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-sjlj-exceptions --with-arch=armv6 --with-fpu=vfp --with-float=hard --enable-checking=release --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf Thread model: posix gcc version 6.3.0 20170516 (Raspbian 6.3.0-18+rpi1) === GCC 7.2.0 configuration === Using built-in specs. COLLECT_GCC=/usr/local/gcc-7.2.0/bin/gcc-7.2.0 COLLECT_LTO_WRAPPER=/usr/local/gcc-7.2.0/libexec/gcc/arm-linux-gnueabihf/7.2.0/lto-wrapper Target: arm-linux-gnueabihf Configured with: ../gcc-7.2.0/configure -v --enable-languages=c,c++,fortran --prefix=/usr/local/gcc-7.2.0 --program-suffix=-7.2.0 --with-arch=armv6 --with-fpu=vfp --with-float=hard --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf Thread model: posix gcc version 7.2.0 (GCC)