https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100745

            Bug ID: 100745
           Summary: GCC generates suboptimal assembly from vector
                    extensions on AArch64
           Product: gcc
           Version: 10.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ajidala at gmail dot com
  Target Milestone: ---

Created attachment 50861
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50861&action=edit
The profile.c file minimal benchmark/test case

As part of an attempt to make mpv's scaletempo2 audio filter faster, two
vectorised implementations were written:

The first one, mine, uses aarch64 intrinsics. It shows a 3.14x speedup on my
test system, and is referred to as "new" or "nicolas" in the code.

The second one, by haasn, also referred to as "niklas" in the code, uses GCC's
vector extensions to automatically generate vectorised code for a wide variety
of architectures. It shows a slower speedup on my system and another aarch64
test system (1.45x) but shows a much better speedup on x86_64 (>2x for generic,
>10x for -march=native on this zen+ laptop thanks to avx).

Clang, on the other hand compiles the vector extension code down to something
more efficient than gcc, beating my intrinsics SIMD (even in absolute terms
compared to gcc). I believe this is due to a bug in gcc making it produce
subpar vector assembly on aarch64 in this case.

Since we'd rather not keep platform specific vector code around in mpv, and
clang's codegen is overall worse in non-vector code, we'd much appreciate it if
someone could look into what gcc is tripping over here.

Attached is the minimal microbenchmark profile.c, which needs no special
options or includes aside from stdio so no .i file if that's alright. My test
system is a cortex-a53 in-order core, though -mtune -march for that does not
fix it, and the problem also exhibits itself on a cortex-a55 in-order core.

The test was compiled with gcc -O3 -o profile profile.c, though it is worth
noting that the pure C implementation performs much better under -O2 (possibly
a separate bug) while both SIMD versions are largely unaffected by this.

GCC Version: 10.2.0
Distribution: Arch Linux ARM
Platform: ROCK64 with a RK3328 (4x Cortex A-53, 2GB RAM)

The options used for building gcc can be found here, in build():
https://archlinuxarm.org/packages/aarch64/gcc/files/PKGBUILD

I've looked at the disassembly of gcc trunk on godbolt, but it did not look
significantly different enough to me to think this has already been fixed in
trunk. If required, I can try building gcc trunk from source.

Reply via email to