https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100745
Bug ID: 100745 Summary: GCC generates suboptimal assembly from vector extensions on AArch64 Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: ajidala at gmail dot com Target Milestone: --- Created attachment 50861 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50861&action=edit The profile.c file minimal benchmark/test case As part of an attempt to make mpv's scaletempo2 audio filter faster, two vectorised implementations were written: The first one, mine, uses aarch64 intrinsics. It shows a 3.14x speedup on my test system, and is referred to as "new" or "nicolas" in the code. The second one, by haasn, also referred to as "niklas" in the code, uses GCC's vector extensions to automatically generate vectorised code for a wide variety of architectures. It shows a slower speedup on my system and another aarch64 test system (1.45x) but shows a much better speedup on x86_64 (>2x for generic, >10x for -march=native on this zen+ laptop thanks to avx). Clang, on the other hand compiles the vector extension code down to something more efficient than gcc, beating my intrinsics SIMD (even in absolute terms compared to gcc). I believe this is due to a bug in gcc making it produce subpar vector assembly on aarch64 in this case. Since we'd rather not keep platform specific vector code around in mpv, and clang's codegen is overall worse in non-vector code, we'd much appreciate it if someone could look into what gcc is tripping over here. Attached is the minimal microbenchmark profile.c, which needs no special options or includes aside from stdio so no .i file if that's alright. My test system is a cortex-a53 in-order core, though -mtune -march for that does not fix it, and the problem also exhibits itself on a cortex-a55 in-order core. The test was compiled with gcc -O3 -o profile profile.c, though it is worth noting that the pure C implementation performs much better under -O2 (possibly a separate bug) while both SIMD versions are largely unaffected by this. GCC Version: 10.2.0 Distribution: Arch Linux ARM Platform: ROCK64 with a RK3328 (4x Cortex A-53, 2GB RAM) The options used for building gcc can be found here, in build(): https://archlinuxarm.org/packages/aarch64/gcc/files/PKGBUILD I've looked at the disassembly of gcc trunk on godbolt, but it did not look significantly different enough to me to think this has already been fixed in trunk. If required, I can try building gcc trunk from source.