https://llvm.org/bugs/show_bug.cgi?id=27219
Bug ID: 27219 Summary: Rewriting VMLA.F32 instructions as VMUL+VADD is not a feature, it's a bug! Product: libraries Version: 3.8 Hardware: All OS: Linux Status: NEW Severity: normal Priority: P Component: Backend: ARM Assignee: unassignedb...@nondot.org Reporter: jacob.benoi...@gmail.com CC: llvm-bugs@lists.llvm.org Classification: Unclassified Created attachment 16172 --> https://llvm.org/bugs/attachment.cgi?id=16172&action=edit testcase For some values of -mcpu, at least -mcpu=cortex-a8 and -mcpu=cortex-a7, LLVM replaces VMLA.F32 instructions by a (VMUL, VADD) pair. That much seems to be well-known: https://groups.google.com/d/msg/llvm-dev/N9u8Kv1m5do/GCyge4kZSnwJ Apparently, the idea is that on some old Cortex A8 CPUs, there was a performance problem with VMLA, so replacing it with (VMUL, VADD) was a work-around for that. However, that is missing two facts: Fact #1: A (VMUL, VADD) pair needs a register to hold the temporary result of the VMUL. In fully register-tight code making use of all NEON registers, that means spilling. Concretely, matrix multiplication (GEMM) kernels are an example of critical code using all available NEON registers and doing mostly VMLA. That's how I stumbled upon this bug: Eigen (http://eigen.tuxfamily.org) was generating unexplainably bad code, with massive register spillage, running 10x slower than normal. Eigen needs to know the number of available registers, and whether a single-instruction multiply-accumulate (thus not requiring an intermediate temporary register) is available. https://bitbucket.org/eigen/eigen/src/78884e16715fc9a7b726db39195ac8bb17103181/Eigen/src/Core/arch/NEON/PacketMath.h?at=default&fileviewer=file-view-default#PacketMath.h-23 The LLVM behavior of silently replacing VMLA by VMUL+VADD breaks what was supposed to be an architecture invariant, and breaks Eigen's assumptions. For now, Eigen works around this by reimplementing the vmlaq_f32 intrinsic in inline assembly: https://bitbucket.org/eigen/eigen/src/78884e16715fc9a7b726db39195ac8bb17103181/Eigen/src/Core/arch/NEON/PacketMath.h?at=default&fileviewer=file-view-default#PacketMath.h-192 One problem in the microbenchmarks that people have been discussing in the above llvm mailing list thread, is that they measured isolated VMLA instructions, not accounting for the side effects on register pressure, which quickly become dominant in real-world register-tight numerical code. Fact #2: Most software compiled with this VMLA rewriting, isn't actually intended to run on a cortex-a8 device specifically. I'm getting the VMLA rewriting even without passing any -mcpu flag, probably because -mcpu=cortex-a8 (or some such) is the default: ~/android/toolchains/arm-linux-androideabi-clang3.5/bin/arm-linux-androideabi-clang++ ~/vrac/vmlaq_f32_testcase.cc -S -o v.s -march=armv7-a -mfloat-abi=softfp -mfpu=neon -O3 In this command line, I didn't say that I was interested in cortex-a8, so why would I be getting a cortex-a8 workaround that's detrimental on every other device, and potentially catastrophic on register-tight code? -- You are receiving this mail because: You are on the CC list for the bug.
_______________________________________________ llvm-bugs mailing list llvm-bugs@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs